Cudadevicesynchronize returned error code 700 after launching addkernel

Содержание

CUDA error driver 700 #93
Comments
Cuda __syncthreads не определено. Без него->случайные результаты
1 ответы
cuda runtime error (700) : an illegal memory access was encountered #42077
Comments
To Reproduce
Expected behavior
Environment
Additional context
Ошибки сегмента добавления матрицы CUDA
Cudadevicesynchronize returned error code 700 after launching addkernel
Вопрос:
Ответ №1:
Комментарии:

CUDA error driver 700 #93

Hi there,
I’ve been running excavator to mine lyra2rev2 algorithm and I’m experiencing some crashes on the miner.
I’ve attached a couple of logs to see if maybe you guys can help me to identify the problem.
I’m running 4 gtx 1070 on windows 10 with the latest drives.
log_2017-12-11_18-50-43.002.log
log_2017-12-11_20-05-56.005.log

The text was updated successfully, but these errors were encountered:

Hi @Kayfolom, thank you for your very quick reply.
I have to be honest, and I have no idea what you just told me to do ):
Is it possible for you to explain to me where I should use that code? Also, if I copy and paste what you just wrote, will it works on my rig or it’s just a general example?
Thank you and sorry for asking for more.

What @Kayfolom told you is to use one worker per device, not two. This looks like an overclocking problem though, I suggest you to run at default GPU settings and then change the settings (memory clock, core clock and power limit) until you find the stable state.

Here is an example that should work (you have to change ***************** to your username:password). You can take a look at our documentation — https://github.com/nicehash/excavator/tree/master/api#-algorithmadd .

Since you were asking where should you use this code I’ll briefly explain that as well. You can copy the code above into config.json file and then run excavator.exe with -c conifg.json parameter.

Please take a look at our documentation and config file examples.

Источник

Cuda __syncthreads не определено. Без него->случайные результаты

Я новичок в cuda, и у меня есть проблема. Я хочу синхронизировать свои потоки, поэтому я попытался использовать синхронизирующие потоки. Проблема в том, что Visual Studio 2010 говорит: идентификатор __syncthreads() не определен. Кстати, я использую cuda 4.2. Поэтому я решил вместо этого использовать cudaDeviceSynchronize() и вызывать его с хоста. Мой код похож на приведенный выше (я отправляю вам только важные части):

Я думал, что результаты должны быть avg[0]=640.000 avg[1]=1.280.000

но не только мои результаты отличаются (это может быть проблема переполнения), но и нестабильны. Например, для трех разных исполнений результаты следующие:

среднее [0] = 3041 среднее [1] = 6604

среднее [0] = 3015 среднее [1] = 6578

среднее [0] = 3047 среднее [1] = 6600

Итак, что я здесь делаю неправильно? Это проблема синхронизации? И почему я не могу использовать __syncthreads() Или это проблема условий гонки?

Кроме того, для проблемы __syncthreads() она присутствует в любом коде, который я пишу. Даже самый простой:

Он говорит следующее: Ошибка: идентификатор «__syncthreads()» не определен

Самое смешное, что даже с примерами кода, которые поставляются с 4.2 CUDA SDK, происходит то же самое. Возможно, что-то более общее неправильно, потому что в примерах SDK больше функций, которые считаются неопределенными.

не должны size=sizeof(float)*2 ? — talonmies

Я не уверен, что сказать о __syncthreads() проблема. Код, который вы добавили, компилируется и работает нормально для меня. Я думаю, это указывает на то, что что-то не так с вашей средой. Возможно, вы захотите внимательно следовать шагам в руководство по началу работы с окнами (после первого удаления версии cuda, которая у вас есть.) — Robert Crovella

Похоже, что проблема __syncthreads() может быть связана с взаимодействием между Visual Studio и конкретными включаемыми файлами, которые у вас есть. При компиляции файлов .cu с помощью nvcc (даже в VS) обычно не требуется специально включать cuda_runtime.h и device_launch_parameters.h (вы заметите, что мой ответ не содержит их). Попробуйте удалить эти операторы включения из исходных файлов, а затем посмотрите, сможете ли вы скомпилировать с помощью __syncthreads() — Robert Crovella

1 ответы

Все ваши блоки потоков записываются в одни и те же два места. Единственный способ заставить это работать правильно — использовать атомарные операции. В противном случае результаты потоков, считывающих местоположение, добавляющих к нему и записывающих результат обратно в местоположение «одновременно», не определены.

Если вы перепишете ядро следующим образом:

Это должно решить проблему, которую вы видите.

Чтобы ответить на вопрос о __syncthreads(), мне нужно увидеть точный код, вызвавший ошибку компилятора. Если вы опубликуете это, я обновлю свой ответ. Не должно быть проблем со вставкой вызова __syncthreads() в это ядро, хотя это и не решит проблему, которую вы видите.

Вы можете просмотреть атомные операции раздел руководства по программированию на C.

Обратите внимание, что использование атомарных переменных обычно приводит к замедлению работы вашего кода, поэтому их следует использовать осторожно. Однако для этого учебного упражнения он должен решить проблему для вас.

также обратите внимание, что код, который вы разместили, не компилируется чисто, есть ряд отсутствующих определений и множество других проблем с вашим кодом. Но поскольку вы публикуете результаты, я предполагаю, что у вас есть какая-то версия этой работы, даже если вы ее не опубликовали. Поэтому я не определил каждую проблему с кодом, который вы опубликовали.

Вот код, похожий на ваш, со всеми исправленными проблемами с кодированием, и, похоже, он работает для меня:

Я получаю следующий вывод, когда запускаю его:

Также обратите внимание, что для atomicAdd быть пригодным для использования на float , необходимо иметь устройство с вычислительными возможностями 2.0 или лучше (и передать переключатель компилятора, например -arch=sm_20 скомпилировать для такого устройства). Если у вас более раннее устройство (вычислительные возможности 1.x), вы можете создать аналогичную программу, определяющую avg[] как int вместо float . Или, если вы предпочитаете, вы можете создать свою собственную функцию atomicAdd __ device__, которую можно использовать на устройстве cc 1.x, как было предложено. здесь в разделе, начинающемся с «Обратите внимание, однако, что любая атомарная операция может быть реализована на основе atomicCAS () (Сравнить и поменять местами)».

Источник

cuda runtime error (700) : an illegal memory access was encountered #42077

I performed some torch.chunk and torch.cat operations on tensors (Conv2D feature maps) during the training of my object detection network, and the forward propagation constantly crash here.
Related codes are:

The error messages are:

which is traced to print(centernesses) if this sentence is uncommented; else, it would be traced to the next line:

To Reproduce

Steps to reproduce the behavior:
Unknown. I experimentally created some tensors in gpu mode, but haven’t caused the error at all. However, my object detection network training constantly crashed here.

There is also some detailed call stack records, which seems to me indicating the error is related to some garbage collection mechanism:

Expected behavior

No such error occurs.

Environment

PyTorch version: 1.4.0
Is debug build: No
CUDA used to build PyTorch: 10.1

OS: CentOS Linux 7 (Core)
GCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.1.243
GPU models and configuration:
GPU 0: Tesla V100-SXM2-16GB
GPU 1: Tesla V100-SXM2-16GB
GPU 2: Tesla V100-SXM2-16GB
GPU 3: Tesla V100-SXM2-16GB

Nvidia driver version: 440.64.00
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] numpy==1.17.0
[pip] torch==1.4.0
[pip] torchvision==0.4.0a0
[conda] _pytorch_select 0.2 gpu_0 defaults
[conda] blas 1.0 mkl defaults
[conda] mkl 2019.4 243 defaults
[conda] mkl-service 2.3.0 py37he904b0f_0 defaults
[conda] mkl_fft 1.0.15 py37ha843d7b_0 defaults
[conda] mkl_random 1.1.0 py37hd6b4f25_0 defaults
[conda] pytorch 1.2.0 cuda100py37h938c94c_0 defaults
[conda] torch 1.4.0 pypi_0 pypi
[conda] torchvision 0.4.0 cuda100py37hecfc37a_0 defaults

Additional context

The text was updated successfully, but these errors were encountered:

Источник

Ошибки сегмента добавления матрицы CUDA

У меня просто вопрос о моей программе cuda, которую я написал. Это позволяет мне вводить размер матрицы, столбцов и строк. Скажем, я ввожу

1124, и он отлично работает. Однако скажем, что я ввожу 1149, что это ошибка сегмента ПОСЛЕ вычислений в устройстве (я думаю, что это ошибка сегмента во время обратного копирования). Но скажем, что я ввожу 2000, это сегментирует ошибки ПЕРЕД вычислением в устройстве (я думаю, что это происходит во время копирования). Думаю, моя проблема связана с управлением памятью. Если бы вы, ребята, могли указать мне правильное направление, я был бы признателен.

Я обновил код, указав, как он называется. В новом редактировании (внизу) он содержит: sumMatrix (пустая матрица с размером eleCount1, который является размером всей матрицы), matrixOne (первая матрица), matrixTwo (вторая матрица, распределяется так же, как и matrix1) , eleCount1 (весь размер матрицы). И matrixOne, и two читаются из файла.

Не был уверен, нужно ли кому-нибудь посмотреть вот что о моем графическом процессоре:

Общий объем постоянной памяти: 65536 байт
Общий объем разделяемой памяти на блок: 49152 байта
Общее количество регистров, доступных на блок: 32768
Размер основы: 32
Максимальное количество потоков на блок: 1024
Максимальный размер каждого измерения блока: 1024 x 1024 x 64
Максимальные размеры каждого измерения сетки: 65535 x 65535 x 65535

Источник

#c #visual-studio #cuda

Вопрос:

Я пытаюсь изучить cuda и преобразовать свой текущий проект в его использование, и я получаю эту ошибку:

Ошибка MSB3721 Команда «»C:Program Файлы Вычисления на графическом процессоре NVIDIA ToolkitCUDAv11.2binnvcc.exe » -gencode=arch=compute_52,code=»sm_52,compute_52″ —использовать-local-env -ccbin «C:Program Файлы (x86)Microsoft Visual Studio 2019 Сообщество VCИнструменты MSVC14.26.28801binHostX86 x64″ -x cu -I»C:Program Файлы NVIDIA GPU Computing Toolkit CUDA v11.2 включают » -I»C:Program Файлы NVIDIA GPU Computing ToolkitCUDA v11.2включают» -G —keep-dir x64Debug -maxrregcount=0 —машина 64 —compile -cudart static -g -D_DEBUG -D_CONSOLE -D_UNICODE -DUNICODE -Xcompiler «/EHsc /W3 /nologo /Od /Fdx64 Debug vc142.pdb /FS / Zi /RTC1 /MDd » -o x64 Debug cudaMain.cu.obj «C:Users [мойимя пользователя] sourcereposLogicGateMachineLearning_V2_SolutionLogicGateMachineLearning_V2cudaMain.cu «» выход с кодом 255.

Я использую файл .cuh как мне объявлять классы, он выдает мне предупреждение о том, что «атрибут не применяется к сущности». Нужно ли мне помечать, я также получаю предупреждение, сообщающее мне: «Предупреждение C26812 Тип перечисления ‘cudaError’ не охвачен. Предпочитайте ‘enum class’ вместо ‘enum’ (Enum.3) «.

файл .cu с определениями слишком велик для включения, но ни у одного из них нет хост -устройства, которое им не нужно, не так ли?

Ответ №1:

является незаконным. Спецификаторы пространства выполнения (so __host__ и __device__ ) применяются к объявлениям и определениям функций и переменных, а не к типам. Правильное объявление класса было бы просто

В вашем коде вполне могут быть другие проблемы, но без просмотра фактического журнала ошибок компиляции и отсутствия энтузиазма, чтобы просмотреть весь код, сброшенный в вопросе, это источник одной ошибки компиляции, которую вы идентифицировали.

1. Это избавило от одного предупреждения, что такое журнал ошибок? В нем говорится, что это выход 1> ptxas fatal: неразрешенная внешняя функция ‘_Z32RandomBruteForceImproveFromArrayR7CircuitPcjj’» спасибо

2. Журнал ошибок или список — это фактический список ошибок, выданных nvcc при компиляции вашего кода. Все, что вы указали в своем вопросе, — это Visual Studio, сообщающая вам, какую команду nvcc она выполнила и что она завершилась с ошибкой. Ошибка в вашем комментарии совершенно не связана ни с чем в вашем вопросе, это, вероятно, проблема с конфигурацией проекта

3. Чтобы увидеть вывод nvcc в некоторых случаях, необходимо увеличить детализацию Visual Studio. Вы можете найти инструкции по этому поводу в Google. Что касается неразрешенной функции extern, эта функция: RandomBruteForceImproveFromArray вызывается из вашего __global__ ядра в одном файле, но ссылается на функцию, определенную в другом .cu файле / модуле компиляции. В этом случае вы должны использовать «CUDA separable compilation relocatable device code generation». Опять же, Google найдет для вас результаты. В связи с этим ваш main.cu файл не показывает никаких доказательств включения вашего .cuh файла.

4. О, я сожалею, что он включает мой файл .cuh в мой основной, это была просто ошибка форматирования с моей стороны. Я установил для перемещаемого устройства значение true, и у меня больше нет этой ошибки, но теперь у меня есть эта ошибка «cudaDeviceSynchronize вернул код ошибки 700 после запуска addKernel!»

5. Теперь вы столкнулись с совершенно другими проблемами. Я уверен, что заманчиво думать об этом сообщении как о вашей личной службе поддержки, но на самом деле это не так, как работает этот сайт Q A.

Источник

My goal is to write a simple ray tracer with the phong shading model with CUDA in C++. It is supposed to calculate the appropriate colors and write them into a frame buffer on the GPU and afterwards I write the values in the frame buffer into a .ppm file on the CPU. The image size I have is 512×512 so for the thread layout in the kernel call I used the following arguments: dim3 thread_blocks(16, 16) and dim3 threads_per_block(32, 32).

This should in theory give me access to (16*16) * (32*32) threads which is equal to the amount of pixels in the image (512 * 512). But this gives me a CUDA error with the error code 700 for cudaMemcpy on the line where I copy the data back from the device to the host. Using a smaller amount of threads_per_block like dim3 threads_per_block(16, 16) works without an error but will of course only render 1/4th of the image.

I have tried other thread layouts as well and even the ones that were specifically explained for a 2D image yielded the same error, so that’s where I need help.

The kernel call:

void run_kernel(const int size, Vec3f* fb, Sphere* spheres, Light* light, Vec3f* origin) {
    // empty_kernel<<<dim3(16, 16, 1), dim3(32, 32, 1)>>>();
    // cudaDeviceSynchronize();

    Vec3f* fb_device = nullptr;
    Sphere* spheres_dv = nullptr;
    Light* light_dv = nullptr;
    Vec3f* origin_dv = nullptr;

    checkErrorsCuda(cudaMalloc((void**) &fb_device, sizeof(Vec3f) * size));
    checkErrorsCuda(cudaMemcpy((void*) fb_device, fb, sizeof(Vec3f) * size, cudaMemcpyHostToDevice));

    checkErrorsCuda(cudaMalloc((void**) &spheres_dv, sizeof(Sphere) * 3));
    checkErrorsCuda(cudaMemcpy((void*) spheres_dv, spheres, sizeof(Sphere) * 3, cudaMemcpyHostToDevice));

    checkErrorsCuda(cudaMalloc((void**) &light_dv, sizeof(Light) * 1));
    checkErrorsCuda(cudaMemcpy((void*) light_dv, light, sizeof(Light) * 1, cudaMemcpyHostToDevice));

    checkErrorsCuda(cudaMalloc((void**) &origin_dv, sizeof(Vec3f) * 1));
    checkErrorsCuda(cudaMemcpy((void*) origin_dv, origin, sizeof(Vec3f) * 1, cudaMemcpyHostToDevice));

    cudaEvent_t start, stop;
    float time = 0;
    cudaEventCreate(&start);
    cudaEventCreate(&stop);
    cudaEventRecord(start, 0);

    cast_ray<<<dim3(16, 16), dim3(32, 32)>>>(fb_device, spheres_dv, light_dv, origin_dv);
    
    cudaEventRecord(stop, 0);
    cudaEventSynchronize(stop);
    cudaEventElapsedTime(&time, start, stop);
    printf("%f msn", time);

    checkErrorsCuda(cudaMemcpy(fb, fb_device, sizeof(Vec3f) * size, cudaMemcpyDeviceToHost));

    cudaEventDestroy(start);
    cudaEventDestroy(stop);

    checkErrorsCuda(cudaFree(fb_device));
    checkErrorsCuda(cudaFree(spheres_dv));
    checkErrorsCuda(cudaFree(light_dv));
    checkErrorsCuda(cudaFree(origin_dv));
}

The cast_ray function:

__global__ void cast_ray(Vec3f* fb, Sphere* spheres, Light* light, Vec3f* origin) {
    int i = (blockIdx.x * blockDim.x) + threadIdx.x;
    int j = (blockIdx.y * blockDim.y) + threadIdx.y;

    int tid = (j*WIDTH) + i;
    if(i >= WIDTH || j >= HEIGHT) return;

    Vec3f ij(2 * (float((i) + 0.5) / (WIDTH - 1)) - 1, 1 - 2 * (float((j) + 0.5) / (HEIGHT - 1)), -1);
    Vec3f *dir = new Vec3f(ij - *origin);
    Ray r(*origin, *dir);

    float intersections[3];
    int hp = -1;
    for(int ii = 0; ii < 3; ii++) {
        intersections[ii] = r.has_intersection(spheres[ii]);
    }

    int asize = sizeof(intersections) / sizeof(*intersections);
    if(asize == 1) {
        hp = intersections[0] < 0 ? -1 : 0;
    } else {
        if(asize != 0) {
            float min_val = 100.0;
            for (int ii = 0; ii < asize; ii++) {
                if (intersections[ii] < 0.0) continue;
                else if (intersections[ii] < min_val) {
                    min_val = intersections[ii];
                    hp = ii;
                }
            }
        }
    }

    if(hp == -1) {
        fb[tid] = Color(94, 156, 255);
    } else {
        auto color = get_color_at(r, intersections[hp], light, spheres[hp], spheres);
        fb[tid] = color;
    }
}

The error message: CUDA error at ./main.cu::195 with error code 700 for cudaMemcpy(fb, fb_device, sizeof(Vec3f) * size, cudaMemcpyDeviceToHost)(). (The corresponding line is the cudaMemcpy after the printf in the kernel call function)

With cuda-memcheck I get the following information:

========= Error: process didn't terminate successfully
========= Out-of-range Shared or Local Address
=========     at 0x00000100 in __cuda_syscall_mc_dyn_globallock_check
=========     by thread (0,7,0) in block (2,5,0)

(This was tried on a RTX 2060 SUPER)

Источник

I am trying to implement some custom CUDA kernels to speed up integration of a system of ODEs using VexCL and odeint-v2. With some kernels this works very well, but with my largest system of equations I am encountering problems when odeint-v2 asks for the state vector to be resized following a step.

I have extracted the corresponding kernel in standalone form: https://gist.github.com/ds283/8016216. However, as far as I can determine, it’s not the kernel which cause the problem here – although it takes a long time to compile, it executes ok – but rather the state vector. This is a vex::multivector<double, 164>.

When https://gist.github.com/ds283/8016216 is compiled and run, I get

1. GeForce GTX 680MX
time t = 0
libc++abi.dylib: terminating with uncaught exception of type vex::backend::cuda::error: /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
    CUDA Driver API Error (700 - CUDA_ERROR_LAUNCH_FAILED)

This happens both on a GeForce GTX 680MX on an iMac and a GeForce GTX650M on a MacBook Pro. Both are running OS X 10.9.1 and CUDA 5.5.28. On both cards, this kernel runs in blocks of 8 threads with 25792 bytes of shared memory per block; the maximum shared memory per block on these cards in 48kb. On the 680MX the grid size is 32 blocks, and on the 650M it is 8 blocks.

Running in the debugger shows that this exception is raised from the calling sequence

bool boost::numeric::odeint::controlled_runge_kutta<boost::numeric::odeint::runge_kutta_dopri5<vex::multivector<double, 164ul>, double, vex::multivector<double, 164ul>, double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations, boost::numeric::odeint::initially_resizer>, boost::numeric::odeint::default_error_checker<double, boost::numeric::odeint::vector_space_algebra, boost::numeric::odeint::default_operations>, boost::numeric::odeint::initially_resizer, boost::numeric::odeint::explicit_error_stepper_fsal_tag>::resize_m_xnew_impl<vex::multivector<double, 164ul> >(vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:848
boost::numeric::odeint::adjust_size_by_resizeability<boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>, vex::multivector<double, 164ul> >(boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>, void>&, vex::multivector<double, 164ul> const&, boost::integral_constant<bool, true>) at /opt/local/include/boost/numeric/odeint/util/resizer.hpp:35
void boost::numeric::odeint::resize<vex::multivector<double, 164ul>, vex::multivector<double, 164ul> >(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/util/resize.hpp:53
boost::numeric::odeint::resize_impl<vex::multivector<double, 164ul>, vex::multivector<double, 164ul>, void>::resize(vex::multivector<double, 164ul>&, vex::multivector<double, 164ul> const&) at /opt/local/include/boost/numeric/odeint/external/vexcl/vexcl_resize.hpp:73
vex::multivector<double, 164ul>::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long) at /usr/local/include/vexcl/multivector.hpp:287
vex::vector<double>::resize(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:490
vex::vector<double>::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:384
vex::vector<double>::vector(std::__1::vector<vex::backend::cuda::command_queue, std::__1::allocator<vex::backend::cuda::command_queue> > const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/vector.hpp:383
vex::vector<double>::allocate_buffers(unsigned int, double const*) at /usr/local/include/vexcl/vector.hpp:802
vex::backend::cuda::device_vector<double>::device_vector<double>(vex::backend::cuda::command_queue const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:111
vex::backend::cuda::device_vector<double>::device_vector<double>(vex::backend::cuda::command_queue const&, unsigned long, double const*, unsigned int) at /usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
vex::backend::cuda::check(cudaError_enum, char const*, int) at /usr/local/include/vexcl/backend/cuda/error.hpp:135

The failing CUDA API call is apparently the invocation of cuMemAlloc() in device_vector.hpp

        /// Allocates memory buffer on the device associated with the given queue.
        template <typename H>
        device_vector(const command_queue &q, size_t n,
                const H *host = 0, mem_flags flags = MEM_READ_WRITE)
            : n(n)
        {
            (void)flags;

            if (n) {
                q.context().set_current();

                CUdeviceptr ptr;
                cuda_check( cuMemAlloc(&ptr, n * sizeof(T)) );

                buffer.reset(reinterpret_cast<char*>(static_cast<size_t>(ptr)), detail::deleter() );

                if (host) {
                    if (std::is_same<T, H>::value)
                        write(q, 0, n, reinterpret_cast<const T*>(host), true);
                    else
                        write(q, 0, n, std::vector<T>(host, host + n).data(), true);
                }
            }
        }

I’m not quite clear what this means because CUDA_ERROR_LAUNCH_FAILED doesn’t seem to be an error code which cuMemAlloc() should return.

I have no problems with a very analogous kernel which uses a vex::multivector<double, 20> as the state. Is there any reason to think that a large multivector should run into resizing problems of this type?

Usually this kind of error comes from the previous kernel launch. Can you
check if this is true (e.g. by inserting ctx.finish() before resizing)?

If that is true, does the kernel come from vexcl, or is it your own?

Error 700 could mean e.g. that incorrect parameters are passed to a kernel.

On Dec 18, 2013 6:41 AM, «ds283» notifications@github.com wrote:

I am trying to implement some custom CUDA kernels to speed up integration
of a system of ODEs using VexCL and odeint-v2. With some kernels this works
very well, but with my largest system of equations I am encountering
problems when odeint-v2 asks for the state vector to be resized following a
step.

I have extracted the corresponding kernel in standalone form:
https://gist.github.com/ds283/8016216. However, as far as I can determine,
it’s not the kernel which cause the problem here – although it takes a long
time to compile, it executes ok – but rather the state vector. This is a
vex::multivector<double, 164>.

When https://gist.github.com/ds283/8016216 is compiled and run, I get

GeForce GTX 680MX
time t = 0
libc++abi.dylib: terminating with uncaught exception of type
vex::backend::cuda::error:
/usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
CUDA Driver API Error (700 — CUDA_ERROR_LAUNCH_FAILED)

This happens both on a GeForce GTX 680MX on an iMac and a GeForce GTX650M
on a MacBook Pro. Both are running OS X 10.9.1 and CUDA 5.5.28. On both
cards, this kernel runs in blocks of 8 threads with 25792 bytes of shared
memory per block; the maximum shared memory per block on these cards in
48kb. On the 680MX the grid size is 32 blocks, and on the 650M it is 8
blocks.

Running in the debugger shows that this exception is raised from the
calling sequence

bool
boost::numeric::odeint::controlled_runge_kutta<boost::numeric::odeint::runge_kutta_dopri5<vex::multivector<double,
164ul>, double, vex::multivector<double, 164ul>, double,
boost::numeric::odeint::vector_space_algebra,
boost::numeric::odeint::default_operations,
boost::numeric::odeint::initially_resizer>,
boost::numeric::odeint::default_error_checker<double,
boost::numeric::odeint::vector_space_algebra,
boost::numeric::odeint::default_operations>,
boost::numeric::odeint::initially_resizer,
boost::numeric::odeint::explicit_error_stepper_fsal_tag>::resize_m_xnew_impl<vex::multivector<double,
164ul> >(vex::multivector<double, 164ul> const&) at
/opt/local/include/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:848

boost::numeric::odeint::adjust_size_by_resizeability<boost::numeric::odeint::state_wrapper<vex::multivector<double,
164ul>, void>, vex::multivector<double, 164ul>
(boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>,
void>&, vex::multivector<double, 164ul> const&,
boost::integral_constant<bool, true>) at
/opt/local/include/boost/numeric/odeint/util/resizer.hpp:35
void boost::numeric::odeint::resize<vex::multivector<double, 164ul>,
vex::multivector<double, 164ul> >(vex::multivector<double, 164ul>&,
vex::multivector<double, 164ul> const&) at
/opt/local/include/boost/numeric/odeint/util/resize.hpp:53
boost::numeric::odeint::resize_impl<vex::multivector<double, 164ul>,
vex::multivector<double, 164ul>, void>::resize(vex::multivector<double,
164ul>&, vex::multivector<double, 164ul> const&) at
/opt/local/include/boost/numeric/odeint/external/vexcl/vexcl_resize.hpp:73
vex::multivector<double,
164ul>::resize(std::__1::vector<vex::backend::cuda::command_queue,
std::__1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long) at /usr/local/include/vexcl/multivector.hpp:287

vex::vector::resize(std::__1::vector<vex::backend::cuda::command_queue,
std::__1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long, double const*, unsigned int) at
/usr/local/include/vexcl/vector.hpp:490

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue,
std::__1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long, double const*, unsigned int) at
/usr/local/include/vexcl/vector.hpp:384

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue,
std::1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long, double const, unsigned int) at
/usr/local/include/vexcl/vector.hpp:383
vex::vector::allocate_buffers(unsigned int, double const) at
/usr/local/include/vexcl/vector.hpp:802

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::command_queue
const&, unsigned long, double const*, unsigned int) at
/usr/local/include/vexcl/backend/cuda/device_vector.hpp:111

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::command_queue
const&, unsigned long, double const_, unsigned int) at
/usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
vex::backend::cuda::check(cudaError_enum, char const_, int) at
/usr/local/include/vexcl/backend/cuda/error.hpp:135

The failing CUDA API call is apparently the invocation of cuMemAlloc() in
device_vector.hpp
    /// Allocates memory buffer on the device associated with the
given queue.
template
device_vector(const command_queue &q, size_t n,
const H *host = 0, mem_flags flags = MEM_READ_WRITE)
: n(n)
{
(void)flags;
        if (n) {
            q.context().set_current();

            CUdeviceptr ptr;
            cuda_check( cuMemAlloc(&ptr, n * sizeof(T)) );
buffer.reset(reinterpret_cast<char*>(static_cast<size_t>(ptr)),
detail::deleter() );
            if (host) {
                if (std::is_same<T, H>::value)
                    write(q, 0, n, reinterpret_cast<const T*>(host),
true);
else
write(q, 0, n, std::vector(host, host +
n).data(), true);
}
}
}

I’m not quite clear what this means because CUDA_ERROR_LAUNCH_FAILED
doesn’t seem to be an error code which cuMemAlloc() should return.

I have no problems with a very analogous kernel which uses a
vex::multivector<double, 20> as the state. Is there any reason to think
that a large multivector should run into resizing problems of this type?

—
Reply to this email directly or view it on GitHub.

I’ve just noticed you have a complete example in that gist. I’ll try to run
it later today.

Btw, you should be able to use cuda runtime to launch your kernels. This
could be more convenient in some cases. (see thrust-sort example).
Although your kernel looks generated.
On Dec 18, 2013 7:02 AM, «Denis Demidov» dennis.demidov@gmail.com wrote:

Usually this kind of error comes from the previous kernel launch. Can you
check if this is true (e.g. by inserting ctx.finish() before resizing)?

If that is true, does the kernel come from vexcl, or is it your own?

Error 700 could mean e.g. that incorrect parameters are passed to a
kernel.

On Dec 18, 2013 6:41 AM, «ds283» notifications@github.com wrote:
I am trying to implement some custom CUDA kernels to speed up
integration of a system of ODEs using VexCL and odeint-v2. With some
kernels this works very well, but with my largest system of equations I am
encountering problems when odeint-v2 asks for the state vector to be
resized following a step.

I have extracted the corresponding kernel in standalone form:
https://gist.github.com/ds283/8016216. However, as far as I can
determine, it’s not the kernel which cause the problem here – although it
takes a long time to compile, it executes ok – but rather the state vector.
This is a vex::multivector<double, 164>.

When https://gist.github.com/ds283/8016216 is compiled and run, I get

GeForce GTX 680MX
time t = 0
libc++abi.dylib: terminating with uncaught exception of type
vex::backend::cuda::error:
/usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
CUDA Driver API Error (700 — CUDA_ERROR_LAUNCH_FAILED)

This happens both on a GeForce GTX 680MX on an iMac and a GeForce
GTX650M on a MacBook Pro. Both are running OS X 10.9.1 and CUDA 5.5.28. On
both cards, this kernel runs in blocks of 8 threads with 25792 bytes of
shared memory per block; the maximum shared memory per block on these cards
in 48kb. On the 680MX the grid size is 32 blocks, and on the 650M it is 8
blocks.

Running in the debugger shows that this exception is raised from the
calling sequence

bool
boost::numeric::odeint::controlled_runge_kutta<boost::numeric::odeint::runge_kutta_dopri5<vex::multivector<double,
164ul>, double, vex::multivector<double, 164ul>, double,
boost::numeric::odeint::vector_space_algebra,
boost::numeric::odeint::default_operations,
boost::numeric::odeint::initially_resizer>,
boost::numeric::odeint::default_error_checker<double,
boost::numeric::odeint::vector_space_algebra,
boost::numeric::odeint::default_operations>,
boost::numeric::odeint::initially_resizer,
boost::numeric::odeint::explicit_error_stepper_fsal_tag>::resize_m_xnew_impl<vex::multivector<double,
164ul> >(vex::multivector<double, 164ul> const&) at
/opt/local/include/boost/numeric/odeint/stepper/controlled_runge_kutta.hpp:848

boost::numeric::odeint::adjust_size_by_resizeability<boost::numeric::odeint::state_wrapper<vex::multivector<double,
164ul>, void>, vex::multivector<double, 164ul>
(boost::numeric::odeint::state_wrapper<vex::multivector<double, 164ul>,
void>&, vex::multivector<double, 164ul> const&,
boost::integral_constant<bool, true>) at
/opt/local/include/boost/numeric/odeint/util/resizer.hpp:35
void boost::numeric::odeint::resize<vex::multivector<double, 164ul>,
vex::multivector<double, 164ul> >(vex::multivector<double, 164ul>&,
vex::multivector<double, 164ul> const&) at
/opt/local/include/boost/numeric/odeint/util/resize.hpp:53
boost::numeric::odeint::resize_impl<vex::multivector<double, 164ul>,
vex::multivector<double, 164ul>, void>::resize(vex::multivector<double,
164ul>&, vex::multivector<double, 164ul> const&) at
/opt/local/include/boost/numeric/odeint/external/vexcl/vexcl_resize.hpp:73
vex::multivector<double,
164ul>::resize(std::__1::vector<vex::backend::cuda::command_queue,
std::__1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long) at /usr/local/include/vexcl/multivector.hpp:287

vex::vector::resize(std::__1::vector<vex::backend::cuda::command_queue,
std::__1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long, double const*, unsigned int) at
/usr/local/include/vexcl/vector.hpp:490

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue,
std::__1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long, double const*, unsigned int) at
/usr/local/include/vexcl/vector.hpp:384

vex::vector::vector(std::__1::vector<vex::backend::cuda::command_queue,
std::1::allocatorvex::backend::cuda::command_queue > const&, unsigned
long, double const, unsigned int) at
/usr/local/include/vexcl/vector.hpp:383
vex::vector::allocate_buffers(unsigned int, double const) at
/usr/local/include/vexcl/vector.hpp:802

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::command_queue
const&, unsigned long, double const*, unsigned int) at
/usr/local/include/vexcl/backend/cuda/device_vector.hpp:111

vex::backend::cuda::device_vector::device_vector(vex::backend::cuda::command_queue
const&, unsigned long, double const_, unsigned int) at
/usr/local/include/vexcl/backend/cuda/device_vector.hpp:100
vex::backend::cuda::check(cudaError_enum, char const_, int) at
/usr/local/include/vexcl/backend/cuda/error.hpp:135

The failing CUDA API call is apparently the invocation of cuMemAlloc()
in device_vector.hpp
    /// Allocates memory buffer on the device associated with the
given queue.
template
device_vector(const command_queue &q, size_t n,
const H *host = 0, mem_flags flags = MEM_READ_WRITE)
: n(n)
{
(void)flags;
        if (n) {
            q.context().set_current();

            CUdeviceptr ptr;
            cuda_check( cuMemAlloc(&ptr, n * sizeof(T)) );
buffer.reset(reinterpret_cast<char*>(static_cast<size_t>(ptr)),
detail::deleter() );
            if (host) {
                if (std::is_same<T, H>::value)
                    write(q, 0, n, reinterpret_cast<const T*>(host),
true);
else
write(q, 0, n, std::vector(host, host +
n).data(), true);
}
}
}

I’m not quite clear what this means because CUDA_ERROR_LAUNCH_FAILED
doesn’t seem to be an error code which cuMemAlloc() should return.

I have no problems with a very analogous kernel which uses a
vex::multivector<double, 20> as the state. Is there any reason to think
that a large multivector should run into resizing problems of this type?

—
Reply to this email directly or view it on GitHub.

I had to do this for the kernel in rhs_functor::operator() to compile:

diff --git a/launch_failure_kernel.cpp b/launch_failure_kernel.cpp
index 00235e2..44e29e5 100644
--- a/launch_failure_kernel.cpp
+++ b/launch_failure_kernel.cpp
@@ -110,10 +110,12 @@ void rhs_functor::operator()(const state& x, state& dxdt, double t)
     for(unsigned int d = 0; d < this->ctx.size(); d++)
       {
         kernel.emplace_back(this->ctx.queue(d),
+#if defined(_MSC_VER) || defined(__APPLE__)
         "typedef unsigned char       uchar;n"
         "typedef unsigned int        uint;n"
         "typedef unsigned short      ushort;n"
         "typedef unsigned long long  ulong;n"
+#endif
         "extern "C" __global__ void threepffused( ulong n,n"
         "                                         double Mp,n"
         "                                         double M_phi, double M_chi,n"

Then I inserted this->ctx.finish() here:

diff --git a/launch_failure_kernel.cpp b/launch_failure_kernel.cpp
index 44e29e5..3fb695e 100644
--- a/launch_failure_kernel.cpp
+++ b/launch_failure_kernel.cpp
@@ -2268,4 +2268,5 @@ void rhs_functor::operator()(const state& x, state& dxdt, double t)

         kernel[d](this->ctx.queue(d));
       }
+    this->ctx.finish();
   }

and I got the 700 error at this point. So the culprit is the kernel. cuda-memcheck shows that there are invalid shared memory writes:

$ cuda-memcheck ./launch_failure_kernel
========= CUDA-MEMCHECK
1. Tesla K20c
time t = 0
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (4,0,0) in block (0,0,0)
=========     Address 0x00013800 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (3,0,0) in block (0,0,0)
=========     Address 0x00013200 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x00012c00 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (1,0,0) in block (0,0,0)
=========     Address 0x00012600 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Invalid __shared__ write of size 8
=========     at 0x000025a8 in threepffused
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x00012000 is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138071]
=========     Host Frame:./launch_failure_kernel [0x10c81]
=========     Host Frame:./launch_failure_kernel [0xc063]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
========= Program hit error 719 on CUDA API call to cuStreamSynchronize 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib64/libcuda.so.1 (cuStreamSynchronize + 0x17a) [0x13966a]
=========     Host Frame:./launch_failure_kernel [0xded0]
=========     Host Frame:./launch_failure_kernel [0x11214]
=========     Host Frame:./launch_failure_kernel [0xc09c]
=========     Host Frame:./launch_failure_kernel [0x2cea7]
=========     Host Frame:./launch_failure_kernel [0x27567]
=========     Host Frame:./launch_failure_kernel [0x21477]
=========     Host Frame:./launch_failure_kernel [0x1c0d8]
=========     Host Frame:./launch_failure_kernel [0x15527]
=========     Host Frame:./launch_failure_kernel [0x41fa]
=========     Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x24c05]
=========     Host Frame:./launch_failure_kernel [0x3b69]
=========
terminate called after throwing an instance of 'vex::backend::cuda::error'
  what():  /home/demidov/work/vexcl/vexcl/backend/cuda/context.hpp:196
    CUDA Driver API Error (Unknown error)
========= Error: process didn't terminate successfully
========= Internal error (20)
========= No CUDA-MEMCHECK results found

Another suggestion: I notice that you use multivectors with a lot of components, which leads to kernels with a lot of parameters. What do you think of replacing vex::multivector<T,N> of size n with a vex::vector<T> of size N * n, where individual components are placed in continuous chunks one after another?

I have two alternatives for convenient access to the components of such vector (you will need commit ac82646 for both to work). First one uses slices:

    vex::vector<double> x(ctx, 3 * n);
    auto X = vex::tag<0>(x);

    vex::slicer<2> slice( vex::extents[3][n] );

    // Alias individual components for convenience.
    auto x0 = slice[0](X);
    auto x1 = slice[1](X);
    auto x2 = slice[2](X);

    // write individual components:
    x0 = 1;
    x1 = 2;
    x2 = 3;

    // Do the fused call:
    vex::tie(x0, x1, x2) = std::tie(sin(x0), cos(x1), x1 - x0);

Another alternative uses permutations:

    vex::vector<double> x(ctx, 3 * n);
    auto X = vex::tag<0>(x);

    // Give second parameter to element_index so it knows its size.
    auto idx = vex::tag<1>( vex::element_index(0, n) );
    auto N   = vex::tag<2>( n );

    // Alias individual components for convenience.
    auto x0 = vex::permutation(idx        )(X);
    auto x1 = vex::permutation(idx + N    )(X);
    auto x2 = vex::permutation(idx + N * 2)(X);

    // write individual components:
    x0 = 1;
    x1 = 2;
    x2 = 3;

    // Do the fused call:
    vex::tie(x0, x1, x2) = std::tie(sin(x0), cos(x1), x1 - x0);

Gist https://gist.github.com/ddemidov/8018055 shows both approaches in a complete example.

The variant with permutations is more effective, because it is less general, uses less arithmetic operations, and uses less kernels arguments. Compare the fused kernel for the sliced expressions:

extern "C" __global__ void vexcl_multivector_kernel
(
  ulong n,
  double * prm_tag_0_1,
  ulong lhs_1_slice_start,
  ulong lhs_1_slice_length0,
  long lhs_1_slice_stride0,
  ulong lhs_1_slice_length1,
  long lhs_1_slice_stride1,
  ulong rhs_1_slice_start,
  ulong rhs_1_slice_length0,
  long rhs_1_slice_stride0,
  ulong rhs_1_slice_length1,
  long rhs_1_slice_stride1,
  ulong lhs_2_slice_start,
  ulong lhs_2_slice_length0,
  long lhs_2_slice_stride0,
  ulong lhs_2_slice_length1,
  long lhs_2_slice_stride1,
  ulong rhs_2_slice_start,
  ulong rhs_2_slice_length0,
  long rhs_2_slice_stride0,
  ulong rhs_2_slice_length1,
  long rhs_2_slice_stride1,
  ulong lhs_3_slice_start,
  ulong lhs_3_slice_length0,
  long lhs_3_slice_stride0,
  ulong lhs_3_slice_length1,
  long lhs_3_slice_stride1,
  ulong rhs_3_slice_start,
  ulong rhs_3_slice_length0,
  long rhs_3_slice_stride0,
  ulong rhs_3_slice_length1,
  long rhs_3_slice_stride1,
  ulong rhs_4_slice_start,
  ulong rhs_4_slice_length0,
  long rhs_4_slice_stride0,
  ulong rhs_4_slice_length1,
  long rhs_4_slice_stride1
)
{
  for
  (
    size_t idx = blockDim.x * blockIdx.x + threadIdx.x, grid_size = blockDim.x * gridDim.x;
    idx < n;
    idx += grid_size
  )
  {
    double buf_1 = sin( prm_tag_0_1[rhs_1_slice_func(rhs_1_slice_start, rhs_1_slice_length0, rhs_1_slice_stride0, rhs_1_slice_length1, rhs_1_slice_stride1, idx)] );
    double buf_2 = cos( prm_tag_0_1[rhs_2_slice_func(rhs_2_slice_start, rhs_2_slice_length0, rhs_2_slice_stride0, rhs_2_slice_length1, rhs_2_slice_stride1, idx)] );
    double buf_3 = ( prm_tag_0_1[rhs_3_slice_func(rhs_3_slice_start, rhs_3_slice_length0, rhs_3_slice_stride0, rhs_3_slice_length1, rhs_3_slice_stride1, idx)] - prm_tag_0_1[rhs_4_slice_func(rhs_4_slice_start, rhs_4_slice_length0, rhs_4_slice_stride0, rhs_4_slice_length1, rhs_4_slice_stride1, idx)] );
    prm_tag_0_1[lhs_1_slice_func(lhs_1_slice_start, lhs_1_slice_length0, lhs_1_slice_stride0, lhs_1_slice_length1, lhs_1_slice_stride1, idx)] = buf_1;
    prm_tag_0_1[lhs_2_slice_func(lhs_2_slice_start, lhs_2_slice_length0, lhs_2_slice_stride0, lhs_2_slice_length1, lhs_2_slice_stride1, idx)] = buf_2;
    prm_tag_0_1[lhs_3_slice_func(lhs_3_slice_start, lhs_3_slice_length0, lhs_3_slice_stride0, lhs_3_slice_length1, lhs_3_slice_stride1, idx)] = buf_3;
  }
}

with the same kernel for the permuted expressions:

extern "C" __global__ void vexcl_multivector_kernel
(
  ulong n,
  double * prm_tag_0_1,
  ulong prm_tag_1_1,
  ulong prm_tag_2_1,
  int lhs_3_slice_3
)
{
  for
  (
    size_t idx = blockDim.x * blockIdx.x + threadIdx.x, grid_size = blockDim.x * gridDim.x;
    idx < n;
    idx += grid_size
  )
  {
    double buf_1 = sin( prm_tag_0_1[(prm_tag_1_1 + idx)] );
    double buf_2 = cos( prm_tag_0_1[( (prm_tag_1_1 + idx) + prm_tag_2_1 )] );
    double buf_3 = ( prm_tag_0_1[( (prm_tag_1_1 + idx) + prm_tag_2_1 )] - prm_tag_0_1[(prm_tag_1_1 + idx)] );
    prm_tag_0_1[(prm_tag_1_1 + idx)] = buf_1;
    prm_tag_0_1[( (prm_tag_1_1 + idx) + prm_tag_2_1 )] = buf_2;
    prm_tag_0_1[( (prm_tag_1_1 + idx) + ( prm_tag_2_1 * lhs_3_slice_3 ) )] = buf_3;
  }
}

Thanks for this. I was confused by the error being caught on return from cuMemAlloc(). It was an address calculation error in the kernel.

I agree it would probably be preferable to package the state as a single vex::vector rather than a high-dimensional vex::multivector – the number of kernel parameters already causes a problem with the OpenCL backend. I will look into the slicer and permutation options. (Unfortunately, for actual calculations I think I will be stuck with writing custom kernels because my system of ODEs is complex enough to cause enormous resource usage in the compiler.)

Same approach should work for the custom kernels.

By the way, I remember that first CUDA versions had a limit of 256 bytes for the total size of kernel parameters. I am not sure if this limit became higher or is nonexistent these days. But you could in theory get same problems with number of parameters for CUDA as well as for OpenCL.

Источник

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
 
#include <stdio.h>
 
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size);
 
__global__ void addKernel(int *c, const int *a, const int *b)
{
    int i = threadIdx.x;
    c[i] = a[i] + b[i];
}
 
int main()
{
    const int arraySize = 5;
    const int a[arraySize] = { 1, 2, 3, 4, 5 };
    const int b[arraySize] = { 10, 20, 30, 40, 50 };
    int c[arraySize] = { 0 };
 
    // Add vectors in parallel.
    cudaError_t cudaStatus = addWithCuda(c, a, b, arraySize);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addWithCuda failed!");
        return 1;
    }
 
    printf("{1,2,3,4,5} + {10,20,30,40,50} = {%d,%d,%d,%d,%d}n",
        c[0], c[1], c[2], c[3], c[4]);
 
    // cudaDeviceReset must be called before exiting in order for profiling and
    // tracing tools such as Nsight and Visual Profiler to show complete traces.
    cudaStatus = cudaDeviceReset();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceReset failed!");
        return 1;
    }
 
    return 0;
}
 
// Helper function for using CUDA to add vectors in parallel.
cudaError_t addWithCuda(int *c, const int *a, const int *b, unsigned int size)
{
    int *dev_a = 0;
    int *dev_b = 0;
    int *dev_c = 0;
    cudaError_t cudaStatus;
 
    // Choose which GPU to run on, change this on a multi-GPU system.
    cudaStatus = cudaSetDevice(0);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
        goto Error;
    }
 
    // Allocate GPU buffers for three vectors (two input, one output)    .
    cudaStatus = cudaMalloc((void**)&dev_c, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }
 
    cudaStatus = cudaMalloc((void**)&dev_a, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }
 
    cudaStatus = cudaMalloc((void**)&dev_b, size * sizeof(int));
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMalloc failed!");
        goto Error;
    }
 
    // Copy input vectors from host memory to GPU buffers.
    cudaStatus = cudaMemcpy(dev_a, a, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
 
    cudaStatus = cudaMemcpy(dev_b, b, size * sizeof(int), cudaMemcpyHostToDevice);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
 
    // Launch a kernel on the GPU with one thread for each element.
    addKernel<<<1, size>>>(dev_c, dev_a, dev_b);
 
    // Check for any errors launching the kernel
    cudaStatus = cudaGetLastError();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "addKernel launch failed: %sn", cudaGetErrorString(cudaStatus));
        goto Error;
    }
    
    // cudaDeviceSynchronize waits for the kernel to finish, and returns
    // any errors encountered during the launch.
    cudaStatus = cudaDeviceSynchronize();
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!n", cudaStatus);
        goto Error;
    }
 
    // Copy output vector from GPU buffer to host memory.
    cudaStatus = cudaMemcpy(c, dev_c, size * sizeof(int), cudaMemcpyDeviceToHost);
    if (cudaStatus != cudaSuccess) {
        fprintf(stderr, "cudaMemcpy failed!");
        goto Error;
    }
 
Error:
    cudaFree(dev_c);
    cudaFree(dev_a);
    cudaFree(dev_b);
    
    return cudaStatus;
}

Источник

Moderator: aoktar

help: CUDA error 700

Hi,
since one week, out of the blue i’m having octane live viewer giving me errors after few minutes it’s loaded.
I looked for some answers on the forum, and as i read i tried to update the bios, fresh installed and downgraded my card driver, tried to use an old octane version, but i still get the error after few minutes
now i’m running on c4d r19.053 and octane 3.07-r2, nvidia driver 398.36 (i have 2 x 1080ti)
i really don’t know what to do, any help?
thanks in advance

Loaded Octane DLL:OctaneRender 3.08.2 version:3080200
Octane Render for Cinema 4D 3.08.2-R2.1 built for:190 C4D vers:19053
Octane Render SDK version: 3.08.2

=============================================================

————————— EXPORT LOG —————————
Export materials time= 178553.472 ms
Collect objects time= 32.774 ms
Mesh creation time = 2126.038 ms.
OCT:Tried to access pin via an invalid ID P_TRANSFORM (243)
OCT:Tried to access null node pin
OCT:Tried to access pin via an invalid ID P_TRANSFORM (243)
OCT:Tried to access null node pin
Total export Time = 180782.363 ms
VRAM used/free/max:7.855Gb/1.036Gb/11Gb Out-of-core used:3.902Gb total used RAM:39.811Gb
OCT:CUDA error 700 on device 1: an illegal memory access was encountered
OCT: -> kernel execution failed(kernel25)
OCT:CUDA error 700 on device 1: an illegal memory access was encountered
OCT: -> failed to launch kernel(kernel34)
OCT:device 1: path tracing kernel failed
—————————————————————————————-
<<< Render failure detected!!! >>>
Please check render statistics to solve the problem.
MB:0/0 ST/MOV:0/19 Nodes:391 Tris:3.5m DispTris:137k Hairs:0 Meshes:22
Device:0 TotMem:11Gb rtData:604Mb film:110Mb geo:587Mb node:18Kb tex:6.583Gb unavailable:2.324Gb temperature:64
Device:1 TotMem:11Gb rtData:604Mb film:110Mb geo:587Mb node:18Kb tex:6.583Gb unavailable:2.067Gb temperature:58
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:Failed to free page-locked memory
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to copy memory to device.
OCT:device 0: failed to upload data texture 23 of context 1
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to deallocate device memory
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> could not get memory info
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to load symbol data to the device(kernel data)
OCT:device 0: preview failed
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to set texture format(__data24__)
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to set texture format(__data26__)
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to deallocate device memory
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> could not get memory info
OCT:CUDA error 700 on device 0: an illegal memory access was encountered
OCT: -> failed to load symbol data to the device(const11)
OCT:device 0: failed to upload imager params

cristiangiro: Licensed Customer; Posts: 28; Joined: Sat Nov 28, 2015 9:29 am

Re: help: CUDA error 700

by aoktar » Thu Aug 02, 2018 10:14 am

VRAM used/free/max:7.855Gb/1.036Gb/11Gb Out-of-core used:3.902Gb total used RAM:39.811Gb

Here I see that you have a low free VRAM. Do you open/work another software as AE, PS, Chrome, etc at same time of render session?

Octane For Cinema 4D developer / 3d generalist

3930k / 16gb / 780ti + 1070/1080 / psu 1600w / numerous hw

aoktar

Octane Plugin Developer

Posts: 15420

Joined: Tue Mar 23, 2010 8:28 pm

Location: Türkiye

Website

Re: help: CUDA error 700

by cristiangiro » Thu Aug 02, 2018 10:36 am

yes, just chrome actually, i close everything else

cristiangiro: Licensed Customer; Posts: 28; Joined: Sat Nov 28, 2015 9:29 am

Re: help: CUDA error 700

by aoktar » Thu Aug 02, 2018 2:16 pm

Watch your free VRAM while rendering. Sometimes it makes peek for transfering buffers or so. It shows a VRAM error in log.
And it’s clear that you have very high export time. It’s 180782.363 ms = 180.78 seconds. It’s very high. Your material export time is much exposive. 178.5 seconds. What do you have so much?

Octane For Cinema 4D developer / 3d generalist

3930k / 16gb / 780ti + 1070/1080 / psu 1600w / numerous hw

aoktar

Octane Plugin Developer

Posts: 15420

Joined: Tue Mar 23, 2010 8:28 pm

Location: Türkiye

Website

Re: help: CUDA error 700

by cristiangiro » Thu Aug 02, 2018 2:19 pm

I always have very dense geometry, since i only render images i guessed having a lot of triangles would not be a problem,
how should i optimise the scene to have a shorter export time and less triangles?
cheers

cristiangiro: Licensed Customer; Posts: 28; Joined: Sat Nov 28, 2015 9:29 am

Re: help: CUDA error 700

by cristiangiro » Thu Aug 02, 2018 2:20 pm

Also using 8k textures (which i am using) may get in the way for a long export, right?

cristiangiro: Licensed Customer; Posts: 28; Joined: Sat Nov 28, 2015 9:29 am

Re: help: CUDA error 700

by aoktar » Thu Aug 02, 2018 2:22 pm

cristiangiro wrote:I always have very dense geometry, since i only render images i guessed having a lot of triangles would not be a problem,
how should i optimise the scene to have a shorter export time and less triangles?
cheers

I may be a problem too. in V3 geometries should fit to VRAM. Be aware of triangle count differences on LV against PV. Check how many triangles in LiveViewer and PictureViewer. Do this by deleting all materials if you cant render in case.

Octane For Cinema 4D developer / 3d generalist

3930k / 16gb / 780ti + 1070/1080 / psu 1600w / numerous hw

aoktar

Octane Plugin Developer

Posts: 15420

Joined: Tue Mar 23, 2010 8:28 pm

Location: Türkiye

Website

Re: help: CUDA error 700

by aoktar » Thu Aug 02, 2018 2:23 pm

cristiangiro wrote:Also using 8k textures (which i am using) may get in the way for a long export, right?

I suppose you’re using some C4D shaders like Layer/Noise/etc… right?

Octane For Cinema 4D developer / 3d generalist

3930k / 16gb / 780ti + 1070/1080 / psu 1600w / numerous hw

aoktar

Octane Plugin Developer

Posts: 15420

Joined: Tue Mar 23, 2010 8:28 pm

Location: Türkiye

Website

Re: help: CUDA error 700

by cristiangiro » Thu Aug 02, 2018 2:35 pm

not at all, i’m using exported substance painter textures branched in the basic octane channels, i never use layers or even c4d noises on octane materials..

what do you mean with:

aoktar wrote:Do this by deleting all materials if you cant render in case.

thanks!!

cristiangiro: Licensed Customer; Posts: 28; Joined: Sat Nov 28, 2015 9:29 am

Re: help: CUDA error 700

by Kalua » Tue Aug 21, 2018 8:52 pm

I have a similar problem:
The difference is I’m not loading a single mesh or material in the scene… Any thoughts…?
It started yestarday, since them I cant render anything… standalone or plugin…
Already tried to reinstal drivers …

Started logging on 21.08.18 14:43:38

OctaneRender 3.08.3 (3080300)

CUDA error 700 on device 2: an illegal memory access was encountered
-> kernel execution failed(kernel25)
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to launch kernel(kernel26)
device 2: direct light kernel failed
CUDA error 700 on device 0: an illegal memory access was encountered
CUDA error 700 on device 1: an illegal memory access was encountered
-> failed to deallocate pinned memory
-> could not get memory info
CUDA error 700 on device 0: an illegal memory access was encountered
CUDA error 700 on device 1: an illegal memory access was encountered
-> could not get memory info
-> failed to load symbol data to the device(deep_data)
CUDA error 700 on device 0: an illegal memory access was encountered
device 1: failed to upload the deep params
-> failed to allocate pinned memory
device 0: failed to allocate tonemap result buffer
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed
CUDA error 700 on device 2: an illegal memory access was encountered
-> failed to load symbol data to the device(kernel data)
device 2: picking failed

EDIT:
After making lots of hardware tests I found out its a GPU hardware issue… so one of my 1080s is faulty and doenst render anymore so I had to disconnect it.
Rendering just fine now.

Last edited by Kalua on Wed Aug 22, 2018 3:57 pm, edited 2 times in total.

Kalua

Licensed Customer

Posts: 420

Joined: Fri Oct 11, 2013 2:13 am

Location: Caribbean Sea

Website
Facebook

Return to Maxon Cinema 4D

Who is online

Users browsing this forum: No registered users and 13 guests

Источник

CUDA error driver 700 #93

Cuda __syncthreads не определено. Без него->случайные результаты

1 ответы

cuda runtime error (700) : an illegal memory access was encountered #42077

To Reproduce

Expected behavior

Environment

Additional context

Ошибки сегмента добавления матрицы CUDA