Cuda error invalid configuration argument

Here is my code: int threadNum = BLOCKDIM/8; dim3 dimBlock(threadNum,threadNum); int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1); int blocks2 = nHeight/threadNum + (nHeight%thread...

Here is my code:

int threadNum = BLOCKDIM/8;
dim3 dimBlock(threadNum,threadNum);
int blocks1 = nWidth/threadNum + (nWidth%threadNum == 0 ? 0 : 1);
int blocks2 = nHeight/threadNum + (nHeight%threadNum == 0 ? 0 : 1);
dim3 dimGrid;
dimGrid.x = blocks1;
dimGrid.y = blocks2;

//  dim3 numThreads2(BLOCKDIM);
//  dim3 numBlocks2(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1) );
perform_scaling<<<dimGrid,dimBlock>>>(imageDevice,imageDevice_new,min,max,nWidth, nHeight);
cudaError_t err = cudaGetLastError();
cudasafe(err,"Kernel2");

This is the execution of my second kernel and it is fully independent in term of the usage of data. BLOCKDIM is 512 , nWidth and nHeight are 512 too and cudasafe simply prints the corresponding string message of the error code. This section of the code gives configuration error just after the kernel call.

What might give this error, any idea?

Ashwin Nanjappa's user avatar

asked Apr 20, 2013 at 21:31

erogol's user avatar

This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it’s a good idea just to print out your actual config parameters before launching the kernel, to see if you’ve made any mistakes.

You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:

dim3 dimBlock(threadNum,threadNum);

So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won’t work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.

Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.

answered Apr 20, 2013 at 21:44

Robert Crovella's user avatar

Robert CrovellaRobert Crovella

139k10 gold badges198 silver badges241 bronze badges

2

Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:

struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;

answered Oct 12, 2015 at 11:11

Niko's user avatar

NikoNiko

6026 silver badges12 bronze badges

3

🐛 Bug

gpu_tensor_cpp.tar.gz

When calling some functions like torch::mean() on this gpu tensor, a CUDA runtime error will occur:

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: invalid configuration argument
Exception raised from launch_reduce_kernel at /pytorch/aten/src/ATen/native/cuda/Reduce.cuh:828 (most recent call first):

Here is the complete output of gdb backtrace (running with CUDA_LAUNCH_BLOCKING=1):

Thread 1 "train" received signal SIGABRT, Aborted.
0x00007fff600fb70f in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: yum debuginfo-install libgcc-8.3.1-5.el8.0.2.x86_64 libgomp-8.3.1-5.el8.0.2.x86_64 libibverbs-26.0-8.el8.x86_64 libnl3-3.5.0-1.el8.x86_64 libstdc++-8.3.1-5.el8.0.2.x86_64 sssd-client-2.2.3-20.el8.x86_64 zlib-1.2.11-16.el8_2.x86_64
(gdb) bt
#0  0x00007fff600fb70f in raise () from /lib64/libc.so.6
#1  0x00007fff600e5b25 in abort () from /lib64/libc.so.6
#2  0x00007fff60ce806b in __gnu_cxx::__verbose_terminate_handler() [clone .cold.1] () from /lib64/libstdc++.so.6
#3  0x00007fff60cee50c in __cxxabiv1::__terminate(void (*)()) ()
   from /lib64/libstdc++.so.6
#4  0x00007fff60cee567 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x00007fff60cee7c8 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x00007fff83782b3c in void at::native::gpu_reduce_kernel<double, double, 4, at::native::MeanOps<double, float>, double>(at::TensorIterator&, at::native::MeanOps<double, float> const&, double, at::native::AccumulationBuffer*, long) ()
   from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#7  0x00007fff83773bf2 in at::native::mean_kernel_cuda(at::TensorIterator&) ()
   from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#8  0x00007fffe48299a6 in void at::native::DispatchStub<void (*)(at::TensorIterator&), at::native::mean_stub>::operator()<at::TensorIterator&>(c10::DeviceType, at::TensorIterator&) ()
   from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#9  0x00007fffe48179f2 in at::native::mean_out_cpu_gpu(at::Tensor&, at::Tensor const&, c10::ArrayRef<long>, bool, c10::optional<c10::ScalarType>) ()
   from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#10 0x00007fffe4817f3b in at::native::mean_cpu_gpu(at::Tensor const&, c10::ArrayRef<long>, bool, c10::optional<c10::ScalarType>) ()
--Type <RET> for more, q to quit, c to continue without paging--c
   from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#11 0x00007fffe4818008 in at::native::mean_cpu_gpu(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#12 0x00007fff8216b373 in at::CUDAType::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#13 0x00007fff821a8dde in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::optional<c10::ScalarType>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cuda.so
#14 0x00007fffe4d5fa91 in at::Tensor c10::Dispatcher::callWithDispatchKey<at::Tensor, at::Tensor const&, c10::optional<c10::ScalarType> >(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)> const&, c10::DispatchKey, at::Tensor const&, c10::optional<c10::ScalarType>) const () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#15 0x00007fffe4c79d8a in at::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#16 0x00007fffe622470e in torch::autograd::VariableType::(anonymous namespace)::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#17 0x00007fffe44994ce in c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoRuntimeFunctor_<at::Tensor (*)(at::Tensor const&, c10::optional<c10::ScalarType>), at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#18 0x00007fffe4d5fa91 in at::Tensor c10::Dispatcher::callWithDispatchKey<at::Tensor, at::Tensor const&, c10::optional<c10::ScalarType> >(c10::TypedOperatorHandle<at::Tensor (at::Tensor const&, c10::optional<c10::ScalarType>)> const&, c10::DispatchKey, at::Tensor const&, c10::optional<c10::ScalarType>) const () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#19 0x00007fffe4c79d8a in at::mean(at::Tensor const&, c10::optional<c10::ScalarType>) () from /home/admin/fanyi/Softwares/python-packages/torch/lib/libtorch_cpu.so
#20 0x000000000047bacd in one_batch_netImpl::calc_std_avg_only (this=0x883e900, coord_energy_force_batch=std::vector of length 5, capacity 5 = {...}, nei_info_batch=std::vector of length 4, capacity 4 = {...}, descrpt_and_deriv_batch=std::vector of length 4, capacity 4 = {...}, parameters_info=0x22c4ee0, DEVICE=...) at /home/admin/fanyi/Softwares/NN_train/src_O0/struct_DP.h:232
#21 0x00000000004770c6 in train (parameters_info=0x22c4ee0, frame_info=0x7fff40f29010, training_dataset=0x7fffffffd520, model=0x7fffffffd4b0, optimizer=0x7fffffffd460) at /home/admin/fanyi/Softwares/NN_train/src_O0/train_DP.cpp:186
#22 0x00000000004206a0 in train_NN_DP (param_filename=0x7fffffffdbd5 "PARAMS.json") at /home/admin/fanyi/Softwares/NN_train/src_O0/train_NN.cpp:254
#23 0x00000000004200ef in train_NN (param_filename=0x7fffffffdbd5 "PARAMS.json") at /home/admin/fanyi/Softwares/NN_train/src_O0/train_NN.cpp:51
#24 0x000000000040fd09 in main (argc=2, argv=0x7fffffffd878) at /home/admin/fanyi/Softwares/NN_train/src_O0/main.cpp:83

The code at /home/admin/fanyi/Softwares/NN_train/src_O0/struct_DP.h:232 is :

torch::Tensor xyz_hat_avg = torch::mean(xyz_hat);

This first happened in my C++ code using PyTorch’s C++ APIs. I have saved this tensor xyz_hat using torch::save() (the attchment gpu_tensor_cpp.tar.gz), then loaded it in python using torch.jit.load. The same error occured when calling torch.mean():

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid configuration argument

To Reproduce

Steps to reproduce the behavior:

# gpu_tensor_cpp.tar.gz is the attachment
$ tar -zxvf gpu_tensor_cpp.tar.gz
$ python3
>>> import torch
>>> a = list(torch.jit.load("gpu_tensor_cpp").parameters())[0]
>>> a.device
device(type='cuda', index=0)
>>> a.dtype
torch.float64
>>> torch.mean(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid configuration argument
>>> torch.sum(a)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: CUDA error: invalid configuration argument
>>> a2 = a.clone()
>>> torch.sum(a2)
tensor(2855.0410, device='cuda:0', dtype=torch.float64)

Expected behavior

Operations on tensor a like torch.sum(a) should return the same result as tensor a2, where a2 is just a clone of a, as described above.

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).

PyTorch version: 1.7.0+cu110
Is debug build: True
CUDA used to build PyTorch: 11.0
ROCM used to build PyTorch: N/A

OS: CentOS Linux 8 (Core) (x86_64)
GCC version: (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)
Clang version: Could not collect
CMake version: version 3.11.4

Python version: 3.6 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: A100-SXM4-40GB
GPU 1: A100-SXM4-40GB
GPU 2: A100-SXM4-40GB
GPU 3: A100-SXM4-40GB
GPU 4: A100-SXM4-40GB
GPU 5: A100-SXM4-40GB
GPU 6: A100-SXM4-40GB
GPU 7: A100-SXM4-40GB

Nvidia driver version: 450.80.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] torch==1.7.0+cu110
[conda] Could not collect

Additional context

This error could be overcome by calling .to(«cpu») first or simply making a clone() of this problematic tensor. But I still would like to understand what actually triggered it. Maybe the data stored in the problematic tensor is broken?

I tried running the same C++ code on anther system with PyTorch 1.4.0 + CUDA 10.1 installed using pip, and found everything goes fine. Here is the environment for that system:
PyTorch version: 1.4.0
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A

OS: CentOS Linux release 7.7.1908 (Core) (x86_64)
GCC version: (GCC) 7.5.0
Clang version: Could not collect
CMake version: version 2.8.12.2

Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration:
GPU 0: Tesla V100-PCIE-32GB
GPU 1: Tesla V100-PCIE-32GB

Nvidia driver version: 450.51.06
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.4
[pip3] numpydoc==0.9.1
[pip3] torch==1.4.0
[conda] _pytorch_select 0.2 gpu_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] blas 1.0 mkl
[conda] cudatoolkit 10.0.130 0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl 2019.4 243
[conda] mkl-service 2.3.0 py37he904b0f_0
[conda] mkl_fft 1.0.14 py37ha843d7b_0
[conda] mkl_random 1.1.0 py37hd6b4f25_0
[conda] numpy 1.17.2 py37haad9e8e_0
[conda] numpy-base 1.17.2 py37hde5b4d6_0
[conda] numpydoc 0.9.1 py_0
[conda] pytorch 1.3.1 cuda100py37h53c1284_0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main

cc @ezyang @gchanan @zou3519 @bdhirsh @jbschlosser @anjali411 @ngimel @heitorschueroff

Handling CUDA error
messages

The following piece of
code, in the cudasafe routine, can be used to handle CUDA error
messages.

1.  #include <iostream>

2.  #include <cuda.h>

3.   

4.  void cudasafe( cudaError_t error, char* message)

5.  {

6.     if(error!=cudaSuccess) { fprintf(stderr,«ERROR: %s : %in«,message,error); exit(-1); }

7.  }

8.   

9.  int main() {

10.   float *a_d; // pointers to device memory; a.k.a. GPU

11.   int block_size, block_no, n=10;

12.

13.// allocate arrays on device

14.   cudasafe( cudaMalloc((void **)&a_d,n*n*sizeof(float)), «cudaMalloc» );

15.   block_size=22;

16.   dim3 dimBlock( block_size, block_size );

17.   dim3 dimGrid( ceil(float(N)/float(dimBlock.x)), ceil(float(N)/float(dimBlock.y)) );

18.

19.   cudasafe( cudaFree(a_d), «cudaFree» );

20.

21.   return(0);

22.}

See the Example 1 for
another way to check for error messages with CUDA.

*  error: a host function call
can not be configured
simply means that you tried to call a
routine as if it was a kernel to be executed on the device, but you forgot to
put __global__ in front of that routine.

*  Invalid Configuration
Argument

This error means that the dimension of either the specified grid of blocks (dimGrid)
, or number of threads in a block (dimBlock), is incorrect. In such a
case, the dimension is either zero or the dimension is larger than it should
be. This error will only occur if you dynamically determine the dimensions.

*  Too Many Resources
Requested for Launch
— This error means that the number of
registers available on the multiprocessor is being exceeded. Reduce the number
of threads per block to solve the problem.

*  Unspecified launch failure — This error means that
CUDA does not know what the problem was. This is the worst error to get because
you do not know where to look to correct the error. One way to look at this
error message is to mentally translate it to «segmentation fault»
for the host code.

Suppose you used the
following piece of code in your program to check for error messages.

1.  void Check_CUDA_Error(const char *message)

2.  {

3.     cudaError_t error = cudaGetLastError();

4.     if(error!=cudaSuccess) {

5.        fprintf(stderr,«ERROR: %s: %sn«, message, cudaGetErrorString(error) );

6.        exit(-1);

7.     }                        

8.  }

9.  int main(int argc, char** argv))

10.{

11.   :

12.   :

13.   block_size=23;

14.   dim3 dimBlock(block_size,block_size);

15.   dim3 dimGrid( ceil(float(N)/float(dimBlock.x)), ceil(float(N)/float(dimBlock.y)));

16.   assign_d<<<dimGrid,
dimBlock>>>
(a_d,N);

17.   Check_CUDA_Error(«Kernel Execution Failed!»);

18.   :

19.   :

20.   return 0;

21.}

This piece of code would
fail without a warning as to the cause. Remember, once you launch the kernel,
it operates asynchronously with the CPU. The kernel would fail, and not tell
you, but the CPU would continue to compute whatever was left in the program. By
checking the error message, you could see that the kernel failed with Invalid
Configuration Argument
. In this case, we know number of threads in the
block is not zero. However, there are 529 threads in the block, which exceeds
the capability of the GPU, which was shown in the Getting information about the GPU
tutorial to be 512. By reducing the number of threads down to 22 per side of
the block (or 484 threads total in the block), the code will run correctly.

Suppose you took the code
from the Laplace Solver Program tutorial, and modified
it so that instead of:

1.  if(i>0 && i<N-1 && j>0 && j<N-1) { B[index] = 0.25*(  A[index1] + A[index2] + A[index3] + A[index4] ); }

you now had a new array, a
masking array. This masking array is set to zero on the boundaries of the
array, and one on the interior. This way the interior is computed, and the
boundary conditions are left alone.

1.  if(mask[index]) { B[index] = 0.25*(  A[index1] + A[index2] + A[index3] + A[index4] ); }

However, when you run the
code, you occasionally get the dreaded unspecified launch failure error.
Sometimes when you run the code it works fine; sometimes it fails. The problem
is that you are accessing an array out of bounds, which is giving you the
error. When the program is executed, a number of threads are created. These
threads are grouped together in thread blocks. Suppose you say you want 16
threads per block, and the grid on which you are solving the Laplace Equation
is 45 x 45. The grid has 2,025 points. Each block has 256 threads. Dividing the
size of the grid by the number of threads per block means that you will need
7.9 blocks. Of course, you cannot have a partial block, so the number is
rounded up to 8 blocks. That means that you have 2,048 threads, while you need
only 2,025. (Really you only need 1,936 threads since you have boundary
conditions where no computation takes place.) The extra threads are unused in
the first code block. However, in the second code block, with the masking
array, those extra threads will be accessing the mask array beyond the bounds
of the array. The result is non-deterministic. Sometimes it may succeed ;
sometimes it may fail with the error unspecified launch failure.

Got
Questions?

Feel
free to ask me any question because I’d be happy to walk you through step by
step!

For
Contact us….. Click on Contact us Tab

GPU CUDA error «An unexpected error occurred trying to launch a kernel. The CUDA error was: invalid configuration argument» resolves automatically on second try

Sia

Hi,

I’m trying to run code on GPU and I get the following error:

An unexpected error occurred trying to launch a kernel. The CUDA error was:

invalid configuration argument

When I insert a breakpoint at the line that gives this error and run the code manually I get the same error but if I immediately try running it like again the code runs without the error. This proves that my code is not buggy and it is something else that’s causing it to crash.

I have seen the same CUDA error using the NaN and subref functions. If it helps at all, here are the exact lines both of which give the same error on first try, but not on second try:

K>> W0(:,ones(1,size(dWU0,3)),:)

Error using gpuArray/subsref

An unexpected error occurred trying to launch a kernel. The CUDA error was:

invalid configuration argument

K>> W0(:,ones(1,size(dWU0,3)),:)

ans(:,:,1) =

0.0087 0.0087

0.0135 0.0135

0.0202 0.0202

0.0303 0.0303

0.0436 0.0436

0.0667 0.0667

0.0973 0.0973

0.1273 0.1273

0.1582 0.1582

0.1935 0.1935

0.2348 0.2348

0.2767 0.2767

0.3009 0.3009

0.2821 0.2821

0.2036 0.2036

0.0705 0.0705

-0.0867 -0.0867

-0.2266 -0.2266

-0.3169 -0.3169

-0.3459 -0.3459

-0.3220 -0.3220

-0.2701 -0.2701

-0.2121 -0.2121

-0.1611 -0.1611

-0.1206 -0.1206

-0.0881 -0.0881

-0.0612 -0.0612

-0.0389 -0.0389

-0.0218 -0.0218

-0.0103 -0.0103

-0.0032 -0.0032

0.0013 0.0013

0.0054 0.0054

0.0086 0.0086

0.0112 0.0112

0.0124 0.0124

0.0128 0.0128

0.0093 0.0093

0.0054 0.0054

0.0059 0.0059

0.0048 0.0048

0.0045 0.0045

0.0040 0.0040

0.0039 0.0039

0.0039 0.0039

0.0039 0.0039

0.0039 0.0039

0.0038 0.0038

0.0037 0.0037

0.0037 0.0037

0.0037 0.0037

0.0036 0.0036

0.0037 0.0037

0.0037 0.0037

0.0038 0.0038

0.0039 0.0039

0.0038 0.0038

0.0037 0.0037

0.0034 0.0034

0.0033 0.0033

0.0030 0.0030

ans(:,:,2) =

0.0620 0.0620

0.0931 0.0931

0.1302 0.1302

0.1729 0.1729

0.2109 0.2109

0.2376 0.2376

0.2496 0.2496

0.2512 0.2512

0.2390 0.2390

0.1930 0.1930

0.0968 0.0968

-0.0537 -0.0537

-0.2302 -0.2302

-0.3746 -0.3746

-0.4279 -0.4279

-0.3749 -0.3749

-0.2513 -0.2513

-0.1257 -0.1257

-0.0487 -0.0487

-0.0268 -0.0268

-0.0308 -0.0308

-0.0354 -0.0354

-0.0321 -0.0321

-0.0256 -0.0256

-0.0231 -0.0231

-0.0242 -0.0242

-0.0249 -0.0249

-0.0207 -0.0207

-0.0136 -0.0136

-0.0073 -0.0073

-0.0027 -0.0027

0.0023 0.0023

0.0103 0.0103

0.0158 0.0158

0.0197 0.0197

0.0195 0.0195

0.0170 0.0170

0.0145 0.0145

0.0147 0.0147

0.0094 0.0094

0.0094 0.0094

0.0077 0.0077

0.0048 0.0048

0.0052 0.0052

0.0047 0.0047

0.0049 0.0049

0.0048 0.0048

0.0043 0.0043

0.0033 0.0033

0.0027 0.0027

0.0026 0.0026

0.0023 0.0023

0.0025 0.0025

0.0032 0.0032

0.0044 0.0044

0.0053 0.0053

0.0060 0.0060

0.0057 0.0057

0.0045 0.0045

0.0032 0.0032

0.0016 0.0016

ans(:,:,3) =

-0.0826 -0.0826

-0.0976 -0.0976

-0.1043 -0.1043

-0.0936 -0.0936

-0.0674 -0.0674

-0.0228 -0.0228

0.0267 0.0267

0.1056 0.1056

0.1959 0.1959

0.2812 0.2812

0.3282 0.3282

0.3057 0.3057

0.2109 0.2109

0.0744 0.0744

-0.0401 -0.0401

-0.0647 -0.0647

0.0084 0.0084

0.1415 0.1415

0.2627 0.2627

0.3130 0.3130

0.2663 0.2663

0.1646 0.1646

0.0571 0.0571

-0.0321 -0.0321

-0.0928 -0.0928

-0.1287 -0.1287

-0.1545 -0.1545

-0.1726 -0.1726

-0.1818 -0.1818

-0.1817 -0.1817

-0.1754 -0.1754

-0.1644 -0.1644

-0.1537 -0.1537

-0.1374 -0.1374

-0.1179 -0.1179

-0.1006 -0.1006

-0.0818 -0.0818

-0.0607 -0.0607

-0.0417 -0.0417

-0.0325 -0.0325

-0.0220 -0.0220

-0.0142 -0.0142

-0.0041 -0.0041

-0.0028 -0.0028

-0.0012 -0.0012

-0.0020 -0.0020

-0.0022 -0.0022

-0.0012 -0.0012

0.0008 0.0008

0.0026 0.0026

0.0037 0.0037

0.0041 0.0041

0.0036 0.0036

0.0031 0.0031

0.0029 0.0029

0.0032 0.0032

0.0031 0.0031

0.0034 0.0034

0.0032 0.0032

0.0024 0.0024

0.0019 0.0019

K>> y = nan(s,‘like’,x)

Error using gpuArray/nan

An unexpected error occurred trying to launch a kernel. The CUDA error was:

invalid configuration argument

K>> y = nan(s,‘like’,x)

y =

NaN

This is very perplexing to me. Any thoughts?


Answers (1)

Ben

Maybe you can consider trying to reduce the size of W0. I encountered this problem before, and it won’t happen again until I reduce the size.

See Also

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

An Error Occurred

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.

This type of error message frequently refers to the launch configuration parameters (grid/threadblock dimensions in this case, could also be shared memory, etc. in other cases). When you see a message like this it’s a good idea just to print out your actual config parameters before launching the kernel, to see if you’ve made any mistakes.

You said BLOCKDIM = 512. You have threadNum = BLOCKDIM/8 so threadNum = 64. Your threadblock configuration is:

dim3 dimBlock(threadNum,threadNum);

So you are asking to launch blocks of 64 x 64 threads, that is 4096 threads per block. That won’t work on any generation of CUDA devices. All current CUDA devices are limited to a maximum of 1024 threads per block, which is the product of the 3 block dimensions.

Maximum dimensions are listed in table 14 of the CUDA programming guide, and also available via the deviceQuery CUDA sample code.

Just to add to the previous answers, you can find the max threads allowed in your code also, so it can run in other devices without hard-coding the number of threads you will use:

struct cudaDeviceProp properties;
cudaGetDeviceProperties(&properties, device);
cout<<"using "<<properties.multiProcessorCount<<" multiprocessors"<<endl;
cout<<"max threads per processor: "<<properties.maxThreadsPerMultiProcessor<<endl;

Tags:

Cuda

Related

Понравилась статья? Поделить с друзьями:
  • Cuda error device unavailable
  • Cuda error device side assert triggered pytorch
  • Cuda error codes
  • Cuda error blender
  • Cuda error an illegal instruction was encountered