Cuda error initialization error - Исправление ошибок и поиск оптимальных решений проблем

🐛 Bug

Trying to make the default tensor location to cuda, torch.utils.data.DataLoader produce RuntimeError: CUDA error: initialization error

To Reproduce

copy MNIST example link.
add the lines:
device_id = 0
if use_cuda:
torch.cuda.set_device(torch.device(«cuda:» + str(device_id) if torch.cuda.is_available() else «cpu»))
torch.set_default_tensor_type(‘torch.cuda.FloatTensor’)

Before (or after) line in the example

Result

cause to a error RuntimeError: CUDA error: initialization error

workaround

In order to solve it kwargs need to be empty
line meet to be kwargs = {}

Expected behavior

default location should not create an error

Environment

Collecting environment information…
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176

OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.10.0

Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 8.0.61
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X

Nvidia driver version: 410.104
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.1

Versions of relevant libraries:
[pip3] numpy==1.16.3
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] Could not collect

Process finished with exit code 0

Источник

I use CUDA for my code, but it still slow run. Therefore I change it to run parallel using multiprocessing (pool.map) in python. But I have CUDA ERROR: initialization error

This Is function :

def step_M(self, iter_training):
    gpe, e_tuple_list = iter_training
    g = gpe[0]
    p = gpe[1]
    em_iters = gpe[2]

    e_tuple_list = sorted(e_tuple_list, key=lambda tup: tup[0])
    data = self.X[e_tuple_list[0][0]:e_tuple_list[0][1]]
    cluster_indices = np.array(range(e_tuple_list[0][0], e_tuple_list[0][1], 1), dtype=np.int32)
    for i in range(1, len(e_tuple_list)):
        d = e_tuple_list[i]
        cluster_indices = np.concatenate((cluster_indices, np.array(range(d[0], d[1], 1), dtype=np.int32)))
        data = np.concatenate((data, self.X[d[0]:d[1]]))

    g.train_on_subset(self.X, cluster_indices, max_em_iters=em_iters)
    return g, cluster_indices, data

And here code call:

pool = Pool()
iter_bic_list = pool.map(self.step_M, iter_training.items())

The iter_training same:

And this is errors

could you help me to fix.Thanks you.

Источник

Problem

Your CUDA program is failing without giving any clue. You check the error value returned by CUDA runtime calls. You discover that one of the first few CUDA runtime calls, probably a cudaMalloc, is failing with an Initialization error. There is no initialization required to start calling CUDA APIs or kernels, so you now wonder what is this error and how to fix it.

Solution

NVIDIA could have been more informative with their error messages. The initialization error usually indicates that something went bad when the CUDA runtime communicated with the CUDA driver. One of the popular causes of this error is if the driver is older than the CUDA toolkit. Each release of CUDA toolkit ships with a driver. Note the version of this driver. Only drivers of the same or later version numbers will work well with that CUDA toolkit. Install the latest driver and this error should go away.

Tried with: CUDA 4.0

Источник

Problem

While accessing the GPUs — CUDA fails with cudaErrorInitializationError along with nvidia-smi has GPU’s showing ‘off’

Symptom

GPU’s are unusable. nvidia-smi shows ‘off’

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  Off   | 00000002:01:00.0 Off |                    0 |
| N/A   28C    P0    30W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  Off   | 00000006:01:00.0 Off |                    0 |
| N/A   31C    P0    31W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Cause

Two main causes of this problem are:

1. nvidia-persistenced.service daemon is not running. If the service is not active, go ahead and start it and see if GPUs are accessible now.

# systemctl start nvidia-persistenced.service

2. In another case — If nvidia-persistenced service is active but the problem is still there — Look for the below messages in ‘systemctl status nvidia-persistenced -l’

Device NUMA memory is already online. This likely means that some non-NVIDIA software has auto-online the device memory before nvidia-persistenced could.

This likely indicates that the server is missing udev rules required for CUDA/Nvidia.

Environment

RHEL 7.6 with Nvidia/CUDA toolkit.

Diagnosing The Problem

Look for the presence of /etc/udev/rules.d/40-redhat.rules and confirm whether the Memory section has been commented out or not. By Default, Red Hat will auto-online the NUMA memory however for CUDA to work — This default action needs to be disabled.

Resolving The Problem

1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
# cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/

2. Edit the /etc/udev/rules.d/40-redhat.rules file:
# vi /etc/udev/rules.d/40-redhat.rules

3. Comment out the entire «Memory hotadd request» section and save the change:

# Memory hotadd request
#SUBSYSTEM!=»memory», ACTION!=»add», GOTO=»memory_hotplug_end»
#PROGRAM=»/bin/uname -p», RESULT==»s390*», GOTO=»memory_hotplug_end»

#ENV{.state}=»online»
#PROGRAM=»/bin/systemd-detect-virt», RESULT==»none», ENV{.state}=»online_movable»
#ATTR{state}==»offline», ATTR{state}=»$env{.state}»
#LABEL=»memory_hotplug_end»

4. Restart the system for the changes to take effect:
# reboot

Document Location

Worldwide

[{«Business Unit»:{«code»:»BU054″,»label»:»Systems w/TPS»},»Product»:{«code»:»HW1W1″,»label»:»Power ->PowerLinux»},»Component»:»»,»Platform»:[{«code»:»PF043″,»label»:»Red Hat»}],»Version»:»All Versions»,»Edition»:»»,»Line of Business»:{«code»:»»,»label»:»»}}]

Источник

🐛 Bug

To Reproduce

Result

workaround

Expected behavior

Environment

Problem

Symptom

Cause

Environment

Diagnosing The Problem

Resolving The Problem

Document Location

Читайте также: