🐛 Bug
Trying to make the default tensor location to cuda, torch.utils.data.DataLoader
produce RuntimeError: CUDA error: initialization error
To Reproduce
- copy MNIST example link.
- add the lines:
device_id = 0
if use_cuda:
torch.cuda.set_device(torch.device(«cuda:» + str(device_id) if torch.cuda.is_available() else «cpu»))
torch.set_default_tensor_type(‘torch.cuda.FloatTensor’)
Before (or after) line in the example
Result
cause to a error RuntimeError: CUDA error: initialization error
workaround
In order to solve it kwargs
need to be empty
line meet to be kwargs = {}
Expected behavior
default location should not create an error
Environment
Collecting environment information…
PyTorch version: 1.1.0
Is debug build: No
CUDA used to build PyTorch: 9.0.176
OS: Ubuntu 16.04.6 LTS
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
CMake version: version 3.10.0
Python version: 3.6
Is CUDA available: Yes
CUDA runtime version: 8.0.61
GPU models and configuration:
GPU 0: GeForce GTX TITAN X
GPU 1: GeForce GTX TITAN X
Nvidia driver version: 410.104
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21
/usr/lib/x86_64-linux-gnu/libcudnn.so.7.5.1
Versions of relevant libraries:
[pip3] numpy==1.16.3
[pip3] torch==1.1.0
[pip3] torchvision==0.3.0
[conda] Could not collect
Process finished with exit code 0
I use CUDA for my code, but it still slow run. Therefore I change it to run parallel using multiprocessing (pool.map) in python. But I have CUDA ERROR: initialization error
This Is function :
def step_M(self, iter_training):
gpe, e_tuple_list = iter_training
g = gpe[0]
p = gpe[1]
em_iters = gpe[2]
e_tuple_list = sorted(e_tuple_list, key=lambda tup: tup[0])
data = self.X[e_tuple_list[0][0]:e_tuple_list[0][1]]
cluster_indices = np.array(range(e_tuple_list[0][0], e_tuple_list[0][1], 1), dtype=np.int32)
for i in range(1, len(e_tuple_list)):
d = e_tuple_list[i]
cluster_indices = np.concatenate((cluster_indices, np.array(range(d[0], d[1], 1), dtype=np.int32)))
data = np.concatenate((data, self.X[d[0]:d[1]]))
g.train_on_subset(self.X, cluster_indices, max_em_iters=em_iters)
return g, cluster_indices, data
And here code call:
pool = Pool()
iter_bic_list = pool.map(self.step_M, iter_training.items())
The iter_training same:
And this is errors
could you help me to fix.Thanks you.
Problem
Your CUDA program is failing without giving any clue. You check the error value returned by CUDA runtime calls. You discover that one of the first few CUDA runtime calls, probably a cudaMalloc
, is failing with an Initialization error
. There is no initialization required to start calling CUDA APIs or kernels, so you now wonder what is this error and how to fix it.
Solution
NVIDIA could have been more informative with their error messages. The initialization error usually indicates that something went bad when the CUDA runtime communicated with the CUDA driver. One of the popular causes of this error is if the driver is older than the CUDA toolkit. Each release of CUDA toolkit ships with a driver. Note the version of this driver. Only drivers of the same or later version numbers will work well with that CUDA toolkit. Install the latest driver and this error should go away.
Tried with: CUDA 4.0
Problem
While accessing the GPUs — CUDA fails with cudaErrorInitializationError along with nvidia-smi has GPU’s showing ‘off’
Symptom
GPU’s are unusable. nvidia-smi shows ‘off’
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla P100-SXM2... Off | 00000002:01:00.0 Off | 0 |
| N/A 28C P0 30W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla P100-SXM2... Off | 00000006:01:00.0 Off | 0 |
| N/A 31C P0 31W / 300W | 0MiB / 16280MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Cause
Two main causes of this problem are:
1. nvidia-persistenced.service daemon is not running. If the service is not active, go ahead and start it and see if GPUs are accessible now.
# systemctl start nvidia-persistenced.service
2. In another case — If nvidia-persistenced service is active but the problem is still there — Look for the below messages in ‘systemctl status nvidia-persistenced -l’
Device NUMA memory is already online. This likely means that some non-NVIDIA software has auto-online the device memory before nvidia-persistenced could.
This likely indicates that the server is missing udev rules required for CUDA/Nvidia.
Environment
RHEL 7.6 with Nvidia/CUDA toolkit.
Diagnosing The Problem
Look for the presence of /etc/udev/rules.d/40-redhat.rules and confirm whether the Memory section has been commented out or not. By Default, Red Hat will auto-online the NUMA memory however for CUDA to work — This default action needs to be disabled.
Resolving The Problem
1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
# cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
2. Edit the /etc/udev/rules.d/40-redhat.rules file:
# vi /etc/udev/rules.d/40-redhat.rules
3. Comment out the entire «Memory hotadd request» section and save the change:
# Memory hotadd request
#SUBSYSTEM!=»memory», ACTION!=»add», GOTO=»memory_hotplug_end»
#PROGRAM=»/bin/uname -p», RESULT==»s390*», GOTO=»memory_hotplug_end»
#ENV{.state}=»online»
#PROGRAM=»/bin/systemd-detect-virt», RESULT==»none», ENV{.state}=»online_movable»
#ATTR{state}==»offline», ATTR{state}=»$env{.state}»
#LABEL=»memory_hotplug_end»
4. Restart the system for the changes to take effect:
# reboot
Document Location
Worldwide
[{«Business Unit»:{«code»:»BU054″,»label»:»Systems w/TPS»},»Product»:{«code»:»HW1W1″,»label»:»Power ->PowerLinux»},»Component»:»»,»Platform»:[{«code»:»PF043″,»label»:»Red Hat»}],»Version»:»All Versions»,»Edition»:»»,»Line of Business»:{«code»:»»,»label»:»»}}]