Hello there.
I’m hitting the same issue here, but with containerd
rather than docker
.
Here’s my configuration:
-
GPUs:
# lspci | grep -i nvidia 00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
-
OS:
# cat /etc/lsb-release DISTRIB_ID=Ubuntu DISTRIB_RELEASE=22.04 DISTRIB_CODENAME=jammy DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
-
containerd release:
# containerd --version containerd containerd.io 1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
-
nvidia-container-toolkit version:
# nvidia-container-toolkit -version NVIDIA Container Runtime Hook version 1.11.0 commit: d9de4a0
-
runc version:
# runc --version runc version 1.1.4 commit: v1.1.4-0-g5fd4c4d spec: 1.0.2-dev go: go1.17.13 libseccomp: 2.5.1
Note that the Nvidia’s container toolkit has been installed with the Nvidia’s GPU operator on Kubernetes (v1.25.3).
I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment.
containerd.txt
nvidia-container-runtime.txt
How I reproduce this bug:
Running on my host the following command:
# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash
After some time, the nvidia-smi
command exits with the error Failed to initialize NVML: Unknown Error
.
Traces, logs, etc…
- Here are the devices listed in the
state.json
file:{ "type": 99, "major": 195, "minor": 255, "permissions": "", "allow": false, "path": "/dev/nvidiactl", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 234, "minor": 0, "permissions": "", "allow": false, "path": "/dev/nvidia-uvm", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 234, "minor": 1, "permissions": "", "allow": false, "path": "/dev/nvidia-uvm-tools", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 195, "minor": 254, "permissions": "", "allow": false, "path": "/dev/nvidia-modeset", "file_mode": 438, "uid": 0, "gid": 0 }, { "type": 99, "major": 195, "minor": 0, "permissions": "", "allow": false, "path": "/dev/nvidia0", "file_mode": 438, "uid": 0, "gid": 0 }
Thank you very much for your help. 🙏
I am having interesting and weird issue.
When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can’t use gpus in docker.
When I do nvidia-smi
in docker machine. I see this msg
«Failed to initialize NVML: Unknown Error»
However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.
My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?
asked Jul 11, 2022 at 1:28
1
I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.
Docker-compose Version:
services:
gpu_container:
...
healthcheck:
test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
start_period: 1s
interval: 20s
timeout: 5s
retries: 2
labels:
- autoheal=true
- autoheal.stop.timeout=1
restart: always
autoheal:
image: willfarrell/autoheal
environment:
- AUTOHEAL_CONTAINER_LABEL=all
volumes:
- /var/run/docker.sock:/var/run/docker.sock
restart: always
Dockerfile Version:
HEALTHCHECK
--label autoheal=true
--label autoheal.stop.timeout=1
--start-period=60s
--interval=20s
--timeout=10s
--retries=2
CMD nvidia-smi || exit 1
with autoheal daemon:
docker run -d
--name autoheal
--restart=always
-e AUTOHEAL_CONTAINER_LABEL=all
-v /var/run/docker.sock:/var/run/docker.sock
willfarrell/autoheal
answered Sep 13, 2022 at 14:05
I had the same weird issue. According to your description, it’s most likely relevant to this issue on nvidia-docker official repo:
https://github.com/NVIDIA/nvidia-docker/issues/1618
I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.
ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.
answered Oct 13, 2022 at 4:03
1
I had the same issue, I just ran screen watch -n 1 nvidia-smi
in the container and now it works continuously.
answered Aug 21, 2022 at 17:58
Im just add some info for SZALINSKI answer. For you guys to understand more about this problem since i have worked on building tensorflow container for a few days
————————————————————————
My docker version is 20.10.8, from this point and i only have additional
installed by
yay -S nvidia-container-toolkit
So i use method2 from THIS POST, which is bypass cgroups option.
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container
So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run
docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run
docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi
A compose file
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
privileged: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
entrypoint: ["nvidia-smi"]
————————————————————————
You can also use
installed by
yay -S nvidia-container-runtime
which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json
{
"data-root": "/virtual/data/docker", // just my personal
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl restart docker
There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json
For a single docker run, with or without privileged mode, just replace
with
For a compose file
Without privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
devices:
- /dev/nvidia0:/dev/nvidia0
- /dev/nvidiactl:/dev/nvidiactl
- /dev/nvidia-caps:/dev/nvidia-caps
- /dev/nvidia-modeset:/dev/nvidia-modeset
- /dev/nvidia-uvm:/dev/nvidia-uvm
- /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
entrypoint: ["nvidia-smi"]
With privilege mode
version: "3.9"
services:
nvidia-startup:
container_name: nv-udm
image: nvidia/cuda:11.0-base
runtime: nvidia
privileged: true
entrypoint: ["nvidia-smi"]
Last edited by howard-o-neil (2021-11-13 16:49:08)
Содержание
- Arch Linux
- #1 2021-06-04 17:31:47
- [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
- #2 2021-06-04 23:38:56
- Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
- #3 2021-06-05 00:32:31
- Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
- #4 2021-06-06 10:24:00
- Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
- #5 2021-10-15 12:57:59
- Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
- Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618
- Comments
- 1. Issue or feature description
- 2. What I Found
- 3. Information to attach (optional if deemed irrelevant)
- nvidia-smi command in container returns «Failed to initialize NVML: Unknown Error» after couple of times #1678
- Comments
- 1. Issue or feature description
- 2. Steps to reproduce the issue
- 3. Information to attach (optional if deemed irrelevant)
- Failed to initialize NVML: Unknown Error after calling systemctl daemon-reload #1650
- Comments
- 1. Issue or feature description
- 2. Steps to reproduce the issue
- 3. Information to attach (optional if deemed irrelevant)
- Footer
Arch Linux
You are not logged in.
#1 2021-06-04 17:31:47
[SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
I’m following the (very simple) instructions here (https://wiki.archlinux.org/title/Docker … VIDIA_GPUs) for the recommended route (nvidia-container-toolkit) and getting
when I try to run any container with «—gpus». In particular, I tried the command here (https://docs.nvidia.com/datacenter/clou … tml#docker):
(both as unprivileged user and root, same result).
Also, I (attemped) to add the kernel param «systemd.unified_cgroup_hierarchy=false», but I don’t know whether that succeeded, and
returns nothing so it may not have worked — I added it to the entries in
The output of «nvidia-smi» on the host is
My docker version: 20.10.6
My kernel version: 5.12.7
My GPU: NVIDIA® GeForce RTX™ 2060 6GB GDDR6 with Max-Q
Last edited by Californian (2021-06-05 00:33:57)
#2 2021-06-04 23:38:56
Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
I’ve bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.
——
Method 1, recommended
1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :
It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime ( https://wiki.archlinux.org/title/Kernel … ng_cmdline)
2) nvidia-container configuration
In the file
set the parameter
After that restart docker and run test container:
———
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)
Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts: https://github.com/NVIDIA/nvidia-docker … -851039827
For debugging purposes, just run:
Last edited by szalinski (2021-06-04 23:41:06)
#3 2021-06-05 00:32:31
Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
Amazing, thank you! The /etc/nvidia-container-runtime/config.toml parameter (no-cgroups = false) was what I was missing, I’ll add that to the wiki.
#4 2021-06-06 10:24:00
Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
FWIW, no-cgroups=true is forced by default in the AUR package: https://aur.archlinux.org/cgit/aur.git/ … 08ba39c5df
#5 2021-10-15 12:57:59
Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
I have exactly the same problem, and although I thoroughly followed szalinski’s instructions, I have still the problem.
My /cat/cmdline contains:
and my /etc/nvidia-container-runtime/config.toml contains:
When I run docker run —gpus all nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi, I have the following error:
Here are my GPU and system’s characteristics:
* nvidia-smi’s output: NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
* GPU: GeFORCE RTX 2080 Ti
* Docker version: Docker version 20.10.9, build c2ea9bc90b
Thank you in avance for your help.
EDIT: I eventually solved my problem, there was nothing to do with the solutions proposed above, but I had to run a privileged container with `docker run`’s option `—privileged` to have access to the GPU:
now everything works perfectly.
Last edited by Wmog (2021-10-15 13:32:41)
Источник
Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618
1. Issue or feature description
Yes, @klueska already described in #1469. And I really tried all of those methods, including use nvidia-device-plugin-compat-with-cpumanager.yml. But, the error still there. So, let me give more details.
Failed to initialize NVML: Unknown Error not occurred in initial NVIDIA docker created and not in couple of seconds(my kubernetes config file is using default nodeStatusUpdateFrequency time, which is 10s), it’s happened after couple of days(sometimes some of hours).
- Not set any kubelet update configuration
- cpu-manager-policy not set(default is none)
- container inspect information can see devices all mounted with rw permission, but /sys/fs/cgroup/devices/devices.list shows m
- If I directly use nvidia-docker create container in kubernetes environment, the error also occur
2. What I Found
The Error docker:
The healthy docker:
3. Information to attach (optional if deemed irrelevant)
- Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
- Kernel version from uname -a
- Any relevant kernel output lines from dmesg
- Driver information from nvidia-smi -a
- Docker version from docker version
Client: Docker Engine — Community
Version: 20.10.5
API version: 1.41
Go version: go1.13.15
Git commit: 55c4c88
Built: Tue Mar 2 20:18:05 2021
OS/Arch: linux/amd64
Context: default
Experimental: true
Server: Docker Engine — Community
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 363e9a8
Built: Tue Mar 2 20:16:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0
- NVIDIA packages version from dpkg -l ‘*nvidia*’ or rpm -qa ‘*nvidia*’
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-=====================================-=======================-=======================-===============================================================================
un libgldispatch0-nvidia (no description available)
ii libnvidia-container-tools 1.6.0 rc.2-1 amd64 NVIDIA container runtime library (command-line tools)
ii libnvidia-container1:amd64 1.6.0 rc.2-1 amd64 NVIDIA container runtime library
un nvidia-304 (no description available)
un nvidia-340 (no description available)
un nvidia-384 (no description available)
un nvidia-common (no description available)
ii nvidia-container-runtime 3.6.0 rc.1-1 amd64 NVIDIA container runtime
un nvidia-container-runtime-hook (no description available)
ii nvidia-container-toolkit 1.6.0 rc.2-1 amd64 NVIDIA container runtime hook
un nvidia-docker (no description available)
ii nvidia-docker2 2.7.0 rc.2-1 all nvidia-docker CLI wrapper
ii nvidia-prime 0.8.16 0.18.04.1 all Tools to enable NVIDIA’s Prime - NVIDIA container library version from nvidia-container-cli -V
version: 1.6.0
rc.2
build date: 2021-11-05T14:19+00:00
build revision: badec1fa4a2c085aa9396f95b6bb1d69f1c7996b
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,—gc-sections
The text was updated successfully, but these errors were encountered:
Источник
nvidia-smi command in container returns «Failed to initialize NVML: Unknown Error» after couple of times #1678
1. Issue or feature description
Nvidia gpu works well upon the container has started, but when it runs a couple of times(maybe several days), gpus mounted by nvidia container runtime becomes invalid. Command Nvidia-smi returns «Failed to initialize NVML: Unknown Error» in container, while it works well on the host machine.
Nvidia-smi looks well on host,and we can see the training process information through host nvidia-smi command output. If now we stop the training process, it can no longer restart.
Referring to the solution from issue #1618 . We try to upgrade cgroup to v2 version, but it does not work.
Surprising, we cannot find any devices.list files in the container,which is mentioned in #1618
2. Steps to reproduce the issue
We find this issue can be reproduced when running «systemctl daemon-reload» on host,but actually we have not run any similar commands in our production environment
Can anyone give some good ideas for positioning this problem
3. Information to attach (optional if deemed irrelevant)
nvidia driver version: 470.103.01
The text was updated successfully, but these errors were encountered:
I’ve noticed the same behavior for some time on Debian 11; at least since March as that is when I started regularly checking for nvidia-smi functioning in containers, and thanks for calling out systemctl daemon-reload as something that triggers it. In my case, I have automatic updates enabled in Debian using unattended upgrades and your mention of daemon-reload makes me think that the package updates may be triggering a daemon-reload event to occur. I’m only updating packages from the Debian repos automatically, applying nvidia-docker and 3rd party repo updates manually.
Example from today where I can see the auto updates happening where in this case, telegraf is being updated and then a daemon-reload occurs, or at least that is what I believe I am seeing from systemd[1]: Reloading. based on the output when I manually run a systemctl daemon-reload :
Do the packages in the debian repositories include the NVIDIA Drivers?
Fair point and callout — they do. I was looking back, doing a cross-check of the times I detected the issue & sending myself a notification and the packages that were upgraded at the time, I recorded the following:
Источник
Failed to initialize NVML: Unknown Error after calling systemctl daemon-reload #1650
1. Issue or feature description
Failed to initialize NVML: Unknown Error does not occurred in initial NVIDIA docker created, but it’s happened after calling systemctl daemon-reload .
It works fine in
Kernel: 4.19.91 and systemd 219.
But it doesn’t work in
Kernel: 5.10.23 and systemd 239.
I tried to monitor it with bpftrace:
During container startup, I can see event:
And I can see the devicel.list in container as below:
But after running systemctl daemon-reload , I find the event:
And the devicel.list in container as below:
GPU device is not able be rw .
Currently I’m not able to use cgroup V2 . Any suggestions about it? Thanks very much.
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
- [ x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
- Kernel version from uname -a
- Any relevant kernel output lines from dmesg
- Driver information from nvidia-smi -a
- [ x] Docker version from docker version
- NVIDIA packages version from dpkg -l ‘*nvidia*’ or rpm -qa ‘*nvidia*’
- NVIDIA container library version from nvidia-container-cli -V
- NVIDIA container library logs (see troubleshooting)
- Docker command, image and tag used
The text was updated successfully, but these errors were encountered:
I also encountered this problem, which has been occurring for some time.
@klueska Could you help take a look? Thanks.
I find these logs during systemd reload:
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ?
Otherwise, if there exists a configuration file that we can explicitly set DeviceAllow as /dev/nvidia* which can be recognized by systemd?
hey, I have been experienced this issue for a long time, I solved this by adding —privilege to the dockers which need graphic card, hope this helps.
hey, I have been experienced this issue for a long time, I solved this by adding —privilege to the dockers which need graphic card, hope this helps.
Thanks for response. But I’m not able to set privilege because I’m using it in Kubernetes, and it will let user see all the gpus.
hey, I have been experienced this issue for a long time, I solved this by adding —privilege to the dockers which need graphic card, hope this helps.
Thanks for response. But I’m not able to set privilege because I’m using it in Kubernetes, and it will let user see all the gpus.
I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that’s an option for you.
I’m using cgroups v2 myself so I would be interested in hearing what you did @gengwg
I’m using cgroups v2 myself so I would be interested in hearing what you did @gengwg
Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env.
In that case, whatever the trigger is that you’re seeing apparently isn’t the same as mine as all that your instructions do is switch from cgroups v1 to v2. I’m already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn’t fix anything for me.
yeah i do see some people still reporting it in v2, example this.
time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i’m not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it’s back again.
It has been over a week. Did you see the error again?
I find these logs during systemd reload:
From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:
So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ?
How to get these logs to find the device numbers for my use case?
How to get these logs to find the device numbers for my use case?
@matifali You can simply use ls -l /dev/nvidia* to find the device ids. For example:
Here, 7,131 is the major and minor device number for this device.
i’ve just fixed same issue in ubuntu 22.04 with changing my docker compose file
simply use cgroup2 by commenting out #no-cgroups = false line in /etc/nvidia-container-runtime/config.toml and change your docker-compose file like this:
mount /dev drive to /dev in container
and set privileged: true in docker compose file
also you need to specify runtime with this «runtime: nvidia»
and your final docker-compose file be like this:
version: ‘3’
services:
nvidia:
image:
restart: always
container_name: Nvidia-Container
ports:
— port:port
privileged: true
volumes:
— /dev:/dev
runtime: nvidia
And what if we are not using docker-compose @RezaImany. I am using terraform to provision with the gpus=»all» flag.
Exposing all devices to the container isn’t a good approach and also privileged=true .
And what if we are not using docker compose @RezaImany. I am using terraform to provision with gpus=all flag. Exposing all devices to the container isn’t a good approach and also privileged=true.
the root cause of this error is cgroup controller not allow container to reconnect to NVML until restart, you should mod cgroup for bypassing some limitations
the —privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do.
For my use case, multiple people are using the same machine, and setting preveliged=true is a good idea as the isolation between users is not there anymore. Is there any other way?
© 2023 GitHub, Inc.
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Источник
-
#1
Hi all,
I’m trying to share a GPU with a Debian Bullseye (11) container. I installed the nvidia driver using NVIDIA-Linux-x86_64-390.144.run on the proxmox host and then on the container.
Host:
Code:
nvidia-smi
Sat Oct 30 22:27:21 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.144 Driver Version: 390.144 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro 600 Off | 00000000:05:00.0 Off | N/A |
| 30% 62C P0 N/A / N/A | 0MiB / 963MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
/etc/modules-load.d/modules.conf:
Code:
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"
# Nvidia modules
nvidia
nvidia_uvm
/etc/udev/rules.d/70-nvidia.rules:
Code:
KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"
Here’s the the dev list and container config on the host:
Code:
ls -la /dev/nvid*
crw-rw-rw- 1 root root 195, 0 Oct 30 22:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Oct 30 22:16 /dev/nvidiactl
crw-rw-rw- 1 root root 239, 0 Oct 30 22:16 /dev/nvidia-uvm
crw-rw-rw- 1 root root 239, 1 Oct 30 22:16 /dev/nvidia-uvm-tools
lxc.cgroup.devices.allow: c 195:* rw
lxc.cgroup.devices.allow: c 239:* rw
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
and on the container:
Code:
ls -la /dev/nvidia*
crw-rw-rw- 1 root root 239, 0 Oct 30 21:16 /dev/nvidia-uvm
crw-rw-rw- 1 root root 239, 1 Oct 30 21:16 /dev/nvidia-uvm-tools
crw-rw-rw- 1 root root 195, 0 Oct 30 21:16 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Oct 30 21:16 /dev/nvidiactl
However, when I run nvidia-smi on the container:
Code:
nvidia-smi
Failed to initialize NVML: Unknown Error
Is anyone able to help please?
Thanks
NTB
Last edited: Oct 31, 2021
About Lenovo
-
Our Company
-
News
-
Investor Relations
-
Sustainability
-
Product Compliance
-
Product Security
-
Lenovo Open Source
-
Legal Information
-
Jobs at Lenovo
Shop
-
Laptops & Ultrabooks
-
Tablets
-
Desktops & All-in-Ones
-
Workstations
-
Accessories & Software
-
Servers
-
Storage
-
Networking
-
Laptop Deals
-
Outlet
Support
-
Drivers & Software
-
How To’s
-
Warranty Lookup
-
Parts Lookup
-
Contact Us
-
Repair Status Check
-
Imaging & Security Resources
Resources
-
Where to Buy
-
Shopping Help
-
Sales Order Status
-
Product Specifications (PSREF)
-
Forums
-
Registration
-
Product Accessibility
-
Environmental Information
-
Gaming Community
-
LenovoEDU Community
-
LenovoPRO Community
©
Lenovo.
|
|
|
|
How to resolve » Failed to initialize NVML: Driver/library version mismatch» error
Overview/Backgroud
You may encounter this error when trying to run a GPU workload or nvidia-smi
command. We have typically seen this when NVIDIA drivers on a node are upgraded but the run-time driver information is not up to date.
A pod impacted by this issue would fail with status RunContainerError
and will report the following error under Events
Warning Failed 91s (x4 over 2m12s) kubelet, ip-10-0-129-17.us-west-2.compute.internal Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\n\""": unknown
Verification
Before we try to resolve the issue let’s try to confirm that the issue is actually with the loaded kernel drivers being outdated.
Check the run-time driver information
The command
cat /proc/driver/nvidia/version
will show the run-time information about the driver like so
NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64.00 Wed Feb 26 16:26:08 UTC 2020 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
Compare against the version for drivers installed
Compare the NVIDIA driver version obtained above (e.g. 440.64.00) against the drivers you have currently installed.
If you are using host-based drivers you can check the driver version using
rpm -qa | grep nvidia-driver
on Centos/RHEL or
dpkg -l | grep nvidia-driver
on Ubuntu
Example output from a Centos node
... nvidia-driver-latest-libs-455.32.00-1.el7.x86_64 nvidia-driver-latest-455.32.00-1.el7.x86_64 ...
If you are using container-based drivers from the Konvoy NVIDIA addon, then check the version tag for the container in the nvidia-kubeaddons-nvidia-driver-
pod.
In the example above from a Centos host, we see that the run-time driver version 440.64.00 is different from the installed version 455.32.00. If you see a discrepancy like this, it confirms that the issue is caused by driver upgrade, and the solution documented here should resolve the issue.
Solution 1: Drain and reboot the worker
Rebooting the node is the easiest way to fix the issue. Rebooting the node will make sure that the drivers are properly initialized after the upgrade.
If you need to upgrade drivers on a GPU worker node, we recommend draining the node, then performing the driver upgrade and then rebooting the node before deploying fresh workloads. If you are using container-based drivers then the recommended procedure to upgrade is documented here.
Solution 2 : Reload NVIDIA kernel modules
This method is more involved and should only be used if draining and rebooting the GPU worker in question is not an option. This will also involve draining any GPU workloads running on the node. If you want to avoid draining and rebooting due to some currently running GPU workloads, then this method offers no advantage. This is useful only if there are non-GPU workloads on the worker node that cannot be drained or the worker node cannot be rebooted for any reason.
Drain GPU workloads
For this method, we would need the GPUs not to be in use and for that we would need to stop any GPU workloads on the impacted node.
Stop NVIDIA device driver
Once we have stopped all the GPU workloads, we need to stop the NVIDIA device plugin. It is deployed as a daemonset on GPU workers. To only stop if on the impacted node we can remove the label konvoy.mesosphere.com/gpu-provider
kubectl label node konvoy.mesosphere.com/gpu-provider-
Once the label is removed all pods associated with NVIDIA konvoy addon on the worker node will be removed.
Restart kubelet
After stopping GPU workloads and device plugin the last process left using the nvidia
kernel module would be the kubelet
service. Since the device plugin is no longer running just restarting the service should stop it from using the kernel module.
sudo systemctl restart kubelet
Check if there are any processes still using NVIDIA drivers
Before attempting to unload the kernel modules lets check if any processes are still using the NVIDIA drivers
sudo lsof /dev/nvidia**
Kill any processes still using the drivers.
Check which NVIDIA kernel modules are loaded
lsmod | grep ^nvidia
example output
nvidia_uvm 939731 0 nvidia_drm 39594 0 nvidia_modeset 1109637 1 nvidia_drm nvidia 20390418 18 nvidia_modeset,nvidia_uvm
Unload NVIDIA kernel modules
In the example above, the third column above shows modules that are using the module listed in the first column. A module that is being used cannot be removed until the dependant module is removed, hence the order of the following commands is important. Make sure that any running GPU workloads are terminated before removing the modules.
sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia
Verify that the modules are unloaded
lsmod | grep ^nvidia
should return no output.
Relaunch the NVIDIA addon pods
Now we are ready to relaunch the pods for NVIDIA addon on the impacted node. Adding the following label back should relaunch the pods and prepare the node to accept GPU workloads again
kubectl label node konvoy.mesosphere.com/gpu-provider=NVIDIA