Failed to initialize nvml unknown error

1. Issue or feature description After a random amount of time (it could be hours or days) the GPUs become unavailable inside all the running containers and nvidia-smi returns "Failed to initia...

Hello there.

I’m hitting the same issue here, but with containerd rather than docker.

Here’s my configuration:

  • GPUs:

     # lspci | grep -i nvidia
     00:04.0 3D controller: NVIDIA Corporation GP100GL [Tesla P100 PCIe 16GB] (rev a1)
  • OS:

     # cat /etc/lsb-release
     DISTRIB_ID=Ubuntu
     DISTRIB_RELEASE=22.04
     DISTRIB_CODENAME=jammy
     DISTRIB_DESCRIPTION="Ubuntu 22.04.1 LTS"
  • containerd release:

     # containerd --version
     containerd containerd.io 1.6.8 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6
  • nvidia-container-toolkit version:

     # nvidia-container-toolkit -version
     NVIDIA Container Runtime Hook version 1.11.0
     commit: d9de4a0
  • runc version:

    # runc --version
    runc version 1.1.4
    commit: v1.1.4-0-g5fd4c4d
    spec: 1.0.2-dev
    go: go1.17.13
    libseccomp: 2.5.1

Note that the Nvidia’s container toolkit has been installed with the Nvidia’s GPU operator on Kubernetes (v1.25.3).

I attached the containerd configuration file and the nvidia-container-runtime configuration file to my comment.
containerd.txt
nvidia-container-runtime.txt

How I reproduce this bug:

Running on my host the following command:

# nerdctl run -n k8s.io --runtime=/usr/local/nvidia/toolkit/nvidia-container-runtime --network=host --rm -ti --name ubuntu --gpus all -v /run/nvidia/driver/usr/bin:/tmp/nvidia-bin docker.io/library/ubuntu:latest bash

After some time, the nvidia-smicommand exits with the error Failed to initialize NVML: Unknown Error.

Traces, logs, etc…

  • Here are the devices listed in the state.json file:
      {
         "type": 99,
         "major": 195,
         "minor": 255,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidiactl",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 234,
         "minor": 0,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-uvm",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 234,
         "minor": 1,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-uvm-tools",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 195,
         "minor": 254,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia-modeset",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       },
       {
         "type": 99,
         "major": 195,
         "minor": 0,
         "permissions": "",
         "allow": false,
         "path": "/dev/nvidia0",
         "file_mode": 438,
         "uid": 0,
         "gid": 0
       }

Thank you very much for your help. 🙏

I am having interesting and weird issue.

When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can’t use gpus in docker.

When I do nvidia-smi in docker machine. I see this msg

«Failed to initialize NVML: Unknown Error»

However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.

My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?

Masoud Keshavarz's user avatar

asked Jul 11, 2022 at 1:28

Justin Song's user avatar

1

I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.

Docker-compose Version:

services:
  gpu_container:
    ...
    healthcheck:
      test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
      start_period: 1s
      interval: 20s
      timeout: 5s
      retries: 2
    labels:
      - autoheal=true
      - autoheal.stop.timeout=1
    restart: always
  autoheal:
    image: willfarrell/autoheal
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

Dockerfile Version:

HEALTHCHECK 
    --label autoheal=true 
    --label autoheal.stop.timeout=1 
    --start-period=60s 
    --interval=20s 
    --timeout=10s   
    --retries=2 
    CMD nvidia-smi || exit 1

with autoheal daemon:

docker run -d 
    --name autoheal 
    --restart=always 
    -e AUTOHEAL_CONTAINER_LABEL=all 
    -v /var/run/docker.sock:/var/run/docker.sock 
    willfarrell/autoheal

answered Sep 13, 2022 at 14:05

sih4sing5hog5's user avatar

I had the same weird issue. According to your description, it’s most likely relevant to this issue on nvidia-docker official repo:

https://github.com/NVIDIA/nvidia-docker/issues/1618

I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.

ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.

answered Oct 13, 2022 at 4:03

nalsas's user avatar

1

I had the same issue, I just ran screen watch -n 1 nvidia-smi in the container and now it works continuously.

answered Aug 21, 2022 at 17:58

Sandro's user avatar

Im just add some info for SZALINSKI answer. For you guys to understand more about this problem since i have worked on building tensorflow container for a few days

————————————————————————

My docker version is 20.10.8, from this point and i only have additional

installed by

yay -S nvidia-container-toolkit

So i use method2 from THIS POST, which is bypass cgroups option.
When using nvidia-container-runtime or nvidia-container-toolkit with cgroup option, it automatically allocate machine resource for the container

So when bypass this option, you gotta allocate resource by your own. Here’s an example
A single docker run

docker run --rm --gpus 1 --device /dev/nvidia-caps --device /dev/nvidia0 --device /dev/nvidiactl --device /dev/nvidia-modeset  --device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

Or you can run docker in privileged mode, which allow docker to access all devices in your machine
A single docker run

docker run --rm --privileged --gpus 1 nvidia/cuda:11.0-base nvidia-smi

A compose file

version: "3.9"

services:

  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    privileged: true
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    entrypoint: ["nvidia-smi"]

————————————————————————

You can also use

installed by

yay -S nvidia-container-runtime

which is another options along with nvidia-container-toolkit
Edit you /etc/docker/daemon.json

{
    "data-root": "/virtual/data/docker", // just my personal
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}
sudo systemctl restart docker

There’s no need to edit docker unit. As you run you container with additional runtime option, it’ll search for runtime inside the daemon.json

For a single docker run, with or without privileged mode, just replace

with

For a compose file

Without privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    devices:
      - /dev/nvidia0:/dev/nvidia0
      - /dev/nvidiactl:/dev/nvidiactl
      - /dev/nvidia-caps:/dev/nvidia-caps
      - /dev/nvidia-modeset:/dev/nvidia-modeset
      - /dev/nvidia-uvm:/dev/nvidia-uvm
      - /dev/nvidia-uvm-tools:/dev/nvidia-uvm-tools
    entrypoint: ["nvidia-smi"]

With privilege mode

version: "3.9"

services:
  nvidia-startup:
    container_name: nv-udm
    image: nvidia/cuda:11.0-base
    runtime: nvidia
    privileged: true
    entrypoint: ["nvidia-smi"]

Last edited by howard-o-neil (2021-11-13 16:49:08)

Содержание

  1. Arch Linux
  2. #1 2021-06-04 17:31:47
  3. [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
  4. #2 2021-06-04 23:38:56
  5. Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
  6. #3 2021-06-05 00:32:31
  7. Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
  8. #4 2021-06-06 10:24:00
  9. Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
  10. #5 2021-10-15 12:57:59
  11. Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»
  12. Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618
  13. Comments
  14. 1. Issue or feature description
  15. 2. What I Found
  16. 3. Information to attach (optional if deemed irrelevant)
  17. nvidia-smi command in container returns «Failed to initialize NVML: Unknown Error» after couple of times #1678
  18. Comments
  19. 1. Issue or feature description
  20. 2. Steps to reproduce the issue
  21. 3. Information to attach (optional if deemed irrelevant)
  22. Failed to initialize NVML: Unknown Error after calling systemctl daemon-reload #1650
  23. Comments
  24. 1. Issue or feature description
  25. 2. Steps to reproduce the issue
  26. 3. Information to attach (optional if deemed irrelevant)
  27. Footer

Arch Linux

You are not logged in.

#1 2021-06-04 17:31:47

[SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»

I’m following the (very simple) instructions here (https://wiki.archlinux.org/title/Docker … VIDIA_GPUs) for the recommended route (nvidia-container-toolkit) and getting

when I try to run any container with «—gpus». In particular, I tried the command here (https://docs.nvidia.com/datacenter/clou … tml#docker):

(both as unprivileged user and root, same result).

Also, I (attemped) to add the kernel param «systemd.unified_cgroup_hierarchy=false», but I don’t know whether that succeeded, and

returns nothing so it may not have worked — I added it to the entries in

The output of «nvidia-smi» on the host is

My docker version: 20.10.6
My kernel version: 5.12.7
My GPU: NVIDIA® GeForce RTX™ 2060 6GB GDDR6 with Max-Q

Last edited by Californian (2021-06-05 00:33:57)

#2 2021-06-04 23:38:56

Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»

I’ve bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.

——
Method 1, recommended

1) Kernel parameter
The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline :

It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime ( https://wiki.archlinux.org/title/Kernel … ng_cmdline)

2) nvidia-container configuration
In the file

set the parameter

After that restart docker and run test container:

———
Method 2
Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above)

Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts: https://github.com/NVIDIA/nvidia-docker … -851039827
For debugging purposes, just run:

Last edited by szalinski (2021-06-04 23:41:06)

#3 2021-06-05 00:32:31

Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»

Amazing, thank you! The /etc/nvidia-container-runtime/config.toml parameter (no-cgroups = false) was what I was missing, I’ll add that to the wiki.

#4 2021-06-06 10:24:00

Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»

FWIW, no-cgroups=true is forced by default in the AUR package: https://aur.archlinux.org/cgit/aur.git/ … 08ba39c5df

#5 2021-10-15 12:57:59

Re: [SOLVED] Docker with GPU: «Failed to initialize NVML: Unknown Error»

I have exactly the same problem, and although I thoroughly followed szalinski’s instructions, I have still the problem.

My /cat/cmdline contains:

and my /etc/nvidia-container-runtime/config.toml contains:

When I run docker run —gpus all nvidia/cuda:11.4.0-runtime-ubuntu20.04 nvidia-smi, I have the following error:

Here are my GPU and system’s characteristics:

* nvidia-smi’s output: NVIDIA-SMI 470.74 Driver Version: 470.74 CUDA Version: 11.4
* GPU: GeFORCE RTX 2080 Ti
* Docker version: Docker version 20.10.9, build c2ea9bc90b

Thank you in avance for your help.

EDIT: I eventually solved my problem, there was nothing to do with the solutions proposed above, but I had to run a privileged container with `docker run`’s option `—privileged` to have access to the GPU:

now everything works perfectly.

Last edited by Wmog (2021-10-15 13:32:41)

Источник

Failed to initialize NVML: Unknown Error without any kublet update(cpu-manager-policy is default none) #1618

1. Issue or feature description

Yes, @klueska already described in #1469. And I really tried all of those methods, including use nvidia-device-plugin-compat-with-cpumanager.yml. But, the error still there. So, let me give more details.

Failed to initialize NVML: Unknown Error not occurred in initial NVIDIA docker created and not in couple of seconds(my kubernetes config file is using default nodeStatusUpdateFrequency time, which is 10s), it’s happened after couple of days(sometimes some of hours).

  • Not set any kubelet update configuration
  • cpu-manager-policy not set(default is none)
  • container inspect information can see devices all mounted with rw permission, but /sys/fs/cgroup/devices/devices.list shows m
  • If I directly use nvidia-docker create container in kubernetes environment, the error also occur

2. What I Found

The Error docker:

The healthy docker:

3. Information to attach (optional if deemed irrelevant)

  • Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
  • Kernel version from uname -a
  • Any relevant kernel output lines from dmesg
  • Driver information from nvidia-smi -a
  • Docker version from docker version
    Client: Docker Engine — Community
    Version: 20.10.5
    API version: 1.41
    Go version: go1.13.15
    Git commit: 55c4c88
    Built: Tue Mar 2 20:18:05 2021
    OS/Arch: linux/amd64
    Context: default
    Experimental: true

Server: Docker Engine — Community
Engine:
Version: 20.10.5
API version: 1.41 (minimum version 1.12)
Go version: go1.13.15
Git commit: 363e9a8
Built: Tue Mar 2 20:16:00 2021
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: 1.4.3
GitCommit: 269548fa27e0089a8b8278fc4fc781d7f65a939b
nvidia:
Version: 1.0.0-rc92
GitCommit: ff819c7e9184c13b7c2607fe6c30ae19403a7aff
docker-init:
Version: 0.19.0
GitCommit: de40ad0

  • NVIDIA packages version from dpkg -l ‘*nvidia*’ or rpm -qa ‘*nvidia*’
    Desired=Unknown/Install/Remove/Purge/Hold
    | Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
    |/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
    ||/ Name Version Architecture Description
    +++-=====================================-=======================-=======================-===============================================================================
    un libgldispatch0-nvidia (no description available)
    ii libnvidia-container-tools 1.6.0 rc.2-1 amd64 NVIDIA container runtime library (command-line tools)
    ii libnvidia-container1:amd64 1.6.0 rc.2-1 amd64 NVIDIA container runtime library
    un nvidia-304 (no description available)
    un nvidia-340 (no description available)
    un nvidia-384 (no description available)
    un nvidia-common (no description available)
    ii nvidia-container-runtime 3.6.0 rc.1-1 amd64 NVIDIA container runtime
    un nvidia-container-runtime-hook (no description available)
    ii nvidia-container-toolkit 1.6.0 rc.2-1 amd64 NVIDIA container runtime hook
    un nvidia-docker (no description available)
    ii nvidia-docker2 2.7.0 rc.2-1 all nvidia-docker CLI wrapper
    ii nvidia-prime 0.8.16 0.18.04.1 all Tools to enable NVIDIA’s Prime
  • NVIDIA container library version from nvidia-container-cli -V
    version: 1.6.0

rc.2
build date: 2021-11-05T14:19+00:00
build revision: badec1fa4a2c085aa9396f95b6bb1d69f1c7996b
build compiler: x86_64-linux-gnu-gcc-7 7.5.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,—gc-sections

  • NVIDIA container library logs (see troubleshooting)
  • Docker command, image and tag used
  • The text was updated successfully, but these errors were encountered:

    Источник

    nvidia-smi command in container returns «Failed to initialize NVML: Unknown Error» after couple of times #1678

    1. Issue or feature description

    Nvidia gpu works well upon the container has started, but when it runs a couple of times(maybe several days), gpus mounted by nvidia container runtime becomes invalid. Command Nvidia-smi returns «Failed to initialize NVML: Unknown Error» in container, while it works well on the host machine.

    Nvidia-smi looks well on host,and we can see the training process information through host nvidia-smi command output. If now we stop the training process, it can no longer restart.

    Referring to the solution from issue #1618 . We try to upgrade cgroup to v2 version, but it does not work.

    Surprising, we cannot find any devices.list files in the container,which is mentioned in #1618

    2. Steps to reproduce the issue

    We find this issue can be reproduced when running «systemctl daemon-reload» on host,but actually we have not run any similar commands in our production environment

    Can anyone give some good ideas for positioning this problem

    3. Information to attach (optional if deemed irrelevant)

    nvidia driver version: 470.103.01

    The text was updated successfully, but these errors were encountered:

    I’ve noticed the same behavior for some time on Debian 11; at least since March as that is when I started regularly checking for nvidia-smi functioning in containers, and thanks for calling out systemctl daemon-reload as something that triggers it. In my case, I have automatic updates enabled in Debian using unattended upgrades and your mention of daemon-reload makes me think that the package updates may be triggering a daemon-reload event to occur. I’m only updating packages from the Debian repos automatically, applying nvidia-docker and 3rd party repo updates manually.

    Example from today where I can see the auto updates happening where in this case, telegraf is being updated and then a daemon-reload occurs, or at least that is what I believe I am seeing from systemd[1]: Reloading. based on the output when I manually run a systemctl daemon-reload :

    Do the packages in the debian repositories include the NVIDIA Drivers?

    Fair point and callout — they do. I was looking back, doing a cross-check of the times I detected the issue & sending myself a notification and the packages that were upgraded at the time, I recorded the following:

    Источник

    Failed to initialize NVML: Unknown Error after calling systemctl daemon-reload #1650

    1. Issue or feature description

    Failed to initialize NVML: Unknown Error does not occurred in initial NVIDIA docker created, but it’s happened after calling systemctl daemon-reload .

    It works fine in

    Kernel: 4.19.91 and systemd 219.

    But it doesn’t work in

    Kernel: 5.10.23 and systemd 239.

    I tried to monitor it with bpftrace:

    During container startup, I can see event:

    And I can see the devicel.list in container as below:

    But after running systemctl daemon-reload , I find the event:

    And the devicel.list in container as below:

    GPU device is not able be rw .

    Currently I’m not able to use cgroup V2 . Any suggestions about it? Thanks very much.

    2. Steps to reproduce the issue

    3. Information to attach (optional if deemed irrelevant)

    • [ x] Some nvidia-container information: nvidia-container-cli -k -d /dev/tty info
    • Kernel version from uname -a
    • Any relevant kernel output lines from dmesg
    • Driver information from nvidia-smi -a
    • [ x] Docker version from docker version
    • NVIDIA packages version from dpkg -l ‘*nvidia*’ or rpm -qa ‘*nvidia*’
    • NVIDIA container library version from nvidia-container-cli -V
    • NVIDIA container library logs (see troubleshooting)
    • Docker command, image and tag used

    The text was updated successfully, but these errors were encountered:

    I also encountered this problem, which has been occurring for some time.

    @klueska Could you help take a look? Thanks.

    I find these logs during systemd reload:

    From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:

    So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ?

    Otherwise, if there exists a configuration file that we can explicitly set DeviceAllow as /dev/nvidia* which can be recognized by systemd?

    hey, I have been experienced this issue for a long time, I solved this by adding —privilege to the dockers which need graphic card, hope this helps.

    hey, I have been experienced this issue for a long time, I solved this by adding —privilege to the dockers which need graphic card, hope this helps.

    Thanks for response. But I’m not able to set privilege because I’m using it in Kubernetes, and it will let user see all the gpus.

    hey, I have been experienced this issue for a long time, I solved this by adding —privilege to the dockers which need graphic card, hope this helps.

    Thanks for response. But I’m not able to set privilege because I’m using it in Kubernetes, and it will let user see all the gpus.

    I fixed this issue in our env (centos 8, systemd 239) perfectly with cgroup v2, for both docker and containerd nodes. i can share the steps how we fixed it by upgrading from cgroup1 to cgroup2, if that’s an option for you.

    I’m using cgroups v2 myself so I would be interested in hearing what you did @gengwg

    I’m using cgroups v2 myself so I would be interested in hearing what you did @gengwg

    Sure here I wrote the detailed steps how I fixed it using cgroup v2. Let me know if it works in your env.

    In that case, whatever the trigger is that you’re seeing apparently isn’t the same as mine as all that your instructions do is switch from cgroups v1 to v2. I’m already on cgroups v2 here on Debian 11 (bullseye) and I know that just having cgroups v2 enabled doesn’t fix anything for me.

    yeah i do see some people still reporting it in v2, example this.

    time wise, this issue starts to appear after we upgraded from centos 7 to centos 8. all components (kernel, systemd, containerd, nvidia runtime, etc.) on the pipeline all got upgraded. so i’m not totally sure which component (or possibly multiple components) caused this issue. in our case v1 to v2 seems fixed this issue so far for a week or so. i will monitor it in case it’s back again.

    It has been over a week. Did you see the error again?

    I find these logs during systemd reload:

    From the major and minor number of these devices, I find they are /dev/nvidia* devices, if i manually create these soft links as the following steps, the problem disappeared:

    So i wonder if nvidia toolkits should provide something like udev rules that can trigger kernel or systemd to create /dev/char/* -> /dev/nvidia* ?

    How to get these logs to find the device numbers for my use case?

    How to get these logs to find the device numbers for my use case?

    @matifali You can simply use ls -l /dev/nvidia* to find the device ids. For example:

    Here, 7,131 is the major and minor device number for this device.

    i’ve just fixed same issue in ubuntu 22.04 with changing my docker compose file
    simply use cgroup2 by commenting out #no-cgroups = false line in /etc/nvidia-container-runtime/config.toml and change your docker-compose file like this:
    mount /dev drive to /dev in container
    and set privileged: true in docker compose file
    also you need to specify runtime with this «runtime: nvidia»

    and your final docker-compose file be like this:

    version: ‘3’
    services:
    nvidia:
    image:
    restart: always
    container_name: Nvidia-Container
    ports:
    — port:port
    privileged: true
    volumes:
    — /dev:/dev
    runtime: nvidia

    And what if we are not using docker-compose @RezaImany. I am using terraform to provision with the gpus=»all» flag.
    Exposing all devices to the container isn’t a good approach and also privileged=true .

    And what if we are not using docker compose @RezaImany. I am using terraform to provision with gpus=all flag. Exposing all devices to the container isn’t a good approach and also privileged=true.

    the root cause of this error is cgroup controller not allow container to reconnect to NVML until restart, you should mod cgroup for bypassing some limitations

    the —privileged flag gives all capabilities to the container, and it also lifts all the limitations enforced by the device cgroup controller. In other words, the container can then do almost everything that the host can do.

    For my use case, multiple people are using the same machine, and setting preveliged=true is a good idea as the isolation between users is not there anymore. Is there any other way?

    © 2023 GitHub, Inc.

    You can’t perform that action at this time.

    You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.

    Источник

    • #1

    Hi all,
    I’m trying to share a GPU with a Debian Bullseye (11) container. I installed the nvidia driver using NVIDIA-Linux-x86_64-390.144.run on the proxmox host and then on the container.
    Host:

    Code:

    nvidia-smi
    Sat Oct 30 22:27:21 2021
    +-----------------------------------------------------------------------------+
    | NVIDIA-SMI 390.144                Driver Version: 390.144                   |
    |-------------------------------+----------------------+----------------------+
    | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
    | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
    |===============================+======================+======================|
    |   0  Quadro 600          Off  | 00000000:05:00.0 Off |                  N/A |
    | 30%   62C    P0    N/A /  N/A |      0MiB /   963MiB |      0%      Default |
    +-------------------------------+----------------------+----------------------+
    
    +-----------------------------------------------------------------------------+
    | Processes:                                                       GPU Memory |
    |  GPU       PID   Type   Process name                             Usage      |
    |=============================================================================|
    |  No running processes found                                                 |
    +-----------------------------------------------------------------------------+

    /etc/modules-load.d/modules.conf:

    Code:

    KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
    KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"
    
    # Nvidia modules
    nvidia
    nvidia_uvm

    /etc/udev/rules.d/70-nvidia.rules:

    Code:

    KERNEL=="nvidia", RUN+="/bin/bash -c '/usr/bin/nvidia-smi -L && /bin/chmod 666 /dev/nvidia*'"
    KERNEL=="nvidia_uvm", RUN+="/bin/bash -c '/usr/bin/nvidia-modprobe -c0 -u && /bin/chmod 0666 /dev/nvidia-uvm*'"

    Here’s the the dev list and container config on the host:

    Code:

    ls -la /dev/nvid*
    crw-rw-rw- 1 root root 195,   0 Oct 30 22:16 /dev/nvidia0
    crw-rw-rw- 1 root root 195, 255 Oct 30 22:16 /dev/nvidiactl
    crw-rw-rw- 1 root root 239,   0 Oct 30 22:16 /dev/nvidia-uvm
    crw-rw-rw- 1 root root 239,   1 Oct 30 22:16 /dev/nvidia-uvm-tools
    
    lxc.cgroup.devices.allow: c 195:* rw
    lxc.cgroup.devices.allow: c 239:* rw
    lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
    lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
    lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
    lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file

    and on the container:

    Code:

    ls -la /dev/nvidia*
    crw-rw-rw- 1 root root 239,   0 Oct 30 21:16 /dev/nvidia-uvm
    crw-rw-rw- 1 root root 239,   1 Oct 30 21:16 /dev/nvidia-uvm-tools
    crw-rw-rw- 1 root root 195,   0 Oct 30 21:16 /dev/nvidia0
    crw-rw-rw- 1 root root 195, 255 Oct 30 21:16 /dev/nvidiactl

    However, when I run nvidia-smi on the container:

    Code:

    nvidia-smi
    Failed to initialize NVML: Unknown Error

    Is anyone able to help please?

    Thanks

    NTB

    Last edited: Oct 31, 2021

    About Lenovo

    • Our Company

    • News

    • Investor Relations

    • Sustainability

    • Product Compliance

    • Product Security

    • Lenovo Open Source

    • Legal Information

    • Jobs at Lenovo

    Shop

    • Laptops & Ultrabooks

    • Tablets

    • Desktops & All-in-Ones

    • Workstations

    • Accessories & Software

    • Servers

    • Storage

    • Networking

    • Laptop Deals

    • Outlet

    Support

    • Drivers & Software

    • How To’s

    • Warranty Lookup

    • Parts Lookup

    • Contact Us

    • Repair Status Check

    • Imaging & Security Resources

    Resources

    • Where to Buy

    • Shopping Help

    • Sales Order Status

    • Product Specifications (PSREF)

    • Forums

    • Registration

    • Product Accessibility

    • Environmental Information

    • Gaming Community

    • LenovoEDU Community

    • LenovoPRO Community

    ©

    Lenovo.

    |
    |
    |
    |

    How to resolve » Failed to initialize NVML: Driver/library version mismatch» error

    Overview/Backgroud

    You may encounter this error when trying to run a GPU workload or nvidia-smi command. We have typically seen this when NVIDIA drivers on a node are upgraded but the run-time driver information is not up to date.

    A pod impacted by this issue would fail with status RunContainerError and will report the following error under Events 

    Warning Failed 91s (x4 over 2m12s) kubelet, ip-10-0-129-17.us-west-2.compute.internal Error: failed to create containerd task: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused "process_linux.go:432: running prestart hook 0 caused \"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver/library version mismatch\\n\""": unknown 
    

    Verification

    Before we try to resolve the issue let’s try to confirm that the issue is actually with the loaded kernel drivers being outdated.

    Check the run-time driver information

    The command 

    cat /proc/driver/nvidia/version 

    will show the run-time information about the driver like so

    NVRM version: NVIDIA UNIX x86_64 Kernel Module 440.64.00 Wed Feb 26 16:26:08 UTC 2020 GCC version: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC) 

      Compare against the version for drivers installed

      Compare the NVIDIA driver version obtained above (e.g. 440.64.00) against the drivers you have currently installed.

      If you are using host-based drivers you can check the driver version using

    rpm -qa | grep nvidia-driver 

     on Centos/RHEL or

    dpkg -l | grep nvidia-driver 

    on Ubuntu

    Example output from a Centos node 

    ... nvidia-driver-latest-libs-455.32.00-1.el7.x86_64 nvidia-driver-latest-455.32.00-1.el7.x86_64 ... 
    

    If you are using container-based drivers from the Konvoy NVIDIA addon, then check the version tag for the container in the nvidia-kubeaddons-nvidia-driver- pod.

    In the example above from a Centos host, we see that the run-time driver version 440.64.00 is different from the installed version 455.32.00. If you see a discrepancy like this, it confirms that the issue is caused by driver upgrade, and the solution documented here should resolve the issue.

    Solution 1: Drain and reboot the worker

    Rebooting the node is the easiest way to fix the issue. Rebooting the node will make sure that the drivers are properly initialized after the upgrade.

    If you need to upgrade drivers on a GPU worker node, we recommend draining the node, then performing the driver upgrade and then rebooting the node before deploying fresh workloads. If you are using container-based drivers then the recommended procedure to upgrade is documented here.

    Solution 2 : Reload NVIDIA kernel modules

    This method is more involved and should only be used if draining and rebooting the GPU worker in question is not an option. This will also involve draining any GPU workloads running on the node. If you want to avoid draining and rebooting due to some currently running GPU workloads, then this method offers no advantage. This is useful only if there are non-GPU workloads on the worker node that cannot be drained or the worker node cannot be rebooted for any reason.

    Drain GPU workloads

    For this method, we would need the GPUs not to be in use and for that we would need to stop any GPU workloads on the impacted node.

    Stop NVIDIA device driver

    Once we have stopped all the GPU workloads, we need to stop the NVIDIA device plugin. It is deployed as a daemonset on GPU workers. To only stop if on the impacted node we can remove the label konvoy.mesosphere.com/gpu-provider

    kubectl label node  konvoy.mesosphere.com/gpu-provider- 

    Once the label is removed all pods associated with NVIDIA konvoy addon on the worker node will be removed.

    Restart kubelet

    After stopping GPU workloads and device plugin the last process left using the nvidia kernel module would be the kubelet service. Since the device plugin is no longer running just restarting the service should stop it from using the kernel module.

    sudo systemctl restart kubelet 

    Check if there are any processes still using NVIDIA drivers

    Before attempting to unload the kernel modules lets check if any processes are still using the NVIDIA drivers 

    sudo lsof /dev/nvidia** 
    

     Kill any processes still using the drivers.

     Check which NVIDIA kernel modules are loaded

    lsmod | grep ^nvidia 

    example output

    nvidia_uvm 939731 0 nvidia_drm 39594 0 nvidia_modeset 1109637 1 nvidia_drm nvidia 20390418 18 nvidia_modeset,nvidia_uvm 

    Unload NVIDIA kernel modules

    In the example above, the third column above shows modules that are using the module listed in the first column. A module that is being used cannot be removed until the dependant module is removed, hence the order of the following commands is important. Make sure that any running GPU workloads are terminated before removing the modules.

    sudo rmmod nvidia_uvm sudo rmmod nvidia_drm sudo rmmod nvidia_modeset sudo rmmod nvidia 

    Verify that the modules are unloaded

    lsmod | grep ^nvidia 

    should return no output.

    Relaunch the NVIDIA addon pods

    Now we are ready to relaunch the pods for NVIDIA addon on the impacted node. Adding the following label back should relaunch the pods and prepare the node to accept GPU workloads again 

    kubectl label node  konvoy.mesosphere.com/gpu-provider=NVIDIA 

    Понравилась статья? Поделить с друзьями:
  • Failed to initialize nvidia driver cs go как исправить
  • Failed gsm cali in phone is not calibrated reserved 7 0x00000000 id 0x2 как исправить
  • Failed getting protected volume configuration with error 1
  • Failed flash configure error
  • Failed execution error return code 1 from org apache hadoop hive ql exec ddltask