Cuda error out of memory что делать - Исправление ошибок и поиск оптимальных решений проблем

I successfully trained the network but got this error during validation:

RuntimeError: CUDA error: out of memory

Mateen Ulhaq

23k16 gold badges89 silver badges130 bronze badges

asked Jan 26, 2019 at 1:39

The error occurs because you ran out of memory on your GPU.

One way to solve it is to reduce the batch size until your code runs without this error.

Mateen Ulhaq

23k16 gold badges89 silver badges130 bronze badges

answered Jan 26, 2019 at 7:11

K. KhandaK. Khanda

4904 silver badges11 bronze badges

1.. When you only perform validation not training,
you don’t need to calculate gradients for forward and backward phase.
In that situation, your code can be located under

with torch.no_grad():
    ...
    net=Net()
    pred_for_validation=net(input)
    ...

Above code doesn’t use GPU memory

2.. If you use += operator in your code,
it can accumulate gradient continuously in your gradient graph.
In that case, you need to use float() like following site
https://pytorch.org/docs/stable/notes/faq.html#my-model-reports-cuda-runtime-error-2-out-of-memory

Even if docs guides with float(), in case of me, item() also worked like

entire_loss=0.0
for i in range(100):
    one_loss=loss_function(prediction,label)
    entire_loss+=one_loss.item()

3.. If you use for loop in training code,
data can be sustained until entire for loop ends.
So, in that case, you can explicitly delete variables after performing optimizer.step()

for one_epoch in range(100):
    ...
    optimizer.step()
    del intermediate_variable1,intermediate_variable2,...

answered Jan 29, 2019 at 9:32

YoungMin ParkYoungMin Park

1,0211 gold badge9 silver badges17 bronze badges

The best way is to find the process engaging gpu memory and kill it:

find the PID of python process from:

nvidia-smi

copy the PID and kill it by:

sudo kill -9 pid

answered Jun 15, 2020 at 6:47

Milad shiriMilad shiri

7226 silver badges5 bronze badges

I had the same issue and this code worked for me :

import gc

gc.collect()

torch.cuda.empty_cache()

Syscall

19k10 gold badges36 silver badges51 bronze badges

answered Apr 2, 2021 at 15:16

It might be for a number of reasons that I try to report in the following list:

Modules parameters: check the number of dimensions for your modules. Linear layers that transform a big input tensor (e.g., size 1000) in another big output tensor (e.g., size 1000) will require a matrix whose size is (1000, 1000).
RNN decoder maximum steps: if you’re using an RNN decoder in your architecture, avoid looping for a big number of steps. Usually, you fix a given number of decoding steps that is reasonable for your dataset.
Tensors usage: minimise the number of tensors that you create. The garbage collector won’t release them until they go out of scope.
Batch size: incrementally increase your batch size until you go out of memory. It’s a common trick that even famous library implement (see the biggest_batch_first description for the BucketIterator in AllenNLP.

In addition, I would recommend you to have a look to the official PyTorch documentation: https://pytorch.org/docs/stable/notes/faq.html

answered Jan 26, 2019 at 16:28

I am a Pytorch user. In my case, the cause for this error message was actually not due to GPU memory, but due to the version mismatch between Pytorch and CUDA.

Check whether the cause is really due to your GPU memory, by a code below.

import torch
foo = torch.tensor([1,2,3])
foo = foo.to('cuda')

If an error still occurs for the above code, it will be better to re-install your Pytorch according to your CUDA version. (In my case, this solved the problem.)
Pytorch install link

A similar case will happen also for Tensorflow/Keras.

answered May 20, 2021 at 8:59

If you are getting this error in Google Colab use this code:

import torch
torch.cuda.empty_cache()

answered Jun 9, 2021 at 14:55

In my experience, this is not a typical CUDA OOM Error caused by PyTorch trying to allocate more memory on the GPU than you currently have.

The giveaway is the distinct lack of the following text in the error message.

Tried to allocate xxx GiB (GPU Y; XXX GiB total capacity; yyy MiB already allocated; zzz GiB free; aaa MiB reserved in total by PyTorch)

In my experience, this is an Nvidia driver issue. A reboot has always solved the issue for me, but there are times when a reboot is not possible.

One alternative to rebooting is to kill all Nvidia processes and reload the drivers manually. I always refer to the unaccepted answer of this question written by Comzyh when performing the driver cycle. Hope this helps anyone trapped in this situation.

answered Nov 21, 2022 at 19:02

If someone arrives here because of fast.ai, the batch size of a loader such as ImageDataLoaders can be controlled via bs=N where N is the size of the batch.

My dedicated GPU is limited to 2GB of memory, using bs=8 in the following example worked in my situation:

from fastai.vision.all import *
path = untar_data(URLs.PETS)/'images'

def is_cat(x): return x[0].isupper()
dls = ImageDataLoaders.from_name_func(
    path, get_image_files(path), valid_pct=0.2, seed=42,
    label_func=is_cat, item_tfms=Resize(244), num_workers=0, bs=)

learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)

answered Aug 22, 2020 at 19:57

dgellowdgellow

6241 gold badge11 silver badges18 bronze badges

Problem solved by the following code:

import os
os.environ['CUDA_VISIBLE_DEVICES']='2, 3'

answered May 22, 2021 at 13:54

ah bonah bon

8,9159 gold badges56 silver badges125 bronze badges

Not sure if this’ll help you or not, but this is what solved the issue for me:

export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

Nothing else in this thread helped.

answered Sep 17, 2022 at 6:34

WaXxX333WaXxX333

1681 gold badge1 silver badge10 bronze badges

I faced the same issue with my computer. All you have to do is customize your configuration file to match your computer’s specifications. Turns out my computer takes image sizes below 600 X 600 and when I adjusted the same in the configuration file, the program ran smoothly.

Gino Mempin

23.1k27 gold badges91 silver badges120 bronze badges

answered Dec 14, 2020 at 5:30

Источник

Hi,

my environment is:

windows 10  
10700K CPU   with 16GB ram 
3090 GPU with 24G memory  
driver version: 461.40  
cuda version: 11.0  
cudnn version: cudnn-11.0-windows-x64-v8.0.5.39  
SSD 512GB  
torch 1.7.1

datasets information:

10180 images with 1080P resolution  
epoch 100
batch size 32
batch count 319

	num_workers=1	num_workers = 2	num_workers = 4
CPU RAM used	9.5G	9.5G	unknown as cuda out of memory issue
GPU RAM used	17.5G	17.5G	unknown as cuda out of memory issue
GPU Power(w)	257	357	unknown as cuda out of memory issue
epoch	100	100	100
batch size	32	32	32
batch count	319	319	319
total time for training(hours)	15	6.459	unknown
time for each epoch(minutes)	9	3.875	unknown
mean time for batch(second)	1.69	0.73	unknown

my doubts are:

time needed for num_workers=2 is more than twice of num_workers =1.
much time is wasted by CPU as num_workers=2 will used 2 threads to load a batch of images.
for num_workers=2, much ram reamined. why CUDA out of memory issue happened for num_workers = 4?
num_workers will only change the threads used to load images, which will not change batch size. That is to say, GPU memory needed will not be changed for num_workers=2 and num_workers=4. why the CUDA out of memory issue happened?
on my computer, the biggest num_workers=2, however, in your code nw = min([os.cpu_count() // world_size, batch_size if batch_size > 1 else 0, workers]) # number of workers , num_workers is set to 8. Dose this work on your computer? what is computer configuration?

Analyzing anchors... anchors/target = 5.86, Best Possible Recall (BPR) = 1.0000
Image sizes 640 train, 640 test
Using 4 dataloader workers
Logging results to runstrainexp7
Starting training for 100 epochs...
     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|          | 0/319 [00:00<?, ?it/s]Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
NumExpr defaulting to 8 threads.
  0%|          | 0/319 [00:02<?, ?it/s]
Traceback (most recent call last):
  File "D:/code_python/har_hailiang/har_hdd/algo/train.py", line 288, in train
    pred = model(imgs)  # forward
  File "C:ProgramDataAnaconda3libsite-packagestorchnnmodulesmodule.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:code_pythonhar_hailianghar_hddalgomodelsyolo.py", line 122, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "D:code_pythonhar_hailianghar_hddalgomodelsyolo.py", line 138, in forward_once
    x = m(x)  # run
  File "C:ProgramDataAnaconda3libsite-packagestorchnnmodulesmodule.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "D:code_pythonhar_hailianghar_hddalgomodelscommon.py", line 35, in forward
    return self.act(self.bn(self.conv(x)))
  File "C:ProgramDataAnaconda3libsite-packagestorchnnmodulesmodule.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:ProgramDataAnaconda3libsite-packagestorchnnmodulesconv.py", line 423, in forward
    return self._conv_forward(input, self.weight)
  File "C:ProgramDataAnaconda3libsite-packagestorchnnmodulesconv.py", line 419, in _conv_forward
    return F.conv2d(input, weight, self.bias, self.stride,
RuntimeError: CUDA out of memory. Tried to allocate 3.85 GiB (GPU 0; 24.00 GiB total capacity; 1.47 GiB already allocated; 20.35 GiB free; 1.54 GiB reserved in total by PyTorch)
python-BaseException

Источник

Риг 7 карт 1660 ti
NBminer 42.2 (на версии 39,5 тоже самое)
Win 10
8 гигов озу.
Файл подкачки 60 гигов. Увеличение не помогает!!!
Майнер стартует и выдает ошибку про нехватку памяти. Прилагаю скрин ошибки и файла подкачки.
Майнер обновлял, менял на феню 6.2, все также ошибка.
Прошу помощи, как победить это безобразие. Риг на удалёнке.

П.С. поиском пользовался. Все только про подкачку говорят.

Скрины

IMG_20220626_212256.jpg

1,5 МБ · Просмотры: 149
IMG_20220626_212229.jpg

1,6 МБ · Просмотры: 150

таже проблема на 1660супер
уже час пробую разные варианты, не помогают…

карта, которая выводит изображение, уже не может загрузить даг файл в память

карта, которая выводит изображение, уже не может загрузить даг файл в память

нифига, изображение выдает встройка

Тоже феникс отлетел час назад, перешел на gminer.

тогда хз, у меня работает 1660ti, но изображение выводит карта на 8 гб.)

Тоже феникс отлетел час назад, перешел на gminer.

у меня gminer последний 3.01 стоял, на нем и начались ошибки

Риг на удалёнке. В карте стоит заглушка. Подключаюсь энидеском…

Тоже час назад словил : Cuda Error: out of memory, заработала на NBminer 40.1.

Dag подкрался незаметно, хоть виден был издалека.

да рано ещё 6гиговкам отваливаться
в 23 году вроде

На 3060 тоже старый феникс отвалился. Поставил Phoenix Miner 6.2c — заработало.

да рано ещё 6гиговкам отваливаться
в 23 году вроде

Не забывай про ось.

у меня gminer последний 3.01 стоял, на нем и начались ошибки

Ерунда какая-то. У меня риг на хайве стабильно работал, 6 карт 570 8 гб, майнер последний феникс начал ругаться на dag, попробовал nbminer и gminer все ок.

У меня на винде тоже отвалился (PhoenixMiner_5.5c_Windows) обновил на версию (PhoenixMiner_5.6d_Windows) все заработало.

Попробую видео драйвер обновить. Сейчас 425.31 стоят. Качаю 472…

Тоже заметил, даг 4.9 пишет но 1660с стала терять хеш, у 8 гиговок только 1,64гб свободно, у 12гб только 5гб свободно а 7 занято, при даге в 4.9 напомню. Стоит последний тирекс, вызываю пояснительную бригаду!

У меня на винде тоже отвалился (PhoenixMiner_5.5c_Windows) обновил на версию (PhoenixMiner_5.6d_Windows) все заработало.

а почему не на актуальную версию 6.2? обновлять так до актуальной

вы тупые? феня все написал

2022.06.26:23:58:49.823: GPU1 GPU1: Allocating DAG for epoch #501 (4.91) GB
2022.06.26:23:58:49.828: GPU1 GPU1: Generating DAG for epoch #501
2022.06.26:23:58:49.828: GPU1 GPU1: Unable to generate DAG for epoch #501; please upgrade to the latest version of PhoenixMiner
2022.06.26:23:58:49.828: GPU1 GPU1 initMiner error: Unable to initialize CUDA miner
2022.06.26:23:58:49.828: wdog Fatal error detected. Restarting.

Источник

Ошибки Видеокарты При Майнинге

Самое полное собрание ошибок в майнинге на Windows, HiveOS и RaveOS и их быстрых и спокойных решений

Can’t find nonce with device CUDA_ERROR_LAUNCH_FAILED

Ошибка майнера Can’t find nonce

Ошибка говорит о том, что майнер не может найти нонс и сразу же сам предлагает решение — уменьшить разгон. Особенно начинающие майнеры стараются выжать из видеокарты максимум — разгоняют слишком сильно по ядру или памяти. В таком разгоне видеокарта даже может запуститься, но потом выдавать ошибки как указано ниже. Помните, лучше — стабильная отправка шар на пул, чем гонка за цифрами в майнере.

Зарабатывай на чужих сделках на бирже BingX. Подробнее — тут.

Phoenixminer Connection to API server failed — что делать?

Ошибка Connection to API server failed

Такая ошибка встречается на PhoenixMiner на операционной систему HiveOS. Она говорит о том, что майнинг-ферма/риг не может подключиться к серверу статистики. Что делать для ее решения:

Введите команду net-test и запомните/запишите сервер с низким пингом. После чего смените его в веб интерфейсе Hive (на воркере) и перезагрузите ваш риг.
Если это не помогло, выполните команду dnscrypt -i && sreboot

Phoenixminer CUDA error in CudaProgram.cu:474 : the launch timed out and was terminated (702)

Ошибка майнера Phoenixminer CUDA error in CudaProgram

Эта ошибка, как и в первом случае, говорит о переразгоне карты. Откатите видеокарту до заводских настроек и постепенно поднимайте разгон до тех пор, пока не будет ошибки.

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

Ошибка майнера Unable to enum CUDA GPUs: invalid device ordinal

Проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).
Если все ок, то проверяем райзера. Часто бывает, что именно райзер бывает причиной такой ошибки.

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

Ошибка майнера Unable to enum CUDA GPUs: Insufficient CUDA driver: 5000

Аналогично предыдущей ошибке — проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка майнера NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка code 1073740791 nbminer возникает, если ваш риг/майнинг-ферма собраны из солянки Nvidia+AMD. В этом случае разделите майнинг на два .bat файла (или полетника, если вы на HiveOS). Один — с картами AMD, другой с картами Nvidia.

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

Ошибка майнера NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2)

Одна из самых распространённых ошибок на Windows — нехватка памяти, в данном случае на майнере Nbminer, но встречается и в майнере Nicehash. Чтобы ее исправить — надо увеличить файл подкачки. Файл подкачки должен быть равен сумме гб всех видеокарт в риге плюс 10% запаса. Как увеличить файл подкачки — читаем тут.

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Ошибка майнера GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

В данном случае скорее всего виноват не файл подкачки, а переразгон по видеокарте, которая идет под номером 0. Сбавьте разгон и ошибка должна пропасть.

Socket error. the remote host closed the connection, в майнере Nbminer

Socket error. the remote host closed the connection

Также может быть описана как «ERROR — Failed to establish connection to mining pool: Socket operation timed out».
Сетевой конфликт — проверьте соединение рига с интернетом. Перегрузите роутер.
Также может быть, что провайдер закрывает соединение с пулом. Смените пул, попробуйте VPN или измените адреса DNS на внешнего провайдера, например cloudflare 1.1.1.1, 1.0.0.1

Server not responded on share, на майнере Gminer

Server not responded on share

Такая ошибка говорит о том, что у вас что-то с подключением к интернету, что критично для Gminer. Попробуйте сделать рестарт роутера и отключить watchdog на майнере.

DAG has been damaged check overclocking settings, в майнере Gminer

Также в этой ошибке может быть указано Device not responding, check overclocking settings.
Ошибка говорит о переразгоне, попробуйте сначала убавить его.
Если это не помогло, смените майнер — Gminer никогда не славился работой с видеокартами AMD. Мы рекомендуем поменять майнер на Teamredminer, а если вам критична поддержка майнером одновременно Nvidia и AMD видеокарт, то используйте Lolminer.
Если смена майнера не поможет, переставьте видеодрайвер.
Если и это не поможет, то нужно тестировать эту карту отдельно в слоте X16.

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

Ошибки настройки памяти с кодом -6 обычно указывают на проблему с драйвером.

Если у вас Windows, используйте программу DDU (DisplayDriverUninstaller), чтобы полностью удалить все драйверы Nvidia.
Перезагрузите систему.
Установите новый драйвер прямо с сайта Nvidia.
Перезагрузите систему снова.
Если у вас HiveOS/RaveOS — накатите чистый образ системы. Чтобы наверняка.

TREX: Can’t unlock GPU

Полный текст ошибки:
TREX: Can’t unlock GPU [ID=1, GPU #1], error code 15
WARN: Miner is going to shutdown…
WARN: NVML: can’t get fan speed for GPU #1, error code 15
WARN: NVML: can’t get power for GPU #1, error code 15
WARN: NVML: can’t get mem/core clock for GPU #1, error code 17

Решение:

Проверьте все кабельные соединения видеокарты и райзера, особенно кабеля питания.
Если с первый пунктом все ок, попробуйте поменять райзер на точно рабочий.
Если ошибка остается, вставьте видеокарту в разъем х16 напрямую в материнскую плату.

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

Ошибка майнера CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

В конкретном случае была проблема в блоке питания, он не держал 3 видеокарты. После замены блока питания ошибка пропала.
Если вы уверены, что ваш мощности вашего блока питания достаточно, попробуйте сменить майнер.

Зарабатывай на чужих сделках на бирже BingX. Подробнее — тут.

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

Ошибка 511 градусов видеокарта

Ошибка 511 говорит о неисправности райзера или питания карты. Проверьте все соединения. Для выявления неисправности рекомендуется запустить систему с одной картой. Протестировать, и затем добавлять по одной карте.

GPU driver error, no temps в HiveOS — что делать?

Вероятнее всего, вы получили эту ошибку, майнив на HiveOS. Причин ее появления может быть несколько — как софтовая, так и аппаратная (например райзер).
Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — проверьте райзер.

GPU are lost, rebooting

Это не ошибка, а ее последствие. Что узнать какая ошибка приводит к перезагрузке карт, сделайте следующее:

Включите сохранение логов (по умолчанию они выключены) командой

logs-on

И перезагрузите риг.
После того как ошибка повторится можно будет скачать логи командами ниже.
Вы можете использовать следующую команду, чтобы загрузить логи майнера прямо с панели мониторинга;

message file «miner.log» -f=/var/log/miner/minername/minername.log

Итак, скажем, например, мне нужны логи TeamRedMiner
message file «teamredminer.log» -f=/var/log/miner/teamredminer/teamredminer.log

Отправленная командная строка будет выделена синим цветом. Загружаемый файл будет отображаться белым цветом. Нажав на него, вы сможете его скачать.
Эта команда позволит скачать лог системы

message file «syslog» -f=/var/log/syslog

exitcode=3 в HiveOS

Если ошибка не уйдет — проверьте райзер.

exitcode=1 в HiveOS

Данная ошибка возникает когда есть проблема с датой в биосе материнской платы (сбитое время) и (или) есть проблема с интернетом.
Если сбито время, то удаленно вы не сможете подключиться.
Тем не менее, обновление драйверов Nvidia должно пройти командой:

nvidia-driver-update —list

gpu fault detected 146

Скорее всего вы пытаетесь майнить с помощью Phoenix miner. Решения два:

Откатитесь на более старую версию, например на 5.4с
(Рекомендуемый вариант) Используйте Trex для видеокарт Nvidia и TeamRedMiner для AMD.

Waiting interface to come up — не работает VPN на HiveOS

Waiting interface to come up

Начните с логов, чтобы понять какая именно ошибка вызывает эту проблему.
Команды для получения логов:
systemctl status openvpn@client
journalctl -u openvpn@client -e —no-pager -n 100

Как узнать ip адрес воркера hive os

Самое простое — зайти в воркера и прокрутить страницу ниже видеокарт. Там будет указан Remote IP — это и есть внешний IP.
Альтернативный вариант — вы можете проверить ваш внешний айпи адрес hive через консоль Hive Shell:
Выполните одну из команд:
curl 2ip.ru
wget -qO- eth0.me
wget -qO- ipinfo.io/ip
wget -qO- ipecho.net/plain
wget -qO- icanhazip.com
wget -qO- ipecho.net
wget -qO- ident.me

Repository update failed в HiveOS

Иногда встречается на HiveOS. Полный текст ошибки:

Some index files failed to download. They have been ignored, or old ones used instead.
Repository update failed
------------------------------------------------------
> Restarting autofan and watchdog
> Starting miners
Miner screen is already running
Run miner or screen -r to resume screen
Upgrade failed

Решение:

Выполнить команду apt update && selfupgrade -f
Если не сработала и она, то 99.9%, что разработчики HiveOS уже знают об этой проблеме и решают ее. Попробуйте выполнить обновление через некоторое время.

Rave os не запускается. Boot aborted Rave os

Перепроверьте все настройки ПК и БИОСа материнской платы:
— Установите загрузочное устройство HDD/SSD/M2/USB в зависимости от носителя с ОС.
— Включите 4G decoding.
— Установите поддержку PCIe на Auto.
— Включите встроенную графику.
— Установите предпочтительный режим загрузки Legacy mode.
— Отключите виртуализацию.

Если после данных настроек не определяется часть карт, то выполните следующие настройки в BIOS (после каждого пункта требуется полная перезагрузка):

— Отключите 4G decoding
— Перезагрузка
— Отключите CSM
— Перезагрузка
— Включите 4G decoding, установите PCI-E Gen2/3, а при отсутствии Gen2/3, можно выбрать Gen1

Failed to allocate memory Raveos

Эта же ошибка может называться как:
failed to allocate initramfs memory bailing out, failed to load idlinux c.32
или
failed to allocate memory for kernel boot parameter block
или
failed to allocate initramfs memory raveos bailing

Но решение у нее одно — вы должны правильно настроить БИОС материнской платы.

gpu_driver_fault, GPU #0 fault в RaveOS

gpu_driver_fault, GPU #0 fault в RaveOS

В большинстве случаев эта проблема решается уменьшением разгона (особенно по памяти) на конкретной видеокарте (на скрине это карта номер 0).
Если уменьшение разгона не помогает, то попробуйте обновить драйвера.
Если обновление драйверов не привело к решению проблемы, то попробуйте поменять райзер на этой карте на точно работающий.
Если и это не помогает, перепроверьте все кабельные соединения и мощность блока питания, хватает ли его для вашей конфигурации.

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes

Что приводит к появлению этой ошибки? Вероятно, вы переразогнали видеокарту (часто сильно гонят по памяти), сбавьте разгон. На скрине видно, что проблему дает именно GPU под номером 1 — начните с нее.
Вторая частая причина — нехватка питания БП на систему с видеокартами. Учтите, что сама система потребляет не менее 100 вт, каждый райзер еще закладывайте 50 вт. БП должно хватать с запасом в 20%.

Miner restarted after error RaveOS

Смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к miner restarted. После этого найдите ее на этой странице и исправьте. Проблема уйдет.

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Аналогично предыдущему пункту — смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к рестарту воркера. Пофиксите ту ошибку — уйдет и эта проблема.

Miner cannot be started, ОС RaveOS

Непосредственно перед этой ошибкой обычно пишется еще другая, которая и вызывает эту проблему. Но если ничего нет, то:

Поставьте майнер на паузу, перезагрузите риг и в консоли выполните команды clear-miners clear-logs и fix-fs. Запустите майнинг.
Если ошибка не ушла, перепишите образ RaveOS.

Overclock can’t be applied в RaveOS

Эта ошибка означает, что значения разгона между собой конфликтуют или выходят за пределы допустимых. Перепроверьте их. Скиньте разгон на стоковый и попробуйте еще раз.
В редких случаях причиной этой ошибки также становится райзер.

Error installing hive miners

Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — физически перезапишите образ. Если у вас флешка, то скорее всего она умерла. Купите SSD.

Warning: Nvidia settings applied with errors

Переразгон. Снизьте значения частот ядра и памяти. После этого перезагрузите риг.

Nvtool error или Danger: nvtool error

Скорее всего при установке драйвера появилась проблема с модулем nvtool
Попробуйте переустановить драйвер Nvidia командой через Hive shell:
nvidia-driver-update версия_драйвера —force
Или попробуйте обновить систему полностью командой из Hive shell:
hive-replace -y —stable

nvtool error

Перестал отображаться кулер видеокарты HiveOS

0% скорости вращения кулера.
Это может произойти по нескольким причинам:

кулер действительно не крутится
датчик оборотов отключен или сломан
видеокарта слишком агрессивно работает (высокий разгон)
неисправен райзер или одно из его частей

ERROR: parsing JSON failed

Необходимо выполнить на риге локально (с клавиатурой и монитором) следующую команду:
net-test

Данная команда покажет ваше текущее состояние подключения к разным зеркалам API серверов HiveOS.
Посмотрите, к какому API у вас наименьшая задержка (ping), и когда воркер снова появится в панели, измените стандартное зеркало на то, что ближе к вам.
После смены зеркала, в обязательном порядке перезагрузите ваш воркер.
Изменить сервер API вы можете командой nano /hive-config/rig.conf
После смены нажмите ctrl + o и ентер для того чтобы сохранить файл.
После этого выйдите в консоль командой ctrl + x, f10 и выполните команду hello

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Проблема с скоростью кулеров на GPU 5
0% скорости вращения кулера / ошибки в целом
Это может произойти по нескольким причинам:
— кулер действительно не крутится
— датчик оборотов отключен или сломан
— видеокарта слишком агрессивно работает (высокий разгон)
Начните с визуальной проверки карты и ее кулера.

Can’t get power for GPU #2

Как правило эта ошибка встречается рядом вместе с другими:
Attribute ‘GPUGraphicsClockOffset’ was already set to 0
Attribute ‘GPUMemoryTransferRateOffset’ was already set to 2200
Attribute ‘GPUFanControlState’ (hive1660s_ETH:0[gpu:2]) assigned value
0.

20211029 12:40:50 WARN: NVML: can’t get fan speed for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get power for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get mem/core clock for GPU #2, error code 999

Решение:

Проверьте корректность установки драйвера на видеокарте.
Убедитесь что нет проблем с драйвером, если все в порядке, то попробуйте другой параметр разгона. Например уменьшить разгон по памяти.

GPU1 search error: unspecified launch failure

Уменьшите разгон и проверьте контакты райзера

Warning: Autofan: unable to set fan speed, rebooting

Найдите логи майнера, посмотрите какие ошибки майнер пишет в логах. Например:

kernel: [12112.410046][ T7358] NVRM: GPU at PCI:0000:0c:00: GPU-236e3bef-2e03-6cdb-0518-7ac01eb8736d
kernel: [12112.410049][ T7358] NVRM: Xid (PCI:0000:0c:00): 62, pid=7317, 0000(0000) 00000000 00000000
kernel: [12112.433831][ T7358] NVRM: Xid (PCI:0000:0c:00): 45, pid=7317, Ch 00000010
CRON[21094]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Исходя из логов, мы видим что есть проблема с видеокартой на слоте PCIE 0c:00 (под номером Gpu пишется номер PCIE слота) с ошибками 45 и 62
Коды ошибок (других, которые также могут быть там) и что с ними делать:

• 13, 43, 45: ошибки памяти, снизить MEM
• 8, 31, 32, 61, 62: снизить CORE, возможно и MEM
• 79: снизить CORE, проверить райзер

Ошибка Kernel-Power код 41

Проверьте все провода (от БП до карт, от БП до райзеров), возможно где-то идёт оплавление. Если визуальный осмотр показал, что все ок, то ошибка программная и вам нужно переустановить Windows.

Danger: hive-replace -y —stable (failed, exitcode=137)

Очень редкая ошибка, которая вылезла в момент удаленного обновления образа HiveOS. Она не встречается в тематических майнинг группах и сайтах. Не поверите что произошло.
На балконе, где стоял риг, поселилась семья голубей. Они засрали риг, в прямом смысле, из-за этого он постоянно уходил в оффлайн. После полной продувки материнской платы и видеокарт проблема решилась сама.

MALFUNCTION HIVEOS

Malfunction — неисправность. Причин и решений может быть несколько:

Вам следует переустановить видео драйвер;
Если драйвер не помог, тогда отключайте все GPU и поочередно вставляйте по 1 шт, и смотрите вызовет ли какая-то видеокарта подобную ошибку или нет. Если да, то возможно это райзер.
Неисправен носитель, на который записана Hive OS, запишите образ еще раз.

Не нашли своей ошибки? Помогите сделать мир майнинга лучше. Отправьте ее по этой форме и мы обновим наш гайд в самое ближайшее время.

Источник

Home
Tech

27 Sep 2022 1:13 PM +00:00 UTC

Try these tips and the Stable Diffusion runtime error will be a thing of the past.

Credit: Stability.ai

If the Stable Diffusion runtime error is preventing you from making art, here is what you need to do.

Stable Diffusion is one of the best AI image generators out there. Unlike DALL-E and MidJourney AI, Stable Diffusion is available for the public and anyone with a powerful machine can generate images from texts.

However, Stable Diffusion might sometimes run into memory issues and stop working. If you are experiencing the Stable Diffusion runtime error, try the following tips.

How To Fix Runtime Error: CUDA Out Of Memory In Stable Diffusion

So you are running Stable Diffusion locally on your PC, maybe trying to make some NSFW images and bam! You are hit by the infamous RuntimeError: CUDA out of memory.

The error is accompanied by a long message that basically looks like this. The amount of memory may change but the content is the same.

RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 6.00 GiB total capacity; 5.16 GiB already allocated; 0 bytes free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It appears you have run out of GPU memory. It is worth mentioning that you need at least 4 GB VRAM in order to run Stable Diffusion. If you have 4 GB or more of VRAM, below are some fixes that you can try.

Restarting the PC worked for some people.
Reduce the resolution. Start with 256 x 256 resolution. Just change the -W 256 -H 256 part in the command.
Try this fork as it requires a lot less VRAM according to many Reddit users.

If the issue persists, don’t worry. We have some additional troubleshooting tips for you to try. Keep reading!

Other Troubleshooting Tips

So you have tried all the simple and quick fixes but the runtime error seems to have no intention to leave you, huh? No worries! Let’s dive into relatively more complex steps. Here you go.

As mentioned in the error message, run the following command first: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6, max_split_size_mb:128. Then run the image generation command with: —n_samples 1.
Call the optimized python script. Use the following command: python optimizedSD/optimized_txt2img.py —prompt «a drawing of a cat on a log» —n_iter 5 —n_samples 1 —H 512 —W 512 —precision full
You can also try removing the safety checks aka NSFW filters, which take up 2GB of VRAM. Just replace scripts/txt2img.py with this:
https://github.com/JustinGuese/stable-diffusor-docker-text2image/blob/master/txt2img.py

Hopefully, one of the suggestions will work for you and you will be able to generate images again. Now that the Stable Diffusion runtime error is fixed, have a look at how to access Stable Diffusion using Google Colab.

Источник

$begingroup$

I’ve been struggling a little with Cycles and my GeForce GTX 750 graphic card. Most of the time it works well, but when I try to work with scenes that have custom shaders (like a skin shader with multiple textures, including for displacement), I get an error, both on the viewport and at rendering:

Cancel | CUDA error: Out of memory in cuLaunchKernel(cuPathTrace, xblocks, yblocks, 1, xthreads, ythreads, 1, 0, 0, args, 0)

Or something like that. Here’s a screenshot so you can check it out:

I’m using a PC and Windows 7, with 8Gb of RAM. I can’t render this scene with GPU, but using CPU, it renders ok.

My question is: What is causing this issue? I have the latest drivers installed for my graphic card, so I have no idea what this is.

360ueck

1,4961 gold badge17 silver badges27 bronze badges

asked Oct 31, 2014 at 18:51

$endgroup$

$begingroup$

The short answer is that SSS on the GPU eats up a lot of memory, so much so that it is recommended to have more than 1 GB of memory on for your GPU. This was mentioned in one of the videos from the Blender Conference (unfortunately I can’t remember which one). Updating your drivers won’t really help as that can’t add more memory, so for now you are stuck with having to render on the CPU. If you have multiple objects that have SSS shaders on them then you could try just rendering one at a time and then compositing them back together.

answered Oct 31, 2014 at 20:01

BlendingJakeBlendingJake

2,5371 gold badge20 silver badges37 bronze badges

$endgroup$

$begingroup$

I assume you are using the same graphics card for both display and compute/rendering. If so you could use hundreds of megabytes to windows and applications for display purposes.

If you then attempt to render with cycles it needs to allocate another chunk of memory in order to support running a program ( cycles ) on your GPU and on top of that all the scene data and textures.

This first bit of memory we cannot accurately measure currently. But it is the case that running an experimental kernel uses significantly more memory then the normal one ( again can be multi hundreds of megabytes).

So if you are unlucky and have a 2GB graphics cards you might only be able to use 700 mb of ram for textures + the scene data ( this is what blender measures and reports ). If you go over this you might place the card into a mode where it can no longer allocate enough for displaying and this might result in artefacts like the ones you show in the screen-shot.

answered Nov 2, 2014 at 21:05

$endgroup$

$begingroup$

I found that reducing tile size (under the performance tab in the render tab) will reduce memory usage. This will be slightly slower (https://youtu.be/8gSyEpt4-60?t=671) but will consume less ram, and allow the render to complete.

answered Nov 18, 2018 at 2:58

$endgroup$

$begingroup$

Sounds like the same Issue as I have with cuCtxCreate, I found out that it hs something to do with the Memory transfer rate of the graphics card, the workaround that I did is to lower the Memory Transfer Rate offset.
though I am using Linux Fedora 25 and tweak some files in order for me to lower it in nvidia X server.

my current transfer rate is 810Mhz both min and max

My proof of this is that I can now run GPU rendering without error, though what I’m rendering right now might not even come close to what you’re rendering, Graphics card wise.

I have dedicated nvidia Gt640m 2Gb.
by the way have you tried to check the blenders console? you might find something useful in that console that can help you.

The idea wasn’t mine I’m only sharing what I experienced and hope that I could share it and help someone, Lowering your GPU’s Memory Transfer Rate offset is up to you, cause to me it’s like I’m playing 2 AAA games in a frying pan disguised laptop, my GPU temp reached up to 81 degree Celsius at first try no electric fan to cool off my laptop.

GPU Rendering issue «Cuda error at cuCtxCreate: Illegal Address»
this is the link to my previous issue that got answered.

answered Jan 30, 2017 at 4:13

$endgroup$

$begingroup$

I solved the problem by doing the rendering using CPU+GPU in Blender 2.8 (linux, Ubuntu). You have to install nvdia driver from nvidea in update menu.

answered Nov 25, 2018 at 9:39

$endgroup$

$begingroup$

My previous message was only a temporary solution and as my file got bigger I had to do losts of trick not to get this CUDA error / stop the program crashing during rendering. I only have problems with rendering in render image mode (F12) and rendering in viewport never causes any problems. For animation and less fireflies I need rendering render mode unfortunately. My drawing is 1 GB now and 11 GB video card and 16 GB internal memory is not enough to render without using harddisk as memory.

steps to improve the situation:
— use alt d instead of shift d to use copy of object and keep file smaller as alt d doesn’t increase file size.
— don’t use HDR and the best is to choose as background RGB input.
— Use the decimate modifier for object with lots of verticate. I could reduce number of verticates for some object by 70 % with hardly visible less quality.
You can also bake normals for bigger decrease in object size with less loss in quality.
— Even after doing all this I got problems again and solved this by increasing swap file size with 32 GB on ubuntu.

If I get problems again guess I will try using rendering services where you sent
what you want to render and can download the result when ready.

answered Feb 17, 2019 at 11:01

$endgroup$

$begingroup$

I am running 3 graphics cards in my system. 1 2g and 2 4g. I kept getting the cuda out of memory error. After reading through these postings I realized that by disabling my 2 g card (primary display) from the cuda stack it works fine now. My primary display card was running out of memory and blender couldn’t utilize the other two.

So, try disabling your primary display card from the Cuda stack and see if that helps.

answered Mar 3, 2019 at 16:25

$endgroup$

Источник

$begingroup$

Cancel | CUDA error: Out of memory in cuLaunchKernel(cuPathTrace, xblocks, yblocks, 1, xthreads, ythreads, 1, 0, 0, args, 0)

Or something like that. Here’s a screenshot so you can check it out:

I’m using a PC and Windows 7, with 8Gb of RAM. I can’t render this scene with GPU, but using CPU, it renders ok.

My question is: What is causing this issue? I have the latest drivers installed for my graphic card, so I have no idea what this is.

360ueck

1,4961 gold badge17 silver badges27 bronze badges

asked Oct 31, 2014 at 18:51

$endgroup$

$begingroup$

answered Oct 31, 2014 at 20:01

BlendingJakeBlendingJake

2,5371 gold badge20 silver badges37 bronze badges

$endgroup$

$begingroup$

I assume you are using the same graphics card for both display and compute/rendering. If so you could use hundreds of megabytes to windows and applications for display purposes.

If you then attempt to render with cycles it needs to allocate another chunk of memory in order to support running a program ( cycles ) on your GPU and on top of that all the scene data and textures.

answered Nov 2, 2014 at 21:05

$endgroup$

$begingroup$

answered Nov 18, 2018 at 2:58

$endgroup$

$begingroup$

my current transfer rate is 810Mhz both min and max

My proof of this is that I can now run GPU rendering without error, though what I’m rendering right now might not even come close to what you’re rendering, Graphics card wise.

I have dedicated nvidia Gt640m 2Gb.
by the way have you tried to check the blenders console? you might find something useful in that console that can help you.

GPU Rendering issue «Cuda error at cuCtxCreate: Illegal Address»
this is the link to my previous issue that got answered.

answered Jan 30, 2017 at 4:13

$endgroup$

$begingroup$

I solved the problem by doing the rendering using CPU+GPU in Blender 2.8 (linux, Ubuntu). You have to install nvdia driver from nvidea in update menu.

answered Nov 25, 2018 at 9:39

$endgroup$

$begingroup$

If I get problems again guess I will try using rendering services where you sent
what you want to render and can download the result when ready.

answered Feb 17, 2019 at 11:01

$endgroup$

$begingroup$

So, try disabling your primary display card from the Cuda stack and see if that helps.

answered Mar 3, 2019 at 16:25

$endgroup$

Источник

Ошибки Видеокарты При Майнинге

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Socket error. the remote host closed the connection, в майнере Nbminer

Server not responded on share, на майнере Gminer

DAG has been damaged check overclocking settings, в майнере Gminer

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

TREX: Can’t unlock GPU

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

GPU driver error, no temps в HiveOS — что делать?

GPU are lost, rebooting

exitcode=3 в HiveOS

exitcode=1 в HiveOS

gpu fault detected 146

Waiting interface to come up — не работает VPN на HiveOS

Как узнать ip адрес воркера hive os

Repository update failed в HiveOS

Rave os не запускается. Boot aborted Rave os

Failed to allocate memory Raveos

gpu_driver_fault, GPU #0 fault в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Miner restarted after error RaveOS

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Miner cannot be started, ОС RaveOS

Overclock can’t be applied в RaveOS

Error installing hive miners

Warning: Nvidia settings applied with errors

Nvtool error или Danger: nvtool error

Перестал отображаться кулер видеокарты HiveOS

ERROR: parsing JSON failed

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Can’t get power for GPU #2

GPU1 search error: unspecified launch failure

Warning: Autofan: unable to set fan speed, rebooting

Ошибка Kernel-Power код 41

Danger: hive-replace -y —stable (failed, exitcode=137)

MALFUNCTION HIVEOS

How To Fix Runtime Error: CUDA Out Of Memory In Stable Diffusion

Other Troubleshooting Tips

Читайте также: