Cuda error unspecified launch failure err no 4 - Исправление ошибок и поиск оптимальных решений проблем

Your problem is simple, since when do linux user’s not read the readme first and foremost guy’s? You cannot use ANY -mt on linux because it will install the windows only driver to speed up vram timing’s, the problem was trying to install that which cannot install to which I would suspect the kernel of the OS itself shut’s down the program for doing said action.

Not sure what you’re talking about. This is straight from the readme:

`—mt, —memory-tweak Memory timings optimize for Nvidia GDDR5 & GDDR5X gpus. range [1-6]. Higher value equals higher hashrate. Individual value can be set via comma seperated list. Power limit may need to be tuned up to get more hashrate. Higher reject share ratio can happen if mining rig hits high temperature, set lower value of -mt can reduce reject ratio. Under windows, a custom driver need to be installed when using -mt, can installed manually by option —driver, or run nbminer.exe with admin privilege to perform auto-install. Under linux, admin priviledge is needed to run, sudo ./nbminer -mt x. OhGodAnETHlargementPill is not needed anymore if -mt is enabled when mining on 1080 & 1080ti GPUs.

Read the last line. It specifically says Linux.

Источник

Содержание

«CUDA Error: unspecified launch failure (err_no=4)» with Memory tweak mode activated #220
Comments
RuntimeError: CUDA error: unspecified launch failure #31702
Comments
Expected behavior
Environment
Additional context
RuntimeError: CUDA error: unspecified launch failure #74235
Comments
Issue description
System Info
Training fails with «CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure» on Windows 10 #39958
Comments

«CUDA Error: unspecified launch failure (err_no=4)» with Memory tweak mode activated #220

NBMiner crashes instantly if Memory tweak mode is activated with the following error: «CUDA Error: unspecified launch failure (err_no=4)». I have tried different memory tweak modes (from 1 to 6).

I suspect it’s BIOS related, but I do not know which setting to change. 4G Decoding is enabled, PCI-e Gen is set to ‘2’.

Motherboard: X11SAT-F Supermicro (1.0a 04/29/2016)
CPU: 8 × Intel(R) Xeon(R) CPU E3-1240L v5 @ 2.10GHz AES
GPU: 2x Asus GeForce GTX 1080 8119 MB — Micron GDDR5X — vBIOS 86.04.17.00.11
nVidia Driver version: 460.39
OS: hiveos-0.6-190-stable (Ubuntu 18.04)
NBMiner: 36.1

The text was updated successfully, but these errors were encountered:

Got any developments on this? I wanted to know more about this and maybe other CUDA errors that shows up in NBMiner.

The error occurs to me apparently when I set the MemoryTransferRateOffset too high, where it enters a cycle of restart and crash. With a little lower offset value than before, the miner crashes but starts again and works normally (and curiously very stable!) indefinitely — however the driver API is buggy at this point and I can’t perform any write operation until I restart the computer.

Then I would suggest you to review your memory settings, try to lower it down and see if the problem persists.

Your problem is simple, since when do linux user’s not read the readme first and foremost guy’s? You cannot use ANY -mt on linux because it will install the windows only driver to speed up vram timing’s, the problem was trying to install that which cannot install to which I would suspect the kernel of the OS itself shut’s down the program for doing said action.

Your problem is simple, since when do linux user’s not read the readme first and foremost guy’s? You cannot use ANY -mt on linux because it will install the windows only driver to speed up vram timing’s, the problem was trying to install that which cannot install to which I would suspect the kernel of the OS itself shut’s down the program for doing said action.

Not sure what you’re talking about. This is straight from the readme:

Read the last line. It specifically says Linux.

Источник

RuntimeError: CUDA error: unspecified launch failure #31702

Search datasets.
Original length: 900
Offset: 0
Limit: 900
Final length: 900
Search datasets.
Original length: 120
Offset: 0
Limit: 120
Final length: 120
Using CuDNN in the experiment.
Traceback (most recent call last):

File «», line 1, in
runfile(‘C:/Users/Administrator/Desktop/haoxiang_CRN/train.py’, wdir=’C:/Users/Administrator/Desktop/haoxiang_CRN’)

File «C:UsersAdministratorAnaconda3libsite-packagesspyder_kernelscustomizespydercustomize.py», line 827, in runfile
execfile(filename, namespace)

File «C:UsersAdministratorAnaconda3libsite-packagesspyder_kernelscustomizespydercustomize.py», line 110, in execfile
exec(compile(f.read(), filename, ‘exec’), namespace)

File «C:/Users/Administrator/Desktop/haoxiang_CRN/train.py», line 103, in
main(config, resume=resume)

File «C:/Users/Administrator/Desktop/haoxiang_CRN/train.py», line 93, in main
validation_dataloader=valid_data_loader

File «C:UsersAdministratorDesktophaoxiang_CRNtrainertrainer.py», line 27, in init
super(Trainer, self).init(config, resume, model, optimizer, loss_function)

File «C:UsersAdministratorDesktophaoxiang_CRNtrainerbase_trainer.py», line 15, in init
self.model = model.to(self.device)

File «C:UsersAdministratorAnaconda3libsite-packagestorchnnmodulesmodule.py», line 432, in to
return self._apply(convert)

File «C:UsersAdministratorAnaconda3libsite-packagestorchnnmodulesmodule.py», line 208, in _apply
module._apply(fn)

File «C:UsersAdministratorAnaconda3libsite-packagestorchnnmodulesmodule.py», line 230, in _apply
param_applied = fn(param)

File «C:UsersAdministratorAnaconda3libsite-packagestorchnnmodulesmodule.py», line 430, in convert
return t.to(device, dtype if t.is_floating_point() else None, non_blocking)

RuntimeError: CUDA error: unspecified launch failure

Expected behavior

When I run on a small data set, the above error occurs when the data set becomes larger.

Environment

Collecting environment information.
PyTorch version: 1.2.0
Is debug build: No
CUDA used to build PyTorch: 10.0

OS: Microsoft Windows 10 Professional
GCC version: Could not collect
CMake version: Could not collect

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: 10.0.130
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: C:Program FilesNVIDIA GPU Computing ToolkitCUDAv10.0bincudnn64_7.dll

Versions of relevant libraries:
[pip] numpy==1.16.5
[pip] numpydoc==0.9.1
[pip] torch==1.2.0
[pip] torchvision==0.4.0
[conda] blas 1.0 mkl defaults
[conda] mkl 2019.4 245 defaults
[conda] mkl-service 2.3.0 py37hb782905_0 defaults
[conda] mkl_fft 1.0.14 py37h14836fe_0 defaults
[conda] mkl_random 1.1.0 py37h675688f_0 defaults
[conda] torch 1.2.0 pypi_0 pypi
[conda] torchvision 0.4.0 pypi_0 pypi

Additional context

win10[1903]
Graphics card model：RTX2060
anaconda 4.7.12
Graphics driver 441.20
CUDA:cuda_10.0.130_411.31_win10
CUDNN:cudnn-10.0-windows10-x64-v7.6.0.64
Python:3.7.4

Do you have any good suggestions for this kind of problem?

The text was updated successfully, but these errors were encountered:

Источник

RuntimeError: CUDA error: unspecified launch failure #74235

Issue description

RuntimeError: CUDA error: unspecified launch failure
Error occurring on any training script. Occurrence is not deterministic. Can occur at anytime during the course of training.
All the codes work fine on RTX 3090.

/lib/python3.8/site-packages/torch/autograd/init.py
Variable._execution_engine.run_backward(
RuntimeError: CUDA error: unspecified launch failure

System Info

PyTorch version: 1.10.2+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1

20.04) 9.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.15.15-76051515-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.2.67
GPU models and configuration:
GPU 0: NVIDIA RTX A6000
GPU 1: NVIDIA RTX A6000

Nvidia driver version: 470.86
cuDNN version: Probably one of the following:
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==1.10.2+cu113
[pip3] torchaudio==0.10.2+cu113
[pip3] torchvision==0.11.3+cu113
[conda] Could not collect

The text was updated successfully, but these errors were encountered:

Hi, do you have a code snippet which we can use to reproduce the issue?

I’m using the ultralytics yolov5 repo to train the model. The command which I’m using to train the model is
python train.py —img 640 —batch 32 —epochs 400 —data idd.yaml —weights yolov5x.pt —rect —image-weights —evolve —device 0,1 —multi-scale —name demo-img —patience 30 —save-period 1 —worker 22 —quad The error is very random and can happen at the very 1st epoch or can happen at the 10th epoch and there is no certain way to know when it’ll happen.
The CPU which I’m using is AMD Ryzen ThreadRripper PRO 3975WX.

Error encountered when replacing the A6000 with RTX3090

RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of ‘c10::CUDAError’
what(): CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from record at /home/avlabs_blue/pytorch/aten/src/ATen/cuda/CUDAEvent.h:119 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string ) + 0x6c (0x7f495477d0ac in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0xed2b62 (0x7f4881e3bb62 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xed67d6 (0x7f4881e3f7d6 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0x43c3fc (0x7f494d8493fc in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #4: c10::TensorImpl::release_resources() + 0x175 (0x7f4954765f35 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #5: + 0x339449 (0x7f494d746449 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x645dc2 (0x7f494da52dc2 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: THPVariable_subclass_dealloc(_object*) + 0x2f5 (0x7f494da53145 in /home/avlabs_blue/.virtualenvs/ultralytics/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: python() [0x5eccdb]
frame #9: python() [0x5aee8a]
frame #10: python() [0x613925]
frame #11: python() [0x5d1e78]
frame #12: python() [0x5a958d]
frame #13: python() [0x5ed1a0]
frame #14: python() [0x544188]
frame #15: python() [0x5441da]
frame #16: python() [0x5441da]

frame #22: __libc_start_main + 0xf3 (0x7f495c8be0b3 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

Environement Dertails
Collecting environment information.
PyTorch version: 1.12.0a0+gitd5744f4
Is debug build: False
CUDA used to build PyTorch: 11.2
ROCM used to build PyTorch: N/A

OS: Pop!_OS 20.04 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1

20.04) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 26 2021, 20:14:08) [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.16.11-76051611-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 11.2.67
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090
Nvidia driver version: 510.54
cuDNN version: Probably one of the following:
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1
/usr/lib/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] numpy==1.22.3
[pip3] torch==1.12.0a0+gitd5744f4
[pip3] torchaudio==0.10.2+cu113
[pip3] torchvision==0.13.0a0+00c119c
[conda] magma-cuda110 2.5.2 1 pytorch
[conda] mkl 2022.0.1 h06a4308_117
[conda] mkl-include 2022.0.1 h06a4308_117
[conda] numpy 1.21.2 py38hd8d4704_0
[conda] numpy-base 1.21.2 py38h2b8c604_0
[conda] torch 1.12.0a0+gitd5744f4 dev_0
(ultralytics) avlabs_blue@pop-os:/mnt/storage$

Источник

Training fails with «CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure» on Windows 10 #39958

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: No
TensorFlow installed from (source or binary): binary (via pip install)
TensorFlow version (use command below): v2.2.0-rc4-8-g2b96f3662b 2.2.0
Python version: 3.6.8
Bazel version (if compiling from source): Not applicable
GCC/Compiler version (if compiling from source): Not applicable
CUDA/cuDNN version: cuda 10.1.243_426.00_win10 (CUDA Toolkit 10.1 update2 Archive ) / cuDNN v7.6.5 (November 5th, 2019), for CUDA 10.1
GPU model and memory: GeForce GTX 1070, Compute Capability 6.1, 8Gb;
also on GeForce GTX 1050 Ti 4Gb on another Win 10 machine with the same cuda version installed

Describe the current behavior
I use TF Keras API to define the model and the training process.

The training stops with the error:

after several training epochs.

It happens in a reproducible way (the same python file execution fails after the same number of epochs).
When I change the random seed (thus the dataset is shuffled differently) the process fails after different number of epochs!
Changing batch size also effects the number of epochs until the training fail (randomly)!
I use batch size of 1 or 2 as I have rather large model.

Describe the expected behaviour
Training runs without the error requested number of epochs.

The text was updated successfully, but these errors were encountered:

Источник

Мужики ,привет всем, поставил стрикс 1080 в риг что за ошибка

Screenshot_2021-09-19-20-11-44-111_com.yandex.browser.jpg

806,2 КБ · Просмотры: 237

Мужики ,привет всем, поставил стрикс 1080 в риг что за ошибка

переразгон, не?)

Ядро 0 память 0 пл 120 30 мхэш завелась

Написано же, неизвестная ошибка.
Откуда мы можем знать, неизвестная же ошибка.

Это же очевидно.
CUDA Error: unspecified launch failure (err_no=4)

Экипаж полностью гражданской миссии SpaceX Inspiration4 благополучно вернулся на Землю

Космический корабль Crew Dragon полностью гражданской миссии Inspiration4 успешно приводнился в Атлантическом океане неподалёку от побережья штата Флорида. Трансляция события велась на YouTube-канале компании SpaceX. На околоземной орбите корабль провёл около трёх суток. Подготовка к посадке началась за несколько часов до её начала. Примерно за час до снижения в автоматическом режиме включились двигатели Crew Dragon, что позволило кораблю сократить высоту орбиты. Около 2:00 по московскому времени капсула вошла в плотные слои атмосферы, а спустя пару минут выпустила основные парашюты. Приводнение в Атлантическом океане состоялось в 2:08 по московскому времени.

Во время полёта корабль достиг рекордной для пилотируемой миссии орбиты высотой 590 км, что значительно выше орбиты Международной космической станции и космического телескопа «Хаббл». Последний раз пилотируемый корабль достигал столь высокой орбиты в 2009 году, когда в рамках миссии STS-125 экипаж шаттла Atlantis проводил обслуживание космического телескопа «Хаббл», орбита которого на тот момент достигала 578 км. Недавний полёт Crew Dragon также позволил побить рекорд по числу одновременно находящихся в космосе людей. Он удерживался более четверти века. В марте 1995 года на низкой околоземной орбите находились 13 человек, с полётом Inspiration4 эта цифра выросла до 14.

Компания SpaceX планирует провести очередную космическую миссию, состоящую из полностью гражданского экипажа, в начале следующего года. Она будет проводиться вместе с компанией Axiom Space, которая подготовит будущую команду корабля Crew Dragon к отправке к Международной космической станции. Отмечается, что в состав экипажа войдёт бывший астронавт NASA Пегги Уитсон (Peggy Whitson). О трёх других членах экипажа пока ничего неизвестно.

Мужики ,привет всем, поставил стрикс 1080 в риг что за ошибка

По делу подскажите пожалуйста

По делу подскажите пожалуйста

Дерзай

мэмори твик нот саппоред он девайс файф…

По делу, комп перезагрузи и Будет тебе счастье) просто карта зависла из-за переразгона изначального

По делу, комп перезагрузи и Будет тебе счастье) просто карта зависла из-за переразгона изначального

Перезагружал, тоже самое…
Выключил — включил на стоке заработала, 9 часов и опять ошибка, только теперь all cudа error , Майнер несколько раз выдает эту ошибку при перезапуске и в последующий перезапуск начинает работать нормально…

Перезагружал, тоже самое…
Выключил — включил на стоке заработала, 9 часов и опять ошибка, только теперь all cudа error , Майнер несколько раз выдает эту ошибку при перезапуске и в последующий перезапуск начинает работать нормально…

Ещё раз, у тебя зависает карта, по памяти или ядру я без понятия, без лога ошибки не вижу, думаю в твоём случае надо позаботиться об охлаждении в первую очередь, возможно из-за перегрева слетает или какая-то другая проблема, понаблюдай за температурой и хеш рейтом во время майнинга, от сюда и пляши, уменьшаешь разгон, по ядру или по памяти, увеличиваешь кульки, охлаждаешь во круг Карты, может она у тебя в самой печке по центру стоит )

Ещё раз, у тебя зависает карта, по памяти или ядру я без понятия, без лога ошибки не вижу, думаю в твоём случае надо позаботиться об охлаждении в первую очередь, возможно из-за перегрева слетает или какая-то другая проблема, понаблюдай за температурой и хеш рейтом во время майнинга, от сюда и пляши, уменьшаешь разгон, по ядру или по памяти, увеличиваешь кульки, охлаждаешь во круг Карты, может она у тебя в самой печке по центру стоит )

Стоит без разгона , стабильно 30.3 кульки 50 , температура 40

Стоит без разгона , стабильно 30.3 кульки 50 , температура 40

Капай в сторону проводки тогда, попробуй на другой косе запитать райзер и карту

Итог один: зависает карта, что-то ее выбивает, если ты исключаешь перегрев и разгон. В стоке она у тебя через какое-то время выбивает, значит ищи в электрике проблему

сделай стресс тест furmark скачай, прогони минут 15

Получилось решит проблему ?
Сейчас такая же вылетает на 3060 ти

Источник

level 1

Try rebooting the rig. It seems that the miner can «hang» if you do OC changes, and a reboot can fix the issue.

level 2

Restart didn’t help but found out is GPU cannot handle OC. After lower down OC more then everything back to normal.

level 1

I’ve used to have this error until I step down a little bit the overclocks.

level 1

I got the same issue while I add one piece of gamin trio x rtx3090 even change miner and pool also can’t fix this problem, all others card are working well

level 1

Try other miner… NBMiner is not one of the recommended ones of HiveOS.

Give a try to lolMiner that is recommended for NVidia and AMD

level 2

I see. Will give it a try. Thank you.

level 1

I trouble shooted the same problem … steps I took too solve … updated bios … reset bios settings to gen 2 also set m.2 to gen 2 … over clocks are set to 1100cc 1000mc power 320w 80% on ftw 3080 ti and 315w on vision 3080ti … also set page file as follows … total of 6 cards 12gb … set max 6×12=72gb (72000mb in windows) for min take half that ….

Источник

This is part of my CUDA code. But last part of this code says some error message.

unsigned int *mat_count;
off_t *mat_position;
unsigned int *matches_count;
off_t *matches_position;
......
cudaMalloc ( (void **) &mat_count,    sizeof(unsigned int)*10);
cudaMalloc ( (void **) &mat_position, sizeof(off_t)*10);
......
matches_count    = (unsigned int *)malloc(sizeof(unsigned int)*10);
matches_position = (off_t *)malloc(sizeof(off_t)*10);
for ( i = 0 ; i < 10 ; i++ ) {
    matches_count   [i] = 0;
    matches_position[i] = 0;
}
......
cudaMemcpy (mat_count,    matches_count   , sizeof(unsigned int)*10, cudaMemcpyHostToDevice );
cudaMemcpy (mat_position, matches_position, sizeof(off_t)*10,        cudaMemcpyHostToDevice );
......
match<<<BLK_SIZE,THR_SIZE>>>(
        reference_total,
        indextable_total,
        sequences, 
        start_sequence, 
        sequence_length, 
        end_sequence,
        ref_base,
        idx_base,
        msk_base,
        mat_count,
        mat_position,
        reference,
        first_indexes,
        seqmaskc
        );
err=cudaGetLastError();
if(err!=cudaSuccess)
{
printf("n1 %sn", cudaGetErrorString(err));
}
err=    cudaMemcpy (matches_count   , mat_count,    sizeof(unsigned int)*10, cudaMemcpyDeviceToHost );
if(err!=cudaSuccess)
{
printf("n2 %sn", cudaGetErrorString(err));
}
err=    cudaMemcpy (matches_position, mat_position, sizeof(off_t)*10, cudaMemcpyDeviceToHost );
if(err!=cudaSuccess)
{
printf("n3 %sn", cudaGetErrorString(err));
}

The following part of code had reported «unspecified launch failure» this error message.
I don’t know why this error message is reported.

err=cudaMemcpy (matches_position, mat_position, sizeof(off_t)*10, cudaMemcpyDeviceToHost );
if(err!=cudaSuccess)
{
printf("n3 %sn", cudaGetErrorString(err));
}

The followings are part of match function.

__global__ void match(...)
{
    ......
reference_blk = (THR_SIZE * blockIdx.x + threadIdx.x) * 32 + reference;
......
//-- added for parallize --//
for (p = start_p ; p != last_p ; p++) {
    for ( s = start_sequence, sequence = sequences ; s != end_sequence ;
            s++, sequence += sequence_bytes ) {
        ref_off = *(((unsigned int*)(idx_base)) + p);

        shifted_in = 0;

        if((int)(first_indexes[s-start_sequence] % 8 - ref_off % 8) < 0){
            int shamt2 = (ref_off % 8 - first_indexes[s-start_sequence] % 8);

            mask_buffer = *((unsigned long *)(msk_base + (ref_off - first_indexes[s-start_sequence])/8)) >> shamt2;

            if( ( (*(unsigned long *)(seqmaskc + 16 * (s-start_sequence))) ^ mask_buffer ) << shamt2) continue;
        }
        else if((int)(first_indexes[s-start_sequence] % 8 - ref_off % 8) == 0){
            mask_buffer = *((unsigned long *)(msk_base + (ref_off)/8));

            if( (*(unsigned long *)(seqmaskc + 16 * (s-start_sequence)) ^ mask_buffer)) continue;
        }
        else{
            int shamt2 = 8 - (first_indexes[s-start_sequence] % 8 - ref_off % 8);

            mask_buffer = *((unsigned long *)(msk_base + (ref_off/8- first_indexes[s-start_sequence]/8) - 1)) >> shamt2;

            if( ( (*(unsigned long *)(seqmaskc + 16 * (s-start_sequence))) ^ mask_buffer ) << shamt2) continue;
        }

        //full compare
        if((int)(first_indexes[s-start_sequence] % 4 - ref_off % 4) < 0){
            int shamt = (ref_off % 4 - first_indexes[s-start_sequence] % 4) * 2;
            memcpy(reference_blk, ref_base + ref_off / 4 - first_indexes[s-start_sequence] / 4, sequence_bytes);
            ......
            //-- instead of memcmp --//
            int v = 0;
            char *p1 = (char *)sequence;
            char *p2 = (char *)reference_blk;
            int tmp_asd = sequence_bytes;
            while(tmp_asd!=0){
                v = *(p1++) - *(p2++);
                if(v!=0)
                    break;
                tmp_asd--;
            }

            if(v == 0){
                mat_count[s - (int)start_sequence]++;      /* Maintain count */
                mat_position[s - (int)start_sequence] = ref_off-first_indexes[s-start_sequence]; /* Record latest position */
            }
        }
        else if((int)(first_indexes[s-start_sequence] % 4 - ref_off % 4 )== 0){
            memcpy(reference_blk, ref_base + ref_off / 4 - first_indexes[s-start_sequence] / 4, sequence_bytes);
            .......
            //-- instead of memcmp --//
            int v = 0;
            char *p1 = (char *)sequence;
            char *p2 = (char *)reference_blk;
            int tmp_asd = sequence_bytes;
            while(tmp_asd!=0){
                v = *(p1++) - *(p2++);
                if(v!=0)
                    break;
                tmp_asd--;
            }
            if(v == 0){
                mat_count[s - (int)start_sequence]++;      /* Maintain count */
                mat_position[s - (int)start_sequence] = ref_off-first_indexes[s-start_sequence]; /* Record latest position */
            }
        }
        else
        {
            int shamt = 8 - (first_indexes[s-start_sequence] % 4 - ref_off % 4) * 2;

            memcpy(reference_blk, ref_base + ref_off / 4 - first_indexes[s-start_sequence] / 4 - 1, 32);
            ......
            //-- instead of memcmp --//
            int v = 0;
            char *p1 = (char *)sequence;
            char *p2 = (char *)reference_blk;
            int tmp_asd = sequence_bytes;
            while(tmp_asd!=0){
                v = *(p1++) - *(p2++);
                if(v!=0)
                    break;
                tmp_asd--;
            }

            if (v == 0){
                mat_count[s - (int)start_sequence]++;      /* Maintain count */
                mat_position[s - (int)start_sequence] = ref_off-first_indexes[s-start_sequence];/* Record latest position */
            }
        }
    }
}

}

Источник

Posted on November 8, 2018 by admin

When I try to run CUDA code that takes a long time to process on the GPU, I would always get an error such as the following:

Error: C:/kernel.cu:170, code: 4, reason: unspecified launch failure

After spending many sleepless nights trying to figure out what was wrong with my setup, I finally found the reason why!

Windows has a protection mechanism to ensure that your computer doesn’t freeze when the GPU takes a long time to process something. As a result, when I run expensive CUDA code it would timeout because the GPU is taking up too much time.

What I have done is increase the number of seconds in Timeout Detection and Recovery (TDR) in the Windows Registry. You will have to restart your system after making the change to get it working with the new settings.

You can also disable the Timeout Detection and Recovery (TDR) entirely, but it will make your system much more unstable. Please note that if you disable it your system no longer has this protection and it is more prone to freezing. I have observed that I am able to get away with much more expensive processing, but if I abuse it and run an algorithm that takes REALLY long to process, my system will still freeze. Not sure why that happens (maybe its a memory issue), but its a step in the right direction.

You can also increase the timeout time, instead of completely disabling TDR, if you prefer that route.

For a detailed explanation visit this link: https://www.pugetsystems.com/labs/hpc/Working-around-TDR-in-Windows-for-a-better-GPU-computing-experience-777/

For instructions from Microsoft on how to make the edits, visit this link: https://docs.microsoft.com/en-us/windows-hardware/drivers/display/tdr-registry-keys

More discussion of the topic: https://training.acceleware.com/blog/timeout-detection-windows-display-driver-model-when-running-cuda-kernels-symptoms-solutions-and

Linux users check out this link: https://nvidia.custhelp.com/app/answers/detail/a_id/3029/~/using-cuda-and-x

Tags: code 4 reason unspecified launch failure, cuda kernel unspecified launch failure, cuda linux timeout, cuda regedit, cuda registry, cuda tdrlevel, cuda timeout detection and recovery, cuda unspecified launch failure, cuda unspecified launch failure tdrlevel, kernel registry, kernel timeout detection and recovery, linux timeout cuda, tdr windows registry cuda, tdrlevel unspecified launch failure, unspecified launch failure

Источник

Ошибки Видеокарты При Майнинге

Самое полное собрание ошибок в майнинге на Windows, HiveOS и RaveOS и их быстрых и спокойных решений

Can’t find nonce with device CUDA_ERROR_LAUNCH_FAILED

Ошибка майнера Can’t find nonce

Ошибка говорит о том, что майнер не может найти нонс и сразу же сам предлагает решение — уменьшить разгон. Особенно начинающие майнеры стараются выжать из видеокарты максимум — разгоняют слишком сильно по ядру или памяти. В таком разгоне видеокарта даже может запуститься, но потом выдавать ошибки как указано ниже. Помните, лучше — стабильная отправка шар на пул, чем гонка за цифрами в майнере.

Зарабатывай на чужих сделках на бирже BingX. Подробнее — тут.

Phoenixminer Connection to API server failed — что делать?

Ошибка Connection to API server failed

Такая ошибка встречается на PhoenixMiner на операционной систему HiveOS. Она говорит о том, что майнинг-ферма/риг не может подключиться к серверу статистики. Что делать для ее решения:

Введите команду net-test и запомните/запишите сервер с низким пингом. После чего смените его в веб интерфейсе Hive (на воркере) и перезагрузите ваш риг.
Если это не помогло, выполните команду dnscrypt -i && sreboot

Phoenixminer CUDA error in CudaProgram.cu:474 : the launch timed out and was terminated (702)

Ошибка майнера Phoenixminer CUDA error in CudaProgram

Эта ошибка, как и в первом случае, говорит о переразгоне карты. Откатите видеокарту до заводских настроек и постепенно поднимайте разгон до тех пор, пока не будет ошибки.

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

Ошибка майнера Unable to enum CUDA GPUs: invalid device ordinal

Проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).
Если все ок, то проверяем райзера. Часто бывает, что именно райзер бывает причиной такой ошибки.

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

Ошибка майнера Unable to enum CUDA GPUs: Insufficient CUDA driver: 5000

Аналогично предыдущей ошибке — проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка майнера NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка code 1073740791 nbminer возникает, если ваш риг/майнинг-ферма собраны из солянки Nvidia+AMD. В этом случае разделите майнинг на два .bat файла (или полетника, если вы на HiveOS). Один — с картами AMD, другой с картами Nvidia.

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

Ошибка майнера NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2)

Одна из самых распространённых ошибок на Windows — нехватка памяти, в данном случае на майнере Nbminer, но встречается и в майнере Nicehash. Чтобы ее исправить — надо увеличить файл подкачки. Файл подкачки должен быть равен сумме гб всех видеокарт в риге плюс 10% запаса. Как увеличить файл подкачки — читаем тут.

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Ошибка майнера GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

В данном случае скорее всего виноват не файл подкачки, а переразгон по видеокарте, которая идет под номером 0. Сбавьте разгон и ошибка должна пропасть.

Socket error. the remote host closed the connection, в майнере Nbminer

Socket error. the remote host closed the connection

Также может быть описана как «ERROR — Failed to establish connection to mining pool: Socket operation timed out».
Сетевой конфликт — проверьте соединение рига с интернетом. Перегрузите роутер.
Также может быть, что провайдер закрывает соединение с пулом. Смените пул, попробуйте VPN или измените адреса DNS на внешнего провайдера, например cloudflare 1.1.1.1, 1.0.0.1

Server not responded on share, на майнере Gminer

Server not responded on share

Такая ошибка говорит о том, что у вас что-то с подключением к интернету, что критично для Gminer. Попробуйте сделать рестарт роутера и отключить watchdog на майнере.

DAG has been damaged check overclocking settings, в майнере Gminer

Также в этой ошибке может быть указано Device not responding, check overclocking settings.
Ошибка говорит о переразгоне, попробуйте сначала убавить его.
Если это не помогло, смените майнер — Gminer никогда не славился работой с видеокартами AMD. Мы рекомендуем поменять майнер на Teamredminer, а если вам критична поддержка майнером одновременно Nvidia и AMD видеокарт, то используйте Lolminer.
Если смена майнера не поможет, переставьте видеодрайвер.
Если и это не поможет, то нужно тестировать эту карту отдельно в слоте X16.

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

Ошибки настройки памяти с кодом -6 обычно указывают на проблему с драйвером.

Если у вас Windows, используйте программу DDU (DisplayDriverUninstaller), чтобы полностью удалить все драйверы Nvidia.
Перезагрузите систему.
Установите новый драйвер прямо с сайта Nvidia.
Перезагрузите систему снова.
Если у вас HiveOS/RaveOS — накатите чистый образ системы. Чтобы наверняка.

TREX: Can’t unlock GPU

Полный текст ошибки:
TREX: Can’t unlock GPU [ID=1, GPU #1], error code 15
WARN: Miner is going to shutdown…
WARN: NVML: can’t get fan speed for GPU #1, error code 15
WARN: NVML: can’t get power for GPU #1, error code 15
WARN: NVML: can’t get mem/core clock for GPU #1, error code 17

Решение:

Проверьте все кабельные соединения видеокарты и райзера, особенно кабеля питания.
Если с первый пунктом все ок, попробуйте поменять райзер на точно рабочий.
Если ошибка остается, вставьте видеокарту в разъем х16 напрямую в материнскую плату.

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

Ошибка майнера CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

В конкретном случае была проблема в блоке питания, он не держал 3 видеокарты. После замены блока питания ошибка пропала.
Если вы уверены, что ваш мощности вашего блока питания достаточно, попробуйте сменить майнер.

Зарабатывай на чужих сделках на бирже BingX. Подробнее — тут.

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

Ошибка 511 градусов видеокарта

Ошибка 511 говорит о неисправности райзера или питания карты. Проверьте все соединения. Для выявления неисправности рекомендуется запустить систему с одной картой. Протестировать, и затем добавлять по одной карте.

GPU driver error, no temps в HiveOS — что делать?

Вероятнее всего, вы получили эту ошибку, майнив на HiveOS. Причин ее появления может быть несколько — как софтовая, так и аппаратная (например райзер).
Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — проверьте райзер.

GPU are lost, rebooting

Это не ошибка, а ее последствие. Что узнать какая ошибка приводит к перезагрузке карт, сделайте следующее:

Включите сохранение логов (по умолчанию они выключены) командой

logs-on

И перезагрузите риг.
После того как ошибка повторится можно будет скачать логи командами ниже.
Вы можете использовать следующую команду, чтобы загрузить логи майнера прямо с панели мониторинга;

message file «miner.log» -f=/var/log/miner/minername/minername.log

Итак, скажем, например, мне нужны логи TeamRedMiner
message file «teamredminer.log» -f=/var/log/miner/teamredminer/teamredminer.log

Отправленная командная строка будет выделена синим цветом. Загружаемый файл будет отображаться белым цветом. Нажав на него, вы сможете его скачать.
Эта команда позволит скачать лог системы

message file «syslog» -f=/var/log/syslog

exitcode=3 в HiveOS

Если ошибка не уйдет — проверьте райзер.

exitcode=1 в HiveOS

Данная ошибка возникает когда есть проблема с датой в биосе материнской платы (сбитое время) и (или) есть проблема с интернетом.
Если сбито время, то удаленно вы не сможете подключиться.
Тем не менее, обновление драйверов Nvidia должно пройти командой:

nvidia-driver-update —list

gpu fault detected 146

Скорее всего вы пытаетесь майнить с помощью Phoenix miner. Решения два:

Откатитесь на более старую версию, например на 5.4с
(Рекомендуемый вариант) Используйте Trex для видеокарт Nvidia и TeamRedMiner для AMD.

Waiting interface to come up — не работает VPN на HiveOS

Waiting interface to come up

Начните с логов, чтобы понять какая именно ошибка вызывает эту проблему.
Команды для получения логов:
systemctl status openvpn@client
journalctl -u openvpn@client -e —no-pager -n 100

Как узнать ip адрес воркера hive os

Самое простое — зайти в воркера и прокрутить страницу ниже видеокарт. Там будет указан Remote IP — это и есть внешний IP.
Альтернативный вариант — вы можете проверить ваш внешний айпи адрес hive через консоль Hive Shell:
Выполните одну из команд:
curl 2ip.ru
wget -qO- eth0.me
wget -qO- ipinfo.io/ip
wget -qO- ipecho.net/plain
wget -qO- icanhazip.com
wget -qO- ipecho.net
wget -qO- ident.me

Repository update failed в HiveOS

Иногда встречается на HiveOS. Полный текст ошибки:

Some index files failed to download. They have been ignored, or old ones used instead.
Repository update failed
------------------------------------------------------
> Restarting autofan and watchdog
> Starting miners
Miner screen is already running
Run miner or screen -r to resume screen
Upgrade failed

Решение:

Выполнить команду apt update && selfupgrade -f
Если не сработала и она, то 99.9%, что разработчики HiveOS уже знают об этой проблеме и решают ее. Попробуйте выполнить обновление через некоторое время.

Rave os не запускается. Boot aborted Rave os

Перепроверьте все настройки ПК и БИОСа материнской платы:
— Установите загрузочное устройство HDD/SSD/M2/USB в зависимости от носителя с ОС.
— Включите 4G decoding.
— Установите поддержку PCIe на Auto.
— Включите встроенную графику.
— Установите предпочтительный режим загрузки Legacy mode.
— Отключите виртуализацию.

Если после данных настроек не определяется часть карт, то выполните следующие настройки в BIOS (после каждого пункта требуется полная перезагрузка):

— Отключите 4G decoding
— Перезагрузка
— Отключите CSM
— Перезагрузка
— Включите 4G decoding, установите PCI-E Gen2/3, а при отсутствии Gen2/3, можно выбрать Gen1

Failed to allocate memory Raveos

Эта же ошибка может называться как:
failed to allocate initramfs memory bailing out, failed to load idlinux c.32
или
failed to allocate memory for kernel boot parameter block
или
failed to allocate initramfs memory raveos bailing

Но решение у нее одно — вы должны правильно настроить БИОС материнской платы.

gpu_driver_fault, GPU #0 fault в RaveOS

gpu_driver_fault, GPU #0 fault в RaveOS

В большинстве случаев эта проблема решается уменьшением разгона (особенно по памяти) на конкретной видеокарте (на скрине это карта номер 0).
Если уменьшение разгона не помогает, то попробуйте обновить драйвера.
Если обновление драйверов не привело к решению проблемы, то попробуйте поменять райзер на этой карте на точно работающий.
Если и это не помогает, перепроверьте все кабельные соединения и мощность блока питания, хватает ли его для вашей конфигурации.

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes

Что приводит к появлению этой ошибки? Вероятно, вы переразогнали видеокарту (часто сильно гонят по памяти), сбавьте разгон. На скрине видно, что проблему дает именно GPU под номером 1 — начните с нее.
Вторая частая причина — нехватка питания БП на систему с видеокартами. Учтите, что сама система потребляет не менее 100 вт, каждый райзер еще закладывайте 50 вт. БП должно хватать с запасом в 20%.

Miner restarted after error RaveOS

Смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к miner restarted. После этого найдите ее на этой странице и исправьте. Проблема уйдет.

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Аналогично предыдущему пункту — смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к рестарту воркера. Пофиксите ту ошибку — уйдет и эта проблема.

Miner cannot be started, ОС RaveOS

Непосредственно перед этой ошибкой обычно пишется еще другая, которая и вызывает эту проблему. Но если ничего нет, то:

Поставьте майнер на паузу, перезагрузите риг и в консоли выполните команды clear-miners clear-logs и fix-fs. Запустите майнинг.
Если ошибка не ушла, перепишите образ RaveOS.

Overclock can’t be applied в RaveOS

Эта ошибка означает, что значения разгона между собой конфликтуют или выходят за пределы допустимых. Перепроверьте их. Скиньте разгон на стоковый и попробуйте еще раз.
В редких случаях причиной этой ошибки также становится райзер.

Error installing hive miners

Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — физически перезапишите образ. Если у вас флешка, то скорее всего она умерла. Купите SSD.

Warning: Nvidia settings applied with errors

Переразгон. Снизьте значения частот ядра и памяти. После этого перезагрузите риг.

Nvtool error или Danger: nvtool error

Скорее всего при установке драйвера появилась проблема с модулем nvtool
Попробуйте переустановить драйвер Nvidia командой через Hive shell:
nvidia-driver-update версия_драйвера —force
Или попробуйте обновить систему полностью командой из Hive shell:
hive-replace -y —stable

nvtool error

Перестал отображаться кулер видеокарты HiveOS

0% скорости вращения кулера.
Это может произойти по нескольким причинам:

кулер действительно не крутится
датчик оборотов отключен или сломан
видеокарта слишком агрессивно работает (высокий разгон)
неисправен райзер или одно из его частей

ERROR: parsing JSON failed

Необходимо выполнить на риге локально (с клавиатурой и монитором) следующую команду:
net-test

Данная команда покажет ваше текущее состояние подключения к разным зеркалам API серверов HiveOS.
Посмотрите, к какому API у вас наименьшая задержка (ping), и когда воркер снова появится в панели, измените стандартное зеркало на то, что ближе к вам.
После смены зеркала, в обязательном порядке перезагрузите ваш воркер.
Изменить сервер API вы можете командой nano /hive-config/rig.conf
После смены нажмите ctrl + o и ентер для того чтобы сохранить файл.
После этого выйдите в консоль командой ctrl + x, f10 и выполните команду hello

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Проблема с скоростью кулеров на GPU 5
0% скорости вращения кулера / ошибки в целом
Это может произойти по нескольким причинам:
— кулер действительно не крутится
— датчик оборотов отключен или сломан
— видеокарта слишком агрессивно работает (высокий разгон)
Начните с визуальной проверки карты и ее кулера.

Can’t get power for GPU #2

Как правило эта ошибка встречается рядом вместе с другими:
Attribute ‘GPUGraphicsClockOffset’ was already set to 0
Attribute ‘GPUMemoryTransferRateOffset’ was already set to 2200
Attribute ‘GPUFanControlState’ (hive1660s_ETH:0[gpu:2]) assigned value
0.

20211029 12:40:50 WARN: NVML: can’t get fan speed for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get power for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get mem/core clock for GPU #2, error code 999

Решение:

Проверьте корректность установки драйвера на видеокарте.
Убедитесь что нет проблем с драйвером, если все в порядке, то попробуйте другой параметр разгона. Например уменьшить разгон по памяти.

GPU1 search error: unspecified launch failure

Уменьшите разгон и проверьте контакты райзера

Warning: Autofan: unable to set fan speed, rebooting

Найдите логи майнера, посмотрите какие ошибки майнер пишет в логах. Например:

kernel: [12112.410046][ T7358] NVRM: GPU at PCI:0000:0c:00: GPU-236e3bef-2e03-6cdb-0518-7ac01eb8736d
kernel: [12112.410049][ T7358] NVRM: Xid (PCI:0000:0c:00): 62, pid=7317, 0000(0000) 00000000 00000000
kernel: [12112.433831][ T7358] NVRM: Xid (PCI:0000:0c:00): 45, pid=7317, Ch 00000010
CRON[21094]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Исходя из логов, мы видим что есть проблема с видеокартой на слоте PCIE 0c:00 (под номером Gpu пишется номер PCIE слота) с ошибками 45 и 62
Коды ошибок (других, которые также могут быть там) и что с ними делать:

• 13, 43, 45: ошибки памяти, снизить MEM
• 8, 31, 32, 61, 62: снизить CORE, возможно и MEM
• 79: снизить CORE, проверить райзер

Ошибка Kernel-Power код 41

Проверьте все провода (от БП до карт, от БП до райзеров), возможно где-то идёт оплавление. Если визуальный осмотр показал, что все ок, то ошибка программная и вам нужно переустановить Windows.

Danger: hive-replace -y —stable (failed, exitcode=137)

Очень редкая ошибка, которая вылезла в момент удаленного обновления образа HiveOS. Она не встречается в тематических майнинг группах и сайтах. Не поверите что произошло.
На балконе, где стоял риг, поселилась семья голубей. Они засрали риг, в прямом смысле, из-за этого он постоянно уходил в оффлайн. После полной продувки материнской платы и видеокарт проблема решилась сама.

MALFUNCTION HIVEOS

Malfunction — неисправность. Причин и решений может быть несколько:

Вам следует переустановить видео драйвер;
Если драйвер не помог, тогда отключайте все GPU и поочередно вставляйте по 1 шт, и смотрите вызовет ли какая-то видеокарта подобную ошибку или нет. Если да, то возможно это райзер.
Неисправен носитель, на который записана Hive OS, запишите образ еще раз.

Не нашли своей ошибки? Помогите сделать мир майнинга лучше. Отправьте ее по этой форме и мы обновим наш гайд в самое ближайшее время.

Источник

Во-первых, у вас неправильно указаны размеры. Программа работает на 10,000,000 100,000,000 100,000,000, а не на 1,000,000,000 XNUMX XNUMX (а вы сказали, что она работает на XNUMX XNUMX XNUMX, а не на XNUMX XNUMX XNUMX XNUMX). Таким образом, размер памяти не является проблемой, и ваши расчеты основаны на неправильных числах.

calculate_grid_parameters перепутались. Цель этой функции — выяснить, сколько блоков необходимо и, следовательно, размер сетки, на основе GPU_MAX_PW, указывающего общее количество необходимых потоков, и 1024 потока на блок (жестко запрограммировано). Строка, которая выводит размер блока = сетка… сетка… на самом деле имеет ключ к разгадке проблемы. Для GPU_MAX_PW, равного 100,000,000 100,000,000 1024, эта функция правильно вычисляет, что необходимо 97657 48828 13951/681,199,428 = 97657 48828 блоков. Однако размеры сетки вычисляются неправильно. Размеры сетки grid.x * grid.y должны равняться общему количеству желаемых блоков (приблизительно). Но эта функция решила, что ей нужны grid.x 13951 и grid.y 1024. Если я умножу эти два, я получу 697,548,214,272 XNUMX XNUMX, что намного больше, чем желаемое общее количество блоков XNUMX. Теперь, если я затем запущу ядро с запрошенными размерами сетки XNUMX (x) и XNUMX (y), а также запросить XNUMX потока на блок, я запросил всего XNUMX XNUMX XNUMX XNUMX потока при этом запуске ядра. Во-первых, это не ваше намерение, а во-вторых, пока я не могу точно сказать, почему, видимо, слишком много тем. Достаточно сказать, что этот общий запрос сетки превышает некоторые ограничения ресурсов машины.

Обратите внимание, что если вы упадете со 100,000,000 10,000,000 XNUMX до XNUMX XNUMX XNUMX для GPU_MAX_PW, расчет сетки станет «разумным», я получаю:

block size = 9766 grid 9766 grid 1

и нет провала запуска.

Источник

I’ve been having stability issues for a while. Always with this error «CUDA error 719 on device *: unspecified launch failure».

I always render animations, roughly 30sec per frame, all using animated alembic files. It can happen after 10min or 10hours it very unpredictable. I have 5x GTX 780 Ti + Quadro 4000 (monitor only). The fewer cards I use the more stable it seems, but it’s hard to be sure. The problem is I want to be able to leave the machine render over the weekend without it stopping.

Things I have tried:

Additional PSU
Underclocking Core and Memory
Setting Fans to 100%
Eliminating each card one by one to test for hardware issues

I use the C4D plugin but have replicated the same issue in the standalone so I don’t believe it’s plugin related.

I don’t know what else I can try? It’s becoming a real issue with many upcoming projects. Thanks you for any insights.

Started logging on 29.12.16 11:46:34

OctaneRender 3.05.1 (3050100)

Scene created in plugin version -1

FRAME 407 fps:30 camMb:0 objMb:0
Triangles:1.06m Disp.triangles:0 Hairs:0 Meshes:5
VRAM used/free/max:712Mb/1.652Gb/3Gb
Out-of-core used:0Kb total used RAM:8.585Gb
MotBlurTM=0 sec. createTM=0.524 sec. evalTM=7.229 sec.
Device:0 TotMem:3Gb rtData:341Mb film:7Mb geo:235Mb node:4Kb tex:128Mb unavailable:668Mb temperature:72
Device:1 TotMem:3Gb rtData:341Mb film:7Mb geo:235Mb node:4Kb tex:128Mb unavailable:598Mb temperature:64
Device:2 TotMem:2Gb rtData:0Kb film:0Kb geo:0Kb node:0Kb tex:0Kb unavailable:0Kb temperature:63
Device:3 TotMem:3Gb rtData:341Mb film:7Mb geo:235Mb node:4Kb tex:128Mb unavailable:598Mb temperature:63
Device:4 TotMem:3Gb rtData:341Mb film:7Mb geo:235Mb node:4Kb tex:128Mb unavailable:598Mb temperature:66
Device:5 TotMem:3Gb rtData:0Kb film:0Kb geo:0Kb node:0Kb tex:0Kb unavailable:0Kb temperature:38
CUDA error 719 on device 3: unspecified launch failure
-> kernel execution failed(report)
CUDA error 719 on device 3: unspecified launch failure
-> failed to launch kernel(ptBrdf2)
device 3: path tracing kernel failed
device 3: detected an error on render device! trying to recover…
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
-> kernel execution failed(report)
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> failed to launch kernel(ptBrdf2)
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
device 4: path tracing kernel failed
device 4: detected an error on render device! trying to recover…
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device array
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device array
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device array
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device array
-> failed to deallocate device array
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device array
-> failed to deallocate device array
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device array
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate pinned memory
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> failed to deallocate device array
-> failed to deallocate pinned memory
CUDA error 719 on device 4: unspecified launch failure
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device array
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate pinned memory
CUDA error 719 on device 3: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 4: unspecified launch failure
-> could not get memory info
CUDA error 719 on device 3: unspecified launch failure
CUDA error 719 on device 4: unspecified launch failure
-> failed to deallocate device memory
-> failed to deallocate pinned memory

Asus X99 10G WS | Windows 10.0 | 6x GTX 1080 Ti + 2x 2080Ti | i7-6850K CPU @ 3.60GHz | 32GB RAM

Источник

«CUDA Error: unspecified launch failure (err_no=4)» with Memory tweak mode activated #220

RuntimeError: CUDA error: unspecified launch failure #31702

Expected behavior

Environment

Additional context

RuntimeError: CUDA error: unspecified launch failure #74235

Issue description

System Info

Training fails with «CUDA_ERROR_LAUNCH_FAILED: unspecified launch failure» on Windows 10 #39958

Ошибки Видеокарты При Майнинге

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Socket error. the remote host closed the connection, в майнере Nbminer

Server not responded on share, на майнере Gminer

DAG has been damaged check overclocking settings, в майнере Gminer

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

TREX: Can’t unlock GPU

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

GPU driver error, no temps в HiveOS — что делать?

GPU are lost, rebooting

exitcode=3 в HiveOS

exitcode=1 в HiveOS

gpu fault detected 146

Waiting interface to come up — не работает VPN на HiveOS

Как узнать ip адрес воркера hive os

Repository update failed в HiveOS

Rave os не запускается. Boot aborted Rave os

Failed to allocate memory Raveos

gpu_driver_fault, GPU #0 fault в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Miner restarted after error RaveOS

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Miner cannot be started, ОС RaveOS

Overclock can’t be applied в RaveOS

Error installing hive miners

Warning: Nvidia settings applied with errors

Nvtool error или Danger: nvtool error

Перестал отображаться кулер видеокарты HiveOS

ERROR: parsing JSON failed

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Can’t get power for GPU #2

GPU1 search error: unspecified launch failure

Warning: Autofan: unable to set fan speed, rebooting

Ошибка Kernel-Power код 41

Danger: hive-replace -y —stable (failed, exitcode=137)

MALFUNCTION HIVEOS

Читайте также: