Nvml error in cudaprogram cu 314 - Исправление ошибок и поиск оптимальных решений проблем

Содержание

Как исправить ошибку NVML cannot get fan speed
Причины появления ошибки NVML error 999 (an internal driver error occurred)
Что нужно сделать, чтобы устранить ошибку NVML error 999 (an internal driver error occurred)
Отпадывает 2 ВК на втором блоке с ошибкой NVML cannot current temperature, error 15
Абырвалг
WARN: NVML: can’t get GPU #0, error code 15 #230
Comments

Как исправить ошибку NVML cannot get fan speed

При майнинге на видеокартах Nvidia иногда возникают досадные ошибки вида:

Через несколько минут после появления таких ошибок обычно крашится майнер и зависает система.

При этом в логах майнера перед зависанием можно найти строки, похожие на следующие:

Рассмотрим подробнее причины появления ошибок NVML cannot get fan/temperature error 15 или 999 и чем их исправить.

Причины появления ошибки NVML error 999 (an internal driver error occurred)

Проблемы, связанные с появлением ошибок библиотеки NVIDIA Management Library (NVML) с разным кодом (обычно это 15, 17 или 999) приводят к потере контроля/управления температурой и вентиляторами видеокарты.

Они связаны с некорректной работой NVML API, входящего в состав драйверов NVidia. Согласно спецификации:

ошибка с кодом 15 свидетельствует о том, что GPU теряет связь с шиной PCI-E и затем становиться недоступной для управления (NVML_ERROR_GPU_IS_LOST);
ошибка с кодом 17 свидетельствует о том, что видеокарта заблокирована другим процессом (NVML_ERROR_OPERATING_SYSTEM = 17, GPU control device has been blocked by the operating system/cgroups);
error 999 свидетельствует о неизвестном сбое в работе драйверов (NVML_ERROR_UNKNOWN = 999 -an internal driver error occurred).

Основной причиной появления этих ошибок является наличие проблем в прохождении данных по каналу видеокарта-материнская плата.

Прохождение сигнала по шине PCI-E происходит с ошибками из-за таких неисправностей:

некорректная установка или повреждение файлов драйверов;
неверная установка в BIOS скорости передачи данных для устройств PCI-Express;
некорректная работа устройств PCI-E из-за включения энергосберегающего режима питания этой шины;
перегрев южного моста и вызванные этим нарушения обмена данными по линии PCI-Express;
перегрев видеокарты и появление, связанных с этим проблем в работе ее электронных компонентов;
неисправности райзеров (обычно это плохие контакты по линиям передачи данных и по питанию);
плохой контакт в разъемах кабелей питания/передачи данных;
некачественные блоки питания, слишком большая нагрузка на них;
наличие сильных электромагнитных помех, проникающих на риг из сети;
слишком большой разгон/даунвольтинг GPU.

Что нужно сделать, чтобы устранить ошибку NVML error 999 (an internal driver error occurred)

Для устранения ошибки NVML: cannot get fan speed, error 999 необходимо произвести следующие действия:

проверить/заменить райзера, кабеля питания, очистить контакты на разъемах видеокарты, райзерах и кабелях питания/USB-удлинителях;
увеличить размер файла подкачки до суммарного объема видеопамяти установленных в риге видеокарт;
уменьшить разгон по ядру и памяти, уменьшить даунвольтинг. В первую очередь нужно уменьшать разгон/даунвольтинг на карте, номер которой первым появляется в логах майнера перед зависанием. Если первой появляется строка GPU 1, GpuMiner cu_k1 failed 30, unknown error, то проблема, скорее всего, связана именно с первой видеокартой;
отключить опцию PCIe Power Saving в параметрах электропитания системы (меню настройки схемы электропитания – изменить дополнительные параметры питания – PCI Express – Управление состоянием связи – Откл.);
заново произвести установку драйверов с полной деинсталляцией предыдущей версии;
обеспечить хорошее охлаждение южного моста путем установки радиатора или вентилятора;
установить в BIOS скорость обмена по линии PCI-E в GEN 2 или 1;
обеспечить достаточную мощность блока питания и проверить качество выдаваемых им напряжений с помощью вольтметра;
поменять разъем PCI-E, в который включается видеокарта, попробовать включить ее в него без райзера.

При подозрении на неисправность разъема материнской платы нужно попробовать вставить в него другую видеокарту с заведомо исправным райзером. Если ошибка возникает снова, то, скорее всего неисправность возможно связана с материнской платой.

Небольшую помощь при запуске Claymore miner при наличии проблем, связанных с контролем вентиляторов видеокарт, может дать включение в батник параметров:

-tt 1 — отключение управления вентиляторами;

-tt 0 — отключение мониторинга температуры и вентиляторов;

-wd 0 – отключение программного watchdog, встроенного в майнер.

Отключение управления работой вентиляторов в майнере не будет проблемой при использовании MSI Afterburner или nvidiainspector по методике, описанной в статье «Оптимизация потребления видеокарт Nvidia при майнинге».

Использование такого «костыля» может увеличить время бесперебойной работы рига, но если в компьютере останется плохой контакт, то все равно работа рига будет нестабильной, и он будет периодически зависать.

В этом случае может помочь один из аппаратных сторожевых таймеров, некоторые из которых описаны в статье «Китайские сторожевые таймеры для майнинга».

О других ошибках, возникающих при майнинге и способах их устранения, можно почитать в статьях:

Источник

Отпадывает 2 ВК на втором блоке с ошибкой NVML cannot current temperature, error 15

Абырвалг

Свой человек

Здравствуйте. Четверо суток мучений с подключением 5 ВидеоКарт к MB ga 970a-ds3p fx.

Проблема в следующем:
Отваливаются 2 видюхи, подключенные ко второму БП(II) с ошибкой майнера nvml cannot get current temperature error 15. Все ВК работают без разгона.

Описание системы:
5 ВК Gigabyte GeForce GTX 1060 WF OC [GV-N1060WF2OC-3GD].
Два блока FSP по 550Вт — корпуса соединены между собой.
Три видеокарты+3 райзера, мать и жёстер подключены к первому блоку(I), а две другие ВК+райзеры ко второму(II).
Райзеры (006 с питанием по молексу) подключены по питанию к тому БП, к которому подключена ВК.
райзеры у меня такие:
https://ru.aliexpress.com/item/2017. -1f8e-44c9-9214-d5cb900807c5&rmStoreLevelAB=4

В наличии второй риг на такой же МБ + две такие же ВК + 3шт. 570 msi armor oc. Там всё пучком работает без отвалов.
Что только не делал.
Сначала поменял процы на матерях(а то амд камни попадаются глючные. ) — проблема осталась.
Думал мало ли дело в материнке — поменял их местами, один фиг такая же ботва.
Ставил жёстер с рабочего рига на этот — проблема оставалась.
Если подключаешь только один БП, то всё работает на 3-х ВК в норме.

Что делать то ?! Куда копать ? Купить один блок ?

Источник

WARN: NVML: can’t get GPU #0, error code 15 #230

I am running T-Rex 0.19.11, and got the following error which so far is solved by recurrently restarting the rig:

WARN: NVML: can’t get GPU #0, error code 15
.
WARN: GPU #0: EVGA Geforce GTX 1060 6GB is idle, last activity was 51 secs ago
WARN: WATCHDOG: T-Rex has a problem with GPU, terminating.
WARN: WATCHDOG: recovering T-Rex

Any hint to fix it is appreciated

The text was updated successfully, but these errors were encountered:

i got same error, never had it before.

follow-up, i tore the pc rack down and found a burnt cable to pci riser, replaced cable mine is now working 100%.

follow-up, i tore the pc rack down and found a burnt cable to pci riser, replaced cable mine is now working 100%.

Thanks for your reply. Unfortunately I just can perform remote configuration adjustments since I am not close to the Rig. Then if there is not any other option I will deactivate GPU 0 and have a visual inspection to the cables in few weeks once I am able to make a site visit.

I ran into this problem today. Checked the cables everything looks good. Restarted the machine and it’s back to 100%.

Wondering if it’s a software problem, as I have been also having watchdog errors where I get an error and the miner process consumes 100% of the CPU. Or I’ll see a GPU crash and the miner will continue along.

I’m running with 1×1070 and 2×3070 under Debian Buster with NVIDIA 460.39 drivers
OC settings:

I ran into this problem today. Checked the cables everything looks good. Restarted the machine and it’s back to 100%.

I’m running with 1×1070 and 2×3070 under Debian Buster with NVIDIA 460.39 drivers
OC settings:

Thanks for sharing your view.

It happens to me on at least on releases 0.19.10 and 0.19.11. Tried to increase the virtual memory and issue still there.
When I reboot the rig, it is solved but after a while (from 10min up to 2 hours) the issue comes back.
So far I am just mining with one GPU less (5 instead of 6). All my GPUs on this rig are EVGA 1060 GTX SSC 6GB.

Did not find the problem yet and I cannot visit the rig on site for several weeks to check cables.

I’m having the same issue here, I’ve tried different risers, cables. in my case it certainly doesn’t seem like it’s riser related. I tried an older driver with no luck. Tried T-Rex .10 and .11, no luck.

I use T-Rex to mine Ethereum without issue, but only when I try mining Raven Coin that it gives me this error. Coincidentally, it does happen to be the only card that I have on a riser, but again, it works fine for Ethererum. The only thing I haven’t tried is moving around my cards. just to definitively determine that it’s riser related.

GPU 0: 3060ti Founders Edition
GPU 1: 3060ti MSI Ventus 2x
GPU 2: 3060ti Gigabyte OC Gaming (this is the only one that’s external/on a riser)

I’m having the same issue here, I’ve tried different risers, cables. in my case it certainly doesn’t seem like it’s riser related. I tried an older driver with no luck. Tried T-Rex .10 and .11, no luck.

GPU 0: 3060ti Founders Edition
GPU 1: 3060ti MSI Ventus 2x
GPU 2: 3060ti Gigabyte OC Gaming (this is the only one that’s external/on a riser)

Nice feedback. probably it is not the riser but one of the cables feeding it up? If you are able to move around the cards and maybe plug the failing one on the motherboard without riser, it would be nice to find or discard some things. even nicer if you maybe only plug the failing card and see if the issue occurs or not.

Another option could be that for any reason a thermal pad is not doing its work and one of the memories is getting hotter than usual. but I in my case I replaced all thermal pads one month ago so they are pretty recent.

I dont think it is related to the coin you are mining. The only thing that stands out in my mind is that maybe depending on the coin you are mining, you are stressing more/less any of the memory chipset which reinforces the theory of a memory component which might be not dissipating the heat properly.

thanks for replying

Nice feedback. probably it is not the riser but one of the cables feeding it up? If you are able to move around the cards and maybe plug the failing one on the motherboard without riser, it would be nice to find or discard some things. even nicer if you maybe only plug the failing card and see if the issue occurs or not.

I should’ve done this when I had it out, but truthfully, I was just being lazy and decided to try it some other time.

Another option could be that for any reason a thermal pad is not doing its work and one of the memories is getting hotter than usual. but I in my case I replaced all thermal pads one month ago so they are pretty recent.

thanks for replying

In my case, this is most probably not an issue, as a matter of fact, the only time the cards have ever hit anything above 60ºc is when I was getting the OC’ing right. Aside from that, the warmest card I have is running at 52ºc, the others are at 45º under full load.

I live in Chicago and the rig is in my crawlspace. They’re basically in a huge refrigerator. I highly doubt thermals are a factor in this issue.

Ok, here’s my update.

My issue was power draw. my power supply is a bit low (waiting for new supply to come in). What I had to do is start Afterburner, load a profile that dropped the power to around 132 watts per card (3060ti’s), launch gMiner and after DAG creation was complete, switch profiles.

T-Rex miner stopped giving me the error, but locks up a few minutes after it started mining. I can’t figure out why it does that, so I just switched miners. Now I’m just trying to figure out why I’m only getting

77Mh/s vs the 80 to 90 that some of the calculators are claiming I should be able to achieve. I hope this helps.

I stopped having the error (so far) after dropping my Memory offset to 1650 and GPU offset to 0 from 2000 and 135 on only my two 3070s.

Maybe the power setting is too low for the higher offsets? I have the power set to 125w for the 3070s.

77Mh/s vs the 80 to 90 that some of the calculators are claiming I should be able to achieve. I hope this helps.

Good that you at least solved the issue and not any error code 15 message is appearing. Did you confirm this with a long period of test time (at least 24h)?

When it comes to the decrement of MHs, it is surely related to the power low profile you loaded on Afterburner. Is it lowering the mem / GPU clock speeds?

I stopped having the error (so far) after dropping my Memory offset to 1650 and GPU offset to 0 from 2000 and 135 on only my two 3070s.

Maybe the power setting is too low for the higher offsets? I have the power set to 125w for the 3070s.

My cards were running at default GPU/mem clocks (I want them to keep alive for a while) but they are already 4 years old. I just decreased the clock for both gpu and mem (-150 on both) on GPU0 similar to what you just did to see if the error appears again. This has dropped the MHs from 20.1 to 17.9 aprox on that card. Lets see if it works, otherwise I will also play with the power limit on the AfterBurner as well.

thanks a lot, will keep you posted

I have the same error on a 13x 3070 rig. Never had it before but also never had 13x in 1 rig before. If I reboot it all works again. However, not sure its related, but sometimes when i reboot one card will «show up» but not mine. I rebooted again and it seems to work.

I have the same error on a 13x 3070 rig. Never had it before but also never had 13x in 1 rig before. If I reboot it all works again. However, not sure its related, but sometimes when i reboot one card will «show up» but not mine. I rebooted again and it seems to work.

try decreasing the clock of the mem or the gpu and/or the gpu that is failing. seems to be working for me

@jvoda1 thanks for the suggestion. I have a LiteON 1400W PSU I am going to try vs the HP one I have. This HP one is old and I got it used. I switched it out and am observing now. If I get another error 15 I will drop the overclocking and try your idea. I will report back.

@jvoda1 thanks for the suggestion. I have a LiteON 1400W PSU I am going to try vs the HP one I have. This HP one is old and I got it used. I switched it out and am observing now. If I get another error 15 I will drop the overclocking and try your idea. I will report back.

Try and let us know. but I dont think it is a problem from the PSU due to lack or unstable power. To me it is more related to the mem clock which is too high to work properly either because we overclocked it excessively or a chipset that stabilizes a memory is getting too warm and we cannot detect it on the reported temperatures or just because the HW is getting old 🙂

@ril3y I believe none of us had the whole rig locked with the issue. In my case the card goes to idle mode on t-rex and do not mine anymore. Pls try to decrease the mem and/or gpu clock of the faulty card and come back to us with the result.

I just changed from Debian Buster with NVIDIA 460.39 drivers to HiveOS with NVIDIA 455.45.01 drivers and am able to push my 3070s a bit further without any crashing using the latest T-Rex. Maybe the latest Nvidia drivers are causing some problems?

Same issue on my 4x 3090 + 3080 rig, one card just gets the error code 15 and crashes

@gtdiehl @AChangXD could you please try to decrease the mem clock and see if the issue reappears? In my case I lost 10% of Hash rate but on the other hand I’ve never seen this error again on the faulty GPU

In my case the cable burned close to the power supply

In summary it seems to be caused by two main issues, 1) Wrong (exceeded) OC Settings or 2) Power issues (insufficient power or damaged cabling). In my case, it seems that its the first one, I’ll run some tests and post results later

I seem to have fixed it on the 3070 that was giving me this error.
A little rig maintenance, checking wires, making sure everything was seated properly and I gave the card in question 5W more juice and it is stable so far for the last 30 mins.

Edit: Spoke too soon. After several hours the card stopped mining again. After some research, I’ve replaced the riser completely to see if that solves the problem.

@rbfx4x did changing the riser solve the problem?

@rbfx4x did changing the riser solve the problem?

Yes it did. Stable ever since, even while pushing the OC with Kawpow.

@rbfx4x did changing the riser solve the problem?

Yes it did. Stable ever since, even while pushing the OC with Kawpow.

Источник

If you’re getting the NVML error in your Cuda program, it’s likely that your GPU is lost. There are a few possible causes for this, but the most common is that your graphics card has overheated and shut down. This can happen if you’re pushing your card too hard with gaming or other demanding applications.

Another possibility is that your drivers are outdated or corrupted. Whatever the cause, you’ll need to take some steps to fix the problem before you can use your GPU again.

If you’re seeing the nvml error “GPU is lost” in your Cuda program, don’t panic! This is a common error that can occur for a variety of reasons. In most cases, it simply means that your GPU has become disconnected from the rest of your computer.

There are a few things you can do to try to fix this problem. First, check all of the connections between your GPU and your computer. Make sure everything is plugged in securely.

If that doesn’t work, try restarting your computer. Sometimes a simple reboot can fix minor issues like this.

If neither of those solutions works, you may need to reinstall your graphics drivers.

You can usually find the latest drivers for your GPU on the manufacturer’s website. Once you’ve downloaded and installed the new drivers, restart your computer again and see if the problem has been resolved.

If you’re still seeing the “GPU is lost” error after trying all of these solutions, there may be something wrong with your GPU itself.

In this case, you’ll need to contact customer support for further assistance.

Credit: www.reddit.com

What is an Nvml Error

NVML errors are caused by an issue with the NVIDIA Management Library (NVML), which is a software library that allows applications to interface with the NVIDIA drivers installed on a system. The most common NVML error is “Nvidia Display Driver has stopped responding and has recovered” followed by the name of the application that caused the error. This error typically occurs when there is a problem with the graphics driver or when an application is not compatible with the current version of the driver.

In some cases, it may be possible to fix NVML errors by updating the graphics driver or reinstalling the application.

What Does It Mean When a Gpu is Lost

When a GPU is lost, it means that the device is no longer able to function properly. This can happen for a variety of reasons, but most often it is due to hardware failure. In some cases, it may be possible to recover data from the device, but in most cases, it will be necessary to replace the entire unit.

How Can I Fix an Nvml Error in My Cuda Program

If you’re getting an NVML error in your Cuda program, there are a few things you can try to fix it. First, make sure that you have the latest drivers installed for your GPU. You can download the latest drivers from NVIDIA’s website.

Secondly, try reinstalling the CUDA toolkit. You can download the toolkit from NVIDIA’s website as well. Finally, if neither of those solutions work, you may need to contact NVIDIA customer support for further assistance.

❗️Nvidia NVML Error FIX | HOW TO FIX NVML ERROR

Unable to Get Temperature Gpu is Lost

There are a few things that can cause this error, “Unable to Get Temperature Gpu is Lost”, when trying to overclock your GPU. Here are a few troubleshooting tips:

1. Check if you have the latest drivers installed for your GPU.

If not, update them and try again.

2. Make sure your GPU is getting enough power. Overclocking can put extra strain on the power supply, so make sure it is up to the task.

3. Try lowering the overclock settings and see if that helps stabilize things. Sometimes pushing the limits can be unstable even with good hardware.

4. If all else fails, contact the support of your GPU manufacturer and they may be able to help you further diagnose the issue.

Nvidia Drivers

Nvidia Drivers are specialized computer programs that enable communication between your Nvidia graphics card and your operating system. These drivers are updated regularly to ensure that your Nvidia card is always able to deliver optimal performance. To get the most out of your Nvidia graphics card, it is important to keep your drivers up-to-date.

If you have an Nvidia graphics card, you can update your drivers automatically through the GeForce Experience application or manually by downloading and installing the latest drivers from the Nvidia website. GeForce Experience is a free application that makes it easy to keep your drivers up-to-date and optimize your game settings. If you prefer to update your drivers manually, you can download the latest drivers for your specific Nvidia graphics card from the NVIDIA Driver Downloads page.

Once you have downloaded the driver file, double-click on it to begin installation. During installation, you will be prompted to select which components you wish to install. Be sure to select “Custom Install” and deselect any optional software that is included in order to avoid installing unwanted bloatware on your system.

After installation is complete, reboot your computer for the changes to take effect.

You should now have the latest Nvidia driver installed on your system! Updating your drivers regularly is crucial for maintaining optimal performance of your Nvidia graphics card.

Cuda Gpu is Lost

If your CUDA GPU is lost, there are a few things you can do to try and recover it. First, check to see if the issue is with your graphics card or with your system settings. If the problem is with your graphics card, you may need to update your drivers or reinstall your graphics card.

If the problem is with your system settings, you may need to reconfigure your BIOS or reset your CMOS. Once you’ve determined the cause of the problem, you can take steps to fix it and get your CUDA GPU back up and running.

Nvapi Error Mining

If you’re a Windows user, you may have come across the NVAPI Error while trying to mine cryptocurrency. This error is caused by an outdated version of the NVIDIA drivers. In order to fix this, you need to update your drivers to the latest version.

Here’s a step-by-step guide on how to do that:

1) Download and install Driver Easy.

2) Run Driver Easy and click the Scan Now button.

Driver Easy will then scan your computer and detect any problem drivers.

3) Click the Update button next to a flagged NVIDIA driver to automatically download and install the correct version of this driver (you can do this with the FREE version).

4) Or click Update All to automatically download and install all the latest correct drivers for your system (this requires the Pro version – you’ll be prompted to upgrade when you click Update All).

Trex Error Code 15

If you’re a Trex user, you may have come across error code 15. This error code indicates that there’s a problem with the communication between your computer and the Trex server. There are a few things you can do to try and fix this issue.

First, make sure that your computer is connected to the Internet. If it’s not, then error code 15 can’t occur. Once you’ve verified that your computer is online, try restarting both the Trex application and your computer.

This will often clear up any communication issues between the two devices.

If restarting doesn’t work, then you may need to uninstall and reinstall both Trex and the driver for your graphics card. To do this, go to the Control Panel on your computer and select “Add or Remove Programs.”

Find both Trex and the driver in the list of programs, select them, and click “Uninstall.” Once they’re uninstalled, restart your computer and then install them again from scratch.

Hopefully one of these solutions will fix error code 15 for you.

If not, then you’ll need to contact customer support for further assistance.

Hiveos Gpu is Lost

If you’re a fan of mining cryptocurrencies, then you’ve probably heard of HiveOS. It’s a Linux-based operating system designed specifically for mining rigs. One of the best features of HiveOS is its ability to monitor and manage your GPUs remotely.

However, there have been reports that some users are losing their GPUs when they update to the latest version of HiveOS. Here’s what you need to know about this issue.

Some users have reported that their GPUs are no longer recognized by HiveOS after updating to the latest version.

This appears to be caused by a change in the way that HiveOS detects and identifies GPUs. As a result, any GPU that isn’t properly supported by the new detection method will simply be lost and won’t be usable by the system.

While this is certainly a serious problem, it doesn’t appear to be affecting all users equally.

Some people seem to be able to update without any issues, while others are reporting multiple lost GPUs. It’s not clear why this is happening or how widespread the problem actually is. However, it’s something that you should be aware of if you’re planning on updating your HiveOS installation.

If you do encounter this issue, there is a workaround that may help you recover your lost GPU. According to some reports, unplugging and then replugging in your GPU after updating HiveOS can cause it to reappear and become usable again.

Exit Code 15 Hiveos

If you’ve ever used the Linux operating system, you’re probably familiar with the concept of an exit code. An exit code is a number that is returned to a shell after a program has been executed. Exit codes are used to indicate whether or not a program was successful.

In the context of HiveOS, exit code 15 indicates that the system is unable to connect to the internet. This can be caused by a number of things, including incorrect network settings or a problem with your ISP. If you’re troubleshooting exit code 15, here are some things you can try:

Check your network settings: Make sure that your network card is configured correctly and that you’re using the right IP address, gateway, and DNS servers.

Restart your modem/router: Sometimes a simple restart can fix connectivity issues.

Contact your ISP: If you’ve verified your network settings and restarted your modem/router, but you still can’t connect to the internet, there may be an issue with your ISP.

Contact them for assistance.

Unable to Find an Entry Point Named Nvmllnitwithflags in Dll Nvml

dll

If you’re getting the error “Unable to find an entry point named Nvmllnitwithflags in Dll Nvml.dll” when trying to launch a game or other application, it’s likely that your graphics card drivers are outdated. This can happen if you’ve installed new hardware and/or software recently, as Windows will sometimes automatically update your drivers without you realizing it.

To fix this problem, simply download and install the latest drivers for your NVIDIA graphics card from their website. Once that’s done, restart your computer and try launching the game or application again. It should now work without any issues!

Conclusion

If you’re getting the NVML error in your Cuda program, it means that your GPU is lost. This can happen for a variety of reasons, but the most common one is that your graphics card has overheated and needs to be cooled down. You can try setting a fan speed profile in your NVIDIA control panel or using a software like MSI Afterburner to help cool down your GPU.

If the problem persists, you may need to replace your graphics card.

Источник

Ошибки Видеокарты При Майнинге

Самое полное собрание ошибок в майнинге на Windows, HiveOS и RaveOS и их быстрых и спокойных решений

Can’t find nonce with device CUDA_ERROR_LAUNCH_FAILED

Ошибка майнера Can’t find nonce

Ошибка говорит о том, что майнер не может найти нонс и сразу же сам предлагает решение — уменьшить разгон. Особенно начинающие майнеры стараются выжать из видеокарты максимум — разгоняют слишком сильно по ядру или памяти. В таком разгоне видеокарта даже может запуститься, но потом выдавать ошибки как указано ниже. Помните, лучше — стабильная отправка шар на пул, чем гонка за цифрами в майнере.

Зарабатывай на чужих сделках на бирже BingX. Подробнее — тут.

Phoenixminer Connection to API server failed — что делать?

Ошибка Connection to API server failed

Такая ошибка встречается на PhoenixMiner на операционной систему HiveOS. Она говорит о том, что майнинг-ферма/риг не может подключиться к серверу статистики. Что делать для ее решения:

Введите команду net-test и запомните/запишите сервер с низким пингом. После чего смените его в веб интерфейсе Hive (на воркере) и перезагрузите ваш риг.
Если это не помогло, выполните команду dnscrypt -i && sreboot

Phoenixminer CUDA error in CudaProgram.cu:474 : the launch timed out and was terminated (702)

Ошибка майнера Phoenixminer CUDA error in CudaProgram

Эта ошибка, как и в первом случае, говорит о переразгоне карты. Откатите видеокарту до заводских настроек и постепенно поднимайте разгон до тех пор, пока не будет ошибки.

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

Ошибка майнера Unable to enum CUDA GPUs: invalid device ordinal

Проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).
Если все ок, то проверяем райзера. Часто бывает, что именно райзер бывает причиной такой ошибки.

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

Ошибка майнера Unable to enum CUDA GPUs: Insufficient CUDA driver: 5000

Аналогично предыдущей ошибке — проверяем драйвера видеокарты и саму видеокарту на работоспособность (как она отмечена в диспетчере устройств, нет ли восклицательных знаков).

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка майнера NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

Ошибка code 1073740791 nbminer возникает, если ваш риг/майнинг-ферма собраны из солянки Nvidia+AMD. В этом случае разделите майнинг на два .bat файла (или полетника, если вы на HiveOS). Один — с картами AMD, другой с картами Nvidia.

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

Ошибка майнера NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2)

Одна из самых распространённых ошибок на Windows — нехватка памяти, в данном случае на майнере Nbminer, но встречается и в майнере Nicehash. Чтобы ее исправить — надо увеличить файл подкачки. Файл подкачки должен быть равен сумме гб всех видеокарт в риге плюс 10% запаса. Как увеличить файл подкачки — читаем тут.

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Ошибка майнера GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

В данном случае скорее всего виноват не файл подкачки, а переразгон по видеокарте, которая идет под номером 0. Сбавьте разгон и ошибка должна пропасть.

Socket error. the remote host closed the connection, в майнере Nbminer

Socket error. the remote host closed the connection

Также может быть описана как «ERROR — Failed to establish connection to mining pool: Socket operation timed out».
Сетевой конфликт — проверьте соединение рига с интернетом. Перегрузите роутер.
Также может быть, что провайдер закрывает соединение с пулом. Смените пул, попробуйте VPN или измените адреса DNS на внешнего провайдера, например cloudflare 1.1.1.1, 1.0.0.1

Server not responded on share, на майнере Gminer

Server not responded on share

Такая ошибка говорит о том, что у вас что-то с подключением к интернету, что критично для Gminer. Попробуйте сделать рестарт роутера и отключить watchdog на майнере.

DAG has been damaged check overclocking settings, в майнере Gminer

Также в этой ошибке может быть указано Device not responding, check overclocking settings.
Ошибка говорит о переразгоне, попробуйте сначала убавить его.
Если это не помогло, смените майнер — Gminer никогда не славился работой с видеокартами AMD. Мы рекомендуем поменять майнер на Teamredminer, а если вам критична поддержка майнером одновременно Nvidia и AMD видеокарт, то используйте Lolminer.
Если смена майнера не поможет, переставьте видеодрайвер.
Если и это не поможет, то нужно тестировать эту карту отдельно в слоте X16.

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

Ошибки настройки памяти с кодом -6 обычно указывают на проблему с драйвером.

Если у вас Windows, используйте программу DDU (DisplayDriverUninstaller), чтобы полностью удалить все драйверы Nvidia.
Перезагрузите систему.
Установите новый драйвер прямо с сайта Nvidia.
Перезагрузите систему снова.
Если у вас HiveOS/RaveOS — накатите чистый образ системы. Чтобы наверняка.

TREX: Can’t unlock GPU

Полный текст ошибки:
TREX: Can’t unlock GPU [ID=1, GPU #1], error code 15
WARN: Miner is going to shutdown…
WARN: NVML: can’t get fan speed for GPU #1, error code 15
WARN: NVML: can’t get power for GPU #1, error code 15
WARN: NVML: can’t get mem/core clock for GPU #1, error code 17

Решение:

Проверьте все кабельные соединения видеокарты и райзера, особенно кабеля питания.
Если с первый пунктом все ок, попробуйте поменять райзер на точно рабочий.
Если ошибка остается, вставьте видеокарту в разъем х16 напрямую в материнскую плату.

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

Ошибка майнера CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

В конкретном случае была проблема в блоке питания, он не держал 3 видеокарты. После замены блока питания ошибка пропала.
Если вы уверены, что ваш мощности вашего блока питания достаточно, попробуйте сменить майнер.

Зарабатывай на чужих сделках на бирже BingX. Подробнее — тут.

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

Ошибка 511 градусов видеокарта

Ошибка 511 говорит о неисправности райзера или питания карты. Проверьте все соединения. Для выявления неисправности рекомендуется запустить систему с одной картой. Протестировать, и затем добавлять по одной карте.

GPU driver error, no temps в HiveOS — что делать?

Вероятнее всего, вы получили эту ошибку, майнив на HiveOS. Причин ее появления может быть несколько — как софтовая, так и аппаратная (например райзер).
Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — проверьте райзер.

GPU are lost, rebooting

Это не ошибка, а ее последствие. Что узнать какая ошибка приводит к перезагрузке карт, сделайте следующее:

Включите сохранение логов (по умолчанию они выключены) командой

logs-on

И перезагрузите риг.
После того как ошибка повторится можно будет скачать логи командами ниже.
Вы можете использовать следующую команду, чтобы загрузить логи майнера прямо с панели мониторинга;

message file «miner.log» -f=/var/log/miner/minername/minername.log

Итак, скажем, например, мне нужны логи TeamRedMiner
message file «teamredminer.log» -f=/var/log/miner/teamredminer/teamredminer.log

Отправленная командная строка будет выделена синим цветом. Загружаемый файл будет отображаться белым цветом. Нажав на него, вы сможете его скачать.
Эта команда позволит скачать лог системы

message file «syslog» -f=/var/log/syslog

exitcode=3 в HiveOS

Если ошибка не уйдет — проверьте райзер.

exitcode=1 в HiveOS

Данная ошибка возникает когда есть проблема с датой в биосе материнской платы (сбитое время) и (или) есть проблема с интернетом.
Если сбито время, то удаленно вы не сможете подключиться.
Тем не менее, обновление драйверов Nvidia должно пройти командой:

nvidia-driver-update —list

gpu fault detected 146

Скорее всего вы пытаетесь майнить с помощью Phoenix miner. Решения два:

Откатитесь на более старую версию, например на 5.4с
(Рекомендуемый вариант) Используйте Trex для видеокарт Nvidia и TeamRedMiner для AMD.

Waiting interface to come up — не работает VPN на HiveOS

Waiting interface to come up

Начните с логов, чтобы понять какая именно ошибка вызывает эту проблему.
Команды для получения логов:
systemctl status openvpn@client
journalctl -u openvpn@client -e —no-pager -n 100

Как узнать ip адрес воркера hive os

Самое простое — зайти в воркера и прокрутить страницу ниже видеокарт. Там будет указан Remote IP — это и есть внешний IP.
Альтернативный вариант — вы можете проверить ваш внешний айпи адрес hive через консоль Hive Shell:
Выполните одну из команд:
curl 2ip.ru
wget -qO- eth0.me
wget -qO- ipinfo.io/ip
wget -qO- ipecho.net/plain
wget -qO- icanhazip.com
wget -qO- ipecho.net
wget -qO- ident.me

Repository update failed в HiveOS

Иногда встречается на HiveOS. Полный текст ошибки:

Some index files failed to download. They have been ignored, or old ones used instead.
Repository update failed
------------------------------------------------------
> Restarting autofan and watchdog
> Starting miners
Miner screen is already running
Run miner or screen -r to resume screen
Upgrade failed

Решение:

Выполнить команду apt update && selfupgrade -f
Если не сработала и она, то 99.9%, что разработчики HiveOS уже знают об этой проблеме и решают ее. Попробуйте выполнить обновление через некоторое время.

Rave os не запускается. Boot aborted Rave os

Перепроверьте все настройки ПК и БИОСа материнской платы:
— Установите загрузочное устройство HDD/SSD/M2/USB в зависимости от носителя с ОС.
— Включите 4G decoding.
— Установите поддержку PCIe на Auto.
— Включите встроенную графику.
— Установите предпочтительный режим загрузки Legacy mode.
— Отключите виртуализацию.

Если после данных настроек не определяется часть карт, то выполните следующие настройки в BIOS (после каждого пункта требуется полная перезагрузка):

— Отключите 4G decoding
— Перезагрузка
— Отключите CSM
— Перезагрузка
— Включите 4G decoding, установите PCI-E Gen2/3, а при отсутствии Gen2/3, можно выбрать Gen1

Failed to allocate memory Raveos

Эта же ошибка может называться как:
failed to allocate initramfs memory bailing out, failed to load idlinux c.32
или
failed to allocate memory for kernel boot parameter block
или
failed to allocate initramfs memory raveos bailing

Но решение у нее одно — вы должны правильно настроить БИОС материнской платы.

gpu_driver_fault, GPU #0 fault в RaveOS

gpu_driver_fault, GPU #0 fault в RaveOS

В большинстве случаев эта проблема решается уменьшением разгона (особенно по памяти) на конкретной видеокарте (на скрине это карта номер 0).
Если уменьшение разгона не помогает, то попробуйте обновить драйвера.
Если обновление драйверов не привело к решению проблемы, то попробуйте поменять райзер на этой карте на точно работающий.
Если и это не помогает, перепроверьте все кабельные соединения и мощность блока питания, хватает ли его для вашей конфигурации.

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes

Что приводит к появлению этой ошибки? Вероятно, вы переразогнали видеокарту (часто сильно гонят по памяти), сбавьте разгон. На скрине видно, что проблему дает именно GPU под номером 1 — начните с нее.
Вторая частая причина — нехватка питания БП на систему с видеокартами. Учтите, что сама система потребляет не менее 100 вт, каждый райзер еще закладывайте 50 вт. БП должно хватать с запасом в 20%.

Miner restarted after error RaveOS

Смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к miner restarted. После этого найдите ее на этой странице и исправьте. Проблема уйдет.

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Аналогично предыдущему пункту — смотрите логи майнера, там будет указана конкретная ошибка, которая приводит к рестарту воркера. Пофиксите ту ошибку — уйдет и эта проблема.

Miner cannot be started, ОС RaveOS

Непосредственно перед этой ошибкой обычно пишется еще другая, которая и вызывает эту проблему. Но если ничего нет, то:

Поставьте майнер на паузу, перезагрузите риг и в консоли выполните команды clear-miners clear-logs и fix-fs. Запустите майнинг.
Если ошибка не ушла, перепишите образ RaveOS.

Overclock can’t be applied в RaveOS

Эта ошибка означает, что значения разгона между собой конфликтуют или выходят за пределы допустимых. Перепроверьте их. Скиньте разгон на стоковый и попробуйте еще раз.
В редких случаях причиной этой ошибки также становится райзер.

Error installing hive miners

Можно попробовать обойтись малой кровью и вбить в HiveOS команду:
hive-replace -y —stable
Система по новой накатит стабильную версию HiveOS.

Если ошибка не уйдет — физически перезапишите образ. Если у вас флешка, то скорее всего она умерла. Купите SSD.

Warning: Nvidia settings applied with errors

Переразгон. Снизьте значения частот ядра и памяти. После этого перезагрузите риг.

Nvtool error или Danger: nvtool error

Скорее всего при установке драйвера появилась проблема с модулем nvtool
Попробуйте переустановить драйвер Nvidia командой через Hive shell:
nvidia-driver-update версия_драйвера —force
Или попробуйте обновить систему полностью командой из Hive shell:
hive-replace -y —stable

nvtool error

Перестал отображаться кулер видеокарты HiveOS

0% скорости вращения кулера.
Это может произойти по нескольким причинам:

кулер действительно не крутится
датчик оборотов отключен или сломан
видеокарта слишком агрессивно работает (высокий разгон)
неисправен райзер или одно из его частей

ERROR: parsing JSON failed

Необходимо выполнить на риге локально (с клавиатурой и монитором) следующую команду:
net-test

Данная команда покажет ваше текущее состояние подключения к разным зеркалам API серверов HiveOS.
Посмотрите, к какому API у вас наименьшая задержка (ping), и когда воркер снова появится в панели, измените стандартное зеркало на то, что ближе к вам.
После смены зеркала, в обязательном порядке перезагрузите ваш воркер.
Изменить сервер API вы можете командой nano /hive-config/rig.conf
После смены нажмите ctrl + o и ентер для того чтобы сохранить файл.
После этого выйдите в консоль командой ctrl + x, f10 и выполните команду hello

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Проблема с скоростью кулеров на GPU 5
0% скорости вращения кулера / ошибки в целом
Это может произойти по нескольким причинам:
— кулер действительно не крутится
— датчик оборотов отключен или сломан
— видеокарта слишком агрессивно работает (высокий разгон)
Начните с визуальной проверки карты и ее кулера.

Can’t get power for GPU #2

Как правило эта ошибка встречается рядом вместе с другими:
Attribute ‘GPUGraphicsClockOffset’ was already set to 0
Attribute ‘GPUMemoryTransferRateOffset’ was already set to 2200
Attribute ‘GPUFanControlState’ (hive1660s_ETH:0[gpu:2]) assigned value
0.

20211029 12:40:50 WARN: NVML: can’t get fan speed for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get power for GPU #2, error code 999
20211029 12:40:50 WARN: NVML: can’t get mem/core clock for GPU #2, error code 999

Решение:

Проверьте корректность установки драйвера на видеокарте.
Убедитесь что нет проблем с драйвером, если все в порядке, то попробуйте другой параметр разгона. Например уменьшить разгон по памяти.

GPU1 search error: unspecified launch failure

Уменьшите разгон и проверьте контакты райзера

Warning: Autofan: unable to set fan speed, rebooting

Найдите логи майнера, посмотрите какие ошибки майнер пишет в логах. Например:

kernel: [12112.410046][ T7358] NVRM: GPU at PCI:0000:0c:00: GPU-236e3bef-2e03-6cdb-0518-7ac01eb8736d
kernel: [12112.410049][ T7358] NVRM: Xid (PCI:0000:0c:00): 62, pid=7317, 0000(0000) 00000000 00000000
kernel: [12112.433831][ T7358] NVRM: Xid (PCI:0000:0c:00): 45, pid=7317, Ch 00000010
CRON[21094]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)

Исходя из логов, мы видим что есть проблема с видеокартой на слоте PCIE 0c:00 (под номером Gpu пишется номер PCIE слота) с ошибками 45 и 62
Коды ошибок (других, которые также могут быть там) и что с ними делать:

• 13, 43, 45: ошибки памяти, снизить MEM
• 8, 31, 32, 61, 62: снизить CORE, возможно и MEM
• 79: снизить CORE, проверить райзер

Ошибка Kernel-Power код 41

Проверьте все провода (от БП до карт, от БП до райзеров), возможно где-то идёт оплавление. Если визуальный осмотр показал, что все ок, то ошибка программная и вам нужно переустановить Windows.

Danger: hive-replace -y —stable (failed, exitcode=137)

Очень редкая ошибка, которая вылезла в момент удаленного обновления образа HiveOS. Она не встречается в тематических майнинг группах и сайтах. Не поверите что произошло.
На балконе, где стоял риг, поселилась семья голубей. Они засрали риг, в прямом смысле, из-за этого он постоянно уходил в оффлайн. После полной продувки материнской платы и видеокарт проблема решилась сама.

MALFUNCTION HIVEOS

Malfunction — неисправность. Причин и решений может быть несколько:

Вам следует переустановить видео драйвер;
Если драйвер не помог, тогда отключайте все GPU и поочередно вставляйте по 1 шт, и смотрите вызовет ли какая-то видеокарта подобную ошибку или нет. Если да, то возможно это райзер.
Неисправен носитель, на который записана Hive OS, запишите образ еще раз.

Не нашли своей ошибки? Помогите сделать мир майнинга лучше. Отправьте ее по этой форме и мы обновим наш гайд в самое ближайшее время.

Источник

This is a new one to me and has me stumped. GTX 1080 is the problem.

Machine has two 3070s and 3 1080s hooked to PCI-E risers.

Deleted and reinstalled drivers. New riser on the GPU getting the error. Replaced PCI power cord to GPU and to riser.

Afterburner shows 6 GPUs instead of the 5 connected to the MSI X470 board. In this case it shows 4 1080s. The settings for two of them are blacked out. I presume one of them is just a ghost since I don’t have 4 1080s. The settings for two of them are adjustable and will hash away at 34-35Mh/s comfortably.

The motherboard and one 3070 are powered from a plain ATX power supply. The other 3070 and the three 1080s are powered by a 1200 watt server PSU. Each riser has its own power and each GPU has its own power. No SATA. No burned or melted cables. Voltage on the breakout board shows 12.4 v. Wall power (110 and 220) were both confirmed to be spot on by the power company 2 days ago. They were out for an unrelated issue (yes, I paid my bill lol).

Errors are:

— NVML error in CudaProgram.cu:216 : GPU is lost (15)

— NVML error in CudaProgram.cu:219 : Invalid Argument (2)

— NVAPI error in NvapiWrapper.c:341 : -6

The bat file is pretty plain vanilla. The only real thing past the base wallet ,username, exe file, is I’ve set -straps 2.

I had this all up and running for a few days until AT&T came out and disconnected my service for a repair (I aerated over a shallow line and it had to be replaced). I sincerely doubt this caused anything but I recognize that is when this began. I’ve been messing with this thing for 5 days. I have no doubt I’m overlooking something simple. Can someone point me in the right direction? I would be most grateful.

Thank you.

EDIT: I did get everything up and running eventually. I’m not sure if what I did fixed the problem or if I just held my mouth the right way and offered a bit of beer and blood to the hashing gods. But, what I did was….

— updated BIOS on the motherboard

— DDU in safe mode, installed new drivers (got all the AMD stuff off the system which I don’t know why it was there. It’s an nVidia rig only)

— updated Phoenix to 5.5

— disabled, then re-enabled Above 4G and Gen 1 on the motherboard

Funny thing is, it didn’t work at first after doing this. After a couple reboots and a liberal application of foul language, it began working. Everything is still fine.

I updated this because I have seen a few people with this problem but no one seems to follow up with a solution. If this helps one person, it was worth it.

Источник

Как исправить ошибку NVML cannot get fan speed

Причины появления ошибки NVML error 999 (an internal driver error occurred)

Что нужно сделать, чтобы устранить ошибку NVML error 999 (an internal driver error occurred)

Отпадывает 2 ВК на втором блоке с ошибкой NVML cannot current temperature, error 15

Абырвалг

Свой человек

WARN: NVML: can’t get GPU #0, error code 15 #230

What is an Nvml Error

What Does It Mean When a Gpu is Lost

How Can I Fix an Nvml Error in My Cuda Program

❗️Nvidia NVML Error FIX | HOW TO FIX NVML ERROR

Unable to Get Temperature Gpu is Lost

Nvidia Drivers

Cuda Gpu is Lost

Nvapi Error Mining

Trex Error Code 15

Hiveos Gpu is Lost

Exit Code 15 Hiveos

Unable to Find an Entry Point Named Nvmllnitwithflags in Dll Nvml

Conclusion

Ошибки Видеокарты При Майнинге

UNABLE TO ENUM CUDA GPUS: INVALID DEVICE ORDINAL

UNABLE TO ENUM CUDA GPUS: INSUFFICIENT CUDA DRIVER: 5000

NBMINER MINING PROGRAM UNEXPECTED EXIT.CODE: -1073740791, REASON: PROCESS CRASHED

NBMINER CUDA ERROR: OUT OF MEMORY (ERR_NO=2) — как исправить?

GMINER ERROR ON GPU: OUT OF MEMORY STOPPED MINING ON GPU0

Socket error. the remote host closed the connection, в майнере Nbminer

Server not responded on share, на майнере Gminer

DAG has been damaged check overclocking settings, в майнере Gminer

ERROR: Can’t start T-Rex, failed to initialize device map: can’t get busid, code -6

TREX: Can’t unlock GPU

CAN’T START MINER, FAILED TO INITIALIZE DEVIS MAP, CAN’T GET BUSID, CODE -6

ОШИБКА 511 ГРАДУСОВ НА ВИДЕОКАРТА

GPU driver error, no temps в HiveOS — что делать?

GPU are lost, rebooting

exitcode=3 в HiveOS

exitcode=1 в HiveOS

gpu fault detected 146

Waiting interface to come up — не работает VPN на HiveOS

Как узнать ip адрес воркера hive os

Repository update failed в HiveOS

Rave os не запускается. Boot aborted Rave os

Failed to allocate memory Raveos

gpu_driver_fault, GPU #0 fault в RaveOS

Gpu driver fault. All tasks have been stopped. Worker will be rebooted after 5 minutes в RaveOS

Miner restarted after error RaveOS

Miner restart limit reached. Worker rebooting by flag auto в RaveOS

Miner cannot be started, ОС RaveOS

Overclock can’t be applied в RaveOS

Error installing hive miners

Warning: Nvidia settings applied with errors

Nvtool error или Danger: nvtool error

Перестал отображаться кулер видеокарты HiveOS

ERROR: parsing JSON failed

NVML: can’t get fan speed for GPU #5, error code 999 hive os

Can’t get power for GPU #2

GPU1 search error: unspecified launch failure

Warning: Autofan: unable to set fan speed, rebooting

Ошибка Kernel-Power код 41

Danger: hive-replace -y —stable (failed, exitcode=137)

MALFUNCTION HIVEOS

Читайте также: