Cuda error illegal address - Исправление ошибок и поиск оптимальных решений проблем

System information

Have I written custom code (as opposed to using a stock example script
provided in TensorFlow): I have followed this tutorial using my own data. Tutorial: https://stackabuse.com/text-generation-with-python-and-tensorflow-keras/
My data: https://gist.github.com/Urkchar/e01a667c1656e874f918ff92db5b998f
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Microsoft Windows 10 Home version 10.0.19041 Build 19041
TensorFlow installed from (source or binary): I installed TensorFlow with pip install tensorflow
TensorFlow version: 2.4.0
Python version: 3.8.6
CUDA/cuDNN version: CUDA version 11.2. cuDNN version 8.0.5
GPU model and memory: Nvidia GTX 1070 with 8 GB memory
Exact command to reproduce:
model.fit(x, y, epochs=8, batch_size=256, callbacks=desired_callbacks)

I receive unexpected errors when trying to fit a model. Examples include CUDA_ERROR_ILLEGAL_ADDRESS and CUDA_ERROR_LAUNCH_FAILED. Full logs and error messages below.
Stuff like F tensorflow/core/common_runtime/device/device_event_mgr.cc:221] Unexpected Event status: 1 Makes me think that this is a bug because these are unexpected errors or failures.

import sys
import numpy
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM
from keras.utils import np_utils
from keras.callbacks import ModelCheckpoint

print(sys.version_info)


def tokenize_words(text_input):
    # Lowercase everything to standardize it
    text_input = text_input.lower()

    # Instantiate the tokenizer
    tokenizer = RegexpTokenizer(r"w+")
    tokens = tokenizer.tokenize(text_input)

    # If the create token isn't in the stop words, make it part of "filtered"
    filtered = filter(lambda token: token not in stopwords.words("english"), tokens)
    return " ".join(filtered)


input_file = open("yejibro_data.txt").read()
# input_file = open("84-0.txt", "r", encoding="utf-8").read()

# Preprocces the input data, make tokens
processed_inputs = tokenize_words(input_file)

chars = sorted(set(processed_inputs))
char_to_num = dict((c, i) for i, c in enumerate(chars))

input_len = len(processed_inputs)
vocab_len = len(chars)
print(f"Total number of characters: {input_len}")
print(f"Total vocab: {vocab_len}")

seq_length = 100
x_data = []
y_data = []

# Loop through inputs, start at the beginning and go until we hit the final character we can create
# a sequence out of
for i in range(0, input_len - seq_length, 1):
    # Define input and output sequences
    # Input is the current character plus desired sequence length
    in_seq = processed_inputs[i:i + seq_length]

    # Out sequence is the initial character plus total sequence length
    out_seq = processed_inputs[i + seq_length]

    # We now convert list of characters to integers based on previous mappings and add the values to
    # our lists
    x_data.append([char_to_num[char] for char in in_seq])
    y_data.append(char_to_num[out_seq])

n_patterns = len(x_data)
print(f"Total patterns: {n_patterns}")

x = numpy.reshape(x_data, (n_patterns, seq_length, 1))
x = x/float(vocab_len)

y = np_utils.to_categorical(y_data)

model = Sequential()
model.add(LSTM(256, input_shape=(x.shape[1], x.shape[2]), return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.2))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(y.shape[1], activation="softmax"))

model.compile(loss="categorical_crossentropy", optimizer="adam")

# print(model.summary())

filepath = "model_weights_saved.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor="loss", verbose=1, save_best_only=True, mode="min")
desired_callbacks = [checkpoint]

model.fit(x, y, epochs=8, batch_size=256, callbacks=desired_callbacks)

Here are some error messages and logs that occurred at seemingly random points during the code.
https://paste.pythondiscord.com/yomuzokuto.apache
https://paste.pythondiscord.com/orusujidaq.yaml
https://paste.pythondiscord.com/igexetayen.swift
https://paste.pythondiscord.com/ovawipegap.yaml
https://paste.pythondiscord.com/yopitiweri.less

Final note: I would be very happy to include more information when needed. I may have neglected to include some information that is necessary to figure out what’s going on so please let me know.
Final final note: This is my first bug report on GitHub so I tired to follow the template to the best of my ability.

Источник

«CUDA_ERROR_ILLEGAL_ADDRESS» is a CUDA error that can occur for multiple reasons. Refer to the following five troubleshooting steps which should address the most common reasons for this error:1) The most likely cause is when accessing a memory address that is not intended to be accessed. For example, trying to access index 10 of an array that only has 5 elements. Check to make sure that all CUDA memory accesses are valid and in the bounds of array sizes. This is the first troubleshooting step that should be taken before trying other steps.

2) After checking to see there are no out-of-bounds memory accesses, you should then check if you have the most recent NVIDIA drivers for your GPU. To download the latest NVIDIA drivers, please refer to the following NVIDIA link:

3) Another potential cause for this CUDA error is when the GPU runs out of memory. GPU memory usage can be monitored by using the «nvidia-smi» command line tool. «nvidia-smi» ships with the CUDA toolkit, for more information regarding «nvidia-smi» please refer to the following NVIDIA documentation link:

4) Another useful tool that will help reduce the scope of the issue and give more information besides the simple «CUDA_ERROR_ILLEGAL_ADDRESS» error message, is «cuda-memcheck». This tool is shipped with the CUDA toolkit and is capable of precisely detecting and attributing out of bounds and misaligned memory access errors in CUDA applications. To run «cuda-memcheck» on a CUDA MEX script, you must first compile the CUDA MEX code with the «-G» compiler flag, similar to the following.code:

mexcuda -G myCudaMexFunction.cu

To run «cuda-memcheck» on a MATLAB script that uses a compiled CUDA MEX script, please use the following command:

cuda-memcheck /path/to/matlab -nodisplay -nojvm -batch «cd <to your code>; <name of MATLAB script to run>

For more information on «cuda-memcheck», please refer to the following NVIDIA documentation link:

Other CUDA MEX debugging methods, such as using Visual Studio debugger, can be found by referring to the following MathWorks documentation link.

5) Another potential cause of «CUDA_ERROR_ILLEGAL_ADDRESS» is when using complex data. If the CUDA MEX function takes complex data as input, it is important to make sure that the complex data is handled correctly on both the MATLAB end and the CUDA receiving end. If the handling of complex data from either side (MATLAB or CUDA) is incorrect, the CUDA error «CUDA_ERROR_ILLEGAL_ADDRESS» can occur randomly or can occur dependent on the input size. For proper complex data handling, for example, in MATLAB, suppose you create the following matrix:

a = complex(ones(4,‘gpuArray’),ones(4,‘gpuArray’));

If you pass a «gpuArray» to a MEX-function as the first argument (prhs[0]), then you can get a pointer to the complex data by using the following code:

mxGPUArray const * A = mxGPUCreateFromMxArray(prhs[0]);

mwSize numel_complex = mxGPUGetNumberOfElements(A);

double2 * d_A = (double2 const *)(mxGPUGetDataReadOnly(A));

For more information regarding complex data in CUDA MEX functions, please refer to the following MathWorks documentation link:

Источник

$begingroup$

I’m trying to render my scene, which was working fine before, but suddenly when I want to render it starts to throw CUDA error.
I upgraded to 2.9 and same deal.
I’m using OPTIX denoiser, but render on normal CUDA.
I have Razer blade 2020 with RTX2080 MAX-Q super
Any tips? I updated windows, restarted the laptop etc.
I’m pretty sure all my drivers are up to date.
I have motion blur enabled, as well as denoising, 512 samples, DoF is enabled as well.

Also after it shows up once, I’m getting «failed to initialize CUDA context» every time after it. :/

asked Sep 12, 2020 at 10:22

AciDAciD

1011 gold badge3 silver badges9 bronze badges

$endgroup$

$begingroup$

I had the same problem. It always appeared when PC woke up from sleep mode. Restarting the app worked, but was annoying. I found the solution in this thread and it worked perfect for me. Now there is no more CUDA error afte trying to render after sleep mode:

https://blenderartists.org/t/failed-to-create-cuda-context-illegal-adress/1278322

In my case I only had to do the last part of the solution:

In file browser:
open etc/modprobe.d as root.
create a new file “nvidia.conf”
open it in text editor and write:
options nvidia NVreg_PreserveVideoMemoryAllocations=1
then save and close.
Back in the terminal, run: sudo update-initramfs -u Reboot.

My system setup:
AMD Ryzon 5 3600
16 MB RAM
Nvidia GForce RTX 2060 Super
Cycles Renderer via GPU and CPU, Optix Denoiser in View and Renderer
Ubuntu 20.04
blender 3 snap
Nvidia Driver 470 (495 makes problems with Optix Denoiser)

I hope it will help in some of the cases above.
Have a nice day!

answered Nov 9, 2021 at 14:12

$endgroup$

$begingroup$

Sometimes when I render things and it doesn’t work I just make a new project and then I append everything into that new project from my old project then it usually works. It’s annoying but I don’t know how else to fix it.

answered Feb 16, 2021 at 22:37

$endgroup$

$begingroup$

Anything (over)loading your GPU could fail CUDA context initialization in my recent case on a RTX 2060 FE it was rendering color depth.

Switching from 16bit to 8bit unstocked the rendering.

Hope it’ll help someone out there.

Good rendering to you !

answered Jan 24, 2021 at 18:15

$endgroup$

$begingroup$

currently running on an old machine . had the same problem however changed from GPU to CPU in device solved the issue in render properties tab.

answered Jan 29, 2021 at 18:12

$endgroup$

$begingroup$

Overclocking NVidia GPU’s can cause CUDA errors. I encountered this same issue with an Nvidia RTX 3070 GPU on both Blender 3.0 and 3.1, stable releases. Removing GPU overclocking, in my case with the MSI Center application on Windows 10, and restarting Blender solved the issue.

answered Apr 14, 2022 at 21:08

$endgroup$

Источник

It is a really strange bug.
Environment: tf 1.12 + cuda9.0 + cudnn 7.5 + single RTX 2080

Today I tried to train YOLO V3 network on my new device. Batch size is 4. Every thing went right at the beginning, training started as usual and I could see the loss reduction during train process.

But, at around 35 round, it reported a message:

2020-03-20 13:52:01.404576: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-03-20 13:52:01.404908: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1

and exited train process.

I have tried several times. It happened randomly. Maybe 30 minutes or several hours after training process started.

But if I changed batch size to 2. It could train successfully.

So why this happened? If my environment is not right or not suitable for RTX 2080, this bug should happened at the early begining of the train progress but middle. The layers in my yolo network was all trainable at beginning so there was nothing change during training process. Why it could train correctly at the first round but fail in middle? Why smaller batch size could train successfully?

And what should I do now? The solutions I can thought are:
1:Compile tf 1.12 in cuda 10 + cudnn 7.5 and try again.
2:Maybe update tensorflow and cuda?
All cost a lot.

Источник

Майнил ERGO на RTX 3080 Ti несколько дне все было нормально, но в один прекрасный день начало выдавать такую ошибку раз в 2-3 часа и рестартать.
Can’t find nonce with device [ID=0, GPU #0], cuda exception CUDA_ERROR_ILLEGAL_ADDRESS, try to reduce overclock to stabilize GPU state
С чем это связано и как это можно исправить кроме понижения memory в afterburner

try to reduce overclock to stabilize GPU state
С чем это связано и как это можно исправить кроме понижения memory в afterburner

не зря в школе говорили, что в вопросе 50% ответа

try to reduce overclock to stabilize GPU state

Разгон скинь. У меня тоже было такое.

сколько хешей на t-rex ? какой разгон?

Вообще, у нового ти-рекса есть некоторые заметные улучшения касательно майнинга эрги — у меня карта конечно попроще, но на 3060ти лхр он и хешрейт побольше даёт. Но сам столкнулся с его глюком, когда копал на винде эргу на 3060ти лхл + 2х1063. Рандомно так же отваливались куда-девайсы (чаще гпу0 и гпу1 — это одна из 1063 и 3060ти лхр, но и гпу2 тож разок отвалилась) — майнер в том же окошке ребутился и продолжал майнить дальше. И просто внизу появилась строчка с количеством ошибок каждого гпу. И дело не в разгоне — отставил на ти-рексе одну 3060ти лхр, а 2х1063 запустил с нбмайнера — и всё норм продолжило копать на тех же настройках безо всяких ошибок.
Явно косяк у проги, а не у карт… Еще риг на рейв ос с 5х1063 на этой же версии ти-рекса глюкнул — сутки остался без присмотра и как раз все эти сутки выдавал в статистику рейва ошибку то гпу0, то гпу1 и сам перезапускался…риг при этом не ребутился — 2ой воркер из 2ух карт амд на нбмайнере продолжал нормально майнить эргу. До этого 2 суток копал без ошибок, с редкими рестартами самого майнера для переходов на разные пулы.
Вообщем, чёта там перемудрили у ти-рекса 24.5.

Источник