Runtime error cuda out of memory

CUDA Out of Memory error but CUDA memory is almost empty I am currently training a lightweight model on very large amount of textual data (about 70GiB of text). For that I am using a machine on a c...

#16417 (comment)

The same issue to me
Dear, did you get the solution?
(base) F:Sureshst-gcn>python main1.py recognition -c config/st_gcn/ntu-xsub/train.yaml —device 0 —work_dir ./work_dir
C:Userscudalab10Anaconda3libsite-packagestorchcuda_init_.py:117: UserWarning:
Found GPU0 TITAN Xp which is of cuda capability 1.1.
PyTorch no longer supports this GPU because it is too old.

warnings.warn(old_gpu_warn % (d, name, major, capability[1]))
[05.22.19|12:02:41] Parameters:
{‘base_lr’: 0.1, ‘ignore_weights’: [], ‘model’: ‘net.st_gcn.Model’, ‘eval_interval’: 5, ‘weight_decay’: 0.0001, ‘work_dir’: ‘./work_dir’, ‘save_interval’: 10, ‘model_args’: {‘in_channels’: 3, ‘dropout’: 0.5, ‘num_class’: 60, ‘edge_importance_weighting’: True, ‘graph_args’: {‘strategy’: ‘spatial’, ‘layout’: ‘ntu-rgb+d’}}, ‘debug’: False, ‘pavi_log’: False, ‘save_result’: False, ‘config’: ‘config/st_gcn/ntu-xsub/train.yaml’, ‘optimizer’: ‘SGD’, ‘weights’: None, ‘num_epoch’: 80, ‘batch_size’: 64, ‘show_topk’: [1, 5], ‘test_batch_size’: 64, ‘step’: [10, 50], ‘use_gpu’: True, ‘phase’: ‘train’, ‘print_log’: True, ‘log_interval’: 100, ‘feeder’: ‘feeder.feeder.Feeder’, ‘start_epoch’: 0, ‘nesterov’: True, ‘device’: [0], ‘save_log’: True, ‘test_feeder_args’: {‘data_path’: ‘./data/NTU-RGB-D/xsub/val_data.npy’, ‘label_path’: ‘./data/NTU-RGB-D/xsub/val_label.pkl’}, ‘train_feeder_args’: {‘data_path’: ‘./data/NTU-RGB-D/xsub/train_data.npy’, ‘debug’: False, ‘label_path’: ‘./data/NTU-RGB-D/xsub/train_label.pkl’}, ‘num_worker’: 4}

[05.22.19|12:02:41] Training epoch: 0
Traceback (most recent call last):
File «main1.py», line 31, in
p.start()
File «F:Sureshst-gcnprocessorprocessor.py», line 113, in start
self.train()
File «F:Sureshst-gcnprocessorrecognition.py», line 91, in train
output = self.model(data)
File «C:Userscudalab10Anaconda3libsite-packagestorchnnmodulesmodule.py», line 489, in call
result = self.forward(*input, **kwargs)
File «F:Sureshst-gcnnetst_gcn.py», line 82, in forward
x, _ = gcn(x, self.A * importance)
File «C:Userscudalab10Anaconda3libsite-packagestorchnnmodulesmodule.py», line 489, in call
result = self.forward(*input, **kwargs)
File «F:Sureshst-gcnnetst_gcn.py», line 194, in forward
x, A = self.gcn(x, A)
File «C:Userscudalab10Anaconda3libsite-packagestorchnnmodulesmodule.py», line 489, in call
result = self.forward(*input, **kwargs)
File «F:Sureshst-gcnnetutilstgcn.py», line 60, in forward
x = self.conv(x)
File «C:Userscudalab10Anaconda3libsite-packagestorchnnmodulesmodule.py», line 489, in call
result = self.forward(*input, **kwargs)
File «C:Userscudalab10Anaconda3libsite-packagestorchnnmodulesconv.py», line 320, in forward
self.padding, self.dilation, self.groups)
RuntimeError: CUDA out of memory. Tried to allocate 1.37 GiB (GPU 0; 12.00 GiB total capacity; 8.28 GiB already allocated; 652.75 MiB free; 664.38 MiB cached)

My model reports “cuda runtime error(2): out of memory”¶

As the error message suggests, you have run out of memory on your
GPU. Since we often deal with large amounts of data in PyTorch,
small mistakes can rapidly cause your program to use up all of your
GPU; fortunately, the fixes in these cases are often simple.
Here are a few common things to check:

Don’t accumulate history across your training loop.
By default, computations involving variables that require gradients
will keep history. This means that you should avoid using such
variables in computations which will live beyond your training loops,
e.g., when tracking statistics. Instead, you should detach the variable
or access its underlying data.

Sometimes, it can be non-obvious when differentiable variables can
occur. Consider the following training loop (abridged from source):

total_loss = 0
for i in range(10000):
    optimizer.zero_grad()
    output = model(input)
    loss = criterion(output)
    loss.backward()
    optimizer.step()
    total_loss += loss

Here, total_loss is accumulating history across your training loop, since
loss is a differentiable variable with autograd history. You can fix this by
writing total_loss += float(loss) instead.

Other instances of this problem:
1.

Don’t hold onto tensors and variables you don’t need.
If you assign a Tensor or Variable to a local, Python will not
deallocate until the local goes out of scope. You can free
this reference by using del x. Similarly, if you assign
a Tensor or Variable to a member variable of an object, it will
not deallocate until the object goes out of scope. You will
get the best memory usage if you don’t hold onto temporaries
you don’t need.

The scopes of locals can be larger than you expect. For example:

for i in range(5):
    intermediate = f(input[i])
    result += g(intermediate)
output = h(result)
return output

Here, intermediate remains live even while h is executing,
because its scope extrudes past the end of the loop. To free it
earlier, you should del intermediate when you are done with it.

Avoid running RNNs on sequences that are too large.
The amount of memory required to backpropagate through an RNN scales
linearly with the length of the RNN input; thus, you will run out of memory
if you try to feed an RNN a sequence that is too long.

The technical term for this phenomenon is backpropagation through time,
and there are plenty of references for how to implement truncated
BPTT, including in the word language model example; truncation is handled by the
repackage function as described in
this forum post.

Don’t use linear layers that are too large.
A linear layer nn.Linear(m, n) uses O(nm)O(nm) memory: that is to say,
the memory requirements of the weights
scales quadratically with the number of features. It is very easy
to blow through your memory
this way (and remember that you will need at least twice the size of the
weights, since you also need to store the gradients.)

Consider checkpointing.
You can trade-off memory for compute by using checkpoint.

My GPU memory isn’t freed properly¶

PyTorch uses a caching memory allocator to speed up memory allocations. As a
result, the values shown in nvidia-smi usually don’t reflect the true
memory usage. See Memory management for more details about GPU
memory management.

If your GPU memory isn’t freed even after Python quits, it is very likely that
some Python subprocesses are still alive. You may find them via
ps -elf | grep python and manually kill them with kill -9 [pid].

My out of memory exception handler can’t allocate memory¶

You may have some code that tries to recover from out of memory errors.

try:
    run_model(batch_size)
except RuntimeError: # Out of memory
    for _ in range(batch_size):
        run_model(1)

But find that when you do run out of memory, your recovery code can’t allocate
either. That’s because the python exception object holds a reference to the
stack frame where the error was raised. Which prevents the original tensor
objects from being freed. The solution is to move you OOM recovery code outside
of the except clause.

oom = False
try:
    run_model(batch_size)
except RuntimeError: # Out of memory
    oom = True

if oom:
    for _ in range(batch_size):
        run_model(1)

My data loader workers return identical random numbers¶

You are likely using other libraries to generate random numbers in the dataset
and worker subprocesses are started via fork. See
torch.utils.data.DataLoader’s documentation for how to
properly set up random seeds in workers with its worker_init_fn option.

My recurrent network doesn’t work with data parallelism¶

There is a subtlety in using the
pack sequence -> recurrent network -> unpack sequence pattern in a
Module with DataParallel or
data_parallel(). Input to each the forward() on
each device will only be part of the entire input. Because the unpack operation
torch.nn.utils.rnn.pad_packed_sequence() by default only pads up to the
longest input it sees, i.e., the longest on that particular device, size
mismatches will happen when results are gathered together. Therefore, you can
instead take advantage of the total_length argument of
pad_packed_sequence() to make sure that the
forward() calls return sequences of same length. For example, you can
write:

from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence

class MyModule(nn.Module):
    # ... __init__, other methods, etc.

    # padded_input is of shape [B x T x *] (batch_first mode) and contains
    # the sequences sorted by lengths
    #   B is the batch size
    #   T is max sequence length
    def forward(self, padded_input, input_lengths):
        total_length = padded_input.size(1)  # get the max sequence length
        packed_input = pack_padded_sequence(padded_input, input_lengths,
                                            batch_first=True)
        packed_output, _ = self.my_lstm(packed_input)
        output, _ = pad_packed_sequence(packed_output, batch_first=True,
                                        total_length=total_length)
        return output


m = MyModule().cuda()
dp_m = nn.DataParallel(m)

Additionally, extra care needs to be taken when batch dimension is dim 1
(i.e., batch_first=False) with data parallelism. In this case, the first
argument of pack_padded_sequence padding_input will be of shape
[T x B x *] and should be scattered along dim 1, but the second argument
input_lengths will be of shape [B] and should be scattered along dim
0. Extra code to manipulate the tensor shapes will be needed.

    • Home
    • Tech

27 Sep 2022 1:13 PM +00:00 UTC

Try these tips and the Stable Diffusion runtime error will be a thing of the past.

Stable Diffusion Runtime Error: How To Fix CUDA Out Of Memory Error In Stable Diffusion


Credit: Stability.ai

If the Stable Diffusion runtime error is preventing you from making art, here is what you need to do.

Stable Diffusion is one of the best AI image generators out there. Unlike DALL-E and MidJourney AI, Stable Diffusion is available for the public and anyone with a powerful machine can generate images from texts.

However, Stable Diffusion might sometimes run into memory issues and stop working. If you are experiencing the Stable Diffusion runtime error, try the following tips.

How To Fix Runtime Error: CUDA Out Of Memory In Stable Diffusion

So you are running Stable Diffusion locally on your PC, maybe trying to make some NSFW images and bam! You are hit by the infamous RuntimeError: CUDA out of memory.

The error is accompanied by a long message that basically looks like this. The amount of memory may change but the content is the same.

RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 6.00 GiB total capacity; 5.16 GiB already allocated; 0 bytes free; 5.30 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

It appears you have run out of GPU memory. It is worth mentioning that you need at least 4 GB VRAM in order to run Stable Diffusion. If you have 4 GB or more of VRAM, below are some fixes that you can try.

  • Restarting the PC worked for some people.
  • Reduce the resolution. Start with 256 x 256 resolution. Just change the -W 256 -H 256 part in the command.
  • Try this fork as it requires a lot less VRAM according to many Reddit users.

If the issue persists, don’t worry. We have some additional troubleshooting tips for you to try. Keep reading!

Other Troubleshooting Tips

So you have tried all the simple and quick fixes but the runtime error seems to have no intention to leave you, huh? No worries! Let’s dive into relatively more complex steps. Here you go.

  • As mentioned in the error message, run the following command first: PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6, max_split_size_mb:128. Then run the image generation command with: —n_samples 1.
  • Call the optimized python script. Use the following command: python optimizedSD/optimized_txt2img.py —prompt «a drawing of a cat on a log» —n_iter 5 —n_samples 1 —H 512 —W 512 —precision full
  • You can also try removing the safety checks aka NSFW filters, which take up 2GB of VRAM. Just replace scripts/txt2img.py with this:
    https://github.com/JustinGuese/stable-diffusor-docker-text2image/blob/master/txt2img.py

Hopefully, one of the suggestions will work for you and you will be able to generate images again. Now that the Stable Diffusion runtime error is fixed, have a look at how to access Stable Diffusion using Google Colab.

Hello Guys, How are you all? Hope You all Are Fine. Today I am just facing following error RuntimeError: CUDA out of memory. Tried to allocate in python. So Here I am Explain to you all the possible solutions here.

Without wasting your time, Let’s start This Article to Solve This Error.

Contents

  1. How RuntimeError: CUDA out of memory. Tried to allocate Error Occurs ?
  2. How To Solve RuntimeError: CUDA out of memory. Tried to allocate Error ?
  3. Solution 1: reduce the batch size
  4. Solution 2: Use this
  5. Solution 3: Follow this
  6. Solution 4: Open terminal and a python prompt
  7. Summary

I am just facing following error.

RuntimeError: CUDA out of memory. Tried to allocate 😊 MiB (GPU 😊; 😊 GiB total capacity; 😊 GiB already allocated; 😊 MiB free; 😊 cached)

I am just trying to empty_cache in loop but I am facing error.

How To Solve RuntimeError: CUDA out of memory. Tried to allocate Error ?

  1. How To Solve RuntimeError: CUDA out of memory. Tried to allocate Error ?

    To Solve RuntimeError: CUDA out of memory. Tried to allocate Error Just reduce the batch size In my case I was on batch size of 32 So that I just changed it to 15 And My error was solved. Just use This torch.cuda.memory_summary(device=None, abbreviated=False). It is because of mini-batch of data does not fit on to GPU memory. Just decrease the batch size. When I set batch size = 256 for cifar10 dataset I got the same error; Then I set the batch size = 128, it is solved.

  2. RuntimeError: CUDA out of memory. Tried to allocate

    To Solve RuntimeError: CUDA out of memory. Tried to allocate Error Just reduce the batch size In my case I was on batch size of 32 So that I just changed it to 15 And My error was solved. Just use This torch.cuda.memory_summary(device=None, abbreviated=False). It is because of mini-batch of data does not fit on to GPU memory. Just decrease the batch size. When I set batch size = 256 for cifar10 dataset I got the same error; Then I set the batch size = 128, it is solved.

Solution 1: reduce the batch size

Just reduce the batch size In my case I was on batch size of 32 So that I just changed it to 15 And My error was solved.

Solution 2: Use this

Just use This.

torch.cuda.memory_summary(device=None, abbreviated=False)

Solution 3: Follow this

It is because of mini-batch of data does not fit on to GPU memory. Just decrease the batch size. When I set batch size = 256 for cifar10 dataset I got the same error; Then I set the batch size = 128, it is solved.

Solution 4: Open terminal and a python prompt

Open terminal and a python prompt

import torch
torch.cuda.empty_cache()

Summary

It’s all About this issue. Hope all solution helped you a lot. Comment below Your thoughts and your queries. Also, Comment below which solution worked for you?

Also, Read

  • ModuleNotFoundError: No module named ‘torch’.

Понравилась статья? Поделить с друзьями:
  • Runtime error risen
  • Runtime error release unlocked lock
  • Runtime error regsvr32 r6034
  • Runtime error r6035
  • Runtime error r6034 gta san andreas