Memory error torch

🐛 Bug Attempting to install torch 1.2.0 on linux results in a MemoryError $ pip install torch==1.2.0 Collecting torch==1.2.0 Downloading https://files.pythonhosted.org/packages/05/65/5248be50c55ab7...

🐛 Bug

Attempting to install torch 1.2.0 on linux results in a MemoryError

$ pip install torch==1.2.0
Collecting torch==1.2.0
  Downloading https://files.pythonhosted.org/packages/05/65/5248be50c55ab7429dd5c11f5e2f9f5865606b80e854ca63139ad1a584f2/torch-1.2.0-cp37-cp37m-manylinux1_x86_64.whl (748.9MB)
     |████████████████████████████████| 748.9MB 18.6MB/s eta 0:00:01ERROR: Exception:
Traceback (most recent call last):
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/cli/base_command.py", line 188, in main
    status = self.run(options, args)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/commands/install.py", line 345, in run
    resolver.resolve(requirement_set)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 196, in resolve
    self._resolve_one(requirement_set, req)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 359, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/legacy_resolve.py", line 307, in _get_abstract_dist_for
    self.require_hashes
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/operations/prepare.py", line 199, in prepare_linked_requirement
    progress_bar=self.progress_bar
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/download.py", line 1064, in unpack_url
    progress_bar=progress_bar
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/download.py", line 924, in unpack_http_url
    progress_bar)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/download.py", line 1152, in _download_http_url
    _download_url(resp, link, content_file, hashes, progress_bar)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/download.py", line 861, in _download_url
    hashes.check_against_chunks(downloaded_chunks)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/utils/hashes.py", line 75, in check_against_chunks
    for chunk in chunks:
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/download.py", line 829, in written_chunks
    for chunk in chunks:
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/utils/ui.py", line 156, in iter
    for x in it:
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_internal/download.py", line 818, in resp_read
    decode_content=False):
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 531, in stream
    data = self.read(amt=amt, decode_content=decode_content)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_vendor/urllib3/response.py", line 479, in read
    data = self._fp.read(amt)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 65, in read
    self._close()
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_vendor/cachecontrol/filewrapper.py", line 52, in _close
    self.__callback(self.__buf.getvalue())
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_vendor/cachecontrol/controller.py", line 300, in cache_response
    cache_url, self.serializer.dumps(request, response, body=body)
  File "/home/hugo/miniconda3/envs/ptesting/lib/python3.7/site-packages/pip/_vendor/cachecontrol/serialize.py", line 72, in dumps
    return b",".join([b"cc=4", msgpack.dumps(data, use_bin_type=True)])
MemoryError

To Reproduce

Steps to reproduce the behavior:

  1. pip install torch==1.2.0 (on linux)

Expected behavior

Installation, without MemoryError

Environment

Please copy and paste the output from our
environment collection script
(or fill out the checklist below manually).
Collecting environment information…
PyTorch version: N/A
Is debug build: N/A
CUDA used to build PyTorch: N/A

OS: Ubuntu 18.04.2 LTS
GCC version: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
CMake version: Could not collect

Python version: 3.7
Is CUDA available: N/A
CUDA runtime version: Could not collect
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] Could not collect

cc @ezyang @gchanan

Image Credits: cryptocurrency360.com

Hello there. This is Part 4 of our PyTorch 101 series and we will cover multiple GPU usage in this post.

In this part we will cover,

  1. How to use multiple GPUs for your network, either using data parallelism or model parallelism.
  2. How to automate selection of GPU while creating a new objects.
  3. How to diagnose and analyse memory issues should they arise.

So, let’s get started.

Before we begin, let me remind you this part 4 of our PyTorch series.

  1. Understanding Graphs, Automatic Differentiation and Autograd
  2. Building Your First Neural Network
  3. Going Deep with PyTorch
  4. Memory Management and Using Multiple GPUs
  5. Understanding Hooks

You can get all the code in this post, (and other posts as well) in the Github repo here.


Moving tensors around CPU / GPUs

Every Tensor in PyTorch has a to() member function. It’s job is to put the tensor on which it’s called to a certain device whether it be the CPU or a certain GPU. Input to the to function is a torch.device object which can initialised with either of the following inputs.

  1. cpu for CPU
  2. cuda:0 for putting it on GPU number 0. Similarly, if you want to put the tensors on

Generally, whenever you initialise a Tensor, it’s put on the CPU. You can move it to the GPU then. You can check whether a GPU is available or not by invoking the torch.cuda.is_available function.

if torch.cuda.is_available():
	dev = "cuda:0"
else:
	dev = "cpu"

device = torch.device(dev)

a = torch.zeros(4,3)   
a = a.to(device)       #alternatively, a.to(0)

You can also move a tensor to a certain GPU by giving it’s index as the argument to to function.

Importantly, the above piece of code is device agnostic, that is, you don’t have to separately change it for it to work on both GPU and the CPU.  

cuda() function

Another way to put tensors on GPUs is to call cuda(n) function on them where n is the index of the GPU. If you just call cuda, then the tensor is placed on GPU 0.

The torch.nn.Module class also has to  adnd cuda functions which puts the entire network on a particular device. Unlike, Tensors calling to on the nn.Module object is enough, and there’s no need to assign the returned value from the to function.

clf = myNetwork()
clf.to(torch.device("cuda:0")    # or clf = clf.cuda() 

Automatic selection of GPU

While it’s good to be able to explicitly decide on which GPU does a tensor go, generally, we create a lot of tensors during our operations. We want them to be automatically created on a certain device, so as to reduce cross device transfers which can slow our code down. In this regard, PyTorch provides us with some functionality to accomplish this.

First, is the torch.get_device function. It’s only supported for GPU tensors. It returns us the index of the GPU on which the tensor resides. We can use this function to determine the device of the tensor, so that we can move a created tensor automatically to this device.

#making sure t2 is on the same device as t2

a = t1.get_device()
b = torch.tensor(a.shape).to(dev)

We can also call cuda(n) while creating new Tensors. By default all tensors created by cuda call are put on GPU 0, but this can be changed by the following statement.

torch.cuda.set_device(0)   # or 1,2,3

If a tensor is created as a result of an operation between two operands which are on same device, so will be the resultant tensor. If operands are on different devices, it will lead to an error.

new_* functions

One can also make use of the bunch of new_ functions that made their way to PyTorch in version 1.0. When a function like new_ones is called on a Tensor it returns a new tensor cof same data type, and on the same device as the tensor on which the new_ones function was invoked.

ones = torch.ones((2,)).cuda(0)

# Create a tensor of ones of size (3,4) on same device as of "ones"
newOnes = ones.new_ones((3,4)) 

randTensor = torch.randn(2,4)

A detailed list of new_ functions can be found in PyTorch docs the link of which I have provided below.

Using Multiple GPUs

There are two ways how we could make use of multiple GPUs.

  1. Data Parallelism, where we divide batches into smaller batches, and process these smaller batches in parallel on multiple GPU.
  2. Model Parallelism, where we break the neural network into smaller sub networks and then execute these sub networks on different GPUs.

Data Parallelism

Data Parallelism in PyTorch is achieved through the nn.DataParallel class. You initialize a nn.DataParallel object with a nn.Module object representing your network, and a list of GPU IDs, across which the batches have to be parallelised.

parallel_net = nn.DataParallel(myNet, gpu_ids = [0,1,2])

Now, you can simply execute the nn.DataParallel object just like a nn.Module .

predictions = parallel_net(inputs)           # Forward pass on multi-GPUs
loss = loss_function(predictions, labels)     # Compute loss function
loss.mean().backward()                        # Average GPU-losses + backward pass
optimizer.step()            

However, there are a few things I want to shed light over. Despite the fact our data has to be parallelised over multiple GPUs, we have to initially store it on a single GPU.

We also need to make sure the DataParallel object is on that particular GPU as well. The syntax remains similar to what we did earlier with nn.Module.

input        = input.to(0)
parallel_net = parellel_net.to(0)

In effect, the following diagram describes how nn.DataParallel works.

Working of nn.DataParallel. Source 

DataParallel takes the input, splits it into smaller batches, replicates the neural network across all the devices, executes the pass and then collects the output back on the original GPU.

One issue with DataParallel can be that it can put asymmetrical load on one GPU (the main node). There are generally two ways to circumvent these problem.

  1. First, is to compute the loss during the forward pass. This makes sure at least the loss calculation phase is parallelised.
  2. Another way is to implement a parallel loss function layer. This is beyond the scope of this article. However, for those interested I have given a link to a medium article detailing implementation of such a layer at the end of this article.

Model Parallelism

Model parallelism means that you break your network into smaller subnetworks that you then put on different GPUs. The main motivation for doing such a thing is that your network might be too large to fit inside a single GPU.

Note that model parallelism is often slower than data parallelism as splitting a single network into multiple GPUs introduces dependencies between GPUs which prevents them from running in a truly parallel way. The advantage one derives out of model parallelism is not about speed, but ability to run networks whose size is too large to fit on a single GPU.

As we see in figure b, Subnet 2 waits for Subnet 1 during forward pass, while Subnet 1 waits for Subnet 2 during backward pass.

Model Parallelism with Dependencies

Implementing Model parallelism is PyTorch is pretty easy as long as you remember 2 things.

  1. The input and the network should always be on the same device.
  2. to and cuda functions have autograd support, so your gradients can be copied from one GPU to another during backward pass.

We will use the following piece of code to understand this better.

class model_parallel(nn.Module):
	def __init__(self):
		super().__init__()
		self.sub_network1 = ...
		self.sub_network2 = ...

		self.sub_network1.cuda(0)
		self.sub_network2.cuda(1)

	def forward(x):
		x = x.cuda(0)
		x = self.sub_network1(x)
		x = x.cuda(1)
		x = self.sub_network2(x)
		return x
		

In the init function we have put the sub-networks on GPUs 0 and 1 respectively.

Notice in the forward function, we transfer the intermediate output from sub_network1 to GPU 1 before feeding it to sub_network2. Since cuda has autograd support, the loss backpropagated from sub_network2 will be copied to buffers of sub_network1 for further backpropagation.

Troubleshooting Out of Memory Errors

In this section we will cover how to diagnose memory issues and possible solutions if your network is using more memory than it is needed.

While going out of memory may necessitate reducing batch size, one can do certain check to ensure that usage of memory is optimal.

Tracking Memory Usage with GPUtil

One way to track GPU usage is by monitoring memory usage in a console with nvidia-smi command. The problem with this approach is that peak GPU usage, and out of memory happens so fast that you can’t quite pinpoint which part of your code is causing the memory overflow.

For this we will use an extension called GPUtil, which you can install with pip by running the following command.

pip install GPUtil

The usage is pretty simple too.

import GPUtil
GPUtil.showUtilization()

Just put the second line wherever you want to see the GPU Utilisation. By placing this statement at different places in the code you can figure out what part is exactly causing the the network to go OOM.


Let us now talk about possible methods for remedying OOM errors.

Dealing with Memory Losses using del keyword

PyTorch has a pretty aggressive garbage collector. As soon as a variable goes out of scope, the garbage collection will free it.

It is to be kept in mind that Python doesn’t enforce scoping rules as strongly as other languages such as C/C++. A variable is only freed when there exists no pointers to it. (This has to do with the fact that variables needn’t be declared in Python)

As a result, memory occupied by tensos holding your input, output tensors can still not be freed even once you are out of training loop. Consider the following chunk of code.

for x in range(10):
	i = x

print(i)   # 9 is printed

Running the above snippet of code will print values of i  even when we are outside are the loop where we initialised i. Similarly, tensors holding  loss and output can live beyond the training loop. In order to truly free up the space held by these tensors, we use del keyword.

del out, loss

In fact, as a general rule of thumb, if you are done with a tensor, you should del as it won’t be garbage collected unless there is no reference to it left.

Using Python Data Types Instead Of 1-D Tensors

Often, we aggregate values in our training loop to compute some metrics. Biggest example of this is that we update the running loss  each iteration. However, if not done carefully in PyTorch, such a thing can lead to excess use of memory than what is required.

Consider the following snippet of code.

total_loss = 0

for x in range(10):
  # assume loss is computed 
  iter_loss = torch.randn(3,4).mean()
  iter_loss.requires_grad = True     # losses are supposed to differentiable
  total_loss += iter_loss            # use total_loss += iter_loss.item) instead

We expect that in the subsequent iterations, the reference to iter_loss is reassigned to new iter_loss, and the object representing iter_loss from earlier representation will be freed. But this doesn’t happen. Why?

Since iter_loss is differentiable, the line total_loss += iter_loss creates a computation graph with one AddBackward function node. During subsequent iterations, AddBackward nodes are added to this graph and no object holding values of iter_loss is freed. Normally, the memory allocated to a computation graph is freed when backward is called upon it, but here, there’s no scope of calling backward.

The computation graph created when you keep adding the loss tensor to the variable loss

The solution to this is to add a python data type, and not a tensor to total_loss which prevents creation of any computation graph.

We merely replace the line total_loss += iter_loss with total_loss += iter_loss.item(). item returns the python data type from a tensor containing single values.

Emptying Cuda Cache

While PyTorch aggressively frees up memory, a pytorch process may not give back the memory back to the OS even after you del your tensors. This memory is cached so that it can be quickly allocated to new tensors being allocated without requesting the OS new extra memory.

This can be a problem when you are using more than two processes in your workflow.

The first process can hold onto the GPU memory even if it’s work is done causing OOM when the second process is launched. To remedy this, you can write the command at the end of your code.

torch.cuda.empy_cache()

This will make sure that the space held by the process is released.

import torch
from GPUtil import showUtilization as gpu_usage

print("Initial GPU Usage")
gpu_usage()                             

tensorList = []
for x in range(10):
  tensorList.append(torch.randn(10000000,10).cuda())   # reduce the size of tensor if you are getting OOM
  
  

print("GPU Usage after allcoating a bunch of Tensors")
gpu_usage()

del tensorList

print("GPU Usage after deleting the Tensors")
gpu_usage()  

print("GPU Usage after emptying the cache")
torch.cuda.empty_cache()
gpu_usage()

The following output is produced when this code is executed on a Tesla K80

Initial GPU Usage
| ID | GPU | MEM |
------------------
|  0 |  0% |  5% |
GPU Usage after allcoating a bunch of Tensors
| ID | GPU | MEM |
------------------
|  0 |  3% | 30% |
GPU Usage after deleting the Tensors
| ID | GPU | MEM |
------------------
|  0 |  3% | 30% |
GPU Usage after emptying the cache
| ID | GPU | MEM |
------------------
|  0 |  3% |  5% |

Using torch.no_grad() for Inference

PyTorch, by default, will create a computational graph during the forward pass. During creation of this graph, it will allocate buffers to store gradients and intermediate values which are used for computing the gradient during the backward pass.

During the backward pass, all of these buffers, with the exception of those allocated for leaf variables are freed.

However, during inference, there is no backward pass and these buffers are never freed, leading up to piling up of memory. Therefore, whenever you want to execute a piece of code that doesn’t need to be backpropagated, put it inside a torch.no_grad() context manager.

with torch.no_grad()
	# your code 

Using CuDNN Backend

You can make use of the cudnn benchmark instead of the vanilla benchmark. CuDNN can provided a lot of optimisation which can bring down your space usage, especially when the input to your neural network is of fixed size. Add the following lines on top of your code to enable CuDNN benchmark.

torch.backends.cudnn.benchmark = True
torch.backends.cudnn.enabled = True

Using 16-bit Floats

The new RTX and Volta cards by nVidia support both 16-bit training and inference.

model = model.half()     # convert a model to 16-bit
input = input.half()     # convert a model to 16-bit

However, the 16-bit training options have to be taken with a pinch of salt.

While usage of 16-bit tensors can cut your GPU usage by almost half, there are a few issues with them.

  1. In PyTorch, batch-norm layers have convergence issues with half precision floats. If that’s the case with you, make sure that batch norm layers are float32.
model.half()  # convert to half precision
for layer in model.modules():
  if isinstance(layer, nn.BatchNorm2d):
    layer.float()

Also, you need to make sure when the output is passed through different layers in the forward function, the input to the batch norm layer is converted from float16 to float32 and then the output needs to be converted back to float16

One can find a good discussion of 16-bit training in PyTorch here.

2.  You can have overflow issues with 16-bit float. Once, I remember I had such an overflow while trying to store the Union area of two bounding boxes (for computation of IoUs)  in a float16.  So make sure you have a realistic bound on the value you are trying to save in a float16.

Nvidia has recently released a PyTorch extension called Apex, that facilitates numerically safe mixed precision training in PyTorch. I have provided the link to that at the end of the article.

Conclusion

That concludes are discussion on memory management and use of Multiple GPUs in PyTorch. Following are the important links that you may wanna follow up this article with.

Further Reading

  1. PyTorch new functions
  2. Parallelised Loss Layer: Training Neural Nets on Larger Batches: Practical Tips for 1-GPU, Multi-GPU & Distributed setups
  3. GPUtil Github page
  4. A discussion on half precision training in PyTorch
  5. Nvidia Apex Github page
  6. Nvidia Apex tutorial

on new install got out of memory or access violations torch.cuda.empty_cache() and others #13778

Comments

tyoc213 commented Nov 9, 2018 •

On a new install with this spec

I get memory errors almost all times.

To Reproduce

Steps to reproduce the behavior:

Install fastai library as https://github.com/fastai/fastai/blob/master/README.md#installation and on a new jupyter nootebook get the cuda available = True and then try to call empty (to see if it frees something because the out of memory).

I have some problems running the examples provided in fastai lib so I posted on their forum. But after searching here for a solution , I found torch.cuda.empty_cache() but still I get the memory error. so that is why Im comming here

/anaconda3/lib/python3.7/site-packages/torch/cuda/__init__.py in empty_cache() 372 «»» 373 if _initialized: —> 374 torch._C._cuda_emptyCache() 375 376 RuntimeError: CUDA error: an illegal memory access was encountered»>

Or try to run the different examples provided there collab.ipynb works OK but stepping on cyfar on fastai/examples I an error executing this line

I get this output

/fastai/fastai/basic_train.py in __post_init__(self) 136 self.path = Path(ifnone(self.path, self.data.path)) 137 (self.path/self.model_dir).mkdir(parents=True, exist_ok=True) —> 138 self.model = self.model.to(self.data.device) 139 self.loss_func = ifnone(self.loss_func, self.data.loss_func) 140 self.metrics=listify(self.metrics)

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs) 377 return t.to(device, dtype if t.is_floating_point() else None, non_blocking) 378 —> 379 return self._apply(convert) 380 381 def register_backward_hook(self, hook):

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn) 183 def _apply(self, fn): 184 for module in self.children(): —> 185 module._apply(fn) 186 187 for param in self._parameters.values():

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn) 183 def _apply(self, fn): 184 for module in self.children(): —> 185 module._apply(fn) 186 187 for param in self._parameters.values():

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn) 189 # Tensors stored in modules are graph leaves, and we don’t 190 # want to create copy nodes, so we have to unpack the data. —> 191 param.data = fn(param.data) 192 if param._grad is not None: 193 param._grad.data = fn(param._grad.data)

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t) 375 376 def convert(t): —> 377 return t.to(device, dtype if t.is_floating_point() else None, non_blocking) 378 379 return self._apply(convert) RuntimeError: cuda runtime error (77) : an illegal memory access was encountered at /opt/conda/conda-bld/pytorch-nightly_1541411195070/work/aten/src/THC/generic/THCTensorCopy.cpp:20″>

torch.cuda.is_available() return True .

Im also running out of memory in dogs_cats.ipynb .

/fastai/fastai/vision/learner.py in create_cnn(data, arch, cut, pretrained, lin_ftrs, ps, custom_head, split_on, classification, **kwargs) 67 learn.split(ifnone(split_on,meta[‘split’])) 68 if pretrained: learn.freeze() —> 69 apply_init(model[1], nn.init.kaiming_normal_) 70 return learn 71

/fastai/fastai/torch_core.py in apply_init(m, init_func) 193 def apply_init(m, init_func:LayerFunc): 194 «Initialize all non-batchnorm layers of `m` with `init_func`.» —> 195 apply_leaf(m, partial(cond_init, init_func=init_func)) 196 197 def in_channels(m:nn.Module) -> List[int]:

/fastai/fastai/torch_core.py in apply_leaf(m, f) 189 c = children(m) 190 if isinstance(m, nn.Module): f(m) —> 191 for l in c: apply_leaf(l,f) 192 193 def apply_init(m, init_func:LayerFunc):

/fastai/fastai/torch_core.py in apply_leaf(m, f) 188 «Apply `f` to children of `m`.» 189 c = children(m) —> 190 if isinstance(m, nn.Module): f(m) 191 for l in c: apply_leaf(l,f) 192

/fastai/fastai/torch_core.py in cond_init(m, init_func) 183 if (not isinstance(m, bn_types)) and requires_grad(m): 184 if hasattr(m, ‘weight’): init_func(m.weight) —> 185 if hasattr(m, ‘bias’) and hasattr(m.bias, ‘data’): m.bias.data.fill_(0.) 186 187 def apply_leaf(m:nn.Module, f:LayerFunc): RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch-nightly_1541411195070/work/aten/src/THC/generic/THCTensorMath.cu:14″>

I get the cuda memory error also in tabular

/fastai/fastai/tabular/data.py in get_tabular_learner(data, layers, emb_szs, metrics, ps, emb_drop, y_range, use_bn, **kwargs) 93 model = TabularModel(emb_szs, len(data.cont_names), out_sz=data.c, layers=layers, ps=ps, emb_drop=emb_drop, 94 y_range=y_range, use_bn=use_bn) —> 95 return Learner(data, model, metrics=metrics, **kwargs) 96 in __init__(self, data, model, opt_func, loss_func, metrics, true_wd, bn_wd, wd, train_bn, path, model_dir, callback_fns, callbacks, layer_groups)

/fastai/fastai/basic_train.py in __post_init__(self) 136 self.path = Path(ifnone(self.path, self.data.path)) 137 (self.path/self.model_dir).mkdir(parents=True, exist_ok=True) —> 138 self.model = self.model.to(self.data.device) 139 self.loss_func = ifnone(self.loss_func, self.data.loss_func) 140 self.metrics=listify(self.metrics)

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in to(self, *args, **kwargs) 377 return t.to(device, dtype if t.is_floating_point() else None, non_blocking) 378 —> 379 return self._apply(convert) 380 381 def register_backward_hook(self, hook):

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn) 183 def _apply(self, fn): 184 for module in self.children(): —> 185 module._apply(fn) 186 187 for param in self._parameters.values():

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn) 183 def _apply(self, fn): 184 for module in self.children(): —> 185 module._apply(fn) 186 187 for param in self._parameters.values():

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in _apply(self, fn) 189 # Tensors stored in modules are graph leaves, and we don’t 190 # want to create copy nodes, so we have to unpack the data. —> 191 param.data = fn(param.data) 192 if param._grad is not None: 193 param._grad.data = fn(param._grad.data)

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in convert(t) 375 376 def convert(t): —> 377 return t.to(device, dtype if t.is_floating_point() else None, non_blocking) 378 379 return self._apply(convert) RuntimeError: CUDA error: out of memory»>

Expected behavior

Examples to work

Environment

python collect_env.py
Collecting environment information.
PyTorch version: 0.4.1
Is debug build: No
CUDA used to build PyTorch: 9.2.148

OS: Ubuntu 18.10
GCC version: (Ubuntu 8.2.0-7ubuntu1) 8.2.0
CMake version: version 3.12.2

Python version: 3.7
Is CUDA available: Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: GeForce RTX 2080
Nvidia driver version: 410.73
cuDNN version: Could not collect

Versions of relevant libraries:
[pip] Could not collect
[conda] cuda92 1.0 0 pytorch
[conda] pytorch 0.4.1 py37_cuda9.2.148_cudnn7.1.4_1 [cuda92] pytorch
[conda] torchvision 0.2.1 py37_1 pytorch
[conda] torchvision-nightly 0.2.1

The text was updated successfully, but these errors were encountered:

Источник

Pytorch cannot allocate enough memory #913

Comments

craftpag commented Nov 28, 2021

I am trying to run encoder_train.py
I have preprocessed Train_other_500, but when I try to start encoder_train.py I get this message
CUDA out of memory. Tried to allocate 4.98 GiB (GPU 0; 8.00 GiB total capacity; 1.64 GiB already allocated; 4.51 GiB free; 1.67 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

if I have read it correctly, i most add/change max_split_size_mb = to one of the codes. I have tried to search around, and everyone has a solution but none of them says where to change the code.

Where do i add/change the code, to add max_split_size_mb = ?

this may be a stupid question, but I am lost.

Specs:
Windows 11 PRO 21H2
RTX3070
AMD Rysen 7 5800x
32Gb DDR4 3200MH/z
Pytorch 1.10, CUDA 11.3
Python 3.7.9

The text was updated successfully, but these errors were encountered:

sveneschlbeck commented Nov 28, 2021

@craftpag This is not a parameter to be found in the code here but a PyTorch command that (if I’m not wrong) needs to be set as an environment variable.
Try setting PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb: .

Doc Quote: » max_split_size_mb prevents the allocator from splitting blocks larger than this size (in MB). This can help prevent fragmentation and may allow some borderline workloads to complete without running out of memory.»

Checkout this link to see the full documentation for PyTorch’s memory management:
https://pytorch.org/docs/stable/notes/cuda.html

craftpag commented Nov 30, 2021

Hello
Thank you for replying @sveneschlbeck

I have tried to add those environment variables, with no luck.
I have tried to add it in different ways, but I still get the same error.
Do you think it can be other solutions out there?

sveneschlbeck commented Nov 30, 2021 •

There’s a couple of things remaining until I am out of answers, too:

1. Are you running any other scripts/games/programs that might be taking up GPU memory? If so, do the following:

Type nvidia-smi into the terminal and find the PID of the process using most GPU memory (apart from PyTorch of course), then kill it by typing taskkill /F /PID

2. Try to reduce memory-intensive (hyper)parameters, e.g. train/test size, batch size, etc.

3. Run the following

My guess is that it’s the batch_size since that is where you specify how much data is loaded into the memory at once. See #914 to get an idea on where you can decrease the batch size. I’d do it file after file to see where the error is caused. Alternatively, you can change it in all files at once. But keep in mind that a lower batch size results in a longer training/testing duration.

Источник

Frequently Asked Questions¶

My model reports “cuda runtime error(2): out of memory”¶

As the error message suggests, you have run out of memory on your GPU. Since we often deal with large amounts of data in PyTorch, small mistakes can rapidly cause your program to use up all of your GPU; fortunately, the fixes in these cases are often simple. Here are a few common things to check:

Don’t accumulate history across your training loop. By default, computations involving variables that require gradients will keep history. This means that you should avoid using such variables in computations which will live beyond your training loops, e.g., when tracking statistics. Instead, you should detach the variable or access its underlying data.

Sometimes, it can be non-obvious when differentiable variables can occur. Consider the following training loop (abridged from source):

Here, total_loss is accumulating history across your training loop, since loss is a differentiable variable with autograd history. You can fix this by writing total_loss += float(loss) instead.

Other instances of this problem: 1.

Don’t hold onto tensors and variables you don’t need. If you assign a Tensor or Variable to a local, Python will not deallocate until the local goes out of scope. You can free this reference by using del x . Similarly, if you assign a Tensor or Variable to a member variable of an object, it will not deallocate until the object goes out of scope. You will get the best memory usage if you don’t hold onto temporaries you don’t need.

The scopes of locals can be larger than you expect. For example:

Here, intermediate remains live even while h is executing, because its scope extrudes past the end of the loop. To free it earlier, you should del intermediate when you are done with it.

Avoid running RNNs on sequences that are too large. The amount of memory required to backpropagate through an RNN scales linearly with the length of the RNN input; thus, you will run out of memory if you try to feed an RNN a sequence that is too long.

The technical term for this phenomenon is backpropagation through time, and there are plenty of references for how to implement truncated BPTT, including in the word language model example; truncation is handled by the repackage function as described in this forum post.

Don’t use linear layers that are too large. A linear layer nn.Linear(m, n) uses O ( n m ) O(nm) O ( nm ) memory: that is to say, the memory requirements of the weights scales quadratically with the number of features. It is very easy to blow through your memory this way (and remember that you will need at least twice the size of the weights, since you also need to store the gradients.)

Consider checkpointing. You can trade-off memory for compute by using checkpoint.

My GPU memory isn’t freed properly¶

PyTorch uses a caching memory allocator to speed up memory allocations. As a result, the values shown in nvidia-smi usually don’t reflect the true memory usage. See Memory management for more details about GPU memory management.

If your GPU memory isn’t freed even after Python quits, it is very likely that some Python subprocesses are still alive. You may find them via ps -elf | grep python and manually kill them with kill -9 [pid] .

My out of memory exception handler can’t allocate memory¶

You may have some code that tries to recover from out of memory errors.

But find that when you do run out of memory, your recovery code can’t allocate either. That’s because the python exception object holds a reference to the stack frame where the error was raised. Which prevents the original tensor objects from being freed. The solution is to move you OOM recovery code outside of the except clause.

Источник

RuntimeError: $ Torch: not enough memory: you tried to allocate 72GB. Buy new RAM! #5434

Comments

ponomarevsy commented Feb 27, 2018

Red Hat Enterprise Linux Server release 7.2 (Maipo)

pytorch 0.3.1 py35_cuda8.0.61_cudnn7.0.5_2 pytorch
torchvision 0.2.0 py35heaa392f_1 pytorch

  • How you installed PyTorch (conda, pip, source):

$ module load anaconda3/4.3.1
$ source activate pytorchenv
$ conda install pytorch torchvision -c pytorch

$ python -V
Python 3.5.5

  • GPU models and configuration:
  • GCC version (if compiling from source):

In addition, including the following information will also be very helpful for us to diagnose the problem:

  • A script to reproduce the bug. Please try to provide as minimal of a test case as possible.
  • Error messages and/or stack traces of the bug
  • Context around what you are trying to do

Training a model with:

I’ve tried batch sizes from 128 to 8, and using GPUs from just one to all 8. GPU node has plenty of RAM (124G):

$ free
total used free shared buff/cache available
Mem: 131930696 6299776 124697204 16968 933716 124718092
Swap: 16777212 441996 16335216

Do you have a maximum RAM allocation limit hardcoded in PyTorch (file «THGeneral.c»)? Thank you in advance!

Complete error message:

The text was updated successfully, but these errors were encountered:

apaszke commented Feb 27, 2018

72GB is a lot, it might be the OS that rejects such a request. We don’t have any allocation limits, this error is raised when malloc fails to claim more memory.

zou3519 commented Feb 27, 2018

@ponomarevsy do you think your code should be allocating 72GB?

I’m currently investigating large memory usage with convolutions on the CPU: #5285, if you’re using any convolutions in your model this could be related.

ponomarevsy commented Feb 27, 2018 •

Thank you for your feedback, @apaszke and @zou3519. @apaszke, are you sure this is not a memory leakage issue similar to @zou3519? Also, I see no reason why PyTorch wouldn’t allocate more than 72G (if needed), knowing that there is 124G available on the node. Still this sounds like too much RAM to me. I remember having similar issues using DIGITS (with Caffe). Do these training sets require that much RAM?

colesbury commented Feb 27, 2018

You probably don’t have 124 GB available at the point where you try to allocate 72GB. It looks like the triggering call is torch.stack . Unless you have repeated inputs, there’s a good chance that the inputs use up 72 GB as well.

yeladlouni commented Aug 24, 2018

It could be a problem of memory alignment.

lisiyaoATbnu commented Dec 3, 2018 •

I think I’ve met a similar situation.

I tested my small (totally forward) network on one

1000×1000 image but it seemed to allocate

45G memory. Hence it broke.

Very interestingly, when I tried Python 2 (different env) on the same computer, it ran and I got a good result.

I’m still confused by that.

linchart commented Dec 24, 2018 •

I also met a similar situation with a small network and data, when my code run after 5 epochs, it raised: RuntimeError: $ Torch: not enough memory: you tried to allocate 5GB . very strange

ronalddas commented Apr 5, 2020 •

Even I am facing a simillar issue, am trying run evaluation on a 28MB model file and am getting issue, RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can’t allocate memory: you tried to allocate 377216000 bytes. Error code 12 (Cannot allocate memory) , why is it trying to allocate in PetaBytes ??(Correction, 377mb)

zou3519 commented Apr 6, 2020

@ronalddas 377216000 is 377 mb

GraphGrailAi commented Jul 1, 2020

Even I am facing a simillar issue, am trying run evaluation on a 28MB model file and am getting issue, RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can’t allocate memory: you tried to allocate 377216000 bytes. Error code 12 (Cannot allocate memory) , why is it trying to allocate in PetaBytes ??(Correction, 377mb)

Similar problem: RAM is 4GB, GPU with 12 GB memory, my model i try to load is 5GB. But i see error «you tried to allocate 108216000 bytes» — that is 108mb — that is strange

dzungarian commented Aug 20, 2020 •

When I save my model using torch.save and load again using torch.load and run embedding layer in the model, I encounter the same error. But if I don’t save the model, it runs without errors.

nusherjk commented Sep 13, 2020

Even I am facing a simillar issue, am trying run evaluation on a 28MB model file and am getting issue, RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can’t allocate memory: you tried to allocate 377216000 bytes. Error code 12 (Cannot allocate memory) , why is it trying to allocate in PetaBytes ??(Correction, 377mb)

Similar problem: RAM is 4GB, GPU with 12 GB memory, my model i try to load is 5GB. But i see error «you tried to allocate 108216000 bytes» — that is 108mb — that is strange

having same issue with 196 mb of RAM. any solution?

nusherjk commented Sep 27, 2020

Even I am facing a simillar issue, am trying run evaluation on a 28MB model file and am getting issue, RuntimeError: [enforce fail at CPUAllocator.cpp:64] . DefaultCPUAllocator: can’t allocate memory: you tried to allocate 377216000 bytes. Error code 12 (Cannot allocate memory) , why is it trying to allocate in PetaBytes ??(Correction, 377mb)

Similar problem: RAM is 4GB, GPU with 12 GB memory, my model i try to load is 5GB. But i see error «you tried to allocate 108216000 bytes» — that is 108mb — that is strange

having same issue with 196 mb of RAM. any solution?

Fixed it with a garbage collector.
add import gc
and add gc.collect() after each end of epoch or wherever you please

afogarty85 commented Mar 22, 2021

Fixed it with a garbage collector.
add import gc
and add gc.collect() after each end of epoch or wherever you please

This fixed a similar issue for me with repeated hyperopt trials. Thank you.

Mehmaam99 commented Jun 6, 2022

Same issue occur, Open Task Manager and End Task all files related to this code, (ex: Python, VS Code etc.) and restart your IDE.
These steps solve my issue.

Источник

Понравилась статья? Поделить с друзьями:
  • Memory error pickle dump
  • Memory error memory allocation failure try simplifying or reducing the number of queries
  • Memory error jupiter notebook
  • Memory error detected copying between
  • Memory error bad allocation