Resourceexhaustederror graph execution error - Исправление ошибок и поиск оптимальных решений проблем

A few days back, I got the same error at 12th epoch. This time, it happens at the 1st. I have no idea why that is happening as I did not make any changes to the model. I only normalized the input to give X_train.max() as 1 after scaling like it should be.

Does it have something to do with patch size? Should I reduce it?

Why do I get this error and how can I fix it?

my_model.summary()

Model: "U-Net"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_6 (InputLayer)           [(None, 64, 64, 64,  0           []                               
                                 3)]                                                              
                                                                                                  
 conv3d_95 (Conv3D)             (None, 64, 64, 64,   5248        ['input_6[0][0]']                
                                64)                                                               
                                                                                                  
 batch_normalization_90 (BatchN  (None, 64, 64, 64,   256        ['conv3d_95[0][0]']              
 ormalization)                  64)                                                               
                                                                                                  
 activation_90 (Activation)     (None, 64, 64, 64,   0           ['batch_normalization_90[0][0]'] 
                                64)                                                               
                                                                                                  
 conv3d_96 (Conv3D)             (None, 64, 64, 64,   110656      ['activation_90[0][0]']          
                                64)                                                               
                                                                                                  
 batch_normalization_91 (BatchN  (None, 64, 64, 64,   256        ['conv3d_96[0][0]']              
 ormalization)                  64)                                                               
                                                                                                  
 activation_91 (Activation)     (None, 64, 64, 64,   0           ['batch_normalization_91[0][0]'] 
                                64)                                                               
                                                                                                  
 max_pooling3d_20 (MaxPooling3D  (None, 32, 32, 32,   0          ['activation_91[0][0]']          
 )                              64)                                                               
                                                                                                  
 conv3d_97 (Conv3D)             (None, 32, 32, 32,   221312      ['max_pooling3d_20[0][0]']       
                                128)                                                              
                                                                                                  
 batch_normalization_92 (BatchN  (None, 32, 32, 32,   512        ['conv3d_97[0][0]']              
 ormalization)                  128)                                                              
                                                                                                  
 activation_92 (Activation)     (None, 32, 32, 32,   0           ['batch_normalization_92[0][0]'] 
                                128)                                                              
                                                                                                  
 conv3d_98 (Conv3D)             (None, 32, 32, 32,   442496      ['activation_92[0][0]']          
                                128)                                                              
                                                                                                  
 batch_normalization_93 (BatchN  (None, 32, 32, 32,   512        ['conv3d_98[0][0]']              
 ormalization)                  128)                                                              
                                                                                                  
 activation_93 (Activation)     (None, 32, 32, 32,   0           ['batch_normalization_93[0][0]'] 
                                128)                                                              
                                                                                                  
 max_pooling3d_21 (MaxPooling3D  (None, 16, 16, 16,   0          ['activation_93[0][0]']          
 )                              128)                                                              
                                                                                                  
 conv3d_99 (Conv3D)             (None, 16, 16, 16,   884992      ['max_pooling3d_21[0][0]']       
                                256)                                                              
                                                                                                  
 batch_normalization_94 (BatchN  (None, 16, 16, 16,   1024       ['conv3d_99[0][0]']              
 ormalization)                  256)                                                              
                                                                                                  
 activation_94 (Activation)     (None, 16, 16, 16,   0           ['batch_normalization_94[0][0]'] 
                                256)                                                              
                                                                                                  
 conv3d_100 (Conv3D)            (None, 16, 16, 16,   1769728     ['activation_94[0][0]']          
                                256)                                                              
                                                                                                  
 batch_normalization_95 (BatchN  (None, 16, 16, 16,   1024       ['conv3d_100[0][0]']             
 ormalization)                  256)                                                              
                                                                                                  
 activation_95 (Activation)     (None, 16, 16, 16,   0           ['batch_normalization_95[0][0]'] 
                                256)                                                              
                                                                                                  
 max_pooling3d_22 (MaxPooling3D  (None, 8, 8, 8, 256  0          ['activation_95[0][0]']          
 )                              )                                                                 
                                                                                                  
 conv3d_101 (Conv3D)            (None, 8, 8, 8, 512  3539456     ['max_pooling3d_22[0][0]']       
                                )                                                                 
                                                                                                  
 batch_normalization_96 (BatchN  (None, 8, 8, 8, 512  2048       ['conv3d_101[0][0]']             
 ormalization)                  )                                                                 
                                                                                                  
 activation_96 (Activation)     (None, 8, 8, 8, 512  0           ['batch_normalization_96[0][0]'] 
                                )                                                                 
                                                                                                  
 conv3d_102 (Conv3D)            (None, 8, 8, 8, 512  7078400     ['activation_96[0][0]']          
                                )                                                                 
                                                                                                  
 batch_normalization_97 (BatchN  (None, 8, 8, 8, 512  2048       ['conv3d_102[0][0]']             
 ormalization)                  )                                                                 
                                                                                                  
 activation_97 (Activation)     (None, 8, 8, 8, 512  0           ['batch_normalization_97[0][0]'] 
                                )                                                                 
                                                                                                  
 max_pooling3d_23 (MaxPooling3D  (None, 4, 4, 4, 512  0          ['activation_97[0][0]']          
 )                              )                                                                 
                                                                                                  
 conv3d_103 (Conv3D)            (None, 4, 4, 4, 102  14156800    ['max_pooling3d_23[0][0]']       
                                4)                                                                
                                                                                                  
 batch_normalization_98 (BatchN  (None, 4, 4, 4, 102  4096       ['conv3d_103[0][0]']             
 ormalization)                  4)                                                                
                                                                                                  
 activation_98 (Activation)     (None, 4, 4, 4, 102  0           ['batch_normalization_98[0][0]'] 
                                4)                                                                
                                                                                                  
 conv3d_104 (Conv3D)            (None, 4, 4, 4, 102  28312576    ['activation_98[0][0]']          
                                4)                                                                
                                                                                                  
 batch_normalization_99 (BatchN  (None, 4, 4, 4, 102  4096       ['conv3d_104[0][0]']             
 ormalization)                  4)                                                                
                                                                                                  
 activation_99 (Activation)     (None, 4, 4, 4, 102  0           ['batch_normalization_99[0][0]'] 
                                4)                                                                
                                                                                                  
 conv3d_transpose_20 (Conv3DTra  (None, 8, 8, 8, 512  4194816    ['activation_99[0][0]']          
 nspose)                        )                                                                 
                                                                                                  
 concatenate_20 (Concatenate)   (None, 8, 8, 8, 102  0           ['conv3d_transpose_20[0][0]',    
                                4)                                'activation_97[0][0]']          
                                                                                                  
 conv3d_105 (Conv3D)            (None, 8, 8, 8, 512  14156288    ['concatenate_20[0][0]']         
                                )                                                                 
                                                                                                  
 batch_normalization_100 (Batch  (None, 8, 8, 8, 512  2048       ['conv3d_105[0][0]']             
 Normalization)                 )                                                                 
                                                                                                  
 activation_100 (Activation)    (None, 8, 8, 8, 512  0           ['batch_normalization_100[0][0]']
                                )                                                                 
                                                                                                  
 conv3d_106 (Conv3D)            (None, 8, 8, 8, 512  7078400     ['activation_100[0][0]']         
                                )                                                                 
                                                                                                  
 batch_normalization_101 (Batch  (None, 8, 8, 8, 512  2048       ['conv3d_106[0][0]']             
 Normalization)                 )                                                                 
                                                                                                  
 activation_101 (Activation)    (None, 8, 8, 8, 512  0           ['batch_normalization_101[0][0]']
                                )                                                                 
                                                                                                  
 conv3d_transpose_21 (Conv3DTra  (None, 16, 16, 16,   1048832    ['activation_101[0][0]']         
 nspose)                        256)                                                              
                                                                                                  
 concatenate_21 (Concatenate)   (None, 16, 16, 16,   0           ['conv3d_transpose_21[0][0]',    
                                512)                              'activation_95[0][0]']          
                                                                                                  
 conv3d_107 (Conv3D)            (None, 16, 16, 16,   3539200     ['concatenate_21[0][0]']         
                                256)                                                              
                                                                                                  
 batch_normalization_102 (Batch  (None, 16, 16, 16,   1024       ['conv3d_107[0][0]']             
 Normalization)                 256)                                                              
                                                                                                  
 activation_102 (Activation)    (None, 16, 16, 16,   0           ['batch_normalization_102[0][0]']
                                256)                                                              
                                                                                                  
 conv3d_108 (Conv3D)            (None, 16, 16, 16,   1769728     ['activation_102[0][0]']         
                                256)                                                              
                                                                                                  
 batch_normalization_103 (Batch  (None, 16, 16, 16,   1024       ['conv3d_108[0][0]']             
 Normalization)                 256)                                                              
                                                                                                  
 activation_103 (Activation)    (None, 16, 16, 16,   0           ['batch_normalization_103[0][0]']
                                256)                                                              
                                                                                                  
 conv3d_transpose_22 (Conv3DTra  (None, 32, 32, 32,   262272     ['activation_103[0][0]']         
 nspose)                        128)                                                              
                                                                                                  
 concatenate_22 (Concatenate)   (None, 32, 32, 32,   0           ['conv3d_transpose_22[0][0]',    
                                256)                              'activation_93[0][0]']          
                                                                                                  
 conv3d_109 (Conv3D)            (None, 32, 32, 32,   884864      ['concatenate_22[0][0]']         
                                128)                                                              
                                                                                                  
 batch_normalization_104 (Batch  (None, 32, 32, 32,   512        ['conv3d_109[0][0]']             
 Normalization)                 128)                                                              
                                                                                                  
 activation_104 (Activation)    (None, 32, 32, 32,   0           ['batch_normalization_104[0][0]']
                                128)                                                              
                                                                                                  
 conv3d_110 (Conv3D)            (None, 32, 32, 32,   442496      ['activation_104[0][0]']         
                                128)                                                              
                                                                                                  
 batch_normalization_105 (Batch  (None, 32, 32, 32,   512        ['conv3d_110[0][0]']             
 Normalization)                 128)                                                              
                                                                                                  
 activation_105 (Activation)    (None, 32, 32, 32,   0           ['batch_normalization_105[0][0]']
                                128)                                                              
                                                                                                  
 conv3d_transpose_23 (Conv3DTra  (None, 64, 64, 64,   65600      ['activation_105[0][0]']         
 nspose)                        64)                                                               
                                                                                                  
 concatenate_23 (Concatenate)   (None, 64, 64, 64,   0           ['conv3d_transpose_23[0][0]',    
                                128)                              'activation_91[0][0]']          
                                                                                                  
 conv3d_111 (Conv3D)            (None, 64, 64, 64,   221248      ['concatenate_23[0][0]']         
                                64)                                                               
                                                                                                  
 batch_normalization_106 (Batch  (None, 64, 64, 64,   256        ['conv3d_111[0][0]']             
 Normalization)                 64)                                                               
                                                                                                  
 activation_106 (Activation)    (None, 64, 64, 64,   0           ['batch_normalization_106[0][0]']
                                64)                                                               
                                                                                                  
 conv3d_112 (Conv3D)            (None, 64, 64, 64,   110656      ['activation_106[0][0]']         
                                64)                                                               
                                                                                                  
 batch_normalization_107 (Batch  (None, 64, 64, 64,   256        ['conv3d_112[0][0]']             
 Normalization)                 64)                                                               
                                                                                                  
 activation_107 (Activation)    (None, 64, 64, 64,   0           ['batch_normalization_107[0][0]']
                                64)                                                               
                                                                                                  
 conv3d_113 (Conv3D)            (None, 64, 64, 64,   260         ['activation_107[0][0]']         
                                4)                                                                
                                                                                                  
==================================================================================================
Total params: 90,319,876
Trainable params: 90,308,100
Non-trainable params: 11,776
__________________________________________________________________________________________________
None

Error Message Log:

Epoch 1/100
---------------------------------------------------------------------------
ResourceExhaustedError                    Traceback (most recent call last)
<ipython-input-52-ec522ff5ad08> in <module>()
      5           epochs=100,
      6           verbose=1,
----> 7           validation_data=(X_test, y_test))

1 frames
/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     53     ctx.ensure_initialized()
     54     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 55                                         inputs, attrs, num_outputs)
     56   except core._NotOkStatusException as e:
     57     if name is not None:

ResourceExhaustedError: Graph execution error:

Detected at node 'U-Net/concatenate_23/concat' defined at (most recent call last):
    File "/usr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
      "__main__", mod_spec)
    File "/usr/lib/python3.7/runpy.py", line 85, in _run_code
      exec(code, run_globals)
    File "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py", line 16, in <module>
      app.launch_new_instance()
    File "/usr/local/lib/python3.7/dist-packages/traitlets/config/application.py", line 846, in launch_instance
      app.start()
    File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelapp.py", line 499, in start
      self.io_loop.start()
    File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 132, in start
      self.asyncio_loop.run_forever()
    File "/usr/lib/python3.7/asyncio/base_events.py", line 541, in run_forever
      self._run_once()
    File "/usr/lib/python3.7/asyncio/base_events.py", line 1786, in _run_once
      handle._run()
    File "/usr/lib/python3.7/asyncio/events.py", line 88, in _run
      self._context.run(self._callback, *self._args)
    File "/usr/local/lib/python3.7/dist-packages/tornado/platform/asyncio.py", line 122, in _handle_events
      handler_func(fileobj, events)
    File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 452, in _handle_events
      self._handle_recv()
    File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 481, in _handle_recv
      self._run_callback(callback, msg)
    File "/usr/local/lib/python3.7/dist-packages/zmq/eventloop/zmqstream.py", line 431, in _run_callback
      callback(*args, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/tornado/stack_context.py", line 300, in null_wrapper
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
      return self.dispatch_shell(stream, msg)
    File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
      handler(stream, idents, msg)
    File "/usr/local/lib/python3.7/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
      user_expressions, allow_stdin)
    File "/usr/local/lib/python3.7/dist-packages/ipykernel/ipkernel.py", line 208, in do_execute
      res = shell.run_cell(code, store_history=store_history, silent=silent)
    File "/usr/local/lib/python3.7/dist-packages/ipykernel/zmqshell.py", line 537, in run_cell
      return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
      interactivity=interactivity, compiler=compiler, result=result)
    File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
      if self.run_code(code, result):
    File "/usr/local/lib/python3.7/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "<ipython-input-52-ec522ff5ad08>", line 7, in <module>
      validation_data=(X_test, y_test))
    File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1021, in train_function
      return step_function(self, iterator)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "/usr/local/lib/python3.7/dist-packages/keras/engine/training.py", line 859, in train_step
      y_pred = self(x, training=True)
    File "/usr/local/lib/python3.7/dist-packages/keras/utils/traceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
packages/keras/layers/merge.py", line 531, in _merge_function
      return backend.concatenate(inputs, axis=self.axis)
    File "/usr/local/lib/python3.7/dist-packages/keras/backend.py", line 3313, in concatenate
      return tf.concat([to_dense(x) for x in tensors], axis)
Node: 'U-Net/concatenate_23/concat'
OOM when allocating tensor with shape[8,128,64,64,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node U-Net/concatenate_23/concat}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_24517]

GPU details:
nvidia-smi command:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   72C    P0    73W / 149W |  11077MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

I’m new to Tensorflow and all of this ML stuff honestly. Would really appreciate any help. Thanks.

Источник

Please make sure that this is an issue related to performance of TensorFlow.
As per our
GitHub Policy,
we only address code/doc bugs, performance issues, feature requests and
build/installation issues on GitHub. tag:performance_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Custom code, but nothing really fancy
TensorFlow installed from (source or binary): conda installed from source
TensorFlow version (use command below): 2.1.0
Python version: 3.7
CUDA/cuDNN version: CUDA 10.1
GPU model and memory: Quadro M1200. 8GB RAM

Describe the current behavior
I get an ResourceExhaustedError during my training. Even if I have a batch size of one. I have to predict and fit my data seperately, because in between I have to create a return function based on my predictions, which is input for the model.fit(). When I train my model, it starts with 4000 MB free memory in the GPU, after initialization it goes to 2022 MB free memory. It stays like this till 92 epochs, after 92 epochs it goes to 949 MB free memory. After 186 epochs it goes to 730 MB free memory in the GPU and after 197 epochs I get the error:

ResourceExhaustedError: OOM when allocating tensor with shape[108,32,103,66] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[node MaxPoolGrad_2 (defined at C:UsersFloorDocumentsBasic modeltestmaptest.py:233) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
[Op:__inference_distributed_function_3945]

Standalone code to reproduce the issue
import tensorflow as tf
tf.config.experimental.set_memory_growth(tf.config.experimental.list_physical_devices(‘GPU’)[0], True)
import numpy as np
import sys
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Dense, Conv2D, Activation, Flatten,MaxPooling2D

def create_model():

model = Sequential()
model.add(Conv2D(32, (6, 6), input_shape=( 108, 71, 9)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (6, 6)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Conv2D(32, (6, 6)))
model.add(Activation("relu"))
model.add(MaxPooling2D(pool_size=(2, 2)))

model.add(Flatten())
model.add(Dense(256, activation='relu'))
model.add(Dense(1, activation = "sigmoid"))

model.compile(optimizer=Adam(lr=0.00001/10), loss='mean_squared_error')  
return(model)

For i in range(N) #amount of epochs
model.predict(EpisodesP.reshape([-1,img_w, img_h, dim]))

#Return function G 

model.fit(EpisodesP, G, epochs = 1, verbose = 0, batch_size=1)

Other info / logs
raceback (most recent call last):

File «C:UsersFloorDocumentsBasic modeltestmaptest.py», line 267, in
history,value,model,loss, loss_episode= basic_code(Episodes, Success, N = 1000, P = 1)

File «C:UsersFloorDocumentsBasic modeltestmaptest.py», line 233, in basic_code
model.fit(EpisodesP, GP, epochs = 1, verbose = 0, batch_size=sum(TP)) #model fitted to get the loss

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythonkerasenginetraining.py», line 819, in fit
use_multiprocessing=use_multiprocessing)

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythonkerasenginetraining_v2.py», line 342, in fit
total_epochs=epochs)

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythonkerasenginetraining_v2.py», line 128, in run_one_epoch
batch_outs = execution_function(iterator)

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythonkerasenginetraining_v2_utils.py», line 98, in execution_function
distributed_function(input_fn))

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerdef_function.py», line 568, in call
result = self._call(*args, **kwds)

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerdef_function.py», line 599, in _call
return self._stateless_fn(*args, **kwds) # pylint: disable=not-callable

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerfunction.py», line 2363, in call
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerfunction.py», line 1611, in _filtered_call
self.captured_inputs)

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerfunction.py», line 1692, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerfunction.py», line 545, in call
ctx=ctx)

File «C:UsersFlooranaconda3libsite-packagestensorflow_corepythoneagerexecute.py», line 67, in quick_execute
six.raise_from(core._status_to_exception(e.code, message), None)

File «», line 3, in raise_from

Function call stack:
distributed_function

Источник

I am having an issue with calculating my gradients where I am running out of memory, but I do not believe it has to do with actually not having enough memory. I have created my own layer, and I assume I am using an operation in there somewhere which may not be differentiable.

Here is my custom layer:

class MaskedDense(tf.keras.layers.Layer):
  
  def __init__(self,
               units,
               max_num_features,
               activation=None,
               use_bias=True,
               kernel_initializer='glorot_uniform',
               bias_initializer='zeros',
               kernel_regularizer=None,
               bias_regularizer=None,
               activity_regularizer=None,
               kernel_constraint=None,
               bias_constraint=None,
               **kwargs):
    super(MaskedDense, self).__init__(
        activity_regularizer=activity_regularizer, **kwargs)

    self.units = int(units) if not isinstance(units, int) else units
    if self.units < 0:
      raise ValueError(f'Received an invalid value for `units`, expected '
                       f'a positive integer, got {units}.')
    self.max_num_features = max_num_features
    self.activation = tf.keras.activations.get(activation)
    self.use_bias = use_bias
    self.kernel_initializer = tf.keras.initializers.get(kernel_initializer)
    self.bias_initializer = tf.keras.initializers.get(bias_initializer)
    self.kernel_regularizer = tf.keras.regularizers.get(kernel_regularizer)
    self.bias_regularizer = tf.keras.regularizers.get(bias_regularizer)
    self.kernel_constraint = tf.keras.constraints.get(kernel_constraint)
    self.bias_constraint = tf.keras.constraints.get(bias_constraint)

    self.input_spec = tf.keras.layers.InputSpec(min_ndim=2)
    self.supports_masking = True

    self.flatten = tf.keras.layers.Flatten()

  def build(self, input_shape):
    dtype = tf.dtypes.as_dtype(self.dtype or tf.keras.backend.floatx())
    if not (dtype.is_floating or dtype.is_complex):
      raise TypeError('Unable to build `Dense` layer with non-floating point '
                      'dtype %s' % (dtype,))
    # input_shape = tf.TensorShape(input_shape)
    # last_dim = input_shape[-1]
    # if last_dim is None:
    #   raise ValueError('The last dimension of the inputs to `Dense` '
    #                    'should be defined. Found `None`.')
    self.input_spec = tf.keras.layers.InputSpec(min_ndim=2, axes={-1: self.max_num_features})
    self.kernel = self.add_weight(
        'kernel',
        shape=[self.max_num_features, self.units],
        initializer=self.kernel_initializer,
        regularizer=self.kernel_regularizer,
        constraint=self.kernel_constraint,
        dtype=self.dtype,
        trainable=True)
    if self.use_bias:
      self.bias = self.add_weight(
          'bias',
          shape=[self.units,],
          initializer=self.bias_initializer,
          regularizer=self.bias_regularizer,
          constraint=self.bias_constraint,
          dtype=self.dtype,
          trainable=True)
    else:
      self.bias = None
    self.built = True

  def call(self, inputs, feature_mask=None):
    if feature_mask == None:
      kernel = self.kernel
    else:
      flattened_feature_mask = self.flatten(feature_mask)
      if self.units > 1:
        inputs = tf.expand_dims(inputs, axis=2)
        inputs = tf.repeat(inputs, self.units, axis=2)
        inputs = tf.ragged.boolean_mask(inputs, flattened_feature_mask)
        kernel = tf.expand_dims(self.kernel, axis = 0)
        kernel = tf.repeat(kernel, inputs.shape[0], axis=0)
        flattened_feature_mask = tf.expand_dims(flattened_feature_mask, axis=2)
        flattened_feature_mask = tf.repeat(flattened_feature_mask, self.units, axis=2)
        kernels = []
        for unit in range(self.units):
          kernels.append(
            tf.ragged.boolean_mask(
              tf.squeeze(kernel[:,:,unit]),
              tf.squeeze(flattened_feature_mask[:,:,unit])))
        kernels = tf.stack(kernels, axis = 2)
      else:
        inputs = tf.ragged.boolean_mask(inputs, flattened_feature_mask)
        kernel = tf.ragged.boolean_mask(tf.transpose(tf.repeat(self.kernel, feature_mask.shape[0], axis = -1)), flattened_feature_mask)
    if inputs.dtype.base_dtype != self._compute_dtype_object.base_dtype:
      inputs = tf.ops.math_ops.cast(inputs, dtype=self._compute_dtype_object)

    rank = inputs.shape.rank
    if rank == 2 or rank is None:
        outputs = tf.expand_dims(tf.reduce_sum(tf.multiply(inputs, kernel), axis = 1), 1)
    else:
      outputs = tf.expand_dims(tf.reduce_sum(tf.multiply(inputs, kernels), axis = 1), 1)
      outputs = tf.squeeze(outputs)
      # Reshape the output back to the original ndim of the input.
      # if not tf.executing_eagerly():
      #   shape = inputs.shape.as_list()
      #   output_shape = shape[:-1] + [kernel.shape[-1]]
      #   outputs.set_shape(output_shape)

    if self.use_bias:
      outputs = tf.nn.bias_add(outputs, self.bias)

    if self.activation is not None:
      outputs = self.activation(outputs)
    return outputs

It’s purpose is to take a tensor of continuous inputs which have a large amount of missing values and a boolean mask tensor of the same shape with True and False values denoting which features should be masked and which should not. The point here is to mask the missing values so as to not have to drop all observations which having missing values, which would be most of the data set, or have to attempt to impute a value.

It does this by creating a weight kernel with shape equal to the maximum number of possible features, and takes the dot product between the masked inputs and the masked kernel, so that the output of the layer ignores any missing values. I assume this is causing issues with calculation of the gradient.

Does anybody know how I might debug this? I am not too familiar with inspecting tensorflow gradients.

Here is the entire error:

ResourceExhaustedError: Graph execution error:

Detected at node 'gradient_tape/masked_dense_27/strided_slice_56/StridedSliceGrad' defined at (most recent call last):
    File "C:UsersJJAppDataLocalProgramsPythonPython38librunpy.py", line 194, in _run_module_as_main
      return _run_code(code, main_globals, None,
    File "C:UsersJJAppDataLocalProgramsPythonPython38librunpy.py", line 87, in _run_code
      exec(code, run_globals)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernel_launcher.py", line 17, in <module>
      app.launch_new_instance()
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagestraitletsconfigapplication.py", line 972, in launch_instance
      app.start()
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelkernelapp.py", line 712, in start
      self.io_loop.start()
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagestornadoplatformasyncio.py", line 199, in start
      self.asyncio_loop.run_forever()
    File "C:UsersJJAppDataLocalProgramsPythonPython38libasynciobase_events.py", line 570, in run_forever
      self._run_once()
    File "C:UsersJJAppDataLocalProgramsPythonPython38libasynciobase_events.py", line 1859, in _run_once
      handle._run()
    File "C:UsersJJAppDataLocalProgramsPythonPython38libasyncioevents.py", line 81, in _run
      self._context.run(self._callback, *self._args)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelkernelbase.py", line 504, in dispatch_queue
      await self.process_one()
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelkernelbase.py", line 493, in process_one
      await dispatch(*args)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelkernelbase.py", line 400, in dispatch_shell
      await result
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelkernelbase.py", line 724, in execute_request
      reply_content = await reply_content
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelipkernel.py", line 383, in do_execute
      res = shell.run_cell(
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesipykernelzmqshell.py", line 528, in run_cell
      return super().run_cell(*args, **kwargs)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesIPythoncoreinteractiveshell.py", line 2880, in run_cell
      result = self._run_cell(
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesIPythoncoreinteractiveshell.py", line 2935, in _run_cell
      return runner(coro)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesIPythoncoreasync_helpers.py", line 129, in _pseudo_sync_runner
      coro.send(None)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesIPythoncoreinteractiveshell.py", line 3134, in run_cell_async
      has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesIPythoncoreinteractiveshell.py", line 3337, in run_ast_nodes
      if await self.run_code(code, result, async_=asy):
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packagesIPythoncoreinteractiveshell.py", line 3397, in run_code
      exec(code_obj, self.user_global_ns, self.user_ns)
    File "C:UsersJJAppDataLocalTempipykernel_181961844441298.py", line 14, in <cell line: 14>
      model.fit(
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packageskerasutilstraceback_utils.py", line 64, in error_handler
      return fn(*args, **kwargs)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packageskerasenginetraining.py", line 1384, in fit
      tmp_logs = self.train_function(iterator)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packageskerasenginetraining.py", line 1021, in train_function
      return step_function(self, iterator)
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packageskerasenginetraining.py", line 1010, in step_function
      outputs = model.distribute_strategy.run(run_step, args=(data,))
    File "c:UsersJJDocumentsGitcig_dbbuilding.venvlibsite-packageskerasenginetraining.py", line 1000, in run_step
      outputs = model.train_step(data)
    File "C:UsersJJAppDataLocalTempipykernel_18196572172987.py", line 134, in train_step
      gradients = tape.gradient(total_loss, self.trainable_variables)
Node: 'gradient_tape/masked_dense_27/strided_slice_56/StridedSliceGrad'
OOM when allocating tensor with shape[256,1420,64] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[{{node gradient_tape/masked_dense_27/strided_slice_56/StridedSliceGrad}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
 [Op:__inference_train_function_58824686]

Here is my train step where the error occurs:

def train_step(self, inputs):
        """Custom train step using the `compute_loss` method."""

        with tf.GradientTape() as tape:
            loss = self.compute_loss(inputs, training=True)

            # Handle regularization losses as well.
            regularization_loss = sum(self.losses)

            total_loss = loss + regularization_loss

        gradients = tape.gradient(total_loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(gradients, self.trainable_variables))

        metrics = {metric.name: metric.result() for metric in self.metrics}
        # metrics["loss"] = loss
        # # metrics["regularization_loss"] = regularization_loss
        metrics["total_loss"] = total_loss

        return metrics

Thanks for your help

Источник

Некоторый ресурс исчерпан.

Наследуется от: OpError

View aliases

Совместимые псевдонимы для миграции

Подробнее см. Руководство по миграции .

tf.compat.v1.errors.ResourceExhaustedError

tf.errors.ResourceExhaustedError(
    node_def, op, message, *args
)

Например,эта ошибка может возникнуть,если квота на одного пользователя исчерпана,или,возможно,вся файловая система не занята.

Attributes
`error_code`	Целый код ошибки,описывающий ошибку.
`experimental_payloads`	Словарь,описывающий детали ошибки.
`message`	Сообщение об ошибке,описывающее ошибку.
`node_def`	`NodeDef` прото , представляющий цит , что не удалось.
`op`	Операция,которая провалилась,если известно. Примечание. Если неудачная операция была синтезирована во время выполнения, например, `Recv` `Send` или Recv , соответствующего объекта `tf.Operation` не будет . В этом случае это вернет `None` , и вместо этого вы должны использовать `tf.errors.OpError.node_def` для получения информации об операции.

TensorFlow

2.9

tf.errors.OutOfRangeError

Поднимается,если операция выходит за допустимый диапазон входных данных.
tf.errors.PermissionDeniedError

Поднимается,когда вызывающий абонент не имеет разрешения на выполнение операции.
tf.errors.UnauthenticatedError

Запрос не имеет действительных учетных данных для аутентификации.
tf.errors.UnavailableError

Возникает,когда время выполнения в данный момент недоступно.

Источник

View aliases

TensorFlow 2.9

Читайте также:

TensorFlow

2.9