Keras Torch-cuda backend fails to recover after assertion failed

Reproduction step: 1. Use one Juypter cell for code 1 below 3. Run the code using GPU CUDA 4. In another cell use the same code but change the loss to 'mse' 5. Run the code using GPU CUDA

Expectation: Keras runs the second code without restarting python

Actual: Cannot run the second code

System: python-3.10.16 Keras 3.8 Torch 2.3.1 Cuda 12.4

I am building a GUI component that users can build custom architecture, while doing some random test I found that:

The following code (Code 1) causes assertion failed with torch backend. Tensorflow on the other hand completes this gracefully. But what is more troublesome is that the torch backend also fails to recover until python restart which is detrimental for interactive environments such as ipykernel and python-based IDEs.

An equivalent code (Code 2) implemented in Torch has a tendency to fail as well, but Torch is able to recover so you can re-run it without restarting the whole python processes.

Code 1:

import os
os.environ["KERAS_BACKEND"] = "torch"
import keras
from keras.models import Sequential
from keras.layers import Input, Dense
import numpy as np

x_values = np.array([1, 2, 3, 4], dtype=np.float32)
y_values = np.array([0.50 + i * 0.50 for i in x_values], dtype=np.float32)

model = Sequential()

model.add(Input(shape=(1,)))

model.add(Dense(1, activation='relu'))

model.compile(optimizer='sgd', loss='binary_crossentropy')

model.fit(x_values, y_values, epochs=10, batch_size=1)

loss = model.evaluate(x_values, y_values)
print(f"Loss: {loss}")

Code 2:

import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import os

x_values = np.array([1, 2, 3, 4], dtype=np.float32).reshape(-1, 1)
y_values = np.array([0.50 + i * 0.50 for i in x_values], dtype=np.float32).reshape(-1, 1)

x_train = torch.tensor(x_values)
y_train = torch.tensor(y_values)

class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.layer1 = nn.Linear(1, 1)  

    def forward(self, x):
        x = torch.relu(self.layer1(x))
        return x

model = SimpleModel()

criterion = nn.BCELoss()
optimizer = optim.SGD(model.parameters(), lr=0.01)

max_y = torch.max(y_train)
if max_y > 0:
    y_train = y_train / max_y
else:
    raise ValueError("Maximum value of y_train is zero, cannot normalize")

epochs = 10
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    outputs = model(x_train)
    loss = criterion(outputs, y_train)
    loss.backward()
    optimizer.step()
    print(f"Epoch [{epoch+1}/{epochs}], Loss: {loss.item()}")

model.eval()
with torch.no_grad():
    outputs = model(x_train)
    loss = criterion(outputs, y_train)
    print(f"Final Loss: {loss.item()}")

Error output:

{
    "name": "RuntimeError",
    "message": "CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
",
    "stack": "---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[3], line 15
     11 model = Sequential()
     13 model.add(Input(shape=(1,)))
---> 15 model.add(Dense(1, activation='relu'))
     17 model.compile(optimizer='sgd', loss='binary_crossentropy')
     19 model.fit(x_values, y_values, epochs=10, batch_size=1)

File ~/python3.10/site-packages/keras/src/models/sequential.py:122, in Sequential.add(self, layer, rebuild)
    120 self._layers.append(layer)
    121 if rebuild:
--> 122     self._maybe_rebuild()
    123 else:
    124     self.built = False

File ~/python3.10/site-packages/keras/src/models/sequential.py:141, in Sequential._maybe_rebuild(self)
    139 if isinstance(self._layers[0], InputLayer) and len(self._layers) > 1:
    140     input_shape = self._layers[0].batch_shape
--> 141     self.build(input_shape)
    142 elif hasattr(self._layers[0], \"input_shape\") and len(self._layers) > 1:
    143     # We can build the Sequential model if the first layer has the
    144     # `input_shape` property. This is most commonly found in Functional
    145     # model.
    146     input_shape = self._layers[0].input_shape

File ~/python3.10/site-packages/keras/src/layers/layer.py:228, in Layer.__new__.<locals>.build_wrapper(*args, **kwargs)
    226 with obj._open_name_scope():
    227     obj._path = current_path()
--> 228     original_build_method(*args, **kwargs)
    229 # Record build config.
    230 signature = inspect.signature(original_build_method)

File ~/python3.10/site-packages/keras/src/models/sequential.py:187, in Sequential.build(self, input_shape)
    185 for layer in self._layers[1:]:
    186     try:
--> 187         x = layer(x)
    188     except NotImplementedError:
    189         # Can happen if shape inference is not implemented.
    190         # TODO: consider reverting inbound nodes on layers processed.
    191         return

File ~/python3.10/site-packages/keras/src/utils/traceback_utils.py:122, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    119     filtered_tb = _process_traceback_frames(e.__traceback__)
    120     # To get the full stack trace, call:
    121     # `keras.config.disable_traceback_filtering()`
--> 122     raise e.with_traceback(filtered_tb) from None
    123 finally:
    124     del filtered_tb

File ~/python3.10/site-packages/torch/_dynamo/eval_frame.py:451, in _TorchDynamoContext.__call__.<locals>._fn(*args, **kwargs)
    449 prior = set_eval_frame(callback)
    450 try:
--> 451     return fn(*args, **kwargs)
    452 finally:
    453     set_eval_frame(prior)

RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
"
}

Comment From: abheesht17

The reason for the error is that Torch expects targets to lie in the range [0, 1] (which makes sense because we are using binary cross entropy, so target should lie between 0 and 1). What is your use-case exactly, i.e., why are you looking to set targets > 1.?

As for the Python restarting issue, I am unable to replicate it on Colab. Here is the notebook: https://colab.research.google.com/gist/abheesht17/d360499d826b87ba12449b362aee398d/keras-issue-20920.ipynb.

Comment From: jobs-git

It should be tried on CUDA, then use another CUDA calculation right after. I added a reproduction step.

Comment From: jobs-git

@abheesht17 any news?