Keras Inplace modification error during back propagation when training a stateful LSTM using PyTorch as backend

When creating a stateful LSMT model in Keras 3 with PyTorch as the backend PyTorch performs in-place operations during back propagation in the LSTM layer.

Error description: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 50]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Further investigation with detect_anomaly: RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.FloatTensor [32, 50]] is at version 2; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later

When the same test code is run with both JAX or TensorFlow as backend the task runs without any errors. The problem is confined to Keras with PyTorch backend only.

Test script:

import numpy as np
import os
os.environ["KERAS_BACKEND"] = "torch"
import keras

# Sample dataset generator for demonstration
def generate_time_series_data(batch_size, time_steps, num_features):
    while True:
        x = np.random.rand(batch_size, time_steps, num_features)
        y = np.sum(x, axis=2)  # Just an example: target is the sum of features along the time step
        yield x, y

# Parameters
batch_size = 32  # Number of sequences per batch
time_steps = 10  # Length of each sequence
num_features = 3  # Number of features per time step
epochs = 10  # Number of epochs

# Build the LSTM model
model = keras.Sequential()
model.add(keras.Input(shape=(time_steps, num_features), batch_size=batch_size))
lstm_layer = keras.layers.LSTM(50,
               stateful=True,
               return_sequences=False) # return_sequences can be True if another LSTM is added
model.add(lstm_layer)
model.add(keras.layers.Dense(1, activation='linear'))  # For scalar output

# Compile the model with optimizer and loss function
model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.001), loss='mse', metrics=['mae'])

# Print model summary
model.summary()

# Generate dummy training data
train_generator = generate_time_series_data(batch_size, time_steps, num_features)
steps_per_epoch = 100  # Number of batches per epoch

# Train the model with stateful data
for epoch in range(epochs):
    print(f"Epoch {epoch + 1}/{epochs}")
    model.fit(train_generator, steps_per_epoch=steps_per_epoch, epochs=1, verbose=1, shuffle=False)
    # Reset states after each epoch
    lstm_layer.reset_states()

PraveenH has created a temporary fix in Colab repo. Please have a look at: (https://colab.research.google.com/drive/1_8HKONyYbWMRLuEJfIS3PSReMEA9xfu6#scrollTo=bPjFIAarl0wL)

We're not yet 100% certain it is the best solution (so will be guided by what team-keras comes up with), but it sure is excellent work in my view, thank you Praveen for the deep dive into this!

Can I ask for this to be logged as a bug?

Comment From: praveenhosdrug123

Just adding more context on the temporary fix, The error occurred because the autograd engine detected that a tensor required for gradient computation was modified in-place, invalidating the computational graph. Specifically: The LSTM states (hidden state and cell state) were being modified in-place during the forward pass PyTorch expects tensors to maintain their original versions for accurate gradient computation When the backward pass attempted to compute gradients, it found tensor versions mismatched (version 2 vs expected version 1) The fix involved: Creating a new tensor with the same content and computational history, but it's a seperate tensor object with its own memory allocation. LSTM states using .clone() before modifications Replacing in-place operations with Keras ops functions that create new tensor instances This ensures the original tensors remain untouched for proper gradient tracking during backpropagation.