Hi,

I am posting here as I am unsure if this is a Tensorflow or Keras problem, and I am slowly getting more and more desperate for a solution. I am having a problem that my memory consumption is steadily growing over iterations of my neural architecture search code where hundreds, if not thousands, of Keras models are created and trained. I created an issue last week with Tensorflow here but now looking into the code I am wondering if it is maybe on the Keras end. I don't have a good knowledge of Tensorflow or Keras internals so I am unsure at the moment which, if either, is responsible.

I believe that the problem is with one or more of the SymbolicTensors created when creating a Conv2D layer, they seem to be persistent even after the model has ceased to be used, have been unable to release them using garbage collection.

In the Tensorflow issue I have detailed my versions as well as provided a minimal code example that reproduces the problem.

Any insight would be very very appreciated.

Comment From: SuryanarayanaY

Hi @alxhoff ,

I have tested the code with Keras3 with tensorflow and torch backend and observed memory leakage with both backends but at slower pace that of tf.keras reported.

Attached gist for reference.

Escalating the issue to Dev team.

Comment From: alxhoff

Thank you @SuryanarayanaY!

Comment From: haifeng-jin

Sorry, I found myself do not have enough time to reach this issue. Unassigning myself and put it to another round of issue triage.

Comment From: jeffcarp

Thanks for the detailed repro. I think this is related to the creation of tf.function in the training loop. When I repro on my local machine, I get this message in the logs:

WARNING:tensorflow:5 out of the last 5 calls to <function TensorFlowTrainer.make_train_function.<locals>.one_step_on_iterator at 0x7f50d42bce00> triggered tf.function retracing. Tracing is expensive and the excessive number of tracings could be due to (1) creating @tf.function repeatedly in a loop, (2) passing tensors with different shapes, (3) passing Python objects instead of tensors. For (1), please define your @tf.function outside of the loop. For (2), @tf.function has reduce_retracing=True option that can avoid unnecessary retracing. For (3), please refer to https://www.tensorflow.org/guide/function#controlling_retracing and https://www.tensorflow.org/api_docs/python/tf/function for more details.

When I switch to eager execution (which skips tf.function), the memory usage grows much slower:

@alxhoff can you try re-running with model.compile(..., run_eagerly=True) and see if that helps?

Comment From: dhantule

Hi @alxhoff, thanks for reporting this. Are you still able to reproduce this?

@alxhoff can you try re-running with model.compile(..., run_eagerly=True) and see if that helps?

Did this suggestion work for you ?

Comment From: github-actions[bot]

This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.