Hello,
I've noticed some (potentially harmful) inconsistencies in bias initializers when running a simple test of the keras package, i.e. using a shallow MLP to learn a sine wave function in the [-1, 1] interval.
Context
Most of the times (or for deep enough networks), using the default zero-initialization for biases is fine. However, for this simple problem having randomized biases is essential, since without them the neurons end up being too similar (redundant) and training converges to a very poor local optimum.
The official guide suggests to use weight initializers for biases as well.
Now:
- The default initialization from native PyTorch leads to good results that improve as expected as the network size grows.
- Several keras initializers are expected to be similar or identical to the PyTorch behavior (i.e.
VarianceScaling
and all its subclasses), but they fail to produce good results, regardless of the number of neurons in the hidden layer.
Issues
The issues are due to the fact that all RandomInitializer subclasses in their __call__
function only have access to the shape they need to fill.
In case of bias vectors for Dense
layers, this shape is a one element tuple, i.e. (n,)
where n
is the number of units in the current layer.
The compute_fans function in this case reports a fan in of n
, which is actually the number of units, i.e. the fan out.
Unfortunately, the correct fan in is not accessible, since the number of layer inputs is not included in the shape of the bias vector.
This makes the official description of the VarianceScaling initializer incorrect when applied to neuron biases. The same holds for the description of the Glorot, He, LeCun initializers, which are implemented as VarianceScaling
subclasses.
In my simple example, as soon as the shallow network has more than very few neurons, all size-dependent initializers have so little variability that they behave very similar to a zero initialization (i.e. incredibly poorly). What stumped me (before understanding the problem) is that the larger is the network, the worse the behavior.
About possible fixes
I can now easily fix the issue by computing bounds for RandomUniform
initializers externally so as to replicate the default PyTorch behavior, but this is not an elegant solution -- and I am worried other users may have encountered similar problems without noticing.
If the goal is correctly computing the fan in, I am afraid that I see no easy fix, short of restructuring the RandomInitializer
API and giving it access to more information.
However, the real goal here is not actually computing the fan in, but preserving the properties that the size-dependent initializers were attempting to enforce. I would need to read more literature on the topic before suggesting a theoretically sound fix from this perspective. I would be willing to do that, in case the keras teams is fine with going in this direction.
Comment From: sonali-kumari1
Hi @lompabo -
Thanks for providing detailed information. Could you please share a standalone code with the model structure, bias initializers you have been using, with the sample output or error to help reproduce this issue.
Comment From: github-actions[bot]
This issue is stale because it has been open for 14 days with no activity. It will be closed if no further activity occurs. Thank you.
Comment From: lompabo
Sure! And my apologies for the delay, I had missed the notification.
I prepared a simple script to run all tests in this repository
Running the scripts starts tests with keras+tensorflow, keras+torch, and native torch. The keras tests run with both the default initializers and a custom one, which mimics the one used by PyTorch. Such custom initializer is not however fully equivalent to the PyTorch one since bias initializers in keras cannot easily access the number units in the current layer.
There are also pre-built plots with the learned functions for all approaches, and the loss curves for the keras tests.
Comment From: lompabo
Ok, I've decided to run slightly more extensive experiments.
The script in the repository now runs multiple tests, with shallow networks having different hidden layer sizes.
I've also:
- Included some weight initializers from keras that should be identical or very close to the default initialization in PyTorch (but are not, as stated in the original comment)
- Adjusted the custom initialization (with an application specific hack) so that it is now identical to the one from PyTorch in terms of both weight and bias initialization.
There's a result.csv
file that reports averages and standard deviations for the MSE on all tests. Here are those for the tests using the TensofFlow backend and the default, HeUniform, LecunUniform, and custom initializations for the weights and biases:
mode | hidde units | MSE (mean) | MSE (std) |
---|---|---|---|
keras-tf-default | 16 | 0.24689329 | 0.057103742 |
keras-tf-custom | 16 | 0.07060415 | 0.06979664 |
keras-tf-he | 16 | 0.11653159 | 0.09471156 |
keras-tf-lecun | 16 | 0.13150033 | 0.08441584 |
keras-tf-default | 32 | 0.22040823 | 0.079308465 |
keras-tf-custom | 32 | 0.03252078 | 0.05684149 |
keras-tf-he | 32 | 0.056932174 | 0.059933245 |
keras-tf-lecun | 32 | 0.15256521 | 0.10467364 |
keras-tf-default | 64 | 0.21521144 | 0.080657125 |
keras-tf-custom | 64 | 0.0010941338 | 0.00067475386 |
keras-tf-he | 64 | 0.04085328 | 0.054360036 |
keras-tf-lecun | 64 | 0.048490494 | 0.06302383 |
As the number of hidden units grows:
- All MSEs improve
- The custom initialization (i.e. the one that deals correctly with the bias vector) is consistently the best performer
- The leading gap of the custom initialization becomes larger, up to one order of magnitude or more for the larger number of hidden units
This is consistent with an incorrect computation of the fan-in, as stated in the original comment.
BTW: I removed the experiments with native torch, since they were redundant at this point.