Moun optimizer is an optimizer proposed by OpenAI that is stronger than AdamW. And it has been verified on the Moonlight model. Has the Keras team implemented his plan? If not yet, I can submit a relevant PR. If I were to provide the relevant PR, what should I pay attention to?
Comment From: pass-lin
Hello, may I ask if I can submit a PR for the muon optimizer in Keras?@mehtamansi29
Comment From: hertschuh
@pass-lin
Would you be able to provide an implementation and example use as a code example on keras.io ?
Thanks!
Comment From: pass-lin
Would you be able to provide an implementation and example use as a code example on keras.io ?
Thanks!
I'm happy to do that. But I want to ask, why isn't it a new feature of keras.optimizers but an example?
Comment From: hertschuh
@pass-lin
Actually, yes, please put it in keras.optimizers
.
Thanks for the contribution!
Comment From: pass-lin
Actually, yes, please put it in
keras.optimizers
.Thanks for the contribution!
This optimizer should not be used for the embedding layer, the final fully connected layer,
or any {0,1}-D parameters; those should all be optimized by a standard method (e.g., AdamW).
This is a warning from the author of moun, it seems that if you want to use moun, you need to use multiple optimizers at the same time. How can we implement this in Keras?
Comment From: hertschuh
There are two ways to do this:
1. A custom train_step
like https://keras.io/examples/keras_recipes/trainer_pattern/ At the part where you apply the gradients, you use different optimizers for different layers.
2. Via a multi-optimizer, i.e. an optimizer that dispatches to different optimizers based on some criteria. This doesn't exist in Keras 3 today. But I can work on this part.