An Optimizer kinds the idea for coaching most fashionable neural networks.
Published in 2017, the Adam Optimizer, along with its variants, has turn out to be the dominant and go-to optimizer for training LLMs in the industry today.
However there’s a problem with Adam that has been largely neglected resulting from its superior efficiency.
That situation is Reminiscence inefficiency.
To coach an LLM with 7 billion parameters, Adam requires round 86 GB of reminiscence.
For fashions like Google PaLM, which consists of 540 billion parameters, greater than 50 GPUs are wanted simply to comprise Adam itself.
However possibly not anymore. Right here’s some thrilling information!
A crew of ML researchers have developed a greater model of Adam referred to as Adam-mini.
The Adam-mini optimizer is twice as reminiscence environment friendly and achieves 49.6% larger throughput than AdamW when used to coach billion-parameter LLMs.