Considerable Technical Piece from Microsoft. Much like the idea of 'mixture of experts' to broaden contextual results, could have often used it in the day.
DeepSpeed: Advancing MoE inference and training to power next-generation AI scale
Published January 19, 2022, By DeepSpeed Team Andrey Proskurin , Corporate Vice President of Engineering
DeepSpeed-MoE for NLG: Reducing the training cost of language models by five times PR-MoE and Mixture-of-Students: Reducing the model size and improving parameter efficiency DeepSpeed-MoE inference: Serving MoE models at unprecedented scale and speed Looking forward to the next generation of AI Scale
In the last three years, the largest trained dense models have increased in size by over 1,000 times, from a few hundred million parameters to over 500 billion parameters in Megatron-Turing NLG 530B (MT-NLG). Improvements in model quality with size suggest that this trend will continue, with larger model sizes bringing better model quality. However, sustaining the growth in model size is getting more difficult due to the increasing compute requirements.
There have been numerous efforts to reduce compute requirements to train large models without sacrificing model quality. To this end, architectures based on Mixture of Experts (MoE) have paved a promising path, enabling sub-linear compute requirements with respect to model parameters and allowing for improved model quality without increasing training cost. ... '
No comments:
Post a Comment