DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Rajbhandari, Samyam; Li, Conglong; Yao, Zhewei; Zhang, Minjia; Aminabadi, Reza Yazdani; Awan, Ammar Ahmad; Rasley, Jeff; He, Yuxiong

Computer Science > Machine Learning

arXiv:2201.05596 (cs)

[Submitted on 14 Jan 2022 (v1), last revised 21 Jul 2022 (this version, v2)]

Title:DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Authors:Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, Yuxiong He

View PDF

Abstract:As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.

Comments:	This paper is published at ICML 2022: this https URL
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Distributed, Parallel, and Cluster Computing (cs.DC)
Cite as:	arXiv:2201.05596 [cs.LG]
	(or arXiv:2201.05596v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2201.05596

Submission history

From: Conglong Li [view email]
[v1] Fri, 14 Jan 2022 18:36:04 UTC (3,043 KB)
[v2] Thu, 21 Jul 2022 18:21:36 UTC (3,138 KB)

Computer Science > Machine Learning

Title:DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Submission history

Access Paper:

References & Citations

1 blog link

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators