Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Tay, Yi; Dehghani, Mostafa; Rao, Jinfeng; Fedus, William; Abnar, Samira; Chung, Hyung Won; Narang, Sharan; Yogatama, Dani; Vaswani, Ashish; Metzler, Donald

Computer Science > Computation and Language

arXiv:2109.10686 (cs)

[Submitted on 22 Sep 2021 (v1), last revised 30 Jan 2022 (this version, v2)]

Title:Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Authors:Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler

View PDF

Abstract:There remain many open questions pertaining to the scaling behaviour of Transformer architectures. These scaling decisions and findings can be critical, as training runs often come with an associated computational cost which have both financial and/or environmental impact. The goal of this paper is to present scaling insights from pretraining and finetuning Transformers. While Kaplan et al. presents a comprehensive study of the scaling behaviour of Transformer language models, the scope is only on the upstream (pretraining) loss. Therefore, it is still unclear if these set of findings transfer to downstream task within the context of the pretrain-finetune paradigm. The key findings of this paper are as follows: (1) we show that aside from only the model size, model shape matters for downstream fine-tuning, (2) scaling protocols operate differently at different compute regions, (3) widely adopted T5-base and T5-large sizes are Pareto-inefficient. To this end, we present improved scaling protocols whereby our redesigned models achieve similar downstream fine-tuning quality while having 50\% fewer parameters and training 40\% faster compared to the widely adopted T5-base model. We publicly release over 100 pretrained checkpoints of different T5 configurations to facilitate future research and analysis.

Comments:	ICLR 2022 + Updated Checkpoint Release
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)
Cite as:	arXiv:2109.10686 [cs.CL]
	(or arXiv:2109.10686v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2109.10686

Submission history

From: Yi Tay [view email]
[v1] Wed, 22 Sep 2021 12:29:15 UTC (2,328 KB)
[v2] Sun, 30 Jan 2022 16:42:46 UTC (2,328 KB)

Computer Science > Computation and Language

Title:Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators