README.md · google/t5-efficient-xl at 5b8b83b002bf9d80d6a4f943f28e379e342e30c4

metadata

language:
  - en
datasets:
  - c4
tags:
  - deep-narrow
license: apache-2.0

T5-Efficient-XL is a checkpoint of the T5 model architecture.

The checkpoint was released with the paper Scale Efficiently: Insights from Pre-training and Fine-tuning Transformers by Yi Tay, Mostafa Dehghani, Jinfeng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, Donald Metzler.

In a nutshell, the paper indicates that a DeepNarrow model architecture is favorable for downstream performance compared to other model architectures of similar parameter count.

To quote the paper:

We generally recommend a DeepNarrow strategy where the model’s depth is preferentially increased before considering any other forms of uniform scaling across other dimensions. This is largely due to how much depth influences the Pareto-frontier as shown in earlier sections of the paper. Specifically, a tall small (deep and narrow) model is generally more efficient compared to the base model. Likewise, a tall base model might also generally more efficient compared to a large model. We generally find that, regardless of size, even if absolute performance might increase as we continue to stack layers, the relative gain of Pareto-efficiency diminishes as we increase the layers, converging at 32 to 36 layers. Finally, we note that our notion of efficiency here relates to any one compute dimension, i.e., params, FLOPs or throughput (speed). We report all three key efficiency metrics (number of params, FLOPS and speed) and leave this decision to the practitioner to decide which compute dimension to consider.

To be more precise, model depth is defined as the number of transformer blocks that are stacked sequentially. A sequence of word embeddings is therefore processed sequentially by each transformer block.

Details model architecture

The conventional T5 architectures are summarized in the following table.

Model	nl	ff	dm	kv	nh	#Params
Tiny	4/4	1024	256	32	4	16M
Mini	4/4	1536	384	32	8	31M
Small	6/6	2048	512	32	8	60M
Base	12/12	3072	768	64	12	220M
Large	24/24	4096	1024	64	16	738M
XL	24/24	16384	1024	128	32	3B
XXL	24/24	65536	1024	128	128	11B

This

Pre-Training

The checkpoint was pretrained on the Colossal, Cleaned version of Common Crawl (C4) for 524288 steps using the span-based masked language modeling (MLM) objective.

Downstream Performance

TODO:

Pretraining Dataset: C4