On the Transformer Growth for Progressive BERT Training

On the Transformer Growth for Progressive BERT Training


As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressively–start from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by and for the base and large models respectively while achieving comparable performances1.


gredRGB219,68,55 \definecolorgblueRGB66,133,244 \definecolorgyellowRGB244,180,0 \definecolorggreenRGB15,157,88 \definecolorggreyRGB115,115,115 \iclrfinalcopy

1 Introduction

Thanks to the increasing computational power, pre-trained language models have been breaking the glass ceiling for natural language processing tasks (Peters et al., 2018; Devlin et al., 2019; Liu et al., 2019; Brown et al., 2020). However, with great power comes great challenges: the excessive computational consumption required by large-scale model training significantly impedes the efficient iteration of both research exploration and industrial applications. To lower the training cost, many attempts have been made to conduct progressive training, which starts from training an inferior but low-cost model, and gradually increases its resource consumption (Gong et al., 2019; Devlin et al., 2019). Typically, two components are needed for designing such progressive training algorithms–the growth scheduler and the growth operator (Dong et al., 2020). The former controls when to conduct network growth, and the latter controls how to perform network growth. Here, our objectives are to better understand growth operators for Transformer models, and specifically, to help design better progressive algorithms for BERT (Devlin et al., 2019) pre-training.

Firstly, we discuss the importance of using compound growth operators, which balance different model dimensions (e.g., number of layers, hidden dimension, and input sequence length). Despite of previous efforts made on Transformer growth, they mainly focus on one single model dimension: either the length (Devlin et al., 2019) or the depth dimension (Gong et al., 2019). In this work, however, we find that compound effect plays a vital role in growing a model to different capacities, just like its importance in deciding network architectures under certain budgets (Tan and Le, 2019). Here, we show that growing a Transformer from both dimensions leads to better performance with less training cost, which verifies our intuitions and shows the potential of using compound growth operators in progressive BERT training.

Moreover, we further explore the potential choices of growth operators on each dimension. We conduct controlled experiments and comprehensive analyses to compare various available solutions. These analyses further guide the design of effective compound growth operators. Specifically, we observe that, on the length dimension, embedding pooling is more effective than directly truncating sentences. On the width dimension, parameter sharing outperforms low-rank approximation.

Guided by our analyses, we propose CompoundGrow by combining the most effective growth operator on each dimension. In our experiments, we reveal how CompoundGrow can help train standard BERT models with substantially less cost without sacrificing final performance. The final model speeds up the overall pre-training by 73.6% and 82.2% on BERT-base and BERT-large models respectively. The speedup and performance preservation hold for both BERT-base and BERT-large models.

Our main contributions are two-fold:

  • We conduct comprehensive analyses on Transformer growth. Specifically, we first recognize and verify the importance of balancing different model dimensions during the growth, and then explore and evaluate potential growth operators to provide practical guidance.

  • Guided by our analyses, we propose CompoundGrow, which progressively grows a Transformer from its length, width and depth dimensions. Controlled experiments demonstrate its effectiveness in reducing the training cost of both BERT-base and BERT-large models without sacrificing performance.

Figure 1: Costs of different parts of a Transformer model.

2 Related Works

Progressive training was originally proposed to improve training stability, which starts from an efficient and small model and gradually increase the model capacity (Simonyan and Zisserman, 2014). Recent study leverages this paradigm to accelerate model training. For example, multi-level residual network (Chang et al., 2018) explores the possibility of augmenting network depth in a dynamic system of view and transforms each layer into two subsequent layers. AutoGrow (Wen et al., 2019) attempts to automate the discover of proper depth to achieve near-optimal performance on different datasets. LipGrow (Dong et al., 2020) proposes a learning algorithm with an automatic growing scheduler for convolution nets. At the same time, many studies have been conducted on the model growing operators. Network Morphism (Wei et al., 2016, 2017) manages to grow a layer to multiple layers with the represented function intact. Net2net (Chen et al., 2015) is a successful application to transfer knowledge to a wider network with function-preserving initialization. Similar ideas can also be discovered in many network architectures, including progressive growing of GAN (Karras et al., 2017) and Adaptive Computation Time (Graves, 2016; Jernite et al., 2016).

As large-scale pre-training keeps advancing the state-of-the-art (Devlin et al., 2019; Radford, 2018), their overwhelming computational consumption becomes the major burden towards further developing more powerful models (Brown et al., 2020). Preliminary application of progressive training has been made on Transformer pre-training. Devlin et al. (2019) designs two-stage training with a reduced sequence length for the first 90% of updates. Gong et al. (2019) stack shallow model trained weights to initialize a deeper model, which grows the BERT-base model on the depth dimension and achieves 25% shorter training time.

3 Preliminaries

Consider a Transformer model with layers (Vaswani et al., 2017). As in Figure 1, each Transformer layer consists of a self-attention layer and a feedforward layer. Here, we refer to the number of tokens in a sentence as , the embedding dimmension as , and the hidden dimension as . Also, we mark the layer inputs as with shape .

Feedforward Layer. Transformers use two-layer perceptrons as feedforward networks. Specifically, it is defined as , where is the non-linear function (i.e. GELU), and are parameters. The feedforward layer requires Mult-Add operations and parameters.

Self-Attention Layer. In Transformer models, attention layers are designed as multi-head attention layers, which allow the network to have multiple focuses in a single layer and is crucial for model performance (Chen et al., 2018). It is defined as (with heads): , where is the row-wise softmax function and are parameters. and are matrices with shape (). and arematrices with shape (). Parameters without subscript refer the concatenation of all -head parameters, e.g., . The self-attention layer requires Mult-Adds, and parameters.

In summary, the computation complexity of a layer Transformer would be . Progressive training methods aim to start from small models with less computational cost, and gradually grow such models to the full model during the training stages.

4 Compound Transformer Growth

for  do
       // grow the model with operator at the -th stage.
       for  do
             // Parameter update
return Final network
Algorithm 1 Generic Progressive Training Setup ( is the total number of growing stages. is the network. is the growing operator at stage . is the optimizer, and is the Dataset.)

In this section, we formulate the task of progressive model growth, introduce our compound growth hypothesis, and conduct empirical verification.

4.1 Progressive Training

Algorithm 1 presents a generic setup for progressive training. In each training stage , the corresponding growth operator grows the model . Then, is updated by the optimizer before entering the next training step. Correspondingly, our goal is to maximize the final model performance after all training stages, which can be formulated as:


where is the empirical loss function, and refers to parameter updates of one whole epoch.

Figure 2: Comparison of single-dimensional operators and the compound operator with comparable cost. Y-axis indicates finetuning performances, including MNLI-match valid accuracy (MNLI), SQuaDv1.1 exact match score and F1 (S1/EM, S1/F1), and SQuaDv2.0 exact match score and F1 (S2/EM, S2/F1). X-axis stands for different training steps of the full model (12-layer BERT-base model with 512-token training data) in the last stage. Different columns represents different training steps for the small (low-cost) model. The three compared methods start from different small models: depth stands for a 3-layer model; length stands for training with 128-token training data; compound stands for a 6-layer model with 256-token training data.

4.2 Compound Growth

Note that our objective (Equation 1) is close to the objective of EfficientNet (Tan and Le, 2019), which aims to find the optimal network architecture by maximizing the model accuracy for a given resource budget:


where is a CNN network, , , are coefficients to scale its depth, width, and resolution. The success of EfficientNet demonstrates the importance of balancing different dimensions. Here, we also recognize such a balance as the key to develop effective progressive training algorithms.

Intuitively, growing the network from more than one dimension creates much larger potential to get better performance with less resource. Restricting the growing operator from handling all dimensions would lead to inferior performance, as . The optimal value of the objective function (in Equation 1) is bounded by the feasible set of the growing operator. Due to the gap between the optimal solution and the accessible strategy in real-world situations, we proceed to empirically verify the following claim:

Claim 1. —

With the same tuning budget, growth operators that balance different model dimensions usually perform better than operators restricted to a single dimension.

4.3 Empirical Verification

We use two growth dimensions (i.e., length and depth) to verify our intuitions. To roughly reduce the training cost to of the original value, we can either use data of original sequence length, or use a small model with only Transformer layers. Alternatively, we can also jointly use -long data and -layer Transformers. We split the total training steps into low-cost steps for the low-cost model and final steps for the full model after the one-time model growth. We then use the same hyperparameter setting to evaluate the fine-tuning performance of various full models.

Across different settings (columns) and metrics (rows), the compound operator consistently outperforms or at least achieves comparable results with single-dimensional operators. The observation meets our intuition: to achieve same speedup, compound method can distribute the required reduction on model size to different dimensions, and achieve better performance.

5 Explore Growth Operators for Transformers

BERT\textsubscriptbase BERT\textsubscriptlarge
MNLI SQuAD v1.1 SQuAD v2.0 MNLI SQuAD v1.1 SQuAD v2.0
Acc.   EM F1   EM F1 Acc.   EM F1   EM F1
Data Truncation 83.72 82.72 90.00 76.06 79.18 85.80 85.51 92.18 79.56 82.57
Embed Pooling 84.04 82.96 90.16 76.83 79.88 85.88 85.07 91.95 80.86 83.69
FFN Factorization 83.53 82.21 89.45 75.27 78.11 85.96 85.66 92.10 79.35 82.38
FFN Share Param. 83.92 83.02 89.91 75.83 78.56 86.28 85.60 92.02 80.92 83.85
Table 1: Empirical comparison among growing operators. For each operator, a low-cost model is first trained for 700K steps, then grown to the original BERT model for another 300K steps training.
Figure 3: Demonstration of matrix factorization and parameter sharing on the width dimension.

Knowing the importance of compound growing, we still face many possible design choices to make a low-cost Transformer model and grow it to a larger one on each dimension. Due to the lack of discussion and analysis of conversions on the length dimension in literature, we first empirically explore operators for this dimension. Then, we extend the choice of growing operator to a third dimension, the width of intermediate outputs.

5.1 Length Dimension.

Data Truncation first limits the maximum length of input sequences by truncating the training sentences to a shorter length, and then train the model on full length data. Note that, shorter input sequences usually come with less masked tokens to predict in each sentence. For instance, Devlin et al. (2019) first use sentences of at most 128 tokens (with 20 masked tokens) before training on data of 512 tokens (with 76 masked tokens). The major issue of this data truncation operator is the incomplete update of positions embeddings. The model needs to learn embeddings for the extra positions from scratch at the last stage.

Embedding Pooling Inspired by the idea of multigrid training in the vision domain (Wu et al., 2020), we train the model with “low-resolution text” through embedding pooling. Its major potential benefit over data truncation is that it leaves the training data intact and can update all position encodings.

The existence of position embeddings gives Transformer the unique advantage in processing tokens regardless of their input orders, Thus, we first reorder the input token embeddings to separate masked tokens from unmasked ones in pre-training, and only apply pooling to the unmasked tokens. In this way, all masked tokens manage to preserve their unique representations for the masked language modeling task. Specifically, since the output length of self-attention modules is decided by the length of query vectors, we only conduct pooling on query vectors and keep key/value vectors intact.

As shown in the first group of Table 1, data truncation (sequence length = ) and mean pooling () has similar performance on MNLI and SQuAD v1.1, while mean pooling outperforms data truncation on SQuAD v2.0.

5.2 Width Dimension

On width dimension we aim to reduce the intra-block cost of the feedforward network (FFN).

Matrix Factorization A straightforward method is to decompose the original weight matrix as the product of two small matrices and in the early stage. In the late stage of training, we would recover as and unleashes the full potential.

Parameter Sharing Since parameters in FFN in pairs, we can directly share parameters instead of decomposition. Specifically, we split the original weight matrix and into pairs of small matrixes along the hidden dimension . In this way, the feed forward network output can be calculated as . Therefore, we first set in the early stage. Then, at the growth step, we vertically duplicate (share) horizontally for times as the new , and vertically duplicate for times as the new . Formally, for a given input ,

which preserves the output after the growth as well.

Figure 3 demonstrates the difference between the two treatments. As the second group of Table 1 shows, parameter sharing has significant advantage over matrix factorization when given comparable training budgets (k=4 for parameter sharing and h=0.2D for matrix factorization).

6 Experiment

speedup speedup MNLI Acc. SQuAD v1.1 SQuAD v2.0
(FLOPs) (walltime) M MM   EM F1   EM F1
BERT\textsubscriptBASE 84.4 84.4 83.3 90.2 77.4 80.4
Stack\textsubscriptBASE +68.7% +64.9% 84.5 84.9 83.5 90.5 77.1 80.3
Compound\textsubscriptBASE +107.1% +73.6% 84.7 84.7 83.8 90.3 77.0 80.0
BERT\textsubscriptLARGE 86.3 86.4 86.2 92.7 81.0 84.3
Stack\textsubscriptLARGE +70.7% +69.7% 86.9 87.3 86.3 92.6 81.7 84.7
Compound\textsubscriptLARGE +111.4% +82.2% 87.3 86.8 85.8 92.4 82.4 85.3
Table 2: The pre-training speedup and finetuning performance on dev sets of MNLI and SQuaD. M/MM stands for matched/mismatched accuracy for MNLI. EM/F1 represents exact match score and F1 score for SQuaD. The FLOPs are estimated for forward pass operations, while the walltime is real training time profiled by the TensorFlow profiler from a distributed multi-host setting.
BERT\textsubscriptBASE 52.1 93.5 88.9/84.8 87.1/85.8 71.2/89.2 84.6/83.4 90.5 66.4 65.1 78.3
Stack\textsubscriptBASE 57.3 92.8 89.4/85.6 85.4/84.1 71.0/89.1 84.7/83.5 91.4 69.9 63.7 79.1
Compound\textsubscriptBASE 50.1 92.6 89.1/85.2 85.4/83.9 70.9/88.9 84.6/83.6 91.3 70.1 65.1 78.3
BERT\textsubscriptLARGE 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7/85.9 92.7 70.1 65.1 80.5
Stack\textsubscriptLARGE 62.2 94.3 89.9/85.9 86.0/85.0 71.2/88.9 86.9/86.3 93.0 75.2 65.1 81.1
Compound\textsubscriptLARGE 61.2 94.2 90.2/86.7 86.4/85.7 71.4/89.2 87.2/86.1 93.6 73.3 65.8 81.1
Table 3: The test performance on the GLUE benchmark with metrics described in the original paper (Wang et al., 2018), the higher the better. Compound stands for the proposed method.

Experiment Setups. All our models are implemented based on the TensorFlow implementation2 of BERT (Chen et al., 2020). We keep the original WordPieceTokenizer and original position embeddings (instead of relative position encoding used in Dai et al. (2020)). Following Devlin et al. (2019), we use the English Wikipedia corpus and the BookCorpus for pre-training. We evaluate the final model on the GLUE benchmark (Wang et al., 2018) including 9 subtasks, and the two versions of SQuAD (Rajpurkar et al., 2018) datasets for question answering.

We first train the original BERT model, where each batch contains 256 input sequences, each consisting of at most 512 tokens. For all other progressively trained models, we require the model to finally grow to the original BERT model at the last growing stage, so their final performances are directly comparable. We control the total training steps of all models to be 1M.

The original BERT models use the AdamW (Loshchilov and Hutter, 2017) optimizer with learning rate decay from 0.0001 to 0 and 10K steps of warmup. At the start of each progressive training stage, the learning rate is reset to 0.0001 and keeps decaying following the original schedule.

Compared Method. Previous studies have rarely focused on progressive Transformer growth for BERT training, and progressive Transformer stacking (Gong et al., 2019) is the only directly comparable method to the best of our knowledge. We apply their method on the official BERT model with the same training setting, learning rate schedule and hardware as our method, and achieves better performance than the reported numbers in the original paper. To further unleash the potential of the compared method, we adjust their original training schedule to 300K steps with \sfrac14 number of layers, 400K steps with \sfrac12 number of layers, and 300K steps with the full model. The new training schedule is much faster than the reported one (speedup from the reported +25% to +64.9%) and still gives better final performance than the original paper. This is the fastest stacking model we can get without performance drop.

Our Method. For CompoundGrow, we apply (1) mean embedding pooling with size 2 on the length dimension; (2) parameter sharing with on FFN modules on the width dimension; (3) stacking on the depth dimension. Following the setting of the progressive stacking baseline, we also try to equally distribute 1M training steps. We start with the model treatments on all three dimensions. We train the model with \sfrac14 number of layers and \sfrac12 number of layers for 200K steps respectively, and then stack it to full layers with treatments on the width and length dimensions for another 300K steps. At the last stage, we train the full model for 300K steps, just like the compared method.

Figure 4: Compare the speed-performance trade-off of stacking and CompoundGrow on BERT\textsubscriptlarge.

Results. Table 2 shows the speedup of different models. We estimate the inference FLOPs for compared models and get their real training time from the Tensorflow profiler3. On the BERT-base model, stacking and CompoundGrow speeds up pre-training by and respectively in FLOPs, and respectively on walltime. On the BERT-large model, stacking and CompoundGrow speeds up pre-training by and respectively in FLOPs, and respectively on walltime. Though CompoundGrow is significantly faster, on development sets of MNLI and SQuaD, the compared methods do not have significantly different finetuning performance from the original BERT models.

Table 3 shows the test performance on the GLUE benchmark. Both compared methods achieve at least the same performance as the original BERT model. While CompoundGrow saves more training time, it achieves the same performance with stacking on the large model. On the base model, stacking is better in terms of average GLUE score, mainly due to its advantage on the CoLA dataset. Such an unusual gap on CoLA might be caused by its relatively small volume and corresponding random variance (Dodge et al., 2020). On the larger and more robust MNLI dataset, the compared methods achieve almost the same score.

To have a deeper understanding of the compared methods, we study their speed-performance trade-off by adjusting the training schedule. Specifically, each time we reduce 200K low-cost training steps for both models, and compare their validation F1 score on SQuaDv2.0. As Figure 4 shows, CompoundGrow has clear performance advantage when given comparable training budgets, which further verifies our hypothesis.

7 Conclusion

In this work we empirically verify the compound effect for Transformer growth. Different from previous works, we propose to grow a low-cost Transformer model from more than one dimension. We show that the compound growth method achieves better performance than single-dimensional growth method with comparable training budget. We apply controlled method to compare available growth operators on different dimensions to provide practical guidance in operator selection. Our final model speeds up the training of the BERT-base and BERT-large model by and in walltime respectively while achieving comparable performance. Meanwhile, the study of compound growth leaves substantial space for future improvement, especially on the design of growth operators on different dimensions. From another perspective, it remains an open research direction to study the relationships between different operators and explore effective schedulers to coordinate different training stages of progressive training.


  1. Code will be released for reproduction and future studies.
  2. https://github.com/tensorflow/models/blob/master/official/nlp/modeling/models/bert_pretrainer.py
  3. https://www.tensorflow.org/guide/profiler


  1. Language models are few-shot learners. ArXiv abs/2005.14165. Cited by: §1, §2.
  2. Multi-level residual networks from dynamical systems view. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  3. TensorFlow official model garden External Links: Link Cited by: §6.
  4. The best of both worlds: combining recent advances in neural machine translation. In ACL, Cited by: §3.
  5. Net2Net: accelerating learning via knowledge transfer. External Links: 1511.05641 Cited by: §2.
  6. Funnel-transformer: filtering out sequential redundancy for efficient language processing. arXiv preprint arXiv:2006.03236. Cited by: §6.
  7. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. Cited by: §1, §1, §2, §5.1, §6.
  8. Fine-tuning pretrained language models: weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305. Cited by: §6.
  9. Towards adaptive residual network training: a neural-ode perspective. In Proceedings of the Thirty-seventh International Conference on Machine Learning (ICML 2020), Cited by: §1, §2.
  10. Efficient training of bert by progressively stacking. In ICML, Cited by: §1, §1, §2, §6.
  11. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983. Cited by: §2.
  12. Variable computation in recurrent neural networks. arXiv preprint arXiv:1611.06188. Cited by: §2.
  13. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196. Cited by: §2.
  14. RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1.
  15. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Cited by: §6.
  16. Deep contextualized word representations. ArXiv abs/1802.05365. Cited by: §1.
  17. Improving language understanding by generative pre-training. Cited by: §2.
  18. Know what you don’t know: unanswerable questions for squad. arXiv preprint arXiv:1806.03822. Cited by: §6.
  19. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.
  20. EfficientNet: rethinking model scaling for convolutional neural networks. ArXiv abs/1905.11946. Cited by: §1, §4.2.
  21. Attention is all you need. ArXiv abs/1706.03762. Cited by: §3.
  22. GLUE: a multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations, Cited by: Table 3, §6.
  23. Modularized morphing of neural networks. arXiv preprint arXiv:1701.03281. Cited by: §2.
  24. Network morphism. In International Conference on Machine Learning, pp. 564–572. Cited by: §2.
  25. AutoGrow: automatic layer growing in deep convolutional networks. External Links: 1906.02909 Cited by: §2.
  26. A multigrid method for efficiently training video models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 153–162. Cited by: §5.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description