Attention that does not Explain Away

Attention that does not Explain Away


Models based on the Transformer architecture have achieved better accuracy than the ones based on competing architectures for a large set of tasks. A unique feature of the Transformer is its universal application of a self-attention mechanism, which allows for free information flow at arbitrary distances. Following a probabilistic view of the attention via the Gaussian mixture model, we find empirical evidence that the Transformer attention tends to “explain away” certain input neurons. To compensate for this, we propose a doubly-normalized attention scheme that is simple to implement and provides theoretical guarantees for avoiding the “explaining away” effect without introducing significant computational or memory cost. Empirically, we show that the new attention schemes result in improved performance on several well-known benchmarks.


1 Introduction

The Transformer architecture (Vaswani et al., 2017) has been successfully used to improve state-of-the-art performance in a variety of machine learning tasks, such as machine translation (Vaswani et al., 2017; Dehghani et al., 2019), language modeling (Devlin et al., 2019; Yang et al., 2019), summarization (Cohan et al., 2018; Goodman et al., 2019), dialog (Mazaré et al., 2018; Cheng et al., 2019), image captioning (Sharma et al., 2018; Zhao et al., 2019), and visual question answering (Yu et al., 2019; Tan and Bansal, 2019). One of the most important components of the Transformer architecture is its self-attention mechanism, applied universally to both the encoder and the decoder components. This attention mechanism allows for information to freely flow between inputs at arbitrary distances, which is intuitively appealing for modeling natural language or tasks that need to model cross-modal relationships between their inputs.

Despite the empirical success of the self-attention mechanism, little formal work has been done to analyze its statistical properties and relate it to previously known classical models. Better understanding its properties can lead to insights into what it does and does not do well. This in turn can lead to improvements to the attention mechanism and ultimately to a better-performing Transformer network.

In this paper, we closely study the Transformer attention formulation from a probabilistic view via the Gaussian mixture model. If we consider the Transformer model as a stack of layers with data flowing from lower to upper layers, then the output neurons (from the upper layer) of an attention unit can be regarded as the most likely data generated by a Gaussian mixture model (GMM), while the input neurons (from the lower layer) of the attention unit act as the Gaussian centers.

Our insight here is that this Transformer attention scheme has an “explaining away” effect, which means that the information present in certain lower layer neurons may be filtered out completely. This is because for a GMM, not all Gaussian centers (lower layer neurons) are required to contribute in generating output data (upper layer neurons). The information of the centers that do not generate data is lost after observing the data. This “explaining-away” effect is related to the one in the directed graphical model, in the sense that the existence of the few contributed lower neurons “explain away” the other muted lower neurons on generating upper neurons.

In order to compensate for this, we describe an alternative probabilistic model for attention, in which the role of the upper and lower layers in the GMM formulation are reversed. This new attention scheme requires all the generated data (lower layer neurons) to be explained by at least one Gaussian center (upper layer neurons). Therefore, it guarantees the preservation of information for all lower layer neurons, as we prove in this paper.

The MLE equation of the reversed GMM model leads to a simple attention update that is similar to the original one, except for the attention weight normalization. The original Transformer attention scheme only normalizes the attention weights once for every upper-layer neuron. By contrast, our new attention mechanism requires a two-step attention weight normalization procedure: the first normalizes each lower-layer neuron, and the second normalizes each upper-layer neuron. In the rest of this paper, we denote the original, upper normalized attention scheme as , and the new doubly-normalized attention scheme as .

We also show that  updates correspond exactly to one iteration of the Sinkhorn algorithm (Peyré and Cuturi, 2019) in a constrained optimization problem. As a result, iterating  until convergence results in a doubly-stochastic attention matrix where the attention weights of all upper and lower neurons are normalized. We also showed that  can be formulated in a similar constrained optimization problem, except that the optimization problem of  does not have the constraint which presents “explaining away” compared to .

Mathematically, we also formalize the concept of “explaining away” of a lower neuron by using the sum of its attention weights. We prove that the attention weights sum of the lower neurons of  are lower bounded by 1/(sequence length), therefore completely avoid the “explaining away” effect of .

Last but not least, we formulate a hybrid attention scheme, , that dynamically combines both attention schemes, and can provide a handle on a task-based preference between  and , as resulting from the learning algorithm. We perform empirical studies and obtain clear numerical improvements using  and  formulation in several well-known benchmarks, with minor computational overhead and negligible increase of model size.

2 Transformer Attention and Gaussian Mixture Models

In this section, we review the Transformer self-attention mechanism and analyze how it relates to the Gaussian Mixture Model.

Assuming a sequence of length , we first focus on the Transformer single-headed attention formulation involving two layers of neurons: the lower-layer neurons are the input representations denoted as at position , and the upper-layer neurons are the output representations denoted as at position . We assume both and are 1-d tensors of the same size .

The self-attention mechanism first transforms the input representations to queries and keys by applying and , where and are trainable transformation matrices of size . The value of an upper-layer neuron is computed as the weighted sum over the lower-layer neurons followed by the value transformation of size ,


Since in this formulation the attention weights are normalized for every upper layer neuron over the lower layer neurons , we refer to this attention scheme as upper-normalized attention, .

2.1 Relation to GMM

The  scheme (1) relates to a Gaussian mixture model (GMM) in the following way. Let us use to denote the positions of the Gaussian cluster centers, and the cluster priors denoted as , satisfying . The generated data position is denoted as . If we assume the variance of the Gaussian distributions to be equal to 11, then the log-likelihood of the GMM is:

We can compute the optimal by taking the derivative of and solve the following equation,

If we assume the cluster priors2 as , we have


Using the fact that and , we obtain a fixed-point equation:


If we compare Eq. (3) with Eq. (1), the Gaussian cluster centers play exactly the same role as the key representation of the lower-layer neurons in Eq. (1). The data position in Eq.(2) plays the same role as the query representation in Eq. (1). By iterating the fixed-point equation (3) for one iteration, the new data position corresponds to the upper layer neuron in Eq. (1) after applying the transformation .

Note that computing the most-likely data positions given the Gaussian centers is non-standard for probabilistic inference. A more natural way would be the MLE estimation for the Gaussian centers given the data. That is exactly what doubly-normalized attention corresponds to, as we will discuss in the next section.

2.2 Multi-head attention

The multi-head ( heads) attention can be derived similarly. The lower neurons are projected into heads with different and where and are transformation matrices of size . This yields outputs ,


where is the value transformation matrix of size . Similar to (1), (4) corresponds to a GMM followed by value transformations3. -heads attention corresponds to GMMs followed by value transformations. The final output is a concatenation of all heads: .

3 Doubly-normalized Attention

As we have shown, in the original  scheme, the lower layer neuron representations correspond to the Gaussian centers, while the upper layer neuron representations correspond to the data generated from these centers. The maximization with respect to the data positions is unnatural. In addition, the formulation has an “explaining away” effect, because for a GMM, not all Gaussian centers (lower layer neurons) are required to contribute in generating output data (upper layer neurons). As a result, the information of the centers that do not generate data is completely lost. For tasks such as summarization, “explaining away” may be acceptable, while for other tasks such as visual question answering and language modeling, the attention mechanism may benefit from a more “conservative” formulation, with the upper layer preserving the neural information at all positions.

To this end, we propose to reverse the role of the upper and lower layers in the GMM, so that all the generated data (lower layer neurons) will be explained by at least one Gaussian center (upper layer neurons). This results in a new doubly-normalized attention scheme () (the derivation will be given shortly):


Comparing (1) with (5), the only difference between the two is the normalization process of the attention weights. The  scheme applies two normalization steps: first for each lower layer neuron and then for each upper layer neuron .

3.1 Relation to GMM

We present here the derivation of (5) from a GMM. When we reverse the role of the upper and lower layers, we use to denote the Gaussian centers and as the data generated by GMM. The log-likelihood function of the GMM is:


where the priors satisfy . We take the gradient with respect to ,



At optimum , we have , and therefore the fixed-point equation is,


By iterating the fixed-point equation (8) for one iteration and assuming , then the new center position is equivalent to the upper layer neuron of Eq. (5), modulo a transformation matrix .

Similar to the , it is also straightforward to extend the above derivations to the multi-head (-heads)  scheme, where it would be GMMs followed by value transformations.

3.2 Relation to Double Stochasticity

It should be emphasized that our doubly-normalized attention is not doubly-stochastic (where the columns and rows of the attention matrix all sum to 1). After applying , the attention weights of the lower layer neurons are not normalized, since the upper layer normalization in the second step of  denormalizes the lower layer. However, as we show in the following, doubly-stochastic attention can be achieved by applying the two normalization steps for multiple iterations until convergence.

Consider the following constrained optimization problem that characterizes ,


This problem is well-known in the optimal transport literature. The classical iterative algorithm for finding the solution is called the Sinkhorn algorithm (Peyré and Cuturi, 2019), which uses the initial condition and iterates


If we write then the doubly-normalized attention weights computed in Eq. (5) correspond exactly to the updates (10) of the Sinkhorn algorithm for one iteration. If more iterations are applied, the attention weights will eventually satisfy both constraints in (9), and become doubly-stochastic. One question is whether  could perform better with more iterations for the updates in Eq. (10). Empirically, we find that adding more update iterations increases computational time but does not improve performance.

Interestingly, the attention weights of the original  scheme can be obtained from a very similar constrained optimization except that the normalization constraint on the lower layer neurons is removed:


Introducing the Lagrange multipliers , this formulation is equivalent to optimizing the Lagrangian, whose gradient with respect to gives

and leads to the same attention weights as in Eq. (1) when .

Comparing the two constrained optimization problems in (11) and (9), the removal of the constraint in (11) allows solutions in which a lower-layer neuron has an arbitrary contribution to the upper layer, causing the “explaining-away” effect.

3.3 Relation to Capsule Networks

It is also worth noting that  is related to the EM routing algorithm in the capsule networks (Hinton et al., 2018). In particular, the vote matrix in (Hinton et al., 2018) is similar to in Eq. (6); the new pose matrix in (Hinton et al., 2018) is similar to in Eq. (6). However, unlike CapsuleNet, there is no variance and estimation in , as we find that estimating variance significantly hurts the empirical performance of the  algorithm. In addition, we only iterate the fixed-point equation (8) for one iteration, as more iterations are computationally expensive and does not improve the performance.

4 Doubly-Normalized Attention Avoids Explaining Away

In this section, we formalize the definition of “explaining-away” and compare  and  theoretically and empirically with respect to the “explaining-away” phenomenon.

Definition 1

In an attention unit, a lower-layer neuron is considered -“explained away”, if the sum of the attention weights over the upper layer neurons is less than .

We consider to be some small value (fixed at in the rest of this paper). For the original Transformer , the only constraint in (11) is . It does not require all lower layer neurons to be attended by the upper layer. Therefore, for a certain lower-layer neuron , the total attention weights to the upper layer can be as low as 0 so that it is -“explained away”.

In contrast, the  scheme attempts to optimize the objective with both lower and upper layer normalization constraints (9) by one iteration of the Sinkhorn algorithm. It turns out that this is sufficient to avoid the “explaining-away” phenomenon. The following theorem formalizes this fact by showing that each lower-layer neuron contributes with a total attention weight of at least , where is the sequence length.

Theorem 2

For any lower-layer neuron , the sum of the doubly-normalized attention weights over the upper layer neurons is lower bounded by .

Proof  Since ,

We illustrate the difference between the two attention schemes, and how different they behave in practice with respect to the “explaining-away” phenomenon, using the multi-view attention model (with a single-layer, single-head attention) described in the VQA experiments later. Fig. 1 shows the histogram distribution of between  and . As the graph indicates, a large proportion of the  attention weights-sum is -”explained-away” ( values < ), meaning that the information of only a few of the lower neurons are passed to the upper layer. In contrast,  preserves more information from all lower layer neurons, as indicated by their weights-sum log values (> , where ).

Finally, we would like to emphasize that  does not work against attention sparsity. It allows the attention map between any pairs of neurons. What it forbids is the 0 total “contribution” of any lower neuron : . Therefore, our method is compatible with existing faster sparse attention structures such as Parmar et al. (2018).

Figure 1: Comparison of the attention weights-sum between  and . Majority of the neurons in  are -“explained away”, as the logarithm of their weights-sum is less than -20.

5 Hybrid Attention

Since the formulations of  and  result in attention mechanisms with quite different properties, it is beneficial to combine them together. A direct way to do so is by using trainable variables that control the contribution of the attention weights (for layer and head ) of the two normalization schemes (we use here to simplify the notation):


where denotes the  weights and denotes the  weights. We call this combination form the hybrid normalized attention scheme, .  allows the model to learn, at different layers and different heads , which of the two normalization schemes fits the data better, for a given task. Each parameter is trained jointly with the other parameters to improve the representation power of the model and better fit the data. Moreover, this approach also allows one to visualize how the values of the parameters change as the model is training, and therefore provides direct evidence of how much and where the different normalization schemes lead to better training performance. We provide examples of such visualizations in the experiments.

5.1 Computational Cost of  and

The pseudo-code of the (multi-headed) ,  and  is summarized in Algorithm 1. Note that for notational clarity, we wrote multi-head operations in a for-loop over different heads . However, an efficient implementation should use single tensor products across all heads, similar to the original Transformer method.

Input: Key, Query, Value transformation matrices , and for heads. Hybrid weights for all heads. Lower layer neurons .
Result: Upper layer neurons .
for  do
       1. Compute , , for all lower neurons
       2. Compute
       3. [] Compute
       4. [] Compute
       5. [] Compute
       6. Compute
end for
Return for all .
Algorithm 1 ,  and

We can see that the additional computational cost of the  scheme compared to the original Transformer’s  scheme is the two normalizations in Step-4 as opposed to one in Step-3.  requires both Step-3 and Step-4 and combines them together in Step-5. The computational cost of the new steps is , where is the sequence length and is the number of heads. In comparison, the cost of step 1 is , where is the size of the hidden representation. In the majority of the applications we consider, we usually have and , and therefore the additional cost of the  and  scheme is usually small in practice.

The additional model variables introduced by the  scheme are the hybrid weights . Therefore, it adds new variables, where is the number of Transformer layers. This increase is negligible compared to , the total size of the Transformer model.

6 Numerical Experiments

6.1 Multi-view Attention Model for VQA

In a vision-and-language multimodal system (e.g., Visual Question Answering), a crucial factor in the performance is the quality of the visual features. A good example is the work of (Yu et al., 2019), where they show that it is beneficial to use visual features produced by different image processing modules (multi-view). They combine these visual features using an attention layer over the bounding-box features derived from multiple object detectors (Fig. 2).

Experiment Setup. Our experimental setup is similar to the one proposed in (Yu et al., 2019). We conduct experiments on the VQA benchmark dataset, VQA-v2 (Goyal et al., 2017). Our core VQA model uses as a backbone the Pythia architecture (Jiang et al., 2018). We used three object detection models, where each detector generates bounding-box features. All three object detection models are trained over the Visual Genome dataset (Krishna et al., 2017), but use different backbone networks: the first uses a ResNet-101 network (He et al., 2016), the second a ResNet-200 network, and the third an Inception-ResNetV2 network (Szegedy et al., 2016).

Figure 2: Multi-view attention model for VQA.

Multi-view features can be used in a straightforward manner by concatenating them all together before feeding them into the Pythia model; we call this approach the 3x100-boxes baseline. The proposal from (Yu et al., 2019) combines the multi-view features using a one-layer attention model as follows: one object-detector model is designated as primary, and its corresponding features are used as queries (after transformation); the second and third object detection models are designated as secondary, and their corresponding features are used to obtain keys (see Figure 2). The resulting output feature is a weighted sum of the features according to the attention weights. More details about the mutliview attention model and the experiment hyperparameter settings are provided in the Appendix. We use a single-layer and single-head attention model and experiment with two versions of the attention scheme:  and .

Figure 3: The hybrid weight heavily favors  over  in multi-view, attention-based VQA models.
Method Test-dev Test-std
10-100-boxes Pythia (Jiang et al., 2018) 66.91 -
3x100-boxes (no-attn baseline) 68.79 69.22
3x100-boxes  (attn baseline) 69.14 69.50
3x100-boxes 69.70 70.01
Table 1: Test Accuracy on VQA v2.0 Test-dev and Test-std splits.

Results Analysis. The results are summarized in Table 1. Confirming the findings from (Yu et al., 2019), using an attention mechanism () over the 3x100 boxes improves the accuracy over the 3x100-boxes no-attn baseline, but the  mechanism achieves a better utilization of the signal provided by the three object detectors compared to the  mechanism. Moreover,  allows us to visually confirm the superiority of the  mechanism for the VQA task: as we plot the hybrid weight from Eq.(12) in Fig. 3, it rapidly converges to 1.0, meaning that the model learns to heavily favor  over  for combining multi-view features. Combining the findings in Fig. 1, we believe that  performs worse because it -“explains-away” too many box features in this stage, while  preserves information from all bounding boxes.

6.2 Language Representation Learning

The goal of language representation learning is to pretrain textual representations that are useful for solving natural language understanding (NLU) tasks like entailment or question answering.

Figure 4: In the BERT model, the hybrid weights favor  in all layers of the encoder ();  gains more weight for closer-to-output layers.

Experiment Setup. We use the BERT (Devlin et al., 2019) setting for our language representation learning setup: a Transformer network with 24 layers of attention, the hidden and embedding size set to 1024, and 16 attention heads.

Method SQuAD 1.1 (EM/F1) SQuAD 2.0 (EM/F1) RACE GLUE (avg.)
 (baseline) 85.1/92.2 80.2/83.6 74.2 84.5
85.8/92.4 81.0/84.2 74.3 85.2
85.6/92.2 81.7/84.8 74.3 84.7
Table 2: Pretraining with BERT models and finetuning on several representative downstream tasks.
-encoder, -decoder (baseline) 38.020.07 18.930.10 35.250.09
-encoder, -decoder 38.190.05 19.090.07 35.520.06
-encoder, -decoder 38.27 19.30 35.56
Table 3: ROUGE F1 scores for headline generation on the Gigaword benchmark.

Our experiment is based on the ALBERT platform (Lan et al., 2019)4. We use the BookCorpus (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2019) to pretrain three contextual representation models, using , , and  respectively. Each pretraining uses a batch size of 4096 and a LAMB optimizer with learning rate 0.00176 for 125k steps on the Cloud TPU V3 with 64 TPUs. We evaluate the resulting representations by using them as a starting point to finetune for a number of representative NLU tasks (Rajpurkar et al., 2018; Williams et al., 2018). Due to space limitation, more experimental details are provided in the Appendix.

Results Analysis. Each fine-tuning experiment is done 5 times, and the mean number and their standard error are reported. The main results are summarized in Table 2 and more detailed results are available in the Appendix. Overall, the network parameters encode their language representations by making use of , resulting in the empirical advantage of the  and  based models over the  based models on most tasks considered. Aside from the numerical improvements when finetuning on the task, we also inspect what happens to the hybrid weight of Eq.(12) during  pretraining. In Fig. 4, we plot the hybrid weights (averaged over all heads of each layer) for all 24 layers and find that they are always larger than 0.5, meaning that the  method is preferred for pretraining (masked-LM & sentence-ordering) tasks. The  method has more weight for higher layers, meaning that “explaining away” is more allowable when it is closer to the output.

6.3 Headline Generation

We also present empirical results on a summarization task. As already mentioned, summarization aligns well with the tendency of  of “explaining away” unimportant information.

Experiment Setup. We use the Gigaword dataset (Graff and Cieri, 2003), which is a standard benchmark for headline generation. We pre-process this dataset as in (Rush et al., 2015), and further tokenize the words into word-pieces (Devlin et al., 2019), which results in a vocabulary size of 30,522 word-piece types. We use a 10k dataset for validation, and the standard 2k test set (Rush et al., 2015) as the evaluation test.

Our model and training hyperparameters are adapted from (Goodman et al., 2019). The transformer contains 12 layers, each with a hidden size of 768 and 12 attention heads. We keep the attention mechanism in the decoder as , and compare the  and  with  as the encoder attention mechanism. Our training uses a batch size of 512 and an Adam optimizer (Kingma and Ba, 2015) with learning rate of for 500k steps. The training is done on Cloud TPU V3 with 16 TPUs for each job.

Figure 5: The hybrid weights favor  in the encoder of headline generation, because the task requires filtering unimportant information. However, the ROUGE scores of  is higher than .

Results Analysis. Each experiment is run 5 times, and the mean number and standard error are reported in Table 3. We also plot the averaged hybrid weights for all layers in Fig. 5 which shows that the  model favors , especially in the top and bottom layers of the encoder. Nevertheless,  still makes a positive contribution in the middle layers, which allows the model based on  to perform better compared to the -based one. Somewhat surprisingly,  alone performs competitvely: all of its ROUGE scores are higher than the ones of  and are close to the ones of . This indicates that complete “explaining away” by  is unnecessary for filtering unimportant information.  provides a conservative alternative which achieves better generation performance.

7 Conclusion

The formulation of the attention mechanism of the Transformer, here called , leads to “explaining away” effects in which the information of certain input neurons is completely ignored. Our new  scheme compensates for ’s weaknesses by avoiding “explaining away”, as we show both theoretically and empirically. Empirically, we show  and a hybrid  to be superior to the original attention mechanism, at the cost of minor computational overhead.

Appendix A Multi-head attention and GMM

In multi-head attention, the lower neurons are projected into heads with different and where and are transformation matrices of size . This yields outputs ,

where is the value transformation matrix of size .

If we follow the same idea as in single-head attention, the corresponding GMM becomes,

In order to convert to , one difficulty is that is a down-projection matrix. Therefore, the inversion of does not exist. In order to avoid the problem, one can use the same key transformation for all heads which is . The query transformation is a zero padded matrix also of size . The rows of are all zero, except of the rows

One can show that, if and , then

Therefore, the corresponding GMM becomes

and can be related to by .

Appendix B Experiment Details about the Multi-view Attention Model for VQA

Dataset and evaluation

The VQA-v2 Goyal et al. (2017) dataset contains a training set (with 80k images and 444k QA pairs), a validation set (with 40k images and 214k QA pairs), and test set (with 80k images and 448k QA pairs). For each question, there are answers provided by different human annotators. Following the same setting as Pythia Jiang et al. (2018), we augment the train set with a part of validation set (train + val2train) and use the remaining data in validation set for validation (minival). The test set is split into test-dev and test-std, and the evaluation can only be conducted online. Same as other work on VQA, we report a robust accuracy metric as the average score over subsets of the groundtruth answers, where each score is computed as follows:

Detailed Model Descriptions

Our VQA model uses as a backbone the Pythia architecture (Jiang et al., 2018). In order to combine the 100 features from each of the three object detection models, we use a one-layer attention mechanism as in (Yu et al., 2019). The features from one object-detector model is used as the primary feature. The features of the second and third object detection models are designated as secondary features. In order to obtain keys and queries, we apply transformation on the secondary and primary features, so that , , . However, we find that it is better to directly use the features as the values without transformation. For the primary view, the output value of the -th feature is

For each secondary view, the feature is computed as

For the  scheme,

For the  schme

The final output feature integrates the 100 features from different views via an element-wise summation, followed by layer normalization,


During the hyperparameter tuning process, we train on training set only and manually tune our hyperparameter based on the accuracy on the validation set. We use the same model hyperparameters as the Pythia model. Our image feature dimension is and the query and key transformation matrices are of size . For the attention layer, we experiment with multiple number of heads including 1, 2, 4, and 8, and we find the single head attention gives the best performance. We also did a grid search on the dropout probability in attention layer from to , and set it to after the search. The hybrid attention weight is initalized to be 0.5. For optimization, we use Adam optimizer with learning rate , and use batch size . We train the model for steps. The training was done on 4 Cloud TPUs. The total training time is approximately 38 hours for each model. The validation performance on the minival dataset is reported in Table 4.

Method minival
3x100-boxes (no-attn baseline) 68.26
3x100-boxes  (attn baseline) 68.34
3x100-boxes 68.99
Table 4: Validation accuracy on the VQA v2.0 minival splits.

Appendix C Experiment Details about Language Representation Learning

c.1 Downstream Evaluation Tasks


SQuAD is an extractive question answering dataset built from Wikipedia. The answers are segments from the context paragraphs and the task is to predict answer spans. We evaluate our models on two versions of SQuAD: v1.1 and v2.0. SQuAD v1.1 has 100,000 human-annotated question/answer pairs. SQuAD v2.0 additionally introduced 50,000 unanswerable questions. For SQuAD v1.1, we use the same training procedure as BERT, whereas for SQuAD v2.0, models are jointly trained with a span extraction loss and an additional classifier for predicting answerability (Yang et al., 2019; Liu et al., 2019). We report the results on the development set.


RACE is a large-scale dataset for multi-choice reading comprehension, collected from English examinations in China with nearly 100,000 questions. Each instance in RACE has 4 candidate answers. Following prior work (Yang et al., 2019; Liu et al., 2019), we use the concatenation of the passage, question, and each candidate answer as the input to models. Then, we use the representations from the “[CLS]” token for predicting the probability of each answer. The dataset consists of two domains: middle school and high school. We train our models on both domains and report accuracies on the development set.


GLUE (Williams et al., 2018) is comprised of 9 tasks, namely Corpus of Linguistic Acceptability (CoLA), Stanford Sentiment Treebank (SST), Microsoft Research Paraphrase Corpus (MRPC), Semantic Textual Similarity Benchmark (STS), Quora Question Pairs (QQP), Multi-Genre NLI (MNLI), Question NLI (QNLI), Recognizing Textual Entailment (RTE) and Winograd NLI (WNLI). It focuses on evaluating model capabilities for natural language understanding. The detailed per-task results on GLUE are available in Table 6.

c.2 Model hyperparameters

Our pretraining uses the same default hyperparameters as in The total number of model parameters of the BERT model is about 334M. The total pretraining time for  is about 40 hours per job, while for  and  are around 48 hours. There is about 20% overhead which is much higher than our theoretical estimation. This is because our BERT pretraining used 64 TPUs that are highly efficient for parallelizing large matmul ops. As a result, the runtime of two consecutive normalization steps of smaller tensors could be longer than a single-step matmul of a much larger tensor. We expect the relative overhead to be smaller with other types of processing units.

Hyperparameters for downstream tasks are shown in Table 5. These hyperparameters were copied from Lan et al. (2019) which were adapted from Liu et al. (2019), Devlin et al. (2019), and Yang et al. (2019). We used the ADAM optimizer for fine-tuning as in Lan et al. (2019).

SQuAD v1.1 5.00E-05 48 0 0.1 3649 365 384
SQuAD v2.0 3.00E-05 48 0 0.1 8144 814 512
RACE 1.00E-05 32 0 0.1 12000 1000 512
CoLA 1.00E-05 16 0 0.1 5336 320 512
STS 2.00E-05 16 0 0.1 3598 214 512
SST-2 1.00E-05 32 0 0.1 20935 1256 512
MNLI 3.00E-05 128 0 0.1 10000 1000 512
QNLI 1.00E-05 32 0 0.1 33112 1986 512
QQP 5.00E-05 128 0.1 0.1 14000 1000 512
RTE 3.00E-05 32 0.1 0.1 800 200 512
MRPC 2.00E-05 32 0 0.1 800 200 512
WNLI 2.00E-05 16 0.1 0.1 2000 250 512
Table 5: Hyperparameters for language representation learning downstream tasks. LR: Learning Rate. BSZ: Batch Size. DR: Dropout Rate. TS: Training Steps. WS: Warmup Steps. MSL: Maximum Sequence Length.
85.5.3 93.1.2 60.7.6 91.1.1 89.4.8 76.2.5 91.1.1 88.7.1 84.5.3
86.4.1 93.1.1 59.9.7 91.5.1 91.2.1 80.3.6 91.1.1 87.7.2 85.2.2
86.2.1 93.2.1 59.4.5 91.4.1 91.1.1 77.81.0 90.8.1 87.6.3 84.7.3
Table 6: Detailed results of  and  on GLUE downstream tasks.

Appendix D Experimental Details about Headline Generation

The Gigaword dataset (Graff and Cieri, 2003) consists of about 4M pairs. We pre-process this dataset as in (Rush et al., 2015), which results in an average length of 31.4 words, and an average length of 8.5 words. We further tokenize the words into word-pieces (Devlin et al., 2019), which results in a vocabulary size of 30,522 word-piece types. We use a 10k dataset for validation, and the standard 2k test set (Rush et al., 2015) as the evaluation test.

Our backbone Transformer model is adapted from (Goodman et al., 2019) that contains 12 layers, each with a hidden size of 768 and 12 attention heads. The total number of model parameters is about 108M. We truncate (or pad) the input and output sequences to a fixed number of word-piece positions, namely 128 encoder positions and 64 decoder positions, to accommodate hardware and model-architecture limitations. The hybrid attention weight is initalized to be 0.1, because the headline generation task favors  to “explain away” unimportant neurons. We use an Adam optimizer (Kingma and Ba, 2015) and a learning rate of for 500 steps. The training was done on Cloud TPU V3 with 16 TPUs for each job. The total training time is approximately 16.5 hours for / and 16 hours for .

The ROUGE-L score on the validation set is 45.75 for , and 45.63 for .

Appendix E Doubly-normalized Attention Alleviates Mode Collapse

Attention model tends to collapse modes. In particular, the data at different positions tend to move closer to each other after attention. We illustrate the collapsing effect in a 2-D example in Fig. 6, where two separated clusters of data converge to a single point after only 4 steps of  (left). Most multi-layer attention models such as the Transformer try to avoid such collapsing effect by adding a residual layer, which pulls the data back to its original position.

(a) , step 0
(b) , step 0
(c) , step 2
(d) , step 2
(e) , step 4
(f) , step 4
Figure 6: Mode-collapsing behavior on a mixture of two Gaussians. 500 data points (red) are centered at [1.8, 0.7] and the other 50 data points (blue) are centered at [-1, -1]. Both Gaussians have covariance matrix equal to . Four steps of self-attention are applied on the data points. In each step, Eq.(3) is applied by  and Eq.(7) and (8) are applied by , and we let in both cases. After four steps,  (left) collapses to 1 cluster, while  (right) maintains 2 clusters.

To compare the mode-collapsing effect of  and  analytically, we study a 1-D toy example which contains two clusters. One cluster contains data points centered at value , and the other contains data points centered at value . The distance between the two centers is . Assuming the relative distance between the data points within each set is negligible compared to , the unnormalized attention weights between one center and the data from the other set is , and the weights between one center and the data within that set is 5. We compare the center distance between the two data clusters after applying the  and  self-attention updates.

Applying Eq. (3) for the  scheme, the new center distance of the upper-normalized attention scheme are:

and the distance between the two updated centers is:

Since we have that , defining then gives


By contrast, if we apply the Eq. (8) updates for the  scheme, the new center position of the doubly-normalized attention scheme are:

and the distance between the two updated centers is:

Since again , defining and then yields


We plot the values of Eq. (13) and Eq. (14) on the -axis against that of on the -axis, for several different values, see Fig. 7. We see that in both cases the distance between the two centers decays after the attention updates. However, the center distance of   always upper bounds the one of , with the gap getting larger as the cluster sizes get more unbalanced (). The above result holds for the 2-D example in Fig. 6 as well, where the  collapses to a single cluster after 4 steps (left) while the   maintains two separate clusters (right).

Figure 7: Center distance values after  (blue solid curve) and  (red dashed curve), as a function of cluster mass ratio with different values (initial distance between centers is ).

The mode collapse effect is even more obvious in multi-layer attention. In Fig. 8, when the two clusters are balanced (both clusters contain 225 data points), both normalization schemes yield similar results. However, when the two clusters are unbalanced (the red cluster contains 500 points and the blue one contains 50) (Fig. 9),  collapses to a single cluster after 4 steps, while the  maintains two separate clusters.

(a) , step 0
(b) , step 1
(c) , step 2
(d) , step 4
(e) , step 0
(f) , step 1