Joint Source-Target Self Attention with Locality Constraints

Joint Source-Target Self Attention with Locality Constraints

José A. R. Fonollosa       Noe Casas        Marta R. Costa-jussà
Universitat Politècnica de Catalunya


The dominant neural machine translation models are based on the encoder–decoder structure, and many of them rely on an unconstrained receptive field over source and target sequences. In this paper we study a new architecture that breaks with both conventions. Our simplified architecture consists in the decoder part of a transformer model, based on self-attention, but with locality constraints applied on the attention receptive field.

As input for training, both source and target sentences are fed to the network, which is trained as a language model. At inference time, the target tokens are predicted autoregressively starting with the source sequence as previous tokens.

The proposed model achieves a new state of the art of 35.7 BLEU on IWSLT’14 German-English and matches the best reported results in the literature on the WMT’14 English-German and WMT’14 English-French translation benchmarks.

Joint Source-Target Self Attention with Locality Constraints

José A. R. Fonollosa       Noe Casas        Marta R. Costa-jussà Universitat Politècnica de Catalunya {jose.fonollosa,noe.casas,marta.ruiz}

1 Introduction

In Neural Machine Translation (NMT), the encoder–decoder architectural pattern has been ubiquitous: all the dominant NMT models have relied on such an architecture, including the sequence to sequence model (sutskever2014sequence; cho2014learning), its variant with attention (bahdanau2014seq2seqattn; luong2015attention), the Convolutional model (gehring2017conv), the Transformer model (vaswani2017transformer), and the Dynamic Convolution model (wu2018dynconv).

The encoder–decoder architectural pattern consists of two blocks: the encoder, which receives the source sentence as input and computes an embedded representation; and the decoder, which receives the output of the generator and is trained to generate the target sentence tokens conditioned also on the target tokens from previous positions.

he2018layerwise proposed an NMT model that does not have the encoder–decoder separation but learns joint source-target representations by means of an architecture that resembles a Transformer Language Model (LM).

In this work we propose to use the idea of joint source-target representations from he2018layerwise with added locality constrains (wu2018dynconv) to the receptive field of the self-attention layers (vaswani2017transformer).

The rest of the article analyzes this proposal following this structure: section 2 provides an overview of related work; section 3 describes the proposed approach in detail; section 4 describes the experimental setup, while the obtained results are described in section 5; finally, section 6 draws the final conclusions.

2 Related Work

Figure 1: Joint source-target architecture with local self-attention.

The dominant NMT architecture is the Transformer model proposed by vaswani2017transformer. It consists in an encoder–decoder architecture, where both encoder and decoder rely on multi-headed attention blocks. They can be either self-attention blocks —if they receive representations of a single side (source or target) as input—  or encoder–decoder attention blocks, where the key and values over which the attention is computed are the output of the encoder and the query that drives the weights is an embedded representation of the target sentence.

Another recently proposed encoder–decoder architecture that improves on the results of the Transformer is the Dynamic Convolution model (wu2018dynconv). It makes use of depthwise separable convolutions with kernels with dynamic weights that are softmax normalized along the temporal dimension. The width of the kernel (i.e. the receptive field over the temporal dimension) is progressively increased from the lower layers to the final ones. Note that other works also propose to force some notion of locality into to the attention receptive fields, like yang2018local who introduce a Gaussian local bias on the computed self-attention weights.

There have been previous attempts to break from the encoder–decoder paradigm and combine the purpose of both elements into a single block: in the work by elbayad2018pervasive, source and target sentence embeddings are tiled forming a grid structure over which 2D masked (causal) convolutions are applied in a stacked manner to obtain joint source-target representations from which to derive each next token in the target sequence.

(bapna2018transpattn) also proposed to profit from joint source and target representations by having the decoder attend to a combination of the outputs of all encoder layers instead of just the last one. Without these joint source-target representations the encoder did not benefit from increasing its number of layers, which suggest that joint representations can also be key to be able to train higher capacity models.

Finally, the work by he2018layerwise proposes to use the decoder part of a Transformer over the concatenation of source and target sequences, training it as a language model. The attention is properly masked so that generated target tokens attend to the whole source sequence and all previous target tokens. Source and target language embeddings are also added to the source and target token embedded vectors to help distinguish the two segments.

A similar architecture was also proposed by (lample2019cross) for Cross-lingual language model pretraining using pairs of parallel sentences. However, the proposed Translation Language Model (TLM) is only used for cross-lingual classification.

3 Joint Source-Target Self-Attention with Locality Constraints

The studied network architecture is based on the Transformer decoder, as proposed in (he2018layerwise). There is not an independent encoder module and the encoder-decoder attention mechanism is not used. The architecture consist only of Transformer self-attention blocks.

As shown in figure 1, in this architecture we provide as input to the network the concatenation of source and target sentence tokens. This design allows the network to learn joint source-target representations from the early layers.

Given that the processing takes place in batches, we prepare the source batch and the target batch separately, applying the appropriate padding in each case, and then we concatenate the source and target batches in the sequence dimension.

In order to make the network aware of the two sentences in different languages, we follow the approach of (he2018layerwise) and (lample2019cross). We add positional embeddings independently to the source and target (that is, starting the position at for source and target parts) as well as language embeddings. Language embeddings are learned during training, while for positional embedding we use the pre-computed sinusoidal variant from (vaswani2017transformer).

In the normal Transformer self-attention, the receptive field comprises all tokens for the source sequence and the previous tokens for the target sequence (i.e. to make it causal). We propose to apply a reduced receptive field, attending only to each token’s locality. This is similar to how convolutional kernels are applied locally in (wu2018dynconv). The receptive field adopted grows progressively from the initial layers to the last ones.

In order to implement such locally constrained attention, a masking approach is followed, similar to the causal masking normally used in the decoder of the Transformer model. In our case, the mask forms a band with the specified receptive field size as width. At the source side, the mask is centered at each token while, at the target side, the mask includes only the previous tokens in order to keep causality.

The loss is defined as a standard categorical cross-entropy applied only to the target output tokens, with the usual label smoothing with 0.1 weight for the uniform prior distribution over the vocabulary of Transformer-based architectures.

At inference time, the model works like any standard autoregressive LM, but in this case we use the source sentence tokens as starting point for the next token generation.

4 Experimental Setup

Model WMT’14 IWSLT’14
vaswani2017transformer 28.4 41.0 34.4
ahmed2018weighted 28.9 41.4 -
chen2018combining 28.5 41.0 -
shaw2018relative 29.2 41.5 -
ott2018scaling 29.3 43.2 -
wu2018dynconv 29.7 43.2 35.2
he2018layerwise 29.0 - 35.1
Joint Self-attention 29.7 43.2 35.3
Local Joint Self-attention 29.7 43.3 35.7
Table 1: Translation quality evaluation (BLEU scores).

In order to assess the translation quality of the proposed architectures experimentally and make the evaluation comparable to previous work, we ran experiments on standard benchmark datasets with the usual setups and evaluation protocols, including IWSLT’14 German-English, WMT’14 English-German and WMT’14 English-French.

For WMT’14 en-de, we use for training the preprocessed data released by vaswani2017transformer as part of tensor2tensor, which actually contains 4.5M sentence pairs from the WMT’16 training data, tokenized and byte-pair encoded (BPE) (sennrich2016bpe) with a joint source and target vocabulary of 32K tokens, while for evaluation we use newstest2014.

For WMT’14 en-fr, we use the preprocessing script included with the fairseq library111, which tokenizes, cleans and normalizes punctuation with the utilities from Moses (moses), and computes a joint source target BPE vocabulary of 40K tokens.

For IWSLT’14 de-en, we use the analogous fairseq script, but we extended the vocabulary to 31K joint source target BPE tokens as he2018layerwise. The script provides the usual lowercased data with 160K training sentence pairs.

In order to evaluate the translation quality, we use case-sensitive tokenized BLEU scores (papineni2002bleu), for WMT’14 en-de and WMT’14 en-fr. For WMT’14 en-de, we also performed compound splitting, like the works we are comparing to. For the computation of BLEU itself, we used sacrebleu (post2018sacrebleu). For the IWSLT’14 de-en we also present comparable tokenized BLEU results.

The hyperparameter configuration used for our experiments with the joint source-target self-attention for the IWSLT’14 de-en benchmark consists on 14 layers, with an embedding size of 256, feedforward expansion size of 1024 and 4 attention heads. For the version with locality constraints, the attention window sizes from input layers to output layers are 3, 5, 7, 9, 11, 13, 15, 17, 21, 25, 29, 33, 37, 41.

The configuration for WMT’14 en-de and en-fr also has 14 layers and the same hyperparameter values as the transformer-big setup from (vaswani2017transformer): an embedding size of 1024, feedforward expansion size of 4096 and 16 attention heads. The attention window sizes, from input layers to output layers are 7, 15, 31, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63, 63.

In all configurations, the number of layers were chosen to have approximately the same number of total trainable parameters as the big Transformer model configuration used for each dataset.

The proposed NMT architecture was implemented on top of the fairseq library. The training parameters are based on (wu2018dynconv): Adam optimizer, batches of 500K source tokens for WMT benchmarks and 4K for IWSLT de-en, 30K training steps for WMT en-de, 80K steps for WMT en-fr and 85K for IWSLT de-en. The learning rate is linearly warmed for the first 10K steps up to a maximum of for IWSLT and for WMT benchmarks, followed by an inverse square root scheduler on IWSLT and a cosine rate with a single cycle on WMT. The source code to reproduce our results and pretrained models are available at

5 Results

Table 1 presents a comparison of the translation quality measured via BLEU score between the currently dominant Transformer (vaswani2017transformer) and Dynamic Convolutions (wu2018dynconv) models, as well as the work by he2018layerwise, which also proposes a joint encoder-decoder structure, and also other refinements over the transformer architecture like (ahmed2018weighted), (chen2018combining), (shaw2018relative) and (ott2018scaling) .

The entry Joint Self-attention corresponds to the results of our implementation of (he2018layerwise), that significantly improves the original results by 0.7 BLEU point on the WMT14 de-en benchmark, and 0.2 on IWSLT. The same architecture with the proposed locality constraints (Local Joint Self-attention) establishes a new state of the art in IWSLT’14 de-en with 35.7 BLEU, surpassing all previous published results by at least in 0.5 BLEU, and our results with the unconstrained version by 0.4.

The Joint Self-attention model obtains the same SoTA BLEU score of (wu2018dynconv) on WMT’14 en-de, and the same SoTA score of (ott2018scaling) and (wu2018dynconv) on WMT’14 en-fr. The local attention constraints do not provide a significant gain on these bigger models, but it improves the BLEU score on WMT’14 en-fr to a new SoTA of .

6 Conclusion

In this work we studied an NMT architecture that merges the classical encoder-decoder components into a single block that learns joint source-target representations starting from its initial layers and makes use of locality constraints over the attention receptive field.

Our experiments show that the joint source-target model with local attention achieve state of the art results on standard WMT benchmarks, and significantly improves the best published result on the IWSLT’14 de-en benchmark.


This work is partially supported by Lucy Software / United Language Group (ULG) and the Catalan Agency for Management of University and Research Grants (AGAUR) through an Industrial PhD Grant. This work is also supported in part by the Spanish Ministerio de Economía y Competitividad, the European Regional Development Fund and the Agencia Estatal de Investigación, through the postdoctoral senior grant Ramón y Cajal, contract TEC2015-69266-P (MINECO/FEDER,EU) and contract PCIN-2017-079 (AEI/MINECO).


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description