Contextual Lensing of Universal Sentence Representations

Contextual Lensing of Universal Sentence Representations


What makes a universal sentence encoder universal? The notion of a generic encoder of text appears to be at odds with the inherent contextualization and non-permanence of language use in a dynamic world. However, mapping sentences into generic fixed-length vectors for downstream similarity and retrieval tasks has been fruitful, particularly for multilingual applications. How do we manage this dilemma? In this work we propose Contextual Lensing, a methodology for inducing context-oriented universal sentence vectors. We break the construction of universal sentence vectors into a core, variable length, sentence matrix representation equipped with an adaptable ‘lens’ from which fixed-length vectors can be induced as a function of the lens context. We show that it is possible to focus notions of language similarity into a small number of lens parameters given a core universal matrix representation. For example, we demonstrate the ability to encode translation similarity of sentences across several languages into a single weight matrix, even when the core encoder has not seen parallel data.


1 Introduction

This work introduces a new framework from which to induce and use universal sentence vectors (USVs). Specifically, we focus on learning mappings of sentences into fixed-length vector representations applicable to a variety of downstream tasks focused on textual similarity, retrieval and analysis.

Recent work has been dominated by research on large-scale self-supervised pre-training and subsequent task-specific fine-tuning (Dai and Le, 2015; Howard and Ruder, 2018; Radford et al., 2018; Devlin et al., 2018). This pretrain-finetune paradigm has led to significant improvements in a wide range of NLP tasks, primarily static classification benchmarks. While early work in USVs had evaluated their utility on such tasks, task-specific fine-tuning has largely made this application of USVs less relevant. However, there are still plenty of use cases and applications where having USVs is appealing. One of the most successful examples of USVs has been LASER (Artetxe and Schwenk, 2019), a tool for learning language-agnostic sentence representations. LASER and related methods can be used to encode sentences across dozens of languages and has been utilized for mining parallel corpora (Artetxe and Schwenk, 2018; Guo et al., 2018) and the creation of new parallel datasets for machine translation (Schwenk et al., 2019b). Language agnostic USVs have also been used to analyze large-scale sentential similarity graphs across languages (Schwenk et al., 2019a). These and related tasks do not lend themselves naturally to the current pretrain-finetune paradigm on static benchmarks. The ability to flexibly encode language into vector representations often lends itself to creative and unconventional applications, as has been done in the space of word representations.

A problem arises, however, with respect to the ‘universal’ part of universal sentence vectors. What does it mean for a sentence vector to be ‘universal’? Initial constructions of USVs, such as skip-thoughts (Kiros et al., 2015), were motivated by the success of transfer learning in computer vision. Namely, that a single task and a high-capacity encoder could result in a generic, internal representation of images that could then be adapted to arbitrary downstream tasks. However, empirical evidence has demonstrated that this setup is substantially less effective for language than vision compared to the recent pretrain-finetune paradigm. 1.

This issue of universality is rooted in the inherent contextualization of language. In a dynamic, impermanent world, language as a tool to convey meaning necessarily requires context from which it can be interpreted 2. The contexts of language use are infinite in nature, may or many not be accessible from text alone, and often have an implicit aspect to them as a function of the conversational or world scope. A USV seems to be at odds with this. If similarity is derived exclusively from the embedding of the sentence, how could this capture all the seemingly, infinite contexts in which this sentence can be interpreted?

In this paper we propose a solution to this dilemma, called Contextual Lensing (CL). Contextual Lensing breaks up the construction of a USV into two components: a ‘universal’ component, as a function of a large self-supervised model and a ‘lens’ from which this universal representation is to be focused on as a function of context. Thus, a Lensed Universal Sentence Vector combines a generic variable-length matrix representation of a sentence with a lens from which a fixed-length vector representation can be derived. As a proof of concept, we focus on arguably the simplest instantiation of this framework. That is to use a contextualized word embedding matrix from BERT (Devlin et al., 2018) or other Muppet models as the universal matrix representation, and to learn a reduction operator with ‘lens parameters’ as a function of context to map this representation into a fixed-length vector. We consider Static Lensing, where the reduction operator is constructed using supervised learning. This is in contrast to Dynamic Lensing, where the lens parameters are adapted online, perhaps via FiLM (Perez et al., 2018) or hyper-networks (Ha et al., 2016) in settings outside direct supervised learning, such as Meta, Few-Shot or Reinforcement Learning. We leave these considerations for future work.

1.1 Additional Motivation

Name Contextualized Embeddings Encoder Lens Data Training Module Dimensionality
BERT BOW BERT-Large MeanPool 1024
sCL(BERT-Large; NLI) BERT-Large Simple NLI Classifier 1024
mBERT BOW mBERT MeanPool 768
CL(mBERT; NLI) mBERT GatedConv NLI Classifier 1024
CL(mBERT; ) mBERT GatedConv Translate Ranker 1024
sCL(mBERT; 100M) mBERT Simple Translate Ranker 1024
CL(mBERT; 100M) mBERT GatedConv Translate Ranker 4096
sCL(mBERT; 100M) mBERT Simple Translate Ranker 4096
Table 1: List of contextualized USVs trained and evaluated in this work. The first group are English models. The second group are unsupervised multilingual models (no parallel data). The third group are translation models with varying numbers of parallel examples ={10k, 100k, 1M, 10M, 100M}. Dimensionality is the final sentence vector representation size.

Before elaborating on the details of our model, we provide three additional motivations for Contextual Lensing under the instantiation of contextual embeddings + reduction:

Ad-hoc Categories and Adaptation. Consider an example of an ad-hoc category (Barsalou, 1983). Given the following concepts: {pets, children, parents, works of art, explosive material}, how are they collectively related? A USV may assign a high relatedness score to children and parents but is unlikely to assign a high score to children and explosive material. Now suppose one was given the following information: ’Things to pay attention to if there is a fire’. Conditioned on this new information, the relationship between the concepts becomes apparent almost immediately.

How could a neural network model this behaviour? Notice that the underlying meanings of each concept do not change before and after the new information is introduced. Instead, only the ‘lens’ from which we view the concepts changes. Under the lens introduced by the additional categorization, the relatedness of concepts becomes clear. Furthermore, to humans, this process happens almost instantaneously. It is unlikely that a neural network should have to fine-tune the representations of each concept as a function of the contextual category. Instead, this behaviour is better modelled as a modulation. The scores of each concept become modulated by the representation of the category. This example highlights two points: one, the existence of a universal core model that captures the underlying meaning(s) of the concepts and two, the existence of a fast, adaptive modulator that operates on the core model. This aligns with our instantiation of Contextual Lensing, albeit a simplified version.

Self-supervision as a Foundation. Large-scale self-supervised models provide a foundation from which additional operations can be applied. Empirical results demonstrate the effectiveness of predicting properties about the scope of the model as a means of learning useful internal representations of the world. This principle can also be applied with multiple modalities, such as the addition of images e.g. (Lu et al., 2019) as part of self-supervised learning with text. Models such as KERMIT (Chan et al., 2019b, a) based on the Insertion Transformer framework (Stern et al., 2019) are capable of modelling the joint distribution and all its factorizations under a single, unified self-supervised model. These have the potential to provide an excellent foundation from which USVs can be induced through Contextual Lensing.

Context Diversity. Language can be analyzed through seemingly infinitely many lenses. Some examples include semantic, syntactic, temporal, sociocultural and ad-hoc relatedness. Each of these can provide a new lens from which USVs can be induced. Such tools might be useful for e.g. retrieving semantically or syntactically similar sentences, inducing representations as a function of time and quickly adapting to ad-hoc categories. Furthermore, these mechanisms could potentially lead to useful tools for social scientists and other disciplines for interactive textual analysis.

1.2 Experimental Contributions

  • We lens BERT representations to the task of Natural Language Inference (NLI) for downstream English tasks. Our results are nearly as strong as SBERT (Reimers and Gurevych, 2019) but without any fine-tuning of BERT. Our results outperform on average all other existing USVs on 9 downstream tasks.

  • We show it is possible to learn massively multilingual sentence vectors while encoding translation similarity into a single weight matrix.

  • We perform an extensive comparison to LASER on 100+ languages, demonstrating improved sentence matching performance on very low resource languages.

2 Related Work

Universal Sentence Vectors. Approaches to learn USVs can be classified into unsupervised or self-supervised methods (Kiros et al., 2015; Hill et al., 2016; Arora et al., 2017; Logeswaran and Lee, 2018), supervised (Conneau et al., 2017a; Reimers and Gurevych, 2019), multi-task (Cer et al., 2018; Subramanian et al., 2018) multi-lingual (Artetxe and Schwenk, 2019) and even random encoders (Wieting and Kiela, 2019). The LASER method of Artetxe and Schwenk (2019) learns vectors through a sequence-to-sequence model by predicting a translated sentence from a source sentence. LASER and SBERT (Reimers and Gurevych, 2019) are the most closely aligned with our work.

Contextualized word embeddings. Our method takes advantage of pre-computed contextualized word representations (Melamud et al., 2016; McCann et al., 2017; Peters et al., 2018). Initially these methods were used to augment or replace non-contextualized word embeddings. Nowadays it is more common to fine-tune the entire pipeline. Our work provides evidence that contextualized word representations from large-scale self-supervised models provide a foundation for learning sentence vectors.

Adaptor methods. Our proposed setup may also fall into a category of adaptor methods that aim to efficiently adapt pre-trained models to new tasks (Houlsby et al., 2019; Stickland and Murray, 2019) including few and zero-shot settings (Bansal et al., 2019; Puri and Catanzaro, 2019). The key difference between these methods and ours is the setting. While these methods focus on adaption to new classification tasks, we focus on adapting sentence representations to various contexts and arbitrary downstream tasks.

3 Method

We begin by defining notation. Let denote a length sequence according to a pre-defined subword vocabulary. Our goal is to map into a -dimensional vector representation conditioned on two parameters sets: the base parameters and the lens parameters . The base parameters are obtained from a pre-trained, self-supervised model such as BERT, while the lens parameters are learned as a function of some pre-defined context. We will describe how the context is defined and the lens parameters learned in a subsequent section. The final sentence embedding for is a composition of two functions and given by . The mapping returns a variable-length matrix as the universal sentence representation of . The sentence encoder is a reduction operator that returns a fixed length vector of size that is independent of . The contextualized embeddings produced from the base model are obtained using the last layer. Other alternatives are also possible, such as ELMo-style re-weighting (Peters et al., 2018), through we do not consider these. We consider three types of sentence encoders :

Average Pooling. We consider a parameter-less mean pooling baseline given by . This baseline is used in all of our experiments as a ‘uniform’ sentence embedding to gauge how a naive sentence encoder performs independent of any context.

Simple (Single Layer). The simple encoder has only a single parameter matrix . For each timestep , let denote the transformation of each contextualized word embedding given by where is a non-linearity such as ReLU. This results in a transformed embedding matrix . The final -dimensional representation is obtained by max-pooling as . Note that the use of maxpool has been empirical shown by Conneau et al. (2017a) and Kiros and Chan (2018) to significantly outperform other pooling operations when learning sentence embeddings for downstream tasks, provided there are learnable parameters before the operation.

GatedConv (Self-Attention). The above sentence encoder treats each contextualized word embedding of equally. For certain choices of context, this is likely sub-optimal. As a final choice of sentence encoder, we consider a simplified form of the encoder of InferLite (Kiros and Chan, 2018), which utilizes a fine-grained gating mechanism. Depending on the context, certain features may benefit from either being up-weighted or down-weighted so as to only highlight the relevant aspects of the contextualized embeddings for the given context. The original InferLite encoder take multiple embedding types as input. We consider the special case of a single embedding type. It’s possible our method can generalize to multiple types of contextualized embeddings.

This encoder has four components: an encoder, controller, fusion and reduction. In the special case of a single embedding matrix , the encoder and controller take similar forms for layers:


where * denotes a convolution and is the -th layer. The controller has an identical structure to compute and . The only difference is the non-linearity on the last layer of the controller is a sigmoid while on the encoder it is tanh. This structure mimics the gated convolutional layers of van den Oord et al. (2016), Dauphin et al. (2016) and Gehring et al. (2017). The fusion layer computes a weighted combination plus a skip connection given by


with the final reduction as . The GatedConv encoder is essentially a form of self-attention layer.

3.1 Relatedness lists

BERT CLS (Reimers and Gurevych, 2019) 78.68 84.85 94.21 88.23 84.13 91.4 71.13 16.50 42.63 -5.28
Glove BOW (Conneau et al., 2017a) 77.25 78.30 91.17 87.85 80.18 83.0 72.87 58.02 53.76 -1.88
BERT BOW (Reimers and Gurevych, 2019) 78.66 86.25 94.37 88.66 84.40 92.8 69.45 46.35 58.40
InferSent (Conneau et al., 2017a) 81.57 86.54 92.50 90.38 84.18 88.2 75.77 68.03 65.65 3.72
USE (Cer et al., 2018) 80.09 85.19 93.98 86.70 86.38 93.2 70.14 74.92 76.69 5.33
SBERT (Reimers and Gurevych, 2019) 84.88 90.07 94.52 90.33 90.66 87.4 75.94 79.23 73.75 7.50
sCL(BERT-Large; NLI) 83.40 89.32 93.49 89.16 89.35 91.8 76.75 73.75 72.08 6.64
Table 2: Comparison of USVs on 9 downstream tasks. The first seven tasks are evaluated by training a logistic regression classifier directly on top of the sentence vectors. Performance is accuracy. The last two tasks report Spearman correlation of unsupervised textual similarity. indicates the mean improvement over all tasks with respect to BERT BOW. Best results per column are bolded. Results that are not ours are obtained from Reimers and Gurevych (2019).

So far we have defined the form of the sentence encoders but still need to define how the parameters are learned given a pre-defined context. We consider a special class of contexts which can be expressed as pairs of relatedness or similarity between lists of sentences. For example, in the context of translation similarity, two sentences are similar under this context if one is a translation of another (parallel). In the case of Natural Language Inference, similarity is given by whether two sentences are entailed, neutral or contradictory. Of course not all contexts are easily defined this way, however these act as a proof of concept for the general method. We consider two forms of training modules from which these relatedness lists can be learned: a Classifier and a Ranker. In both cases, pairs of sentences are encoded by a Siamese network. For the Classifier, we take standard practice and produce a single vector as the concatenation of the two vectors along with their component-wise product and absolute difference. This is then fed to a two-layer neural network with softmax layer to predict a pairwise class label. The corresponding errors are backpropagated through the sentence encoders up to the contextualized embeddings. For the Ranker, we use the margin-based loss of VSE++ (Faghri et al., 2017) which takes a max over contrastive examples. Contrastive examples are taken from the same minibatch and no special effort is made to mine hard negatives.

3.2 Combinations

With the above we can mix and match different contextualized embeddings, sentence encoders and contexts from which to learn the lens parameters. Table  1 highlights the combinations considered in this work with the corresponding notation used.

4 Experiments

Hyper-parameters Search space
Batch size {128, 256, 512}
Warmup steps {1k, 2k, 4k, 8k, 16k}
Embedding dropout rate {0, 0.1, 0.2}
Gating layer size {128, 256, 512}
Hidden layer size (Classifier) {256, 512, 1024}
Margin (Ranker) {0.1, 0.2, 0.3}
Table 3: Hyperparameters ranges evaluated by random search.

We perform experiments across five evaluation settings. Each setting aims to analyze a particular aspect of our models as well as compare to relevant work, namely Sentence BERT (Reimers and Gurevych, 2019) and LASER (Artetxe and Schwenk, 2019).

4.1 Learning Lensed Sentence Vectors

The overview of our experimental protocol is as follows. We first train sentence encoders as described in Table  1. After training is complete, the composition of contextualized embeddings with their corresponding reduction module is used to map text into fixed-length vectors. These vectors are then directly evaluated on a suite of downstream tasks, similar to other work on universal sentence vectors. No additional fine-tuning is done in any experiments. We use the last layer of publicly available BERT-Large and Multilingual BERT (mBERT) models for contextualized embeddings. Models that lens with NLI are trained on the concatenation of SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018). For translation lensing, we follow the instructions of Artetxe and Schwenk (2019) to reproduce the corpus used for training LASER. This corpus consists of data from Europarl, United Nations Parallel Corpus, OpenSubtitles2018, Global Voices, Tanzil and Tatoeba, all available from OPUS 3. See the appendix of Artetxe and Schwenk (2019) for complete details. For Tatoeba, we verify that none of our training sentences appear in the official test set. After collecting this corpus, we shuffle the dataset into chunks of size 10k, 100k, 1M, 10M and 100M, representing five orders of magnitude of parallel data. This data is then used for learning the translation-based sentence encoder. We train models with both Simple and GatedConv encoders and dimensionalities of 1024 and 4096. All of our models are trained on a single GPU each using a random hyper-parameter search. Table  3 describes this search space. All models are trained using Adam (Kingma and Ba, 2015) with a learning rate warmup schedule (Vaswani et al., 2017). The sentence embedding size determines the initial learning rate. We validate matching errors on a small set of WMT news datasets. For models trained with 100M sentences, we perform just a single pass through the dataset. For NLI-based models, we early stop based on the development set performance. No ‘peeking’ at downstream tasks is done, we exclusively use these validation metrics to choose models for all subsequent experiments.

4.2 English Downstream Evaluations

Our first set of experiments compares NLI-lensed sentence embeddings against existing universal sentence vectors. In particular, we focus on comparison with Sentence BERT (SBERT) on a suite of SentEval benchmarks (Conneau and Kiela, 2018). The SBERT framework here is essentially identical to ours with the exception that their model fine-tunes the BERT parameters. Thus, we can control other conditions (such as NLI training) and directly analyze what effect fine-tuning all of BERT has on downstream performance vs frozen contextualized embeddings. Other work such as Peters et al. (2019) perform extensive analysis of ‘hot’ vs ‘cold’ settings but to our knowledge we are the first to explore this for universal sentence vectors. We refer the reader to Reimers and Gurevych (2019) for a more detailed description of experimental setup as ours is identical to theirs.

Results of these experiments are in Table  2. Here we make two observations. First, our results outperform on aggregate against existing non-BERT universal sentence encoders while only learning a single weight matrix on top of the contextualized embeddings. Second, while SBERT outperforms on aggregate, the is small relative to a basic BERT BOW baseline. This demonstrates that the gain from fine-tuning all of BERT for these tasks are minimal and much of the performance improvements over a naive BERT BOW baseline can be achieved by adapting fixed contextualized embeddings. We note that Reimers and Gurevych (2019) observed a similar quirk comparing BERT vs RoBERTa (Liu et al., 2019). While fine-tuning RoBERTa on static tasks substantially improves over BERT, the gains are minimal for downstream sentence evaluations. Taken together, these results suggest that the fine-tune paradigm that is so crucial for static evaluations may not directly carry over to learning USVs.

4.3 Matching High-Resource Languages

The remainder of our experiments are done with multilingual data. For these experiments, we analyze several of our models on matching parallel sentences on 6 languages: English, French, Spanish, German, Czech and Russian. We use WMT newstest2012 as our test corpus which has 6-way parallel data allowing us to experiment with all language pair combinations. For each sentence in the source language, we search for the nearest sentence by cosine similarity. If it is a translation, it is considered a hit. Otherwise it is a miss. We report the average match error.

(a) English (b) French (c) Spanish (d) German (e) Czech (f) Russian

Figure 1: Error rates from searching for the correct ground-truth translation across 6 languages for WMT newstest2012. Standard deviation is computed over all other languages from the source/target. Averaged error rates from LASER are included.

Figure  1 shows results on each language using 6 of our sentence encoders with varying amounts of parallel data. We also plot LASER performance. Observe that as we increase the amount of parallel data, performance improves and converges to within 0.5-1.0 absolute error to LASER. Also observe that unsupervised encoders (no parallel data) are able to achieve mean errors between 20-40%. While substantially worse than parallel models, these results add to a growing body of work the implicit translation abilities of large monolingual self-supervision (Pires et al., 2019).

(a) Avg - NLI (b) GatedConv - Simple (c) 1024 dim - 4096 dim (d) Dense - Binary

Figure 2: Relative comparisons of different model types. (a): mBERT mean pooled embeddings against NLI (b) GatedConv encoder against simple encoder (c) 1024-dimensional vectors against 4096-dimensional vectors and (d) dense vectors vs sparse binary vectors. Positive numbers indicate the right argument performs better while negative numbers indicate the left argument performs better.

We also do relative comparisons to different components of our models. In Figure  2 we analyze 4 comparisons: unsupervised models (BERT BOW vs NLI), encoder type (GatedConv vs Simple), sentence embedding dimensionality (1024 vs 4096) and embedding type (Dense vs Sparse binary). Binary representations are obtained by thesholding at 1.0, resulting in a vector with 2.5% active units. This works due to the ReLU+MaxPool combination of the sentence encoder. See Shen et al. (2019) for an extensive analysis of binary sentence embeddings.

4.4 Mining Parallel Sentences

Language mBERT? LASER Ours
Algerian Arabic 60.5 71.5
Asturian 13.8 7.5
Awadhi 63.9 62.1
Cebuano 84.3 81.8
Chamorro 70.8 65.7
Egyptian Arabic 31.1 45.5
Faroese 28.4 25.4
Gaelic; Scottish Gaelic 96.3 84.3
Javanese 77.1 49.5
Kashubian 56.7 48.8
Mongolian 91.8 47.7
North Moluccan Malay 49.1 44.7
Novial 34.1 25.1
Nynorsk Norwegian 11.7 8.3
Old English 62.3 39.9
Pampangan; Kapampangan 94.1 90.1
Piemontese 50.4 22.1
Russian old 71.9 64.7
Sorbian Lower 52.0 42.3
Sorbian Upper 45.5 36.5
Swabian 54.0 34.8
Swiss German 55.6 57.7
Talossan 55.3 44.2
Turkmen 79.3 69.2
Waray 86.4 79.5
Welsh 91.4 52.9
Western Frisian 48.3 16.8
Xhosa 91.6 84.2
Yiddish 94.3 94.5
Table 4: Match errors for languages from which there is no parallel data. For each language, we also indicate whether the language was included for self-supervision as part of multilingual BERT.
Model DE FR RU ZH DE FR RU ZH Mean() Std()
Schwenk (2018) 76.1 74.9 73.3 71.6 76.9 75.8 73.8 71.6
Azpeitia et al. (2018) 84.27 80.63 80.89 76.45 85.52 81.47 81.30 77.45
LASER (Artetxe and Schwenk, 2019) 95.43 92.40 92.29 91.20 96.19 93.91 93.30 92.27
mBERT BOW 54.47 50.89 39.42 39.57 1.11 0.012
CL(mBERT; NLI) 59.00 59.46 47.11 41.12 1.13 0.012
CL(mBERT; 100M) 88.24 85.11 86.18 82.15 1.103 0.004
sCL(mBERT; 100M) 89.98 86.79 87.15 84.91 1.130 0.007
CL(mBERT; 100M) 89.84 87.14 88.00 86.05 90.24 88.54 89.25 86.70 1.132 0.011
Table 5: F1 scores for parallel data mining on the BUCC 2018 training and test sets. The first four language columns are training while the last four are testing. Middle group are unsupervised mining. For training data we report oracle scores with the corresponding mean and variance of thresholds (). For test data, we report results using the best threshold found on the training sets. Best results overall are bolded, best results per group are underlined.

In this experiment, we consider the task of parallel sentence mining. Given two monolingual corpora in different languages, the goal is to return (hard mine) pairs of sentences which are direct translations of each other while ignoring all other pairs. We use the BUCC 2018 corpora (Zweigenbaum et al., 2018) for this task, consisting of a target English corpus and source data in German, French, Russian and Chinese. The metric is F1 score. A naive way to perform this task is just to compute all-pairs cosine similarity and then choose a threshold. However this often performs poorly due to the hubness problem. Instead, we use a type of normalized similarity that mitigates the hubness problem by re-weighting scores based on scores of its nearest neighbours. To do this, we use margin-based scoring of Artetxe and Schwenk (2018):

with ratio () score, a generalization of CSLS proposed by Conneau et al. (2017b). Here, and are two sentences and are the -nearest neighbours of . We use in all experiments. For unsupervised models, we report ‘oracle’ results using the best threshold on the training set. This is to our knowledge the first attempt at unsupervised parallel sentence mining.

Our results are reported in Table  5. First observe the mean and variance of the threshold parameters. They are nearly identical across all languages, a similar result observed by Artetxe and Schwenk (2019). This is even true for unsupervised models, meaning one could tune the threshold on high-resource pairs and use the same model for mining language pairs without seeing any parallel data. Again, we see strong performance from our Simple encoder, which only learns a single weight matrix. In this experiment, we notice a more substantial performance difference between our best results and LASER. We made a single submission of our best model to Zweigenbaum et al. (2018) to be evaluated on a held-out test set.

4.5 Matching on 100+ Languages

As of now we have only explored experiments on high resource languages. We next consider matching experiments on the Tatoeba test dataset, consisting of 112 languages. The evaluation is identical to before: we take cosine similarities across language pairs and compute mean matching errors based on whether we retrieve the correct sentence translation. In Figure  3 we report results as a function of language families and scripts for a number of our models compared to LASER. In Table  4 we consider languages from which there is no parallel data.

Figure 3: Match errors for different models across language groupings on Tatoeba. All indicates all languages. The last three groups are scripts. All others are language families. A * next to the method indicates that it uses a 4096-dimensional vector.

Across all languages, we obtain a similar overall match error to LASER. On some families LASER is better while on others ours is. Perhaps most interestingly are results across languages with no parallel data. Here we observe substantial gains over LASER, noting which languages were also available to mBERT. Taken together, we observe that (a) LASER is better for high-resource languages (b) mixed across certain language families and (c) ours is better for very low resource languages. The later highlights the usefulness of contextualized embeddings vs learning all parameters of the sentence encoder directly from parallel data.

4.6 Visualization of Language Embeddings

Figure 4: t-SNE of language vectors colored by language family. No parallel data is used for these embeddings. Best seen electronically.

As a final experiment, we qualitatively analyze unsupervised language embeddings derived from Tatoeba test sentences. For test sentences in each language, we first normalize each sentence embedding to have unit length, then compute the variance across all sentence embeddings in the corresponding language. This is to derive an embedding that is independent of the particular sentences from that language. We then apply t-SNE (Maaten and Hinton, 2008) to obtain a two-dimensional visualization of these language embeddings. This is done using the CL(mBERT; NLI) model which has no access to parallel data. The visualization is in Figure  4. Each tick is color coded to its language family. This is for visualization only, the model has no access to this information. We observe a surprising amount of family relatedness, indicating that our embeddings can capture high-level properties of the underlying language families. A similar visualization is also done by (Kudugunta et al., 2019) under different settings. We note that we also created a similar visualization with models that had access to parallel data and observed similar outputs. This suggests that high-level language relatedness was primarily learned through self-supervision with minimal effects from the addition of parallel data.

Finally, we took the Tatoeba test sentences and trained a logistic regression classifier to predict what language each vector encodes. Using 1% of sentences to train, we observed that mBERT BOW could obtain over 50% accuracy, CL(mBERT; NLI) achieves 23% accuracy while CL(mBERT; 100M) was only 10% accuracy. Models using the Simple encoder had a higher accuracy than those with a GatedConv encoder. Our results suggest that translation-based lensing removes language discrimination to some degree but largely keeps languages in separate subspaces.

5 Conclusion

Our work proposed a framework for obtaining lensed universal sentence vectors as a function of a pre-defined context. Experimentally we show strong results against existing methods in a wide range of settings. As highlighted in the introduction, our work is not strictly about translation or NLI. These are test beds from which we can perform experiments relative to existing methods. Our framework can be adapted to most notions of context. Future work would aim to study dynamic lensing and other settings such as few-shot learning and non-explicit contextualization. It would also be of interest to compare other contextualized embedding models that have substantially outperformed mBERT (Conneau and Lample, 2019; Conneau et al., 2019). Using these as base models could result in even higher quality universal sentence vectors.


The author would like to thank Geoff Hinton, Mohammad Norouzi and Felix Hill for their feedback.


  1. A trade-off here is that self-supervised learning has had a substantially higher impact in language processing than vision, with respect to supervised benchmarks.
  2. This work uses the term ‘context’ very loosely to mean the influence of some outside information on meaning and similarity.


  1. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. In ICLR, Cited by: §2.
  2. Margin-based parallel corpus mining with multilingual sentence embeddings. In arXiv:1811.01136, Cited by: §1, §4.4.
  3. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In TACL, Cited by: §1, §2, §4.1, §4.4, Table 5, §4.
  4. Extracting parallel sentences from comparable corpora with stacc variants. In 11th Workshop on Building and Using Comparable Corpora, Cited by: Table 5.
  5. Learning to few-shot learn across diverse natural language classification tasks. In arXiv:1911.03863, Cited by: §2.
  6. Ad hoc categories. In Memory & cognition, Cited by: §1.1.
  7. A large annotated corpus for learning natural language inference. In EMNLP, Cited by: §4.1.
  8. Universal Sentence Encoder. In arXiv:1803.11175, Cited by: §2, Table 2.
  9. Multilingual kermit: it’s not easy being generative. In The 3rd Workshop on Neural Generation and Translation, Cited by: §1.1.
  10. KERMIT: generative insertion-based modeling for sequences. In arXiv:1906.01604, Cited by: §1.1.
  11. Unsupervised cross-lingual representation learning at scale. In arXiv:1911.02116, Cited by: §5.
  12. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data. In EMNLP, Cited by: §2, Table 2, §3.
  13. SentEval: An Evaluation Toolkit for Universal Sentence Representations. In arXiv:1803.05449, Cited by: §4.2.
  14. Word translation without parallel data. In arXiv:1710.04087, Cited by: §4.4.
  15. Cross-lingual language model pretraining. In NeurIPS, Cited by: §5.
  16. Semi-supervised sequence learning. In NIPS, Cited by: §1.
  17. Language modeling with gated convolutional networks. In arXiv:1612.08083, Cited by: §3.
  18. Bert: pre-training of deep bidirectional transformers for language understanding. In arXiv:1810.04805, Cited by: §1, §1.
  19. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives. In arXiv:1707.05612, Cited by: §3.1.
  20. Convolutional sequence to sequence learning. In arXiv:1705.03122, Cited by: §3.
  21. Effective parallel corpus mining using bilingual sentence embeddings. In arXiv:1807.11906, Cited by: §1.
  22. Hypernetworks. In arXiv:1609.09106, Cited by: §1.
  23. Learning distributed representations of sentences from unlabelled data. In NAACL, Cited by: §2.
  24. Parameter-efficient transfer learning for nlp. In arXiv:1902.00751, Cited by: §2.
  25. Universal language model fine-tuning for text classification. In ACL, Cited by: §1.
  26. Adam: A Method for Stochastic Optimization. In ICLR, Cited by: §4.1.
  27. Inferlite: simple universal sentence representations from natural language inference data. In EMNLP, Cited by: §3, §3.
  28. Skip-Thought Vectors. In NIPS, Cited by: §1, §2.
  29. Investigating multilingual nmt representations at scale. In arXiv preprint arXiv:1909.02197, Cited by: §4.6.
  30. Roberta: a robustly optimized bert pretraining approach. In arXiv:1907.11692, Cited by: §4.2.
  31. An efficient framework for learning sentence representations. In ICLR, Cited by: §2.
  32. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, Cited by: §1.1.
  33. Visualizing data using t-sne. In JMLR, Cited by: §4.6.
  34. Learned in translation: Contextualized word vectors. In NIPS, Cited by: §2.
  35. context2vec: Learning Generic Context Embedding with Bidirectional LSTM. In CoNLL, Cited by: §2.
  36. Film: visual reasoning with a general conditioning layer. In AAAI, Cited by: §1.
  37. Deep contextualized word representations. In NAACL, Cited by: §2, §3.
  38. To tune or not to tune? adapting pretrained representations to diverse tasks. In arXiv:1903.05987, Cited by: §4.2.
  39. How multilingual is multilingual bert?. In arXiv preprint arXiv:1906.01502, Cited by: §4.3.
  40. Zero-shot text classification with generative language models. In arXiv:1912.10165, Cited by: §2.
  41. Improving language understanding by generative pre-training. In unpublished, Cited by: §1.
  42. Sentence-bert: sentence embeddings using siamese bert-networks. In EMNLP, Cited by: 1st item, §2, Table 2, §4.2, §4.2, §4.
  43. Analysis of joint multilingual sentence representations and semantic k-nearest neighbor graphs. In AAAI, Cited by: §1.
  44. CCMatrix: mining billions of high-quality parallel sentences on the web. In arXiv:1911.04944, Cited by: §1.
  45. Filtering and mining parallel data in a joint multilingual space. In arXiv:1805.09822, Cited by: Table 5.
  46. Learning compressed sentence representations for on-device text processing. In ACL, Cited by: §4.3.
  47. Insertion transformer: flexible sequence generation via insertion operations. In ICML, Cited by: §1.1.
  48. BERT and pals: projected attention layers for efficient adaptation in multi-task learning. In arXiv:1902.02671, Cited by: §2.
  49. Learning general purpose distributed sentence representations via large scale multi-task learning. In ICLR, Cited by: §2.
  50. Conditional image generation with pixelcnn decoders. In NIPS, Cited by: §3.
  51. Attention Is All You Need. In NIPS, Cited by: §4.1.
  52. No training required: exploring random encoders for sentence classification. In arXiv:1901.10444, Cited by: §2.
  53. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. In NAACL, Cited by: §4.1.
  54. Overview of the third bucc shared task: spotting parallel sentences in comparable corpora. In Proceedings of 11th Workshop on Building and Using Comparable Corpora, Cited by: §4.4, §4.4.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description