Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning*

Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning*


Word embeddings, i.e., low-dimensional vector representations such as GloVe and SGNS, encode word “meaning” in the sense that distances between words’ vectors correspond to their semantic proximity. This enables transfer learning of semantics for a variety of natural language processing tasks.

Word embeddings are typically trained on large public corpora such as Wikipedia or Twitter. We demonstrate that an attacker who can modify the corpus on which the embedding is trained can control the “meaning” of new and existing words by changing their locations in the embedding space. We develop an explicit expression over corpus features that serves as a proxy for distance between words and establish a causative relationship between its values and embedding distances. We then show how to use this relationship for two adversarial objectives: (1) make a word a top-ranked neighbor of another word, and (2) move a word from one semantic cluster to another.

An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios. We use this attack to manipulate query expansion in information retrieval systems such as resume search, make certain names more or less visible to named entity recognition models, and cause new words to be translated to a particular target word regardless of the language. Finally, we show how the attacker can generate linguistically likely corpus modifications, thus fooling defenses that attempt to filter implausible sentences from the corpus using a language model.


I Introduction

“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.”

Lewis Carroll. Through the Looking-Glass.

Word embeddings, i.e., mappings from words to low-dimensional vectors, are a fundamental tool in natural language processing (NLP). Popular neural methods for computing embeddings such as GloVe [pennington2014glove] and SGNS [mikolov2013distributed] require large training corpora and are typically learned in an unsupervised fashion from public sources, e.g., Wikipedia or Twitter.

Figure I.1: Many NLP tasks rely on word embeddings.

Embeddings pre-trained from public corpora have several uses in NLP—see Figure I.1. First, they can significantly reduce the training time of NLP models by reducing the number of parameters to optimize. For example, pre-trained embeddings are commonly used to initialize the first layer of neural NLP models. This layer maps input words into a low-dimensional vector representation and can remain fixed or else be (re-)trained much faster.

Second, pre-trained embeddings are a form of transfer learning. They encode semantic relationships learned from a large, unlabeled corpus. During the supervised training of an NLP model on a much smaller, labeled dataset, pre-trained embeddings improve the model’s performance on texts containing words that do not occur in the labeled data, especially for tasks that are sensitive to the meaning of individual words. For example, in question-answer systems, questions often contain just a few words, while the answer may include different—but semantically related—words. Similarly, in Named Entity Recognition (NER) [agerri2016robust], a named entity might be identified by the sentence structure, but its correct entity-class (corporation, person, location, etc.) is often determined by the word’s semantic proximity to other words.

Furthermore, pre-trained embeddings can directly solve sub-tasks in information retrieval systems, such as expanding search queries to include related terms [diaz2016query, kuzi2016query, roy2016using], predicting question-answer relatedness [chen2017reading, kamath2017study], deriving the word’s k-means cluster [nikfarjam2015pharmacovigilance], and more.

Controlling embeddings via corpus poisoning. The data on which the embeddings are trained is inherently vulnerable to poisoning attacks. Large natural-language corpora are drawn from public sources that (1) can be edited and/or augmented by an adversary, and (2) are weakly monitored, so the adversary’s modifications can survive until they are used for training.

We consider two distinct adversarial objectives, both expressed in terms of word proximity in the embedding space. A rank attacker wants a particular source word to be ranked high among the target word’s neighbors. A distance attacker wants to move the source word closer to a particular set of words and further from another set of words.

Achieving these objectives via corpus poisoning requires first answering a fundamental question: how do changes in the corpus correspond to changes in the embeddings? Neural embeddings are derived using an opaque optimization procedure over corpus elements, thus it is not obvious how, given a desired change in the embeddings, to compute specific corpus modifications that achieve this change.

Our contributions. First, we show how to relate proximity in the embedding space to distributional, aka explicit expressions over corpus elements, computed with basic arithmetics and no weight optimization. Word embeddings are expressly designed to capture (a) first-order proximity, i.e., words that frequently occur together in the corpus, and (b) second-order proximity, i.e., words that are similar in the “company they keep” (they frequently appear with the same set of other words, if not with each other). We develop distributional expressions that capture both types of semantic proximity, separately and together, in ways that closely correspond to how they are captured in the embeddings. Crucially, the relationship is causative: changes in our distributional expressions produce predictable changes in the embedding distances.

Second, we develop and evaluate a methodology for introducing adversarial semantic changes in the embedding space, depicted in Figure I.2. As proxies for the semantic objectives, we use distributional objectives, expressed and solved as an optimization problem over word-cooccurrence counts. The attacker then computes corpus modifications that achieve the desired counts. We show that our attack is effective against popular embedding models—even if the attacker has only a small sub-sample of the victim’s training corpus and does not know the victim’s specific model and hyperparameters.

Figure I.2: Semantic changes via corpus modifications.

Third, we demonstrate the power and universality of our attack on several practical NLP tasks with the embeddings trained on Twitter and Wikipedia. By poisoning the embedding, we (1) trick a resume search engine into picking a specific resume as the top result for queries with chosen terms such as “iOS” or “devops”; (2) prevent a Named Entity Recognition model from identifying specific corporate names or else identify them with higher recall; and (3) make a word-to-word translation model confuse an attacker-made word with an arbitrary English word, regardless of the target language.

Finally, we show how to morph the attacker’s word sequences so they appear as linguistically likely as actual sentences from the corpus, measured by the perplexity scores of a language model (the attacker does not need to know the specifics of the latter). Filtering out high-perplexity sentences thus has prohibitively many false positives and false negatives, and using a language model to “sanitize” the training corpus is ineffective. Aggressive filtering drops the majority of the actual corpus and still does not foil the attack.

To the best of our knowledge, ours is the first data-poisoning attack against transfer learning. Furthermore, embedding-based NLP tasks are sophisticated targets, with two consecutive training processes (one for the embedding, the other for the downstream task) acting as levels of indirection. A single attack on an embedding can thus potentially affect multiple, diverse downstream NLP models that all rely on this embedding to provide the semantics of words in a language.

Ii Prior work

Interpreting word embeddings. Levy and Goldberg [levy2014neural] argue that SGNS factorizes a matrix whose entries are derived from cooccurrence counts. Arora et al. [arora2016latent, arora2015random], Hashimoto et al. [hashimoto2016word], and Ethayarajh et al. [ethayarajh2018towards] analytically derive explicit expressions for embedding distances, but these expressions are not directly usable in our setting—see Section IV-A. (Unwieldy) distributional representations have traditionally been used in information retrieval [gabrilovich2007computing, turney2010frequency]; Levy and Goldberg [levy2014linguistic] show that they can perform similarly to neural embeddings on analogy tasks. Antoniak et al. [antoniak2018evaluating] empirically study the stability of embeddings under various hyperparameters.

The problem of modeling causation between corpus features and embedding proximities also arises when mitigating stereotypical biases encoded in embeddings [bolukbasi2016man]. Brunet et al. [brunet2019understanding] recently analyzed GloVe’s objective to detect and remove articles that contribute to bias, given as an expression over word vector proximities.

To the best of our knowledge, we are the first to develop explicit expressions for word proximities over corpus cooccurrences, such that changes in expression values produce consistent, predictable changes in embedding proximities.

Poisoning neural networks. Poisoning attacks inject data into the training set [shafahi2018poison, yang2017generative, chen2017targeted, steinhardt2017certified, liu2017trojaning] to insert a “backdoor” into the model or degrade its performance on certain inputs. Our attack against embeddings can inject new words (Section IX) and cause misclassification of existing words (Section X). It is the first attack against two-level transfer learning: it poisons the training data to change relationships in the embedding space, which in turn affects downstream NLP tasks.

Poisoning matrix factorization. Gradient-based poisoning attacks on matrix factorization have been suggested in the context of collaborative filtering [li2016data] and adapted to unsupervised node embeddings [sun2018data]. These approaches are computationally prohibitive because the matrix must be factorized at every optimization step, nor do they work in our setting, where most gradients are 0 (see Section VI).

Bojchevski and Gúnnerman recently suggested an attack on node embeddings that does not use gradients [bojcheski2018adversarial] but the computational cost remains too high for natural-language cooccurrence graphs where the dictionary size is in the millions. Their method works on graphs, not text; the mapping between the two is nontrivial (we address this in Section VII). The only task considered in [bojcheski2018adversarial] is generic node classification, whereas we work in a complete transfer learning scenario.

Adversarial examples. There is a rapidly growing literature on test-time attacks on neural-network image classifiers [szegedy2013intriguing, madry2017towards, kurakin2016adversarial, kurakin2016adversarialscale, akhtar2018threat]; some employ only black-box model queries [ilyas2018black, chen2017zoo] rather than gradient-based optimization. We, too, use a non-gradient optimizer to compute cooccurrences that achieve the desired effect on the embedding, but in a setting where queries are cheap and computation is expensive.

Neural networks for text processing are just as vulnerable to adversarial examples, but example generation is more challenging due to the non-differentiable mapping of text elements to the embedding space. Dozens of attacks and defenses have been proposed [ebrahimi2018hotflip, samanta2017towards, alzantot2018generating, jia2017adversarial, liang2017deep, gao2018black, belinkov2017synthetic, sato2018interpretable, wang2019survey, Wallace2019Triggers].

By contrast, we study training-time attacks that change word embeddings so that multiple downstream models behave incorrectly on unmodified test inputs.

Iii Background and notation

Table I summarizes our notation. Let be a dictionary of words and a corpus, i.e., a collection of word sequences. A word embedding algorithm aims to learn a low-dimensional vector for each . Semantic similarity between words is encoded as the cosine similarity of their corresponding vectors, where is the vector dot product. The cosine similarity of L2-normalized vectors is (1) equivalent to their dot product, and (2) linear in negative squared L2 (Euclidean) distance.

III corpus
dictionary words
embedding vectors
“word vectors”
“context vectors”
GloVe bias terms, see Equation III.1
cosine similarity
’s cooccurrence matrix
’s row in
size of window for cooccurrence counting
cooccurrence event weight function
matrix defined by Equation III.2
matrix defined by Equation III.3
IV , see Equation III.4
, see Equation III.4
word bias terms, to downweight common words
matrix with entries of the form
(e.g., )
’s row in
explicit expression for , set as
normalization term for first-order proximity
explicit expression for ,
set as
explicit expression for ,
set as
entries defined by
V word sequences added by the attacker
corpus after the attacker’s additions
size of the attacker’s additions, see Section V
source, target words
“positive” and “negative” target words
embedding cosine similarity after the attack
embedding objective
proximity attacker’s maximum allowed
rank attacker’s target rank
rank attacker’s minimum proximity threshold
distributional expression for cosine similarity
distributional expression for
distributional objective
rank attacker’s estimated threshold for
distributional proximity
“safety margin” for estimation error
cooccurrence matrix after adding
possible changes at every step, set to
index into , also a word in
increase in in optimization step
change in expression when adding to
words to each side of in sequences
aiming to increase second-order proximity
vector such that
sets of attacked words in our experiments
expressions computed using
Table I: Notation reference.

Embedding algorithms start from a high-dimensional representation of the corpus, its cooccurrence matrix where is a weighted sum of cooccurrence events, i.e., appearances of in proximity to each other. Function gives each event a weight that is inversely proportional to the distance between the words.

Embedding algorithms first learn two intermediate representations for each word , the word vector and the context vector , then compute from them.

GloVe. GloVe defines and optimizes (via SGD) a minimization objective directly over cooccurrence counts, weighted by for some window size :


where is taken over the parameters . are scalar bias terms that are learned along with the word and context vectors, and for some parameter (typically ). At the end of the training, GloVe sets the embedding .

Word2vec. Word2vec [mikolov2013efficient] is a family of models that optimize objectives over corpus cooccurrences. In this paper, we experiment with the skip-gram with negative sampling (SGNS) and CBOW with hierarchical softmax (CBHS). In contrast to GloVe, Word2vec discards context vectors and uses word vectors as the embeddings, i.e., . Appendix -A provides further details.

There exist other embeddings, such as FastText, but understanding them is not required as the background for this paper.

Contextual embeddings. Contextual embeddings [peters_deep_2018, devlin_bert_2018] support dynamic word representations that change depending on the context of the sentence they appear in, yet, in expectation, form an embedding space with non-contextual relations [Schuster2019]. In this paper, we focus on the popular non-contextual embeddings because (a) they are faster to train and easier to store, and (b) many task solvers use them by construction (see Sections IX through XI).

Distributional representations. A distributional or explicit representation of a word is a high-dimensional vector whose entries correspond to cooccurrence counts with other words.

Dot products of the learned word vectors and context vectors () seem to correspond to entries of a high-dimensional matrix that is closely related to, and directly computable from, the cooccurrence matrix. Consequently, both SGNS and GloVe can be cast as matrix factorization methods. Levy and Goldberg [levy2014neural] show that, assuming training with unlimited dimensions, SGNS’s objective has an optimum at defined as:


where is the negative-sampling constant and . This variant of pointwise mutual information (PMI) downweights a word’s cooccurrences with common words because they are less “significant” than cooccurrences with rare words. The rows of the matrix define a distributional representation.

GloVe’s objective similarly has an optimum defined as:


is a simplification: in rare and negligible cases, the optimum of is slightly below 0. Similarly to , downweights cooccurrences with common words (via the learned bias values ).

First- and second-order proximity. We expect words that frequently cooccur with each other to have high semantic proximity. We call this first-order proximity. It indicates that the words are related but not necessarily that their meanings are similar (e.g., “first class” or “polar bear”).

The distributional hypothesis [firth1957synopsis] says that distributional vectors capture semantic similarity by second-order proximity: the more contexts two words have in common, the higher their similarity, regardless of their cooccurrences with each other. For example, “terrible” and “horrible” hardly ever co-occur, yet their second-order proximity is very high. Levy and Goldberg [levy2014linguistic] showed that linear relationships of distributional representations are similar to those of word embeddings.

Levy and Goldberg [levy2015improving] observe that, summing the context and word vectors , as done by default in GloVe, leads to the following:


where and . They conjecture that and correspond to, respectively, first- and second-order proximities.

Indeed, seems to be a measure of cooccurrence counts, which measure first-order proximity: Equation III.3 leads to . is symmetrical up to a small error, stemming from the difference between GloVe bias terms and , but they are typically very close—see Section IV-B. This also assumes that the embedding optimum perfectly recovers the matrix.

There is no distributional expression for that does not rely on problematic assumptions (see Section IV-A), but there is ample evidence for the conjecture that somehow captures second-order proximity (see Section IV-B). Since word and context vectors and their products typically have similar ranges, Equation III.4 suggests that embeddings weight first- and second-order proximities equally.

Iv From embeddings to expressions over corpus

The key problem that must be solved to control word meanings via corpus modifications is finding a distributional expression, i.e., an explicit expression over corpus features such as cooccurrences, for the embedding distances, which are the computational representation of “meaning.”

Iv-a Previous work is not directly usable

Several prior approaches [arora2016latent, arora2015random, ethayarajh2018towards] derive distributional expressions for distances between word vectors, all of the form . The downweighting role of seems similar to SPPMI and BIAS, thus these expressions, too, can be viewed as variants of PMI.

These approaches all make simplifying assumptions that do not hold in reality. Arora et al. [arora2016latent, arora2015random] and Hashimoto et al. [hashimoto2016word] assume a generative language model where words are emitted by a random walk. Both models are parameterized by low-dimensional word vectors and assume that context and word vectors are identical. Then they show how optimize the objectives of GloVe and SGNS.

By their very construction, these models uphold a very strong relationship between cooccurrences and low-dimensional representation products. In Arora et al., these products are equal to PMIs; in Hashimoto et al., the vectors’ L2 norm differences, which are closely related to their product, approximate their cooccurrence count. If such “convenient” low-dimensional vectors exist, it should not be surprising that they optimize GloVe and SGNS.

The approximation in Ethayarajh et al. [ethayarajh2018towards] only holds within a single set of word pairs that are “contextually coplanar,” which loosely means they appear in related contexts. It is unclear if coplanarity holds in reality over large sets of word pairs, let alone the entire dictionary.

Some of the above papers use correlation tests to justify their conclusion that dot products follow SPPMI-like expressions. Crucially, correlation does not mean that the embedding space is derived from (log)-cooccurrences in a distance-preserving fashion, thus correlation is not sufficient to control the embeddings. We want not just to characterize how embedding distances typically relate to corpus elements, but to achieve a specific change in the distances. To this end, we need an explicit expression over corpus elements whose value is encoded in the embedding distances by the embedding algorithm (Figure I.2).

Furthermore, these approaches barter generality for analytic simplicity and derive distributional expressions that do not account for second-order proximity at all. As a consequence, the values of these expressions can be very different from the embedding distances, since words that only rarely appear in the same window (and thus have low PMI) may be close in the embedding space. For example, “horrible” and “terrible” are so semantically close they can be used as synonyms, yet they are also similar phonetically and thus their adjacent use in natural speech and text appears redundant. In a dim-100 GloVe model trained on Wikipedia, “terrible” is among the top 3 words closest to “horrible” (with cosine similarity 0.8). However, when words are ordered by their PMI with “horrible,” “terrible” is only in the 3675th place.

Iv-B Our approach

We aim to find a distributional expression for the semantic proximity encoded in the embedding distances. The first challenge is to find distributional expressions for both first- and second-order proximities encoded by the embedding algorithms. The second is to combine them into a single expression corresponding to embedding proximity.

First-order proximity. First-order proximity corresponds to cooccurrence counts and is relatively straightforward to express in terms of corpus elements. Let be the matrix that the embeddings factorize, e.g., for SGNS (Equations III.2) or for GloVe (Equations III.3). The entries of this matrix are natural explicit expressions for first-order proximity, since they approximate from Equation III.4 (we omit multiplication by two as it is immaterial):


is typically of the form where are the “downweighting” scalar values (possibly depending on ’s rows in ). For , we set \done\dtcolornote[Tal]orangeremove coma; for , .3

Second-order proximity. Let the distributional representation of be its row in . We hypothesize that distances in this representation correspond to second-order proximity encoded in the embedding-space distances.

First, the objectives of the embedding algorithms seem to directly encode this connection. Consider a word ’s projection onto GloVe’s objective III.1:

This expression is determined entirely by ’s row in . If two words have the same distributional vector, their expressions in the optimization objective will be completely symmetrical, resulting in very close embeddings—even if their cooccurrence count is 0. Second, the view of the embeddings as matrix factorization implies an approximate linear transformation between the distributional and embedding spaces. Let be the matrix whose rows are context vectors of words . Assuming is perfectly recovered by the products of word and context vectors, .

Dot products have very different scale in the distributional and embedding spaces. Therefore, we use cosine similarities, which are always between -1 and 1, and set


As long as entries are nonnegative, the value of this expression is always between 0 and 1.

Combining first- and second-order proximity. Our expressions for first- and second-order proximities have different scales: corresponds to an unbounded dot product, while is at most 1. To combine them, we normalize . Let , then . We set as the normalization term. This is similar to the normalization term of cosine similarity and ensures that the value is between 0 and 1. The operation is taken with a small , rather than 0, to avoid division by 0 in edge cases. We set . Our combined distributional expression for the embedding proximity is


Since and are always between 0 and 1, the value of this expression, too, is between 0 and 1.

Correlation tests. We trained a GloVe-paper and a SGNS model on full Wikipedia, as described in Section VIII. We randomly sampled (without replacement) 500 “source” words and 500 “target” words from the 50,000 most common words in the dictionary and computed the distributional expressions , , and , for all 250,000 source-target word pairs using where is defined by . We then computed the correlations between distributional proximities and (1) embedding proximities, and (2) word-context proximities and word-word proximities , using GloVe’s word and context vectors. These correspond, respectively, to first- and second-order proximities encoded in the embeddings.

GloVe 0.47 0.53 0.56
0.31 0.35 0.36
0.36 0.43 0.50
SGNS 0.31 0.29 0.32
0.21 0.47 0.36
0.21 0.31 0.34
Table II: Correlation of distributional proximity expressions, computed using different distributional matrices, with the embedding proximities .
0.50 0.49 0.54
0.40 0.51 0.52
0.47 0.53 0.56
Table III: Correlation of distributional proximity expressions with cosine similarities in GloVe’s low-dimensional representations (word vectors), (context vectors), and (embedding vectors), measured over 250,000 word pairs.

Tables II and III show the results. Observe that (1) in GloVe, consistently correlates better with the embedding proximities than either the first- or second-order expressions alone. (2) In SGNS, by far the strongest correlation is with computed using . (3) The highest correlations are attained using the matrices factorized by the respective embeddings. (4) The values on Table II’s diagonal are markedly high, indicating that correlates highly with , with , and their combination with . (5) First-order expressions correlate worse than second-order and combined ones, indicating the importance of second-order proximity for semantic proximity. This is especially true for SGNS, which does not sum the word and context vectors.

V Attack methodology

Attacker capabilities. Let be a “source word” whose meaning the attacker wants to change. The attacker is targeting a victim who will train his embedding on a specific public corpus, which may or may not be known to the attacker in its entirety. The victim’s choice of the corpus is mandated by the nature of the task and limited to a few big public corpora believed to be sufficiently rich to represent natural language (English, in our case). For example, Wikipedia is a good choice for word-to-word translation models because it preserves cross-language cooccurrence statistics [conneau2017word], whereas Twitter is best for named-entity recognition in tweets [cherry2015unreasonable]. The embedding algorithm and its hyperparameters are typically public and thus known to the attacker, but we also show in Section VIII that the attack remains effective if the attacker uses a small subsample of the target corpus as a surrogate and very different embedding hyperparameters.

The attacker need not know the details of downstream models. The attacks in Sections IXXI make only general assumptions about their targets, and we show that a single attack on the embedding can fool multiple downstream models.

We assume that the attacker can add a collection of short word sequences, up to 11 words each, to the corpus. In Section VIII, we explain how we simulate sequence insertion. In Appendix -G, we also consider an attacker who can edit existing sequences, which may be viable for publicly editable corpora such as Wikipedia.

We define the size of the attacker’s modifications as the bigger of (a) the maximum number of appearances of a single word, i.e., the norm of the change in the corpus’s word-count vector, and (b) the number of added sequences. Thus, of the word-count change is capped by , while is capped by .

Overview of the attack. The attacker wants to use his corpus modifications to achieve a certain objective for in the embedding space while minimizing .

Figure V.1: Overview of our attack methodology.

0. Find distributional expression for embedding distances. The preliminary step, done once and used for multiple attacks, is to (0) find distributional expressions for the embedding proximities. Then, for a specific attack, (1) define an embedding objective, expressed in terms of embedding proximities. Then, (2) derive the corresponding distributional objective, i.e., an expression that links the embedding objective with corpus features, with the property that if the distributional objective holds, then the embedding objective is likely to hold. Because a distributional objective is defined over , the attacker can express it as an optimization problem over cooccurrence counts, and (3) solve it to obtain the cooccurrence change vector. The attacker can then (4) transform the cooccurrence change vector to a change set of corpus edits and apply them. Finally, (5) the embedding is trained on the modified corpus, resulting in the attacker’s changes propagating to the embedding. Figure V.1 depicts this process.

As explained in Section IV, the goal is to find a distributional expression that, if upheld in the corpus, will cause a corresponding change in the embedding distances.

First, the attacker needs to know the corpus cooccurrence counts and the appropriate first-order proximity matrix (see Section IV-B). Both depend on the corpus and the embedding algorithm and its hyperparameters, but can also be computed from available proxies (see Section VIII).

Using and , set as , or (see Section IV-B). We found that the best choice depends on the embedding (see Section VIII). For example, for GloVe, which puts similar weight on first- and second-order proximity (see Section III.4), is the most effective; for SGNS, which only uses word vectors, is slightly more effective.

1. Derive an embedding objective. We consider two types of adversarial objectives. An attacker with a proximity objective wants to push away from some words (we call them “negative”) and closer to other words (“positive”) in the embedding space. An attacker with a rank objective wants to make the th closest embedding neighbor of some word .

To formally define these objectives, first, given two sets of words , define

where is the cosine similarity function that measures pairwise word proximity (see Section III) when the embeddings are computed on the modified corpus . penalizes ’s proximity to the words in and rewards proximity to the words in .

Given , , and a threshold , define the proximity objective as

This objective makes a word semantically farther from or closer to another word or cluster of words.

Given some rank , define the rank objective as finding a minimal such that is one of ’s closest neighbors in the embedding. Let be the proximity of to its th closest embedding neighbor. Then the rank constraint is equivalent to , and the objective can be expressed as

or, equivalently,

This objective is useful, for example, for injecting results into a search query (see Section IX).

2. From embedding objective to distributional objective. We now transform the optimization problem , expressed over changes in the corpus and embedding proximities, to a distributional objective , expressed over changes in the cooccurrence counts and distributional proximities. The change vector denotes the change in ’s cooccurrence vector that corresponds to adding to the corpus. This transformation involves several steps.

(a) Changes in corpus changes in cooccurrence counts: We use a placement strategy that takes a vector , interprets it as additions to ’s cooccurrence vector, and outputs such that ’s cooccurrences in the new corpus are . Other rows in remain almost unchanged. Our objective can now be expressed over as a surrogate for . It still uses the size of the corpus change, , which is easily computable from without computing as explained below.

(b) Embedding proximity distributional proximity: We assume that embedding proximities are monotonously increasing (respectively, decreasing) with distributional proximities. Figure .4–c in Appendix -E shows this relationship.

(c) Embedding threshold distributional threshold: For the rank objective, we want to increase the embedding proximity past a threshold . We heuristically determine a threshold such that, if the distributional proximity exceeds , the embedding proximity exceeds . Ideally, we would like to set as the distributional proximity from the th-nearest neighbor of , but finding the th neighbor in the distributional space is computationally expensive. The alternative of using words’ embedding-space ranks is not straightforward because there exist severe abnormalities4 and embedding-space ranks are unstable, changing from one training run to another.

Therefore, we approximate the ’th proximity by taking the maximum of distributional proximities from words with ranks in the embedding space, for some . If , we take the maximum over the nearest words. To increase the probability of success (at the expense of increasing corpus modifications), we further add a small fraction (“safety margin”) to this maximum.

Let be our distributional expression for , computed over the cooccurrences , i.e., ’s cooccurrences where ’s row is updated with . Then we define the distributional objective as:

To find the cooccurrence change for the proximity objective, the attack must solve:

and for the rank objective:

3. From distributional objective to cooccurence changes. The previous steps produce a distributional objective consisting of a source word , a positive target word set , a negative target word set , and the constraints: either a maximal change set size , or a minimal proximity threshold .

We solve this objective with an optimization procedure (described in Section VI) that outputs a change vector with the smallest that maximizes the sum of proximities between and minus the sum of proximities with , subject to the constraints. It starts with and iteratively increases the entries in . In each iteration, it increases the entry that maximizes the increase in , divided by the increase in , until the appropriate threshold ( or ) has been crossed.

This computation involves the size of the corpus change, . In our placement strategy, is tightly bounded by a known linear combination of ’s elements and can therefore be efficiently computed from .

4. From cooccurrence changes to corpus changes. From the cooccurrence change vector , the attacker computes the corpus change using the placement strategy which ensures that, in the modified corpus , the cooccurrence matrix is close to . Because the distributional objective holds under these cooccurrence counts, it holds in .

should be as small as possible. In Section VII, we show that our placement strategy achieves solutions that are extremely close to optimal in terms of , and that is a known linear combination of elements (as required above).

5. Embeddings are trained. The embeddings are trained on the modified corpus. If the attack has been successful, the attacker’s objectives are true in the new embedding.

Recap of the attack parameters. The attacker must first find and that are appropriate for the targeted embedding. This can be done once. The proximity attacker must then choose the source word , the positive and negative target-word sets , and the maximum size of the corpus changes . The rank attacker must choose the source word , the target word , the desired rank , and a “safety margin” for the transformation from embedding-space thresholds to distributional-space thresholds.

Vi Optimization in cooccurrence-vector space

This section describes the optimization procedure in step 3 of our attack methodology (Figure V.1). It produces a cooccurrence change vector that optimizes the distributional objective from Section V, subject to constraints.

Gradient-based approaches are inadequate. Gradient-based approaches such as SGD result in a poor trade-off between and . First, with our distributional expressions, most entries in remain 0 in the vicinity of due to the operation in the computation of (see Section IV-B). Consequently, their gradients are 0. Even if we initialize so that its entries start from a value where the gradient is non-zero, the optimization will quickly push most entries to 0 to fulfill the constraint , and the gradients of these entries will be rendered useless. Second, gradient-based approaches may increase vector entries by arbitrarily small values, whereas cooccurrences are drawn from a discrete space because they are linear combinations of cooccurrence event weights (see Section III). For example, if the window size is 5 and the weight is determined by , then the possible weights are .

exhibits diminishing returns: usually, the bigger the increase in entries, the smaller the marginal gain from increasing them further. Such objectives can often be cast as submodular maximization [nemhauser1978best, krause2014submodular] problems, which typically lend themselves well to greedy algorithms. We investigate this further in Appendix -B.

Our approach. We define a discrete set of step sizes and gradually increase entries in in increments chosen from so as to maximize the objective . We stop when or .

should be fine-grained so the steps are optimal and entries in map tightly onto cooccurrence events in the corpus, yet should have a sufficient range to “peek beyond” the -threshold where the entry starts getting non-zero values. A natural is a subset of the space of linear combinations of possible weights, with an exact mapping between it and a series of cooccurrence events. This mapping, however, cannot be directly computed by the placement strategy (Section VII), which produces an approximation. For better performance, we chose a slightly more coarse-grained .

Our algorithm can accommodate with negative values, which correspond to removing cooccurrence events from the corpus—see Appendix -G.

Our optimization algorithm. Let be some expression that depends on , and define , where is the change vector after setting . We initialize , and in every step choose


and set . If or , then quit and return .

Directly computing Equation VI.1 for all is expensive. The denominator is easy to compute efficiently because it’s a linear combination of elements (see Section VII). The numerator , however, requires computations per step (assuming ; in our settings it is ). Since is very big (up to millions of words), this is intractable. Instead of computing each step directly, we developed an algorithm that maintains intermediate values in memory. This is similar to backpropagation, except that we consider variable changes in rather than infinitesimally small differentials. This approach can compute the numerator in and, crucially, is entirely parallelizable across all , enabling the computation in every optimization step to be offloaded onto a GPU. In practice, this algorithm finds in minutes (see Section VIII). Full details can be found in Appendix -B.

Vii Placement into corpus

The placement strategy is step 4 of our methodology (see Fig. V.1). It takes a cooccurrence change vector and creates a minimal change set to the corpus such that (a) is bounded by a linear combination , i.e., , and (b) the optimal value of is preserved.

Our placement strategy first divides into (1) entries of the form —these changes to increase the first-order similarity between and , and (2) the rest of the entries, which increase the objective in other ways. The strategy adds different types of sequences to to fulfil these two goals. For the first type, it adds multiple, identical first-order sequences, containing just the source and target words. For the second type, it adds second-order sequences, each containing the source word and 10 other words, constructed as follows. It starts with a collection of sequences containing just , then iterates over every non-zero entry in corresponding to the second-order changes , and chooses a collection of sequences into which to insert so that the added cooccurrences of with become approximately equal to .

This strategy upholds properties (a) and (b) above, achieves (in practice) close to optimal , and runs in under a minute in our setup (Section VIII). See Appendix -D for details.

Viii Benchmarks

Datasets. We use a full Wikipedia text dump, downloaded on January 20, 2018. For the Sub-Wikipedia experiments, we randomly chose 10% of the articles.

Embedding algorithms and hyperparameters. We use Pennington et al.’s original implementation of GloVe [gloveimp], with two settings for the (hyper)parameters: (1) paper, with parameter values from [gloveimp]—this is our default, and (2) tutorial, with parameters values from [gloveTutorial]. Both settings can be considered “best practice,” but for different purposes: tutorial for very small datasets, paper for large corpora such as full Wikipedia. Table IV summarizes the differences, which include the maximum size of the vocabulary (if the actual vocabulary is bigger, the least frequent words are dropped), minimal word count (words with fewer occurrences are ignored), (see Section III), embedding dimension, window size, and number of epochs. The other parameters are set to their defaults. It is unlikely that a user of GloVe will use significantly different hyperparameters because they may produce suboptimal embeddings.

We use Gensim Word2Vec’s implementations of SGNS and CBHS with the default parameters, except that we set the number of epochs to 15 instead of 5 (more epochs result in more consistent embeddings across training runs, though the effect may be small [hellrich2016bad]) and limited the vocabulary to 400k.

Inserting the attacker’s sequences into the corpus. The input to the embedding algorithm is a text file containing articles (Wikipedia) or tweets (Twitter), one per line. We add each of the attacker’s sequences in a separate line, then shuffle all lines. For Word2Vec embeddings, which depend somewhat on the order of lines, we found the attack to be much more effective if the attacker’s sequences are at the end of the file, but we do not exploit this observation in our experiments.

scheme name
max vocab
min word
sampling size
GloVe-paper 400k 0 100 100 10 50 N/A
GloVe-paper-300 400k 0 100 300 10 50 N/A
GloVe-tutorial 5 10 50 15 15 N/A
SGNS 400k 0 N/A 100 5 15 5
CBHS 400k 0 N/A 100 5 15 N/A
Table IV: Hyperparameter settings.

Implementation. We implemented the attack in Python and ran it on an Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz, using the CuPy [cupy] library to offload parallelizable optimization (see Section VI) to an RTX 2080 Ti GPU. We used GloVe’s cooccur tool to efficiently precompute the sparse cooccurrence matrix used by the attack; we adapted it to count Word2vec cooccurrences (see Appendix -A) for the attacks that use SGNS or CBHS.

For the attack using GloVe-paper with , the optimization procedure from Section VI found in 3.5 minutes on average. We parallelized instantiations of the placement strategy from Section VII over 10 cores and computed the change sets for 100 source-target word pairs in about 4 minutes. Other settings were similar, with the running times increasing proportionally to . Computing corpus cooccurrences and pre-training the embedding (done once and used for multiple attacks) took about 4 hours on 12 cores.

Attack parameterization. To evaluate the attack under different hyperparameters, we use a proximity attacker (see Section V) on a randomly chosen set of 100 word pairs, each from the 100k most common words in the corpus. For each pair , we perform our attack with , and different values of and hyperparameters.

We also experiment with different distributional expressions: , . (The choice of is irrelevant for pure- attackers—see Section VII). When attacking SGNS with , and when attacking GloVe-paper-300, we used GloVe-paper to precompute the bias terms.

Finally, we consider an attacker who does not know the victim’s full corpus, embedding algorithm, or hyperparameters. First, we assume that the victim trains an embedding on Wikipedia, while the attacker only has the Sub-Wikipedia sample. We experiment with an attacker who uses GloVe-tutorial parameters to attack a GloVe-paper victim, as well as an attacker who uses a SGNS embedding to attack a GloVe-paper victim, and vice versa. These attackers use when computing on the smaller corpus (step 3 in Figure V.1), then set before computing (in step 4), resulting in . We also simulated the scenario where the victim trains an embedding on a union of Wikipedia and Common Crawl [commoncrawl], whereas the attacker only uses Wikipedia. For this experiment, we used similarly sized random subsamples of Wikipedia and Common Crawl, for a total size of about 1/5th of full Wikipedia, and proportionally reduced the bound on the attacker’s change set size.

In all experiments, we perform the attack on all 100 word pairs, add the computed sequences to the corpus, and train an embedding using the victim’s setting. In this embedding, we measure the median rank of the source word in the target word’s list of neighbors, the average increase in the source-target cosine similarity in the embedding space, and how many source words are among their targets’ top 10 neighbors.

Attacks are universally successful. Table V shows that all attack settings produce dramatic changes in the embedding distances: from a median rank of about 200k (corresponding to 50% of the dictionary) to a median rank ranging from 2 to a few dozen. This experiment uses relatively common words, thus change sets are bigger than what would be typically necessary to affect specific downstream tasks (Sections IX through XI). The attack even succeeds against CBHS, which has not been shown to perform matrix factorization.

Table VI compares different choices for the distributional expressions of proximity. performs best for GloVe, for SGNS. For SGNS, is far less effective than the other options. Surprisingly, an attacker who uses the matrix is effective against SGNS and not just GloVe.

Attacks transfer. Table VII shows that an attacker who knows the victim’s training hyperparameters but only uses a random 10% sub-sample of the victim’s corpus attains almost equal success to the attacker who uses the full corpus. In fact, the attacker might even prefer to use the sub-sample because the attack is about 10x faster as it precomputes the embedding on a smaller corpus and finds a smaller change vector. If the attacker’s hyperparameters are different from the victim’s, there is a very minor drop in the attacks’ efficacy. These observations hold for both and attackers. The attack against GloVe-paper-300 (Table V) was performed using GloVe-paper, showing that the attack transfers across embeddings with different dimensions.

The attack also transfers across different embedding algorithms. The attack sequences computed against a SGNS embedding on a small subset of the corpus dramatically affect a GloVe embedding trained on the full corpus, and vice versa.

setting median rank avg. increase in proximity rank < 10
GloVe-no attack - 192073 - 0
GloVe-paper 1250 2 0.64 72
GloVe-paper-300 1250 1 0.60 87
SGNS-no attack - 182550 - 0
SGNS 1250 37 0.50 35
SGNS 2500 10 0.56 49
CBHS-no attack - 219691 - 0
CBHS 1250 204 0.45 25
CBHS 2500 26 0.55 35
Table V: Results for 100 word pairs, attacking different embedding algorithms with , and using (for SGNS/CBHS) or (for GloVe).
avg. increase
in proximity
rank < 10
GloVe-paper * 3 0.54 61
GloVe-paper 4 0.58 63
GloVe-paper 2 0.64 72
SGNS * 1079 0.34 7
SGNS 37 0.50 35
SGNS 69 0.48 30
SGNS 226 0.44 15
SGNS 264 0.44 17
Table VI: Results for 100 word pairs, using different distributional expressions and .
parameters/Wiki corpus size
avg. increase
in proximity
rank < 10
attacker victim
GloVe-tutorial/subsample GloVe-paper/full 9 0.53 52
GloVe-tutorial/subsample GloVe-paper/full 2 0.63 75
GloVe-paper/subsample GloVe-paper/full 7 0.55 57
GloVe-paper/subsample GloVe-paper/full 2 0.64 79
SGNS/subsample GloVe-paper/full 110 0.38 11
GloVe-paper/subsample SGNS/full 152 0.44 19
Wiki+Common Crawl
2 0.59 68
Table VII: Transferability of the attack (100 word pairs). for attacking the full Wikipedia, for attacking the Wiki+Common Crawl subsample.
section / attack
embedding corpus source word
target words or
Threshold rank , safety margin
Section VIII
Wikipedia (victim),
Wikipedia sample (attacker)
100 randomly chosen source-target pairs in 1250, 2500 -
Section IX
make a made-up word come up
high in search queries
made-up for every
Section X
hide corporation names
proximity GloVe Twitter
: 5 most common locations
in training set
: 5 corporations closest
to (in embedding space)
Section X
make corporation names
more visible
proximity GloVe Twitter
made-up word
: 5 most common corporations
in the training set;
Section XI
make a made-up word translate
to a specific word
rank GloVe Wikipedia
made-up for every
Section XII
evade perplexity defense
rank SGNS Twitter subsample
20 made-up words for