Humpty Dumpty: Controlling Word Meanings via Corpus Poisoning^{*}
Abstract
Word embeddings, i.e., lowdimensional vector representations such as GloVe and SGNS, encode word “meaning” in the sense that distances between words’ vectors correspond to their semantic proximity. This enables transfer learning of semantics for a variety of natural language processing tasks.
Word embeddings are typically trained on large public corpora such as Wikipedia or Twitter. We demonstrate that an attacker who can modify the corpus on which the embedding is trained can control the “meaning” of new and existing words by changing their locations in the embedding space. We develop an explicit expression over corpus features that serves as a proxy for distance between words and establish a causative relationship between its values and embedding distances. We then show how to use this relationship for two adversarial objectives: (1) make a word a topranked neighbor of another word, and (2) move a word from one semantic cluster to another.
An attack on the embedding can affect diverse downstream tasks, demonstrating for the first time the power of data poisoning in transfer learning scenarios. We use this attack to manipulate query expansion in information retrieval systems such as resume search, make certain names more or less visible to named entity recognition models, and cause new words to be translated to a particular target word regardless of the language. Finally, we show how the attacker can generate linguistically likely corpus modifications, thus fooling defenses that attempt to filter implausible sentences from the corpus using a language model.
I Introduction
“When I use a word,” Humpty Dumpty said, in rather a scornful tone, “it means just what I choose it to mean—neither more nor less.” “The question is,” said Alice, “whether you can make words mean so many different things.”
Lewis Carroll. Through the LookingGlass.
Word embeddings, i.e., mappings from words to lowdimensional vectors, are a fundamental tool in natural language processing (NLP). Popular neural methods for computing embeddings such as GloVe [pennington2014glove] and SGNS [mikolov2013distributed] require large training corpora and are typically learned in an unsupervised fashion from public sources, e.g., Wikipedia or Twitter.
Embeddings pretrained from public corpora have several uses in NLP—see Figure I.1. First, they can significantly reduce the training time of NLP models by reducing the number of parameters to optimize. For example, pretrained embeddings are commonly used to initialize the first layer of neural NLP models. This layer maps input words into a lowdimensional vector representation and can remain fixed or else be (re)trained much faster.
Second, pretrained embeddings are a form of transfer learning. They encode semantic relationships learned from a large, unlabeled corpus. During the supervised training of an NLP model on a much smaller, labeled dataset, pretrained embeddings improve the model’s performance on texts containing words that do not occur in the labeled data, especially for tasks that are sensitive to the meaning of individual words. For example, in questionanswer systems, questions often contain just a few words, while the answer may include different—but semantically related—words. Similarly, in Named Entity Recognition (NER) [agerri2016robust], a named entity might be identified by the sentence structure, but its correct entityclass (corporation, person, location, etc.) is often determined by the word’s semantic proximity to other words.
Furthermore, pretrained embeddings can directly solve subtasks in information retrieval systems, such as expanding search queries to include related terms [diaz2016query, kuzi2016query, roy2016using], predicting questionanswer relatedness [chen2017reading, kamath2017study], deriving the word’s kmeans cluster [nikfarjam2015pharmacovigilance], and more.
Controlling embeddings via corpus poisoning. The data on which the embeddings are trained is inherently vulnerable to poisoning attacks. Large naturallanguage corpora are drawn from public sources that (1) can be edited and/or augmented by an adversary, and (2) are weakly monitored, so the adversary’s modifications can survive until they are used for training.
We consider two distinct adversarial objectives, both expressed in terms of word proximity in the embedding space. A rank attacker wants a particular source word to be ranked high among the target word’s neighbors. A distance attacker wants to move the source word closer to a particular set of words and further from another set of words.
Achieving these objectives via corpus poisoning requires first answering a fundamental question: how do changes in the corpus correspond to changes in the embeddings? Neural embeddings are derived using an opaque optimization procedure over corpus elements, thus it is not obvious how, given a desired change in the embeddings, to compute specific corpus modifications that achieve this change.
Our contributions. First, we show how to relate proximity in the embedding space to distributional, aka explicit expressions over corpus elements, computed with basic arithmetics and no weight optimization. Word embeddings are expressly designed to capture (a) firstorder proximity, i.e., words that frequently occur together in the corpus, and (b) secondorder proximity, i.e., words that are similar in the “company they keep” (they frequently appear with the same set of other words, if not with each other). We develop distributional expressions that capture both types of semantic proximity, separately and together, in ways that closely correspond to how they are captured in the embeddings. Crucially, the relationship is causative: changes in our distributional expressions produce predictable changes in the embedding distances.
Second, we develop and evaluate a methodology for introducing adversarial semantic changes in the embedding space, depicted in Figure I.2. As proxies for the semantic objectives, we use distributional objectives, expressed and solved as an optimization problem over wordcooccurrence counts. The attacker then computes corpus modifications that achieve the desired counts. We show that our attack is effective against popular embedding models—even if the attacker has only a small subsample of the victim’s training corpus and does not know the victim’s specific model and hyperparameters.
Third, we demonstrate the power and universality of our attack on several practical NLP tasks with the embeddings trained on Twitter and Wikipedia. By poisoning the embedding, we (1) trick a resume search engine into picking a specific resume as the top result for queries with chosen terms such as “iOS” or “devops”; (2) prevent a Named Entity Recognition model from identifying specific corporate names or else identify them with higher recall; and (3) make a wordtoword translation model confuse an attackermade word with an arbitrary English word, regardless of the target language.
Finally, we show how to morph the attacker’s word sequences so they appear as linguistically likely as actual sentences from the corpus, measured by the perplexity scores of a language model (the attacker does not need to know the specifics of the latter). Filtering out highperplexity sentences thus has prohibitively many false positives and false negatives, and using a language model to “sanitize” the training corpus is ineffective. Aggressive filtering drops the majority of the actual corpus and still does not foil the attack.
To the best of our knowledge, ours is the first datapoisoning attack against transfer learning. Furthermore, embeddingbased NLP tasks are sophisticated targets, with two consecutive training processes (one for the embedding, the other for the downstream task) acting as levels of indirection. A single attack on an embedding can thus potentially affect multiple, diverse downstream NLP models that all rely on this embedding to provide the semantics of words in a language.
Ii Prior work
Interpreting word embeddings. Levy and Goldberg [levy2014neural] argue that SGNS factorizes a matrix whose entries are derived from cooccurrence counts. Arora et al. [arora2016latent, arora2015random], Hashimoto et al. [hashimoto2016word], and Ethayarajh et al. [ethayarajh2018towards] analytically derive explicit expressions for embedding distances, but these expressions are not directly usable in our setting—see Section IVA. (Unwieldy) distributional representations have traditionally been used in information retrieval [gabrilovich2007computing, turney2010frequency]; Levy and Goldberg [levy2014linguistic] show that they can perform similarly to neural embeddings on analogy tasks. Antoniak et al. [antoniak2018evaluating] empirically study the stability of embeddings under various hyperparameters.
The problem of modeling causation between corpus features and embedding proximities also arises when mitigating stereotypical biases encoded in embeddings [bolukbasi2016man]. Brunet et al. [brunet2019understanding] recently analyzed GloVe’s objective to detect and remove articles that contribute to bias, given as an expression over word vector proximities.
To the best of our knowledge, we are the first to develop explicit expressions for word proximities over corpus cooccurrences, such that changes in expression values produce consistent, predictable changes in embedding proximities.
Poisoning neural networks. Poisoning attacks inject data into the training set [shafahi2018poison, yang2017generative, chen2017targeted, steinhardt2017certified, liu2017trojaning] to insert a “backdoor” into the model or degrade its performance on certain inputs. Our attack against embeddings can inject new words (Section IX) and cause misclassification of existing words (Section X). It is the first attack against twolevel transfer learning: it poisons the training data to change relationships in the embedding space, which in turn affects downstream NLP tasks.
Poisoning matrix factorization. Gradientbased poisoning attacks on matrix factorization have been suggested in the context of collaborative filtering [li2016data] and adapted to unsupervised node embeddings [sun2018data]. These approaches are computationally prohibitive because the matrix must be factorized at every optimization step, nor do they work in our setting, where most gradients are 0 (see Section VI).
Bojchevski and Gúnnerman recently suggested an attack on node embeddings that does not use gradients [bojcheski2018adversarial] but the computational cost remains too high for naturallanguage cooccurrence graphs where the dictionary size is in the millions. Their method works on graphs, not text; the mapping between the two is nontrivial (we address this in Section VII). The only task considered in [bojcheski2018adversarial] is generic node classification, whereas we work in a complete transfer learning scenario.
Adversarial examples. There is a rapidly growing literature on testtime attacks on neuralnetwork image classifiers [szegedy2013intriguing, madry2017towards, kurakin2016adversarial, kurakin2016adversarialscale, akhtar2018threat]; some employ only blackbox model queries [ilyas2018black, chen2017zoo] rather than gradientbased optimization. We, too, use a nongradient optimizer to compute cooccurrences that achieve the desired effect on the embedding, but in a setting where queries are cheap and computation is expensive.
Neural networks for text processing are just as vulnerable to adversarial examples, but example generation is more challenging due to the nondifferentiable mapping of text elements to the embedding space. Dozens of attacks and defenses have been proposed [ebrahimi2018hotflip, samanta2017towards, alzantot2018generating, jia2017adversarial, liang2017deep, gao2018black, belinkov2017synthetic, sato2018interpretable, wang2019survey, Wallace2019Triggers].
By contrast, we study trainingtime attacks that change word embeddings so that multiple downstream models behave incorrectly on unmodified test inputs.
Iii Background and notation
Table I summarizes our notation. Let be a dictionary of words and a corpus, i.e., a collection of word sequences. A word embedding algorithm aims to learn a lowdimensional vector for each . Semantic similarity between words is encoded as the cosine similarity of their corresponding vectors, where is the vector dot product. The cosine similarity of L2normalized vectors is (1) equivalent to their dot product, and (2) linear in negative squared L2 (Euclidean) distance.
III  corpus  
dictionary  
dictionary words  
embedding vectors  
“word vectors”  
“context vectors”  
GloVe bias terms, see Equation III.1  
cosine similarity  
’s cooccurrence matrix  
’s row in  


cooccurrence event weight function  
matrix defined by Equation III.2  
matrix defined by Equation III.3  
IV  , see Equation III.4  
, see Equation III.4  




’s row in  
explicit expression for , set as  
normalization term for firstorder proximity  




entries defined by  
V  word sequences added by the attacker  
corpus after the attacker’s additions  
size of the attacker’s additions, see Section V  
source, target words  
“positive” and “negative” target words  
embedding cosine similarity after the attack  
embedding objective  
proximity attacker’s maximum allowed  
rank attacker’s target rank  


distributional expression for cosine similarity  
distributional expression for  
distributional objective  


“safety margin” for estimation error  
cooccurrence matrix after adding  
VI 










vector such that  

sets of attacked words in our experiments  


Embedding algorithms start from a highdimensional representation of the corpus, its cooccurrence matrix where is a weighted sum of cooccurrence events, i.e., appearances of in proximity to each other. Function gives each event a weight that is inversely proportional to the distance between the words.
Embedding algorithms first learn two intermediate representations for each word , the word vector and the context vector , then compute from them.
GloVe. GloVe defines and optimizes (via SGD) a minimization objective directly over cooccurrence counts, weighted by for some window size :
(III.1) 
where is taken over the parameters . are scalar bias terms that are learned along with the word and context vectors, and for some parameter (typically ). At the end of the training, GloVe sets the embedding .
Word2vec. Word2vec [mikolov2013efficient] is a family of models that optimize objectives over corpus cooccurrences. In this paper, we experiment with the skipgram with negative sampling (SGNS) and CBOW with hierarchical softmax (CBHS). In contrast to GloVe, Word2vec discards context vectors and uses word vectors as the embeddings, i.e., . Appendix A provides further details.
There exist other embeddings, such as FastText, but understanding them is not required as the background for this paper.
Contextual embeddings. Contextual embeddings [peters_deep_2018, devlin_bert_2018] support dynamic word representations that change depending on the context of the sentence they appear in, yet, in expectation, form an embedding space with noncontextual relations [Schuster2019]. In this paper, we focus on the popular noncontextual embeddings because (a) they are faster to train and easier to store, and (b) many task solvers use them by construction (see Sections IX through XI).
Distributional representations. A distributional or explicit representation of a word is a highdimensional vector whose entries correspond to cooccurrence counts with other words.
Dot products of the learned word vectors and context vectors () seem to correspond to entries of a highdimensional matrix that is closely related to, and directly computable from, the cooccurrence matrix. Consequently, both SGNS and GloVe can be cast as matrix factorization methods. Levy and Goldberg [levy2014neural] show that, assuming training with unlimited dimensions, SGNS’s objective has an optimum at defined as:
(III.2)  
where is the negativesampling constant and . This variant of pointwise mutual information (PMI) downweights a word’s cooccurrences with common words because they are less “significant” than cooccurrences with rare words. The rows of the matrix define a distributional representation.
GloVe’s objective similarly has an optimum defined as:
(III.3) 
is a simplification: in rare and negligible cases, the optimum of is slightly below 0. Similarly to , downweights cooccurrences with common words (via the learned bias values ).
First and secondorder proximity. We expect words that frequently cooccur with each other to have high semantic proximity. We call this firstorder proximity. It indicates that the words are related but not necessarily that their meanings are similar (e.g., “first class” or “polar bear”).
The distributional hypothesis [firth1957synopsis] says that distributional vectors capture semantic similarity by secondorder proximity: the more contexts two words have in common, the higher their similarity, regardless of their cooccurrences with each other. For example, “terrible” and “horrible” hardly ever cooccur, yet their secondorder proximity is very high. Levy and Goldberg [levy2014linguistic] showed that linear relationships of distributional representations are similar to those of word embeddings.
Levy and Goldberg [levy2015improving] observe that, summing the context and word vectors , as done by default in GloVe, leads to the following:
(III.4) 
where and . They conjecture that and correspond to, respectively, first and secondorder proximities.
Indeed, seems to be a measure of cooccurrence counts, which measure firstorder proximity: Equation III.3 leads to . is symmetrical up to a small error, stemming from the difference between GloVe bias terms and , but they are typically very close—see Section IVB. This also assumes that the embedding optimum perfectly recovers the matrix.
There is no distributional expression for that does not rely on problematic assumptions (see Section IVA), but there is ample evidence for the conjecture that somehow captures secondorder proximity (see Section IVB). Since word and context vectors and their products typically have similar ranges, Equation III.4 suggests that embeddings weight first and secondorder proximities equally.
Iv From embeddings to expressions over corpus
The key problem that must be solved to control word meanings via corpus modifications is finding a distributional expression, i.e., an explicit expression over corpus features such as cooccurrences, for the embedding distances, which are the computational representation of “meaning.”
Iva Previous work is not directly usable
Several prior approaches [arora2016latent, arora2015random, ethayarajh2018towards] derive distributional expressions for distances between word vectors, all of the form . The downweighting role of seems similar to SPPMI and BIAS, thus these expressions, too, can be viewed as variants of PMI.
These approaches all make simplifying assumptions that do not hold in reality. Arora et al. [arora2016latent, arora2015random] and Hashimoto et al. [hashimoto2016word] assume a generative language model where words are emitted by a random walk. Both models are parameterized by lowdimensional word vectors and assume that context and word vectors are identical. Then they show how optimize the objectives of GloVe and SGNS.
By their very construction, these models uphold a very strong relationship between cooccurrences and lowdimensional representation products. In Arora et al., these products are equal to PMIs; in Hashimoto et al., the vectors’ L2 norm differences, which are closely related to their product, approximate their cooccurrence count. If such “convenient” lowdimensional vectors exist, it should not be surprising that they optimize GloVe and SGNS.
The approximation in Ethayarajh et al. [ethayarajh2018towards] only holds within a single set of word pairs that are “contextually coplanar,” which loosely means they appear in related contexts. It is unclear if coplanarity holds in reality over large sets of word pairs, let alone the entire dictionary.
Some of the above papers use correlation tests to justify their conclusion that dot products follow SPPMIlike expressions. Crucially, correlation does not mean that the embedding space is derived from (log)cooccurrences in a distancepreserving fashion, thus correlation is not sufficient to control the embeddings. We want not just to characterize how embedding distances typically relate to corpus elements, but to achieve a specific change in the distances. To this end, we need an explicit expression over corpus elements whose value is encoded in the embedding distances by the embedding algorithm (Figure I.2).
Furthermore, these approaches barter generality for analytic simplicity and derive distributional expressions that do not account for secondorder proximity at all. As a consequence, the values of these expressions can be very different from the embedding distances, since words that only rarely appear in the same window (and thus have low PMI) may be close in the embedding space. For example, “horrible” and “terrible” are so semantically close they can be used as synonyms, yet they are also similar phonetically and thus their adjacent use in natural speech and text appears redundant. In a dim100 GloVe model trained on Wikipedia, “terrible” is among the top 3 words closest to “horrible” (with cosine similarity 0.8). However, when words are ordered by their PMI with “horrible,” “terrible” is only in the 3675th place.
IvB Our approach
We aim to find a distributional expression for the semantic proximity encoded in the embedding distances. The first challenge is to find distributional expressions for both first and secondorder proximities encoded by the embedding algorithms. The second is to combine them into a single expression corresponding to embedding proximity.
Firstorder proximity. Firstorder proximity corresponds to cooccurrence counts and is relatively straightforward to express in terms of corpus elements. Let be the matrix that the embeddings factorize, e.g., for SGNS (Equations III.2) or for GloVe (Equations III.3). The entries of this matrix are natural explicit expressions for firstorder proximity, since they approximate from Equation III.4 (we omit multiplication by two as it is immaterial):
(IV.1) 
is typically of the form
where are the
“downweighting” scalar values (possibly depending
on ’s rows in ). For , we set
\done\dtcolornote[Tal]orangeremove coma;
for , .
Secondorder proximity. Let the distributional representation of be its row in . We hypothesize that distances in this representation correspond to secondorder proximity encoded in the embeddingspace distances.
First, the objectives of the embedding algorithms seem to directly encode this connection. Consider a word ’s projection onto GloVe’s objective III.1:
This expression is determined entirely by ’s row in . If two words have the same distributional vector, their expressions in the optimization objective will be completely symmetrical, resulting in very close embeddings—even if their cooccurrence count is 0. Second, the view of the embeddings as matrix factorization implies an approximate linear transformation between the distributional and embedding spaces. Let be the matrix whose rows are context vectors of words . Assuming is perfectly recovered by the products of word and context vectors, .
Dot products have very different scale in the distributional and embedding spaces. Therefore, we use cosine similarities, which are always between 1 and 1, and set
(IV.2) 
As long as entries are nonnegative, the value of this expression is always between 0 and 1.
Combining first and secondorder proximity. Our expressions for first and secondorder proximities have different scales: corresponds to an unbounded dot product, while is at most 1. To combine them, we normalize . Let , then . We set as the normalization term. This is similar to the normalization term of cosine similarity and ensures that the value is between 0 and 1. The operation is taken with a small , rather than 0, to avoid division by 0 in edge cases. We set . Our combined distributional expression for the embedding proximity is
(IV.3) 
Since and are always between 0 and 1, the value of this expression, too, is between 0 and 1.
Correlation tests. We trained a GloVepaper and a SGNS model on full Wikipedia, as described in Section VIII. We randomly sampled (without replacement) 500 “source” words and 500 “target” words from the 50,000 most common words in the dictionary and computed the distributional expressions , , and , for all 250,000 sourcetarget word pairs using where is defined by . We then computed the correlations between distributional proximities and (1) embedding proximities, and (2) wordcontext proximities and wordword proximities , using GloVe’s word and context vectors. These correspond, respectively, to first and secondorder proximities encoded in the embeddings.
GloVe  0.47  0.53  0.56  
0.31  0.35  0.36  
0.36  0.43  0.50  
SGNS  0.31  0.29  0.32  
0.21  0.47  0.36  
0.21  0.31  0.34 
expression  
0.50  0.49  0.54  
0.40  0.51  0.52  
0.47  0.53  0.56 
Tables II and III show the results. Observe that (1) in GloVe, consistently correlates better with the embedding proximities than either the first or secondorder expressions alone. (2) In SGNS, by far the strongest correlation is with computed using . (3) The highest correlations are attained using the matrices factorized by the respective embeddings. (4) The values on Table II’s diagonal are markedly high, indicating that correlates highly with , with , and their combination with . (5) Firstorder expressions correlate worse than secondorder and combined ones, indicating the importance of secondorder proximity for semantic proximity. This is especially true for SGNS, which does not sum the word and context vectors.
V Attack methodology
Attacker capabilities. Let be a “source word” whose meaning the attacker wants to change. The attacker is targeting a victim who will train his embedding on a specific public corpus, which may or may not be known to the attacker in its entirety. The victim’s choice of the corpus is mandated by the nature of the task and limited to a few big public corpora believed to be sufficiently rich to represent natural language (English, in our case). For example, Wikipedia is a good choice for wordtoword translation models because it preserves crosslanguage cooccurrence statistics [conneau2017word], whereas Twitter is best for namedentity recognition in tweets [cherry2015unreasonable]. The embedding algorithm and its hyperparameters are typically public and thus known to the attacker, but we also show in Section VIII that the attack remains effective if the attacker uses a small subsample of the target corpus as a surrogate and very different embedding hyperparameters.
The attacker need not know the details of downstream models. The attacks in Sections IX–XI make only general assumptions about their targets, and we show that a single attack on the embedding can fool multiple downstream models.
We assume that the attacker can add a collection of short word sequences, up to 11 words each, to the corpus. In Section VIII, we explain how we simulate sequence insertion. In Appendix G, we also consider an attacker who can edit existing sequences, which may be viable for publicly editable corpora such as Wikipedia.
We define the size of the attacker’s modifications as the bigger of (a) the maximum number of appearances of a single word, i.e., the norm of the change in the corpus’s wordcount vector, and (b) the number of added sequences. Thus, of the wordcount change is capped by , while is capped by .
Overview of the attack. The attacker wants to use his corpus modifications to achieve a certain objective for in the embedding space while minimizing .
0. Find distributional expression for embedding distances. The preliminary step, done once and used for multiple attacks, is to (0) find distributional expressions for the embedding proximities. Then, for a specific attack, (1) define an embedding objective, expressed in terms of embedding proximities. Then, (2) derive the corresponding distributional objective, i.e., an expression that links the embedding objective with corpus features, with the property that if the distributional objective holds, then the embedding objective is likely to hold. Because a distributional objective is defined over , the attacker can express it as an optimization problem over cooccurrence counts, and (3) solve it to obtain the cooccurrence change vector. The attacker can then (4) transform the cooccurrence change vector to a change set of corpus edits and apply them. Finally, (5) the embedding is trained on the modified corpus, resulting in the attacker’s changes propagating to the embedding. Figure V.1 depicts this process.
As explained in Section IV, the goal is to find a distributional expression that, if upheld in the corpus, will cause a corresponding change in the embedding distances.
First, the attacker needs to know the corpus cooccurrence counts and the appropriate firstorder proximity matrix (see Section IVB). Both depend on the corpus and the embedding algorithm and its hyperparameters, but can also be computed from available proxies (see Section VIII).
Using and , set as , or (see Section IVB). We found that the best choice depends on the embedding (see Section VIII). For example, for GloVe, which puts similar weight on first and secondorder proximity (see Section III.4), is the most effective; for SGNS, which only uses word vectors, is slightly more effective.
1. Derive an embedding objective. We consider two types of adversarial objectives. An attacker with a proximity objective wants to push away from some words (we call them “negative”) and closer to other words (“positive”) in the embedding space. An attacker with a rank objective wants to make the th closest embedding neighbor of some word .
To formally define these objectives, first, given two sets of words , define
where is the cosine similarity function that measures pairwise word proximity (see Section III) when the embeddings are computed on the modified corpus . penalizes ’s proximity to the words in and rewards proximity to the words in .
Given , , and a threshold , define the proximity objective as
This objective makes a word semantically farther from or closer to another word or cluster of words.
Given some rank , define the rank objective as finding a minimal such that is one of ’s closest neighbors in the embedding. Let be the proximity of to its th closest embedding neighbor. Then the rank constraint is equivalent to , and the objective can be expressed as
or, equivalently,
This objective is useful, for example, for injecting results into a search query (see Section IX).
2. From embedding objective to distributional objective. We now transform the optimization problem , expressed over changes in the corpus and embedding proximities, to a distributional objective , expressed over changes in the cooccurrence counts and distributional proximities. The change vector denotes the change in ’s cooccurrence vector that corresponds to adding to the corpus. This transformation involves several steps.
(a) Changes in corpus changes in cooccurrence counts: We use a placement strategy that takes a vector , interprets it as additions to ’s cooccurrence vector, and outputs such that ’s cooccurrences in the new corpus are . Other rows in remain almost unchanged. Our objective can now be expressed over as a surrogate for . It still uses the size of the corpus change, , which is easily computable from without computing as explained below.
(b) Embedding proximity distributional proximity: We assume that embedding proximities are monotonously increasing (respectively, decreasing) with distributional proximities. Figure .4–c in Appendix E shows this relationship.
(c) Embedding threshold distributional threshold:
For the rank objective, we want to increase the embedding proximity
past a threshold . We heuristically determine a threshold
such that, if the distributional proximity exceeds
, the embedding proximity exceeds . Ideally, we
would like to set as the distributional proximity from
the thnearest neighbor of , but finding the th neighbor in
the distributional space is computationally expensive. The alternative of
using words’ embeddingspace ranks is not straightforward because there
exist severe abnormalities
Therefore, we approximate the ’th proximity by taking the maximum of distributional proximities from words with ranks in the embedding space, for some . If , we take the maximum over the nearest words. To increase the probability of success (at the expense of increasing corpus modifications), we further add a small fraction (“safety margin”) to this maximum.
Let be our distributional expression for , computed over the cooccurrences , i.e., ’s cooccurrences where ’s row is updated with . Then we define the distributional objective as:
To find the cooccurrence change for the proximity objective, the attack must solve:
and for the rank objective:
3. From distributional objective to cooccurence changes. The previous steps produce a distributional objective consisting of a source word , a positive target word set , a negative target word set , and the constraints: either a maximal change set size , or a minimal proximity threshold .
We solve this objective with an optimization procedure (described in Section VI) that outputs a change vector with the smallest that maximizes the sum of proximities between and minus the sum of proximities with , subject to the constraints. It starts with and iteratively increases the entries in . In each iteration, it increases the entry that maximizes the increase in , divided by the increase in , until the appropriate threshold ( or ) has been crossed.
This computation involves the size of the corpus change, . In our placement strategy, is tightly bounded by a known linear combination of ’s elements and can therefore be efficiently computed from .
4. From cooccurrence changes to corpus changes. From the cooccurrence change vector , the attacker computes the corpus change using the placement strategy which ensures that, in the modified corpus , the cooccurrence matrix is close to . Because the distributional objective holds under these cooccurrence counts, it holds in .
should be as small as possible. In Section VII, we show that our placement strategy achieves solutions that are extremely close to optimal in terms of , and that is a known linear combination of elements (as required above).
5. Embeddings are trained. The embeddings are trained on the modified corpus. If the attack has been successful, the attacker’s objectives are true in the new embedding.
Recap of the attack parameters. The attacker must first find and that are appropriate for the targeted embedding. This can be done once. The proximity attacker must then choose the source word , the positive and negative targetword sets , and the maximum size of the corpus changes . The rank attacker must choose the source word , the target word , the desired rank , and a “safety margin” for the transformation from embeddingspace thresholds to distributionalspace thresholds.
Vi Optimization in cooccurrencevector space
This section describes the optimization procedure in step 3 of our attack methodology (Figure V.1). It produces a cooccurrence change vector that optimizes the distributional objective from Section V, subject to constraints.
Gradientbased approaches are inadequate. Gradientbased approaches such as SGD result in a poor tradeoff between and . First, with our distributional expressions, most entries in remain 0 in the vicinity of due to the operation in the computation of (see Section IVB). Consequently, their gradients are 0. Even if we initialize so that its entries start from a value where the gradient is nonzero, the optimization will quickly push most entries to 0 to fulfill the constraint , and the gradients of these entries will be rendered useless. Second, gradientbased approaches may increase vector entries by arbitrarily small values, whereas cooccurrences are drawn from a discrete space because they are linear combinations of cooccurrence event weights (see Section III). For example, if the window size is 5 and the weight is determined by , then the possible weights are .
exhibits diminishing returns: usually, the bigger the increase in entries, the smaller the marginal gain from increasing them further. Such objectives can often be cast as submodular maximization [nemhauser1978best, krause2014submodular] problems, which typically lend themselves well to greedy algorithms. We investigate this further in Appendix B.
Our approach. We define a discrete set of step sizes and gradually increase entries in in increments chosen from so as to maximize the objective . We stop when or .
should be finegrained so the steps are optimal and entries in map tightly onto cooccurrence events in the corpus, yet should have a sufficient range to “peek beyond” the threshold where the entry starts getting nonzero values. A natural is a subset of the space of linear combinations of possible weights, with an exact mapping between it and a series of cooccurrence events. This mapping, however, cannot be directly computed by the placement strategy (Section VII), which produces an approximation. For better performance, we chose a slightly more coarsegrained .
Our algorithm can accommodate with negative values, which correspond to removing cooccurrence events from the corpus—see Appendix G.
Our optimization algorithm. Let be some expression that depends on , and define , where is the change vector after setting . We initialize , and in every step choose
(VI.1) 
and set . If or , then quit and return .
Directly computing Equation VI.1 for all is expensive. The denominator is easy to compute efficiently because it’s a linear combination of elements (see Section VII). The numerator , however, requires computations per step (assuming ; in our settings it is ). Since is very big (up to millions of words), this is intractable. Instead of computing each step directly, we developed an algorithm that maintains intermediate values in memory. This is similar to backpropagation, except that we consider variable changes in rather than infinitesimally small differentials. This approach can compute the numerator in and, crucially, is entirely parallelizable across all , enabling the computation in every optimization step to be offloaded onto a GPU. In practice, this algorithm finds in minutes (see Section VIII). Full details can be found in Appendix B.
Vii Placement into corpus
The placement strategy is step 4 of our methodology (see Fig. V.1). It takes a cooccurrence change vector and creates a minimal change set to the corpus such that (a) is bounded by a linear combination , i.e., , and (b) the optimal value of is preserved.
Our placement strategy first divides into (1) entries of the form —these changes to increase the firstorder similarity between and , and (2) the rest of the entries, which increase the objective in other ways. The strategy adds different types of sequences to to fulfil these two goals. For the first type, it adds multiple, identical firstorder sequences, containing just the source and target words. For the second type, it adds secondorder sequences, each containing the source word and 10 other words, constructed as follows. It starts with a collection of sequences containing just , then iterates over every nonzero entry in corresponding to the secondorder changes , and chooses a collection of sequences into which to insert so that the added cooccurrences of with become approximately equal to .
Viii Benchmarks
Datasets. We use a full Wikipedia text dump, downloaded on January 20, 2018. For the SubWikipedia experiments, we randomly chose 10% of the articles.
Embedding algorithms and hyperparameters. We use Pennington et al.’s original implementation of GloVe [gloveimp], with two settings for the (hyper)parameters: (1) paper, with parameter values from [gloveimp]—this is our default, and (2) tutorial, with parameters values from [gloveTutorial]. Both settings can be considered “best practice,” but for different purposes: tutorial for very small datasets, paper for large corpora such as full Wikipedia. Table IV summarizes the differences, which include the maximum size of the vocabulary (if the actual vocabulary is bigger, the least frequent words are dropped), minimal word count (words with fewer occurrences are ignored), (see Section III), embedding dimension, window size, and number of epochs. The other parameters are set to their defaults. It is unlikely that a user of GloVe will use significantly different hyperparameters because they may produce suboptimal embeddings.
We use Gensim Word2Vec’s implementations of SGNS and CBHS with the default parameters, except that we set the number of epochs to 15 instead of 5 (more epochs result in more consistent embeddings across training runs, though the effect may be small [hellrich2016bad]) and limited the vocabulary to 400k.
Inserting the attacker’s sequences into the corpus. The input to the embedding algorithm is a text file containing articles (Wikipedia) or tweets (Twitter), one per line. We add each of the attacker’s sequences in a separate line, then shuffle all lines. For Word2Vec embeddings, which depend somewhat on the order of lines, we found the attack to be much more effective if the attacker’s sequences are at the end of the file, but we do not exploit this observation in our experiments.
scheme name 




epochs 


GloVepaper  400k  0  100  100  10  50  N/A  
GloVepaper300  400k  0  100  300  10  50  N/A  
GloVetutorial  5  10  50  15  15  N/A  
SGNS  400k  0  N/A  100  5  15  5  
CBHS  400k  0  N/A  100  5  15  N/A 
Implementation. We implemented the attack in Python and ran it on an Intel(R) Core(TM) i99980XE CPU @ 3.00GHz, using the CuPy [cupy] library to offload parallelizable optimization (see Section VI) to an RTX 2080 Ti GPU. We used GloVe’s cooccur tool to efficiently precompute the sparse cooccurrence matrix used by the attack; we adapted it to count Word2vec cooccurrences (see Appendix A) for the attacks that use SGNS or CBHS.
For the attack using GloVepaper with , the optimization procedure from Section VI found in 3.5 minutes on average. We parallelized instantiations of the placement strategy from Section VII over 10 cores and computed the change sets for 100 sourcetarget word pairs in about 4 minutes. Other settings were similar, with the running times increasing proportionally to . Computing corpus cooccurrences and pretraining the embedding (done once and used for multiple attacks) took about 4 hours on 12 cores.
Attack parameterization. To evaluate the attack under different hyperparameters, we use a proximity attacker (see Section V) on a randomly chosen set of 100 word pairs, each from the 100k most common words in the corpus. For each pair , we perform our attack with , and different values of and hyperparameters.
We also experiment with different distributional expressions: , . (The choice of is irrelevant for pure attackers—see Section VII). When attacking SGNS with , and when attacking GloVepaper300, we used GloVepaper to precompute the bias terms.
Finally, we consider an attacker who does not know the victim’s full corpus, embedding algorithm, or hyperparameters. First, we assume that the victim trains an embedding on Wikipedia, while the attacker only has the SubWikipedia sample. We experiment with an attacker who uses GloVetutorial parameters to attack a GloVepaper victim, as well as an attacker who uses a SGNS embedding to attack a GloVepaper victim, and vice versa. These attackers use when computing on the smaller corpus (step 3 in Figure V.1), then set before computing (in step 4), resulting in . We also simulated the scenario where the victim trains an embedding on a union of Wikipedia and Common Crawl [commoncrawl], whereas the attacker only uses Wikipedia. For this experiment, we used similarly sized random subsamples of Wikipedia and Common Crawl, for a total size of about 1/5th of full Wikipedia, and proportionally reduced the bound on the attacker’s change set size.
In all experiments, we perform the attack on all 100 word pairs, add the computed sequences to the corpus, and train an embedding using the victim’s setting. In this embedding, we measure the median rank of the source word in the target word’s list of neighbors, the average increase in the sourcetarget cosine similarity in the embedding space, and how many source words are among their targets’ top 10 neighbors.
Attacks are universally successful. Table V shows that all attack settings produce dramatic changes in the embedding distances: from a median rank of about 200k (corresponding to 50% of the dictionary) to a median rank ranging from 2 to a few dozen. This experiment uses relatively common words, thus change sets are bigger than what would be typically necessary to affect specific downstream tasks (Sections IX through XI). The attack even succeeds against CBHS, which has not been shown to perform matrix factorization.
Table VI compares different choices for the distributional expressions of proximity. performs best for GloVe, for SGNS. For SGNS, is far less effective than the other options. Surprisingly, an attacker who uses the matrix is effective against SGNS and not just GloVe.
Attacks transfer. Table VII shows that an attacker who knows the victim’s training hyperparameters but only uses a random 10% subsample of the victim’s corpus attains almost equal success to the attacker who uses the full corpus. In fact, the attacker might even prefer to use the subsample because the attack is about 10x faster as it precomputes the embedding on a smaller corpus and finds a smaller change vector. If the attacker’s hyperparameters are different from the victim’s, there is a very minor drop in the attacks’ efficacy. These observations hold for both and attackers. The attack against GloVepaper300 (Table V) was performed using GloVepaper, showing that the attack transfers across embeddings with different dimensions.
The attack also transfers across different embedding algorithms. The attack sequences computed against a SGNS embedding on a small subset of the corpus dramatically affect a GloVe embedding trained on the full corpus, and vice versa.
setting  median rank  avg. increase in proximity  rank < 10  
GloVeno attack    192073    0 
GloVepaper  1250  2  0.64  72 
GloVepaper300  1250  1  0.60  87 
SGNSno attack    182550    0 
SGNS  1250  37  0.50  35 
SGNS  2500  10  0.56  49 
CBHSno attack    219691    0 
CBHS  1250  204  0.45  25 
CBHS  2500  26  0.55  35 
setting 


rank < 10  
GloVepaper  *  3  0.54  61  
GloVepaper  4  0.58  63  
GloVepaper  2  0.64  72  
SGNS  *  1079  0.34  7  
SGNS  37  0.50  35  
SGNS  69  0.48  30  
SGNS  226  0.44  15  
SGNS  264  0.44  17 
parameters/Wiki corpus size 


rank < 10  
attacker  victim  
GloVetutorial/subsample  GloVepaper/full  9  0.53  52  
GloVetutorial/subsample  GloVepaper/full  2  0.63  75  
GloVepaper/subsample  GloVepaper/full  7  0.55  57  
GloVepaper/subsample  GloVepaper/full  2  0.64  79  
SGNS/subsample  GloVepaper/full  110  0.38  11  
GloVepaper/subsample  SGNS/full  152  0.44  19  
GloVepaper/subsample 

2  0.59  68 
section / attack 

embedding  corpus  source word 

Threshold  rank , safety margin  

proximity 




100 randomly chosen sourcetarget pairs in  1250, 2500    

rank 

Wikipedia 

  

proximity  GloVe 


  

proximity  GloVe 



  

rank  GloVe  Wikipedia 

  

rank  SGNS  Twitter subsample 

