Contrastive Unsupervised Word Alignment with NonLocal Features
Abstract
Word alignment is an important natural language processing task that indicates the correspondence between natural languages. Recently, unsupervised learning of loglinear models for word alignment has received considerable attention as it combines the merits of generative and discriminative approaches. However, a major challenge still remains: it is intractable to calculate the expectations of nonlocal features that are critical for capturing the divergence between natural languages. We propose a contrastive approach that aims to differentiate observed training examples from noises. It not only introduces prior knowledge to guide unsupervised learning but also cancels out partition functions. Based on the observation that the probability mass of loglinear models for word alignment is usually highly concentrated, we propose to use top alignments to approximate the expectations with respect to posterior distributions. This allows for efficient and accurate calculation of expectations of nonlocal features. Experiments show that our approach achieves significant improvements over stateoftheart unsupervised word alignment methods.
1 Introduction
Word alignment is a natural language processing (NLP) task that aims to identify the correspondence between words in natural languages (Brown et al., 1993). Wordaligned parallel corpora are an indispensable resource for many NLP tasks such as machine translation and crosslingual IR.
Current word alignment approaches can be roughly divided into two categories: generative and discriminative. Generative approaches are often based on generative models (Brown et al., 1993; Vogel, Ney, and Tillmann, 1996; Liang, Taskar, and Klein, 2006), the parameters of which are learned by maximizing the likelihood of unlabeled data. One major drawback of these approaches is that they are hard to extend due to the strong dependencies between submodels. On the other hand, discriminative approaches overcome this problem by leveraging loglinear models (Liu, Liu, and Lin, 2005; Blunsom and Cohn, 2006) and linear models (Taskar, LacosteJulien, and Klein, 2005; Moore, Yih, and Bode, 2006; Liu, Liu, and Lin, 2010) to include arbitrary features. However, labeled data is expensive to build and hence is unavailable for most language pairs and domains.
As generative and discriminative approaches seem to be complementary, a number of authors have tried to combine the advantages of both in recent years (BergKirkpatrick et al., 2010; Dyer et al., 2011; Dyer, Chahuneau, and Smith, 2013). They propose to train loglinear models for word alignment on unlabeled data, which involves calculating two expectations of features: one ranging over all possible alignments given observed sentence pairs and another over all possible sentence pairs and alignments. Due to the complexity and diversity of natural languages, it is intractable to calculate the two expectations. As a result, existing approaches have to either restrict loglinear models to be locally normalized (BergKirkpatrick et al., 2010) or only use local features to admit efficient dynamic programming algorithms on compact representations (Dyer et al., 2011). Although it is possible to use MCMC methods to draw samples from alignment distributions (DeNero, BouchardCot̂é, and Klein, 2008) to calculate expectations of nonlocal features, it is computationally expensive to reach the equilibrium distribution. Therefore, including nonlocal features, which are critical for capturing the divergence between natural languages, still remains a major challenge in unsupervised learning of loglinear models for word alignment.
In the paper, we present a contrastive learning approach to training loglinear models for word alignment on unlabeled data. Instead of maximizing the likelihood of loglinear models on the observed data, our approach follows contrastive estimation methods (Smith and Eisner, 2005; Gutmann and Hyvärinen, 2012) to guide the model to assign higher probabilities to observed data than to noisy data. To calculate the expectations of nonlocal features, we propose an approximation method called top sampling based on the observation that the probability mass of loglinear models for word alignment is highly concentrated. Hence, our approach has the following advantages over previous work:

Partition functions canceled out. As learning only involves observed and noisy training examples, our training objective cancels out partition functions that comprise exponentially many sentence pairs and alignments.

Efficient sampling. We use a dynamic programming algorithm to extract top alignments, which serve as samples to compute the approximate expectations.

Arbitrary features. The expectations of both local and nonlocal features can be calculated using top approximation accurately and efficiently.
Experiments on multilingual datasets show that our approach achieves significant improvements over stateoftheart unsupervised alignment systems.
2 LatentVariable LogLinear Models for Unsupervised Word Alignment
Figure 1(a) shows a (romanized) Chinese sentence, an English sentence, and the word alignment between them. The links indicate the correspondence between Chinese and English words. Word alignment is a challenging task because both the lexical choices and word orders in two languages are significantly different. For example, while the English word “at” corresponds to a discontinuous Chinese phrase “zai … shang”, the English function word “the” has no counterparts in Chinese. In addition, a verb phrase (e.g., “made a speech”) is usually followed by a prepositional phrase (e.g., “at the meeting”) in English but the order is reversed in Chinese. Therefore, it is important to design features to capture various characteristics of word alignment.
To allow for unsupervised word alignment with arbitrary features, latentvariable loglinear models have been studied in recent years (BergKirkpatrick et al., 2010; Dyer et al., 2011; Dyer, Chahuneau, and Smith, 2013). Let be a pair of source and target sentences and be the word alignment. A latentvariable loglinear model parametrized by a realvalued vector is given by
(1)  
(2) 
where is a feature vector and is a partition function for normalization:
(3) 
We use to denote all possible pairs of source and target strings and to denote the set of all possible alignments for a sentence pair . Let and be the lengths of the source and target sentences in , respectively. Then, the number of possible alignments for is . In this work, we use 5 local features (translation probability product, relative position absolute difference, link count, monotone and swapping neighbor counts) and 11 nonlocal features (cross count, source and target linked word counts, source and target sibling distances, source and target maximal fertilities, multiple link types) that prove to be effective in modeling regularities in word alignment (Taskar, LacosteJulien, and Klein, 2005; Moore, Yih, and Bode, 2006; Liu, Liu, and Lin, 2010).
Given a set of training examples , the standard training objective is to find the parameter that maximizes the loglikelihood of the training set:
(4)  
(6)  
Standard numerical optimization methods such as LBFGS and Stochastic Gradient Descent (SGD) require to calculate the partial derivative of the loglikelihood with respect to the th feature weight
(7)  
(8) 
As there are exponentially many sentences and alignments, the two expectations in Eq. (8) are intractable to calculate for nonlocal features that are critical for measuring the fertility and nonmonotonicity of alignment (Liu, Liu, and Lin, 2010). Consequently, existing approaches have to use only local features to allow dynamic programming algorithms to calculate expectations efficiently on lattices (Dyer et al., 2011). Therefore, how to calculate the expectations of nonlocal features accurately and efficiently remains a major challenge for unsupervised word alignment.
3 Contrastive Learning with Top Sampling
Instead of maximizing the loglikelihood of the observed training data, we propose a contrastive approach to unsupervised learning of loglinear models. For example, given an observed training example as shown in Figure 1(a), it is possible to generate a noisy example as shown in Figure 1(b) by randomly shuffling and substituting words on both sides. Intuitively, we expect that the probability of the observed example is higher than that of the noisy example. This is called contrastive learning, which has been advocated by a number of authors (see Related Work).
More formally, let be a noisy training example derived from an observed example . Our training data is composed of pairs of observed and noisy examples: . The training objective is to maximize the difference of probabilities between observed and noisy training examples:
(9)  
Accordingly, the partial derivative of with respect to the th feature weight is given by
The key difference is that our approach cancels out the partition function , which poses the major computational challenge in unsupervised learning of loglinear models. However, it is still intractable to calculate the expectation with respect to the posterior distribution for nonlocal features due to the exponential search space (i.e., ). One possible solution is to use Gibbs sampling to draw samples from the posterior distribution (DeNero, BouchardCot̂é, and Klein, 2008). But the Gibbs sampler usually runs for a long time to converge to the equilibrium distribution.
Fortunately, by definition, only alignments with highest probabilities play a central role in calculating expectations. If the probability mass of the loglinear model for word alignment is concentrated on a small number of alignments, it will be efficient and accurate to only use most likely alignments to approximate the expectation.
Figure 2 plots the distributions of loglinear models parametrized by 1,000 random feature weight vectors. We used all the 16 features. The true distributions were calculated by enumerating all possible alignments for short Chinese and English sentences ( 4 words). We find that top alignments usually account for over of the probability mass.
More importantly, we also tried various sentence lengths, language pairs, and feature groups and found this concentration property to hold consistently. One possible reason is that the exponential function enlarges the differences between variables dramatically (i.e., ).
Therefore, we propose to approximate the expectation using most likely alignments:
(14)  
(15)  
(16) 
where contains the most likely alignments depending on :
(17) 
Let the cardinality of be . We refer to Eq. (16) as top sampling because the approximate posterior distribution is normalized over top alignments:
(18) 
In this paper, we use the beam search algorithm proposed by Liu, Liu, and Lin (2010) to retrieve top alignments from the full search space. Starting with an empty alignment, the algorithm keeps adding links until the alignment score will not increase. During the process, local and nonlocal feature values can be calculated in an incremental way efficiently. The algorithm generally runs in time, where is the beam size. As it is intractable to calculate the objective function in Eq. (11), we use the stochastic gradient descent algorithm (SGD) for parameter optimization, which requires to calculate partial derivatives with respect to feature weights on single training examples.
4 Experiments
4.1 Approximation Evaluation
To measure how well top sampling approximates the true expectations, we define the approximation error as
(19) 
where returns the true difference between the expectations of observed and noisy examples:
(20) 
Similarly, returns the approximate difference. is the norm.
In addition, we define average approximation error on a set of random feature weight vectors :
(21) 
Figure 3 shows the average approximation errors of our top sampling method on short sentences (up to 4 words) with 1,000 random feature weight vectors. To calculate the true expectations of both local and nonlocal features, we need to enumerate all alignments in an exponential space. We randomly selected short ChineseEnglish sentence pairs. One noisy example was generated for each observed example by randomly shuffling, replacing, inserting, and deleting words. We used the beam search algorithm (Liu, Liu, and Lin, 2010) to retrieve best lists. We plotted the approximation errors for up to 15. We find that the average approximation errors drop dramatically when ranges from 1 to 5 and approach zero for large values of , suggesting that a small value of might suffice to approximate the expectations.
Figure 4 shows the average approximation errors of top sampling on long sentences (up to 100 words) with 1,000 random feature weight vectors. To calculate the true expectations, we follow Dyer et al. (2011) to use a dynamic programming algorithm on lattices that compactly represent exponentially many asymmetric alignments. The average errors decrease much less dramatically than in Figure 3 but still maintain at a very low level (below 0.17). This finding implies that the probability mass of loglinear models is still highly concentrated for long sentences.
Table 1 compares our approach with Gibbs sampling. We treat each link as a binary variable and the alignment probability is a joint distribution over variables, which can be sampled successively from the conditional distribution . Starting with random alignments, the Gibbs sampler achieves an average approximation error of 0.5180 with 500 samples and takes a very long time to converge. In contrast, our approach achieves much lower errors than Gibbs even only using one sample. Therefore, using more likely alignments in sampling improves not only the accuracy but also efficiency.
# samples  Gibbs  Top 

1  1.5411  0.1653 
5  0.7410  0.1477 
10  0.6550  0.1396 
50  0.5498  0.1108 
100  0.5396  0.1086 
500  0.5180  0.0932 
4.2 Alignment Evaluation
We evaluated our approach on FrenchEnglish and ChineseEnglish alignment tasks. For FrenchEnglish, we used the dataset from the HLT/NAACL 2003 alignment shared task (Mihalcea and Pedersen, 2003). The training set consists of 1.1M sentence pairs with 23.61M French words and 20.01M English words, the validation set consists of 37 sentence pairs, and the test set consists of 447 sentence pairs. Both the validation and test sets are annotated with goldstandard alignments. For ChineseEnglish, we used the dataset from Liu, Liu, and Lin (2005). The training set consists of 1.5M sentence pairs with 42.1M Chinese words and 48.3M English words, the validation set consists of 435 sentence pairs, and the test set consists of 500 sentence pairs. The evaluation metric is alignment error rate (AER) (Och and Ney, 2003).
noise generation  FrenchEnglish  ChineseEnglish 

Shuffle  8.93  21.05 
Delete  9.03  21.49 
Insert  12.87  24.87 
Replace  13.13  25.59 
FrenchEnglish  ChineseEnglish  

1  8.93  21.05 
5  8.83  21.06 
10  8.97  21.05 
50  8.88  21.07 
100  8.92  21.05 
features  FrenchEnglish  ChineseEnglish 

local  15.56  25.35 
local + nonlocal  8.93  21.05 
The baseline systems we compared in our experiments include
As both GIZA++ and fast_align produce asymmetric alignments, we use the growdiagfinaland heuristic (Koehn et al., 2007) to generate symmetric alignments for evaluation. While the baseline systems used all the training sets, we randomly selected 500 sentences and generated noises by randomly shuffling, replacing, deleting, and inserting words. ^{1}^{1}1As the translation probability product feature derived from GIZA++ is a very strong dense feature, using small training corpora (e.g., 50 sentence pairs) proves to yield very good results consistently (Liu, Liu, and Lin, 2010). However, if we model translation equivalence using millions of sparse features (Dyer et al., 2011), the unsupervised learning algorithm must make full use of all parallel corpora available like GIZA++. We leave this for future work.
We first used the validation sets to find the optimal setting of our approach: noisy generation, the value of , feature group, and training corpus size.
Table 2 shows the results of different noise generation strategies: randomly shuffling, inserting, replacing, and deleting words. We find shuffling source and target words randomly consistently yields the best results. One possible reason is that the translation probability product feature (Liu, Liu, and Lin, 2010) derived from GIZA++ suffices to evaluate lexical choices accurately. It is more important to guide the aligner to model the structural divergence by changing word orders randomly.
system  model  supervision  algorithm  FE  CE 

GIZA++  IBM model 4  unsupervised  EM  6.36  21.92 
Berkeley  joint HMM  unsupervised  EM  5.34  21.67 
fast_align  loglinear model  unsupervised  EM  15.20  28.44 
Vigne  linear model  supervised  MERT  4.28  19.37 
this work  loglinear model  unsupervised  SGD  5.01  20.24 
Table 3 gives the results of different values of sample size on the validation sets. We find that increasing does not lead to significant improvements. This might result from the high concentration property of loglinear models. Therefore, we simply set in the following experiments.
Table 4 shows the effect of adding nonlocal features. As most structural divergence between natural languages are nonlocal, including nonlocal features leads to significant improvements for both FrenchEnglish and ChineseEnglish. As a result, we used all 16 features in the following experiments.
Table 5 gives our final result on the test sets. Our approach outperforms all unsupervised aligners significantly statistically () except for the Berkeley aligner on the FrenchEnglish data. The margins on ChineseEnglish are generally much larger than FrenchEnglish because Chinese and English are distantly related and exhibit more nonlocal structural divergence. Vigne used the same features as our system but was trained in a supervised way. Its results can be treated as the upper bounds that our method can potentially approach.
We also compared our approach with baseline systems on FrenchEnglish and ChineseEnglish translation tasks but only obtained modest improvements. As alignment and translation are only loosely related (i.e., lower AERs do not necessarily lead to higher BLEU scores), imposing appropriate structural constraints (e.g., the grow, diag, final operators in symmetrizing alignments) seems to be more important for improving translation translation quality than developing unsupervised training algorithms (Koehn et al., 2007).
5 Related Work
Our work is inspired by three lines of research: unsupervised learning of loglinear models, contrastive learning, and sampling for structured prediction.
5.1 Unsupervised Learning of LogLinear Models
Unsupervised learning of loglinear models has been widely used in natural language processing, including word segmentation (BergKirkpatrick et al., 2010), morphological segmentation (Poon, Cherry, and Toutanova, 2009), POS tagging (Smith and Eisner, 2005), grammar induction (Smith and Eisner, 2005), and word alignment (Dyer et al., 2011; Dyer, Chahuneau, and Smith, 2013). The contrastive estimation (CE) approach proposed by Smith and Eisner (2005) is in spirit most close to our work. CE redefines the partition function as the set of each observed example and its noisy “neighbors”. However, it is still intractable to compute the expectations of nonlocal features. In contrast, our approach cancels out the partition function and introduces top sampling to approximate the expectations of nonlocal features.
5.2 Contrastive Learning
Contrastive learning has received increasing attention in a variety of fields. Hinton (2002) proposes contrastive divergence (CD) that compares the data distribution with reconstructions of the data vector generated by a limited number of full Gibbs sampling steps. It is possible to apply CD to unsupervised learning of latentvariable loglinear models and use top sampling to approximate the expectation on posterior distributions within each full Gibbs sampling step. The noisecontrastive estimation (NCE) method (Gutmann and Hyvärinen, 2012) casts density estimation, which is a typical unsupervised learning problem, as supervised classification by introducing noisy data. However, a key limitation of NCE is that it cannot be used for models with latent variables that cannot be integrated out analytically. There are also many other efforts in developing contrastive objectives to avoid computing partition functions (LeCun and Huang, 2005; Liang and Jordan, 2008; Vickrey, Lin, and Koller, 2010). Their focus is on choosing assignments to be compared with the observed data and developing subobjectives that allow for dynamic programming for tractable substructures. In this work, we simply remove the partition functions by comparing pairs of observed and noisy examples. Using noisy examples to guide unsupervised learning has also been pursued in deep learning (Collobert and Weston, 2008; Tamura, Watanabe, and Sumita, 2014).
5.3 Sampling for Structured Prediction
Widely used in NLP for inference (Teh, 2006; Johnson, Griffiths, and Goldwater, 2007) and calculating expectations (DeNero, BouchardCot̂é, and Klein, 2008), Gibbs sampling has not been used for unsupervised training of loglinear models for word alignment. Tamura, Watanabe, and Sumita (2014) propose a similar idea to use beam search to calculate expectations. However, they do not offer indepth analyses and the accuracy of their unsupervised approach is far worse than the supervised counterpart in terms of F1 score (0.55 vs. 0.89).
6 Conclusion
We have presented a contrastive approach to unsupervised learning of loglinear models for word alignment. By introducing noisy examples, our approach cancels out partition functions that makes training computationally expensive. Our major contribution is to introduce top sampling to calculate expectations of nonlocal features since the probability mass of loglinear models for word alignment is usually concentrated on top alignments. Our unsupervised aligner outperforms stateoftheart unsupervised systems on both closelyrelated (FrenchEnglish) and distantlyrelated (ChineseEnglish) language pairs.
As loglinear models have been widely used in NLP, we plan to validate the effectiveness of our approach on more structured prediction tasks with exponential search spaces such as word segmentation, partofspeech tagging, dependency parsing, and machine translation. It is important to verify whether the concentration property of loglinear models still holds. Since our contrastive approach compares between observed and noisy training examples, another promising direction is to develop large margin learning algorithms to improve generalization ability of our approach. Finally, it is interesting to include millions of sparse features (Dyer et al., 2011) to directly model the translation equivalence between words rather than relying on GIZA++.
Acknowledgements
This research is supported by the 973 Program (No. 2014CB340501), the National Natural Science Foundation of China (No. 61331013), The National Key Technology R & D Program (No. 2014BAK10B03), Google Focused Research Award, the Singapore National Research Foundation under its International Research Center @ Singapore Funding Initiative and administered by the IDM Programme.
References
 BergKirkpatrick et al. (2010) BergKirkpatrick, T.; BouchardCot̂é, A.; DeNero, J.; and Klein, D. 2010. Painless unsupervised learning with features. In Proceedings of NAACL 2010.
 Blunsom and Cohn (2006) Blunsom, P., and Cohn, T. 2006. Discriminative word alignment with conditional random fields. In Proceedings of COLINGACL 2006.
 Brown et al. (1993) Brown, P. F.; Pietra, V. J. D.; Pietra, S. A. D.; and Mercer, R. L. 1993. The mathematics of statistical machine translation: parameter estimation. Computational Linguistics.
 Collobert and Weston (2008) Collobert, R., and Weston, J. 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of ICML 2008.
 DeNero, BouchardCot̂é, and Klein (2008) DeNero, J.; BouchardCot̂é, A.; and Klein, D. 2008. Sampling alignment structure under a bayesian translation model. In Proceedings of EMNLP 2008.
 Dyer et al. (2011) Dyer, C.; Clark, J. H.; Lavie, A.; and Smith, N. A. 2011. Unsupervised word alignment with arbitrary features. In Proceedings of ACL 2011.
 Dyer, Chahuneau, and Smith (2013) Dyer, C.; Chahuneau, V.; and Smith, N. A. 2013. A simple, fast, and effective reparameterization of ibm model 2. In Proceedings of NAACL 2013.
 Gutmann and Hyvärinen (2012) Gutmann, M. U., and Hyvärinen. 2012. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research.
 Hinton (2002) Hinton, G. 2002. Training products of experts by minimizing contrastive divergence. Neural Computation.
 Johnson, Griffiths, and Goldwater (2007) Johnson, M.; Griffiths, T.; and Goldwater, S. 2007. Bayesian inference for pcfgs via markov chain monte carlo. In Proceedings of ACL 2007.
 Koehn et al. (2007) Koehn, P.; Hoang, H.; Birch, A.; CallisonBurch, C.; Federico, M.; Bertoldi, N.; Cowan, B.; Shen, W.; Moran, C.; Zens, R.; Dyer, C.; Bojar, O.; Constantin, A.; and Herbst, E. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of ACL 2007 (Demo and Poster).
 LeCun and Huang (2005) LeCun, Y., and Huang, F. J. 2005. Loss functions for discriminative training of energybased models. In Proceedings of AISTATS 2005.
 Liang and Jordan (2008) Liang, P., and Jordan, M. I. 2008. An asymptotic analysis of generative, discriminative, and pseudolikelihood estimators. In Proceedings of ICML 2008.
 Liang, Taskar, and Klein (2006) Liang, P.; Taskar, B.; and Klein, D. 2006. Alignment by agreement. In Proceedings of HLTNAACL 2006.
 Liu, Liu, and Lin (2005) Liu, Y.; Liu, Q.; and Lin, S. 2005. Loglinear models for word alignment. In Proceedings of ACL 2005.
 Liu, Liu, and Lin (2010) Liu, Y.; Liu, Q.; and Lin, S. 2010. Discriminative word alignment by linear modeling. Computational Linguistics.
 Mihalcea and Pedersen (2003) Mihalcea, R., and Pedersen, T. 2003. An evaluation excercise for word alignment. In Proceedings of HLTNAACL 2003 Workshop on Building and Using Parallel Texts.
 Moore, Yih, and Bode (2006) Moore, R. C.; Yih, W.t.; and Bode, A. 2006. Improved discriminative bilingual word alignment. In Proceedings of COLINGACL 2006.
 Och and Ney (2003) Och, F., and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics.
 Och (2003) Och, F. 2003. Minimum error rate training in statistical machine translation. In Proceedings of ACL 2003.
 Poon, Cherry, and Toutanova (2009) Poon, H.; Cherry, C.; and Toutanova, K. 2009. Unsupervised morphological segmentation with loglinear models. In Proceedings of NAACL 2009.
 Smith and Eisner (2005) Smith, N., and Eisner, J. 2005. Contrastive estimation: Training loglinear models on unlabeled data. In Proceedings of ACL 2005.
 Tamura, Watanabe, and Sumita (2014) Tamura, A.; Watanabe, T.; and Sumita, E. 2014. Recurrent neural networks for word alignment model. In Proceedings of EMNLP 2014.
 Taskar, LacosteJulien, and Klein (2005) Taskar, B.; LacosteJulien, S.; and Klein, D. 2005. A discriminative matching approach to word alignment. In Proceedings of EMNLP 2005.
 Teh (2006) Teh, Y. W. 2006. A hierarchical bayesian language model based on pitmanyor processes. In Proceedings of COLING/ACL 2006.
 Vickrey, Lin, and Koller (2010) Vickrey, D.; Lin, C. C.Y.; and Koller, D. 2010. Nonlocal contrastive objectives. In Proceedings of ICML 2010.
 Vogel, Ney, and Tillmann (1996) Vogel, S.; Ney, H.; and Tillmann, C. 1996. Hmmbased word alignment in statistical translation. In Proceedings of COLING 1996.