Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning

Cross-lingual Pseudo-Projected Expectation Regularization for
Weakly Supervised Learning

Mengqiu Wang    Christopher D. Manning
Computer Science Department
Stanford University
Stanford, CA 94305  USA

We consider a multilingual weakly supervised learning scenario where knowledge from annotated corpora in a resource-rich language is transferred via bitext to guide the learning in other languages. Past approaches project labels across bitext and use them as features or gold labels for training. We propose a new method that projects model expectations rather than labels, which facilities transfer of model uncertainty across language boundaries. We encode expectations as constraints and train a discriminative CRF model using Generalized Expectation Criteria [\citenameMann and McCallum2010]. Evaluated on standard Chinese-English and German-English NER datasets, our method demonstrates F scores of 64% and 60% when no labeled data is used. Attaining the same accuracy with supervised CRFs requires 12k and 1.5k labeled sentences. Furthermore, when combined with labeled examples, our method yields significant improvements over state-of-the-art supervised methods, achieving best reported numbers to date on Chinese OntoNotes and German CoNLL-03 datasets.



1 Introduction

Supervised statistical learning methods have enjoyed great popularity in Natural Language Processing (NLP) over the past decade. The success of supervised methods depends heavily upon the availability of large amounts of annotated training data. Manual curation of annotated corpora is a costly and time consuming process. To date, most annotated resources resides within the English language, which hinders the adoption of supervised learning methods in many multilingual environments.

To minimize the need for annotation, significant progress has been made in developing unsupervised and semi-supervised approaches to NLP (Collins and Singer 1999; Klein 2005; Liang 2005; Smith 2006; Goldberg 2010; inter alia) . More recent paradigms for semi-supervised learning allow modelers to directly encode knowledge about the task and the domain as constraints to guide learning [\citenameChang et al.2007, \citenameMann and McCallum2010, \citenameGanchev et al.2010]. However, in a multilingual setting, coming up with effective constraints require extensive knowledge of the foreign111For experimental purposes, we designate English as the resource-rich language, and other languages of interest as “foreign”. In our experiments, we simulate the resource-poor scenario using Chinese and German, even though in reality these two languages are quite rich in resources. language.

Bilingual parallel text (bitext) lends itself as a medium to transfer knowledge from a resource-rich language to a foreign languages. \newciteYarowsky:2001:NAACL project labels produced by an English tagger to the foreign side of bitext, then use the projected labels to learn a HMM model. More recent work applied the projection-based approach to more language-pairs, and further improved performance through the use of type-level constraints from tag dictionary and feature-rich generative or discriminative models [\citenameDas and Petrov2011, \citenameTäckström et al.2013].

In our work, we propose a new project-based method that differs in two important ways. First, we never explicitly project the labels. Instead, we project expectations over the labels. This pseudo-projection acts as a soft constraint over the labels, which allows us to transfer more information and uncertainty across language boundaries. Secondly, we encode the expectations as constraints and train a model by minimizing divergence between model expectations and projected expectations in a Generalized Expectation (GE) Criteria [\citenameMann and McCallum2010] framework.

We evaluate our approach on Named Entity Recognition (NER) tasks for English-Chinese and English-German language pairs on standard public datasets. We report results in two settings: a weakly supervised setting where no labeled data or a small amount of labeled data is available, and a semi-supervised settings where labeled data is available, but we can gain predictive power by learning from unlabeled bitext.

2 Related Work

Most semi-supervised learning approaches embody the principle of learning from constraints. There are two broad categories of constraints: multi-view constraints, and external knowledge constraints.

Examples of methods that explore multi-view constraints include self-training [\citenameYarowsky1995, \citenameMcClosky et al.2006],222A multi-view interpretation of self-training is that the self-tagged additional data offers new views to learners trained on existing labeled data. co-training [\citenameBlum and Mitchell1998, \citenameSindhwani et al.2005], multi-view learning [\citenameAndo and Zhang2005, \citenameCarlson et al.2010], and discriminative and generative model combination [\citenameSuzuki and Isozaki2008, \citenameDruck and McCallum2010].

An early example of using knowledge as constraints in weakly-supervised learning is the work by \newciteCollins:1999:EMNLP. They showed that the addition of a small set of “seed” rules greatly improve a co-training style unsupervised tagger. \newciteChang:2007:ACL proposed a constraint-driven learning (CODL) framework where constraints are used to guide the selection of best self-labeled examples to be included as additional training data in an iterative EM-style procedure. The kind of constraints used in applications such as NER are the ones like “the words CA, Australia, NY are Location[\citenameChang et al.2007]. Notice the similarity of this particular constraint to the kinds of features one would expect to see in a discriminative model such as MaxEnt. The difference is that instead of learning the validity (or weight) of this feature from labeled examples — since we do not have them — we can constrain the model using our knowledge of the domain. \newciteDruck:2009:EMNLP also demonstrated that in an active learning setting where annotation budget is limited, it is more efficient to label features than examples. Other sources of knowledge include lexicons and gazetteers [\citenameDruck et al.2007, \citenameChang et al.2007].

While it is straight-forward to see how resources such as a list of city names can give a lot of mileage in recognizing locations, we are also exposed to the danger of over-committing to hard constraints. For example, it becomes problematic with city names that are ambiguous, such as Augusta, Georgia.333This is a city in the state of Georgia in USA, famous for its golf courses. It is ambiguous since both Augusta and Georgia can also be used as person names. To soften these constraints, \newciteMann:2010:JMLR proposed the Generalized Expectation (GE) Criteria framework, which encodes constraints as a regularization term over some score function that measures the divergence between the model’s expectation and the target expectation. The connection between GE and CODL is analogous to the relationship between hard (Viterbi) EM and soft EM, as illustrated by \newciteSamdani:2012:NAACL.

Another closely related work is the Posterior Regularization (PR) framework by \newciteGanchev:2010:JMLR. In fact, as \newciteBellare:2009:UAI have shown, in a discriminative model these two methods optimize exactly the same objective.444The different terminology employed by GE and PR may be confusing to discerning readers, but the “expectation” in the context of GE means the same thing as “marginal posterior” as in PR. The two differ in optimization details: PR uses a EM algorithm to approximate the gradients which avoids the expensive computation of a covariance matrix between features and constraints, whereas GE directly calculates the gradient. However, later results [\citenameDruck2011] have shown that using the Expectation Semiring techniques of \newciteLi:2009:EMNLP, one can compute the exact gradients of GE in a Conditional Random Fields (CRF) [\citenameLafferty et al.2001] at costs no greater than computing the gradients of ordinary CRF. And empirically, GE tends to perform more accurately than PR [\citenameBellare et al.2009, \citenameDruck2011].

Obtaining appropriate knowledge resources for constructing constraints remain as a bottleneck in applying GE and PR to new languages. However, a number of past work recognizes parallel bitext as a rich source of linguistic constraints, naturally captured in the translations. As a result, bitext has been effectively utilized for unsupervised multilingual grammar induction [\citenameAlshawi et al.2000, \citenameSnyder et al.2009], parsing [\citenameBurkett and Klein2008], and sequence labeling [\citenameNaseem et al.2009].

A number of recent work also explored bilingual constraints in the context of simultaneous bilingual tagging, and showed that enforcing agreements between language pairs give superior results than monolingual tagging [\citenameBurkett et al.2010, \citenameChe et al.2013, \citenameWang et al.2013]. They also demonstrated a uptraining [\citenamePetrov et al.2010] setting where tag-induced bitext can be used as additional monolingual training data to improve monolingual taggers. A major drawback of this approach is that it requires a readily-trained tagging models in each languages, which makes a weakly supervised setting infeasible. Another intricacy of this approach is that it only works when the two models have comparable strength, since mutual agreements are enforced between them.

Projection-based methods can be very effective in weakly-supervised scenarios, as demonstrated by \newciteYarowsky:2001:NAACL, and \newciteXi:2012:EMNLP. One problem with projected labels is that they are often too noisy to be directly used as training signals. To mitigate this problem, \newciteDas:2011:ACL designed a label propagation method to automatically induce a tag lexicon for the foreign language to smooth the projected labels. \newciteFossum:2005:IJCNLP filter out projection noise by combining projections from from multiple source languages. However, this approach is not always viable since it relies on having parallel bitext from multiple source languages. \newciteLi:2012:EMNLP proposed the use of crowd-sourced Wiktionary as additional resources for inducing tag lexicons. More recently, \newciteTackstrom:2013:ACL combined token-level and type-level constraints to constrain legitimate label sequences and and recalibrate the probability distribution in a CRF. The tag dictionary used for POS tagging are analogous to the gazetteers and name lexicons used for NER by \newciteChang:2007:ACL.

Our work is also closely related to \newciteGanchev:2009:ICML. They used a two-step projection method similar to \newciteDas:2011:ACL for dependency parsing. Instead of using the projected linguistic structures as ground truth [\citenameYarowsky and Ngai2001], or as features in a generative model [\citenameDas and Petrov2011], they used them as constraints in a PR framework. Our work differs by projecting expectations rather than Viterbi one-best labels. We also choose the GE framework over PR. Experiments in \newciteBellare:2009:UAI and \newciteDruck:2011:Thesis suggest that in a discriminative model (like ours), GE is more accurate than PR.

Figure 1: Diagram illustrating the workflow of Cross-Lingual Pseudo-Projection Expectation Regularization (CLiPPER) method. Colors over the bitext are intended to denote model expectations, not the actual label assignments.

3 Approach

Given bitext between English and a foreign language, our goal is to learn a CRF model in the foreign language from little or no labeled data. Our method performs Cross-Lingual Pseudo-Projection Expectation Regularization (CLiPPER).

Figure 1 illustrates the high-level workflow. For every aligned sentence pair in the bitext, we first compute the posterior marginal at each word position on the English side using a pre-trained English CRF tagger; then for each aligned English word, we project its posterior marginal as expectations to the aligned word position on the foreign side.

We would like to learn a CRF model in the foreign language that has similar expectations as the projected expectations from English. To this end, we adopt the Generalized Expectation (GE) Criteria framework introduced by \newciteMann:2010:JMLR. In the remainder of this section, we follow the notation used in [\citenameDruck2011] to explain our approach.


The general idea of GE is that we can express our preferences over models through constraint functions. A desired model should satisfy the imposed constraints by matching the expectations on these constraint functions with some target expectations (attained by external knowledge like lexicons or in our case transferred knowledge from English). We define a constraint function for each word position and output label assignment as a label identity indicator:

The set denotes all possible label assignment for each , and is number of label values. is the set of English words aligned to Chinese word . The condition specifies that the constraint function applies only to Chinese word positions that have at least one aligned English word. Each can be treated as a Bernoulli random variable, and we concatenate the set of all into a random vector , where if . We drop the in for simplicity.

The target expectation over , denoted as , is the expectation of assigning label to English word 555An English word aligned to foreign word at position . When multiple English words are aligned to the same foreign word, we average the expectations. under the English conditional probability model.

The expectation over under a conditional probability model is denoted as , and simplified as whenever it is unambiguous.

The conditional probability model in our case is defined as a standard linear-chain CRF:666We simplify notation by dropping the regularizer in the CRF definition, but apply it in our experiments.

where is a set of feature functions; are the matching parameters to learn; .

The objective function to maximize in a standard CRF is the log probability over a collection of labeled documents:


is the number of labeled sentences. is an observed label sequence.

The objective function to maximize in GE is defined as the sum over all unlabeled examples (foreign side of bitext), over some cost function between between the model expectation () and the target expectation () over .

We choose to be the negative squared error,777In general, other loss functions such as KL-divergence can also be used for . We found to work well in practice. defined as:


is the total number of unlabeled bitext sentence pairs.

When both labeled and bitext training data are available, the joint objective is the sum of Eqn. 1 and 2. Each is computed over the labeled training data and foreign half in the bitext, respectively.

We can optimize this joint objective by computing the gradients and use a gradient-based optimization method such as L-BFGS. Gradients of decomposes down to the gradients over each labeled training example , computed as:

where and are the empirical and expected feature counts, respectively.

Computing the gradient of decomposes down to the gradients of for each unlabeled foreign sentence and the constraints over this example . The gradients can be calculated as:

We redefine the penalty vector to be . is a matrix where each column contains the gradients for a particular model feature with respect to all constraint functions . It can be computed as:


Eqn. 3 gives the intuition of how optimization works in GE. In each iteration of L-BFGS, the model parameters are updated according to their covariance with the constraint features, scaled by the difference between current expectation and target expectation. The term in Eqn. 4 can be computed using a dynamic programming (DP) algorithm, but solving it directly requires us to store a matrix of the same dimension as in each step of the DP. We can reduce the complexity by using the following trick:


Now in Eqn. 5, becomes a scalar value; and to compute the term , we only need to store a vector in each step of the following DP algorithm [\citenameDruck2011, 93]:

The bracketed term can be broken down to two parts:

The resulting algorithm has complexity , which is the same as the standard forward-backward inference algorithm for CRF.

3.2 Hard vs. Soft Projection

Projecting expectations instead of one-best label assignments from English to foreign language can be thought of as a soft version of the method described in [\citenameDas and Petrov2011] and [\citenameGanchev et al.2009]. Soft projection has its advantage: when the English model is not certain about its predictions, we do not have to commit to the current best prediction. The foreign model has more freedom to form its own belief since any marginal distribution it produces would deviates from a flat distribution by just about the same amount. In general, preserving uncertainties till later is a strategy that has benefited many NLP tasks [\citenameFinkel et al.2006]. Hard projection can also be treated as a special case in our framework. We can simply recalibrate posterior marginal of English by assigning probability mass to the most likely outcome, and zero everything else out, effectively taking the of the marginal at each word position. We refer to this version of expectation as the “hard” expectation. In the hard projection setting, GE training resembles a “project-then-train” style semi-supervised CRF training scheme [\citenameYarowsky and Ngai2001, \citenameTäckström et al.2013]. In such a training scheme, we project the one-best predictions of English CRF to the foreign side through word alignments, then include the newly “tagged” foreign data as additional training data to a standard CRF in the foreign language. The difference between GE training and this scheme is that they optimize different objectives: CRF optimizes maximum conditional likelihood of the observed label sequence, whereas GE minimizes squared error between model’s expectation and “hard” expectation based on the observed label sequence. We compare the hard and soft variants of GE with the project-then-train style CRF training in our experiments and report results in Section 4.2.

4 Experiments

We conduct experiments on Chinese and German NER. We evaluate CLiPPER in two learning settings: weakly supervised and semi-supervised. In the weakly supervised setting, we simulate the condition of having no labeled training data, and evaluate the model learned from bitext alone. We then vary the amount of labeled data available to the model, and examine the model’s learning curve. In the semi-supervised setting, we assume our model has access to the full labeled data; our goal is to improve performance of the supervised method by learning from additional bitext.

4.1 Dataset and Setup

We used the latest version of Stanford NER Toolkit888 as our base CRF model in all experiments. Features for English, Chinese and German CRFs are documented extensively in [\citenameChe et al.2013] and [\citenameFaruqui and Padó2010] and omitted here for brevity. It it worth noting that the current Stanford NER models include recent improvements from semi-supervise learning approaches that induces distributional similarity features from large word clusters. These models represent the current state-of-the-art in supervised methods, and serve as a very strong baseline.

For Chinese NER experiments, we follow the same setup as \newciteChe:2013:NAACL to evaluate on the latest OntoNotes (v4.0) corpus [\citenameHovy et al.2006].999LDC catalogue No.: LDC2011T03 A total of 8,249 sentences from the parallel Chinese and English Penn Treebank portion 101010File numbers: chtb_0001-0325, ectb_1001-1078 are reserved for evaluation. Odd-numbered documents are used as development set, and even-numbered documents are held out as blind test set. The rest of OntoNotes annotated with NER tags are used to train the English and Chinese CRF base taggers. There are about 16k and 39k labeled sentences for Chinese and English training, respectively. The English CRF tagger trained on this training corpus gives F score of 81.68% on the OntoNotes test set. Four entities types (Person, Location, Organization and GPE) are used with a BO tagging scheme. The English-Chinese bitext comes from the Foreign Broadcast Information Service corpus (FBIS).111111LDC catalogue No.: LDC2003E14 It is first sentence aligned using the Champollion Tool Kit, then word aligned with the

For German NER experiments, we evaluate using the standard CoNLL-03 NER corpus [\citenameSang and Meulder2003]. The labeled training set has 12k and 15k sentences. We used the de-en portion of the News Commentary141414 data from WMT13 as bitext. The English CRF tagger trained on CoNLL-03 English training corpus gives F score of 90.4% on the CoNLL-03 test set.

We report standard entity-level precision (P), recall (R) and F score given by ConllEval script on both the development and test sets. Statistical significance tests are done using a paired bootstrap resampling method with 1000 iterations, averaged over 5 runs. We compare against three recently approaches that were introduced in Section 2. They are: semi-supervised learning method using factored bilingual models with Gibbs sampling [\citenameWang et al.2013]; bilingual NER using Integer Linear Programming (ILP) with bilingual constraints, by [\citenameChe et al.2013]; and constraint-driven bilingual-reranking approach [\citenameBurkett et al.2010]. The code from [\citenameChe et al.2013] and [\citenameWang et al.2013] are publicly available,151515 Code from [\citenameBurkett et al.2010] is obtained through personal communications.161616Due to technical difficulties, we are unable to replicate \newciteBurkett:2010:CONLL experiments on German NER, therefore only Chinese results are reported.

Since the objective function in Eqn. 2 is non-convex, we adopted the early stopping training scheme from [\citenameTurian et al.2010] as the following: after each iteration in L-BFGS training, the model is evaluated against the development set; the training procedure is terminated if no improvements have been made in 20 iterations.

4.2 Weakly Supervised Results

# of labeled training sentences [k]

F1 score [%]

supervised CRF

CLiPPER soft

(a) Chinese Dev

# of labeled training sentences [k]

F1 score [%]

supervised CRF

CLiPPER soft

(b) German Dev

# of labeled training sentences [k]

F1 score [%]

supervised CRF

CLiPPER soft

(c) Chinese Test

# of labeled training sentences [k]

F1 score [%]

supervised CRF

CLiPPER soft

(d) German Test

# of labeled training sentences [k]

F1 score [%]

CRF projection

CLiPPER hard

CLiPPER soft

(e) Soft vs. Hard on Chinese Test

# of labeled training sentences [k]

F1 score [%]

CRF projection

CLiPPER hard

CLiPPER soft

(f) Soft vs. Hard on German Test
Figure 2: Performance curves of CLiPPER with varying amounts of available labeled training data in a weakly supervised setting. Vertical axes show the F score on the development and test set, respectively. Performance curves of supervised CRF and “project-then-train” CRF are plotted for comparison.

The top four figures in Figure 2 show results of weakly supervised learning experiments. Quite remarkably, on Chinese test set, our proposed method (CLiPPER) achieves a F score of 64.4% with 80k bitext, when no labeled training data is used. In contrast, the supervised CRF baseline would require as much as 12k labeled sentences to attain the same accuracy. Results on the German test set is less striking. With no labeled data and 40k of bitext, CLiPPER performs at F of 60.0%, the equivalent of using 1.5k labeled examples in the supervised setting. When combined with 1k labeled examples, performance of CLiPPER reaches 69%, a gain of over 5% absolute over supervised CRF. We also notice that supervised CRF model learns much faster in German than Chinese. This result is not too surprising, since it is well recognized that Chinese NER is more challenging than German or English due to the lack of orthographical features, such as word capitalization. Chinese NER relies more on lexicalized features, and therefore needs more labeled data to achieve good coverage. The results also suggest that CLiPPER seems to be very effective at transferring lexical knowledge from English to Chinese.

The bottom two figures in Figure 2 compares soft GE projection with hard GE projection and the “project-then-train” style CRF training scheme (cf. Section 3.2). We observe that both soft and hard GE projection significantly outperform the “project-then-train” style training scheme. The difference is especially pronounced on the Chinese results when fewer labeled examples are available. Soft projection gives better accuracy than hard projection when no labeled data is available, and also has a faster learning rate.

4.3 Semi-supervised Results

Chinese German
CRF 79.87 63.62 70.83 88.05 73.03 79.84
CLiPPER 10k 81.36 65.16 72.36 85.23 77.79 81.34
20k 81.79 64.80 72.31 88.11 75.93 81.57
40k 79.24 66.08 72.06 88.25 76.52 81.97
80k 80.26 65.92 72.38 87.80 76.82 81.94
Table 1: Chinese and German NER results on the development set using CLiPPER with varying amounts of unlabeled bitext (10k, 20k, etc.). Best number of each column is highlighted in bold. The F score improvements over CRF baseline in all cases are statistically significant at 99.9% confidence level.

In the semi-supervised experiments, we let the CRF model use the full set of labeled examples in addition to the unlabeled bitext. Table 1 shows results on the development dataset for Chinese and German using 10-80k bitext. We see that with merely 10k additional bitext, CLiPPER is able to improve significantly over state-of-the-art CRF baselines by as much as 1.5% F on both Chinese and German. With more unlabeled data, we notice a tradeoff between precision and recall on Chinese. The final F score on Chinese at 80k level is only marginally better than 10k. On the other hand, we observe a modest but steady improvement on German as we add more unlabeled bitext, up until 40k sentences. We select the best configurations on development set (80k for Chinese and 40k for German) to evaluate on test set.

Chinese German
CRF 79.09 63.56 70.48 86.77 71.30 78.28
CRF 84.01 45.29 58.85 81.50 75.56 78.41
WCD13 80.31 65.78 72.33 85.98 72.37 78.59
CWD13 81.31 65.50 72.55 85.99 72.98 78.95
BPBK10 79.25 65.67 71.83 - - -
CLiPPER 83.67 64.80 73.04 86.52 72.02 78.61
CLiPPER 82.57 65.99 73.35 87.11 72.56 79.17
Table 2: Chinese and German NER results on the test set. Best number of each column is highlighted in bold. CRF is the supervised baseline. CRF is the “project-then-train” semi-supervised scheme for CRF. WCD13 is [\citenameWang et al.2013], CWD13 is [\citenameChe et al.2013], and BPBK10 is [\citenameBurkett et al.2010]. CLiPPER and CLiPPER are the soft and hard projections. indicates F scores that are statistically significantly better than CRF baseline at 99.5% confidence level; marks significance over CRF with 99.5% confidence; and marks significance over WCD13 with 99.9% and 94% confidence; and marks significance over CWD13 with 99.7% confidence; marks significance over BPBK10 with 99.9% confidence.
(b) Example where word proceeding “monument” is of type LOCATION
Figure 3: Examples of aligned sentence pairs in Chinese and English. The lines going across a sentence pair indicate individual word alignments induced by an automatic word aligner. Entities of type Location are highlighted in magenta, and entities of type Person are highlighted in blue.

Results on the test set are shown in Table 2. All semi-supervised baselines are tested with the same number of unlabeled bitext as CLiPPER in each language. The “project-then-train” semi-supervised training scheme severely hurts performance on Chinese, but gives a small improvement on German. Moreover, on Chinese it learns to achieve high precision but at a significant loss in recall. On German its behavior is the opposite. Such drastic and erratic imbalance suggest that this method is not robust or reliable. The other three semi-supervised baselines (row 3-5) all show improvements over the CRF baseline, consistent with their reported results. CLiPPER gives the best results on both Chinese and German, yielding statistically significant improvements over all baselines except for CWD13 on German. The hard projection version of CLiPPER also gives sizable gain over CRF. However, in comparison, CLiPPER is superior.

The improvements of CLiPPER over CRF on Chinese test set is over 2.8% in absolute F. The improvement over CRF on German is almost a percent. To our knowledge, these are the best reported numbers on the OntoNotes Chinese and CoNLL-03 German datasets.

4.4 Efficiency

Another advantage of our proposed approach is efficiency. Because we eliminated the previous multi-stage “project-then-train” paradigm, but instead integrating the semi-supervised and supervised objective into one joint objective, we are able to attain significant speed improvements. Table 3 shows the training time required to produce models that give results in Table 2.

Chinese German
CRF 19m30s 7m15s
CRF 34m2s 12m45s
WCD13 3h17m 1h1m
CWD13 16h42m 4h49m
BPBK10 6h16m -
CLiPPER 1h28m 16m30s
CLiPPER 1h40m 18m51s
Table 3: Timing stats during model training.

5 Error Analysis and Discussion

Figure 3 gives two examples of CLiPPER in action. Both examples have a named entity that immediately proceeds the word “纪念碑” (monument) in the Chinese sentence. In Figure (a)a, the word “高岗” has literal meaning of a hillock located at a high position, which also happens to be the name of a former vice president of China. Without having previously observed this word as a person name in the labeled training data, the CRF model does not have enough evidence to believe that this is a Person, instead of Location. But the aligned words in English (“Gao Gang”) are clearly part of a person name as they were preceded by a title (“Vice President”). The English model has high expectation that the aligned Chinese word of ”Gao Gang” is also a Person. Therefore, projecting the English expectations to Chinese provides a strong clue to help disambiguating this word. Figure (b)b gives another example: the word “黄河”(Huang He, the Yellow River of China) can be confused with a person name since “黄”(Huang or Hwang) is also a common Chinese last name.171717In fact, a people search of the name 黄河 on the Chinese equivalent of Facebook ( returns over 13,000 matches.. Again, knowing the translation in English, which has the indicative word “River” in it, helps disambiguation.

6 Conclusion

We introduced a domain and language independent semi-supervised method for training discriminative models by projecting expectations across bitext. Experiments on Chinese and German NER show that our method, learned over bitext alone, can rival performance of supervised models trained with thousands of labeled examples. Furthermore, applying our method in a setting where all labeled examples are available also shows improvements over state-of-the-art supervised methods. Our experiments also showed that soft expectation projection is more favorable to hard projection. This technique can be generalized to all sequence labeling tasks, and can be extended to include more complex constraints. For future work, we plan to apply this method to more language pairs and examine the formal properties of the model.


  • [\citenameAlshawi et al.2000] Hiyan Alshawi, Srinivas Bangalore, and Shona Douglas. 2000. Head-transducer models for speech translation and their automatic acquisition from bilingual data. Machine Translation, 15.
  • [\citenameAndo and Zhang2005] Rie Kubota Ando and Tong Zhang. 2005. A high-performance semi-supervised learning method for text chunking. In Proceedings of ACL.
  • [\citenameBellare et al.2009] Kedar Bellare, Gregory Druck, and Andrew McCallum. 2009. Alternating projections for learning with expectation constraints. In Proceedings of UAI.
  • [\citenameBlum and Mitchell1998] Avrim Blum and Tom Mitchell. 1998. Combining labeled and unlabeled data with co-training. In Proceedings of COLT.
  • [\citenameBurkett and Klein2008] David Burkett and Dan Klein. 2008. Two languages are better than one (for syntactic parsing). In Proceedings of EMNLP.
  • [\citenameBurkett et al.2010] David Burkett, Slav Petrov, John Blitzer, and Dan Klein. 2010. Learning better monolingual models with unannotated bilingual text. In Proceedings of CoNLL.
  • [\citenameCarlson et al.2010] Andrew Carlson, Justin Betteridge, Richard C. Wang, Estevam R. Hruschka Jr., and Tom M. Mitchell. 2010. Coupled semi-supervised learning for information extraction. In Proceedings of WSDM.
  • [\citenameChang et al.2007] Ming-Wei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint-driven learning. In Proceedings of ACL.
  • [\citenameChe et al.2013] Wanxiang Che, Mengqiu Wang, and Christopher D. Manning. 2013. Named entity recognition with bilingual constraints. In Proceedings of NAACL.
  • [\citenameCollins and Singer1999] Michael Collins and Yoram Singer. 1999. Unsupervised models for named entity classification. In Proceedings of EMNLP.
  • [\citenameDas and Petrov2011] Dipanjan Das and Slav Petrov. 2011. Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL.
  • [\citenameDruck and McCallum2010] Gregory Druck and Andrew McCallum. 2010. High-performance semi-supervised learning using discriminatively constrained generative models. In Proceedings of ICML.
  • [\citenameDruck et al.2007] Gregory Druck, Gideon Mann, and Andrew McCallum. 2007. Leveraging existing resources using generalized expectation criteria. In Proceedings of NIPS Workshop on Learning Problem Design.
  • [\citenameDruck et al.2009] Gregory Druck, Burr Settles, and Andrew McCallum. 2009. Active learning by labeling features. In Proceedings of EMNLP.
  • [\citenameDruck2011] Gregory Druck. 2011. Generalized Expectation Criteria for Lightly Supervised Learning. Ph.D. thesis, University of Massachusetts Amherst.
  • [\citenameFaruqui and Padó2010] Manaal Faruqui and Sebastian Padó. 2010. Training and evaluating a German named entity recognizer with semantic generalization. In Proceedings of KONVENS.
  • [\citenameFinkel et al.2006] Jenny Rose Finkel, Christopher D. Manning, and Andrew Y. Ng. 2006. Solving the problem of cascading errors: Approximate bayesian inference for linguistic annotation pipelines. In Proceedings of EMNLP.
  • [\citenameFossum and Abney2005] Victoria Fossum and Steven Abney. 2005. Automatically inducing a part-of-speech tagger by projecting from multiple source languages across aligned corpora. In Proceedings of IJCNLP.
  • [\citenameGanchev et al.2009] Kuzman Ganchev, Jennifer Gillenwater, and Ben Taskar. 2009. Dependency grammar induction via bitext projection constraints. In Proceedings of ACL.
  • [\citenameGanchev et al.2010] Kuzman Ganchev, Jo ao Graça, Jennifer Gillenwater, and Ben Taskar. 2010. Posterior regularization for structured latent variable models. JMLR, 10:2001–2049.
  • [\citenameGoldberg2010] Andrew B. Goldberg. 2010. New Directions in Semi-supervised Learning. Ph.D. thesis, University of Wisconsin-Madison.
  • [\citenameHovy et al.2006] Eduard Hovy, Mitchell Marcus, Martha Palmer, Lance Ramshaw, and Ralph Weischedel. 2006. OntoNotes: the 90% solution. In Proceedings of NAACL-HLT.
  • [\citenameKlein2005] Dan Klein. 2005. The Unsupervised Learning of Natural Language Structure. Ph.D. thesis, Stanford University.
  • [\citenameLafferty et al.2001] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of ICML.
  • [\citenameLi and Eisner2009] Zhifei Li and Jason Eisner. 2009. First- and second-order expectation semirings with applications to minimum-risk training on translation forests. In Proceedings of EMNLP.
  • [\citenameLi et al.2012] Shen Li, Jo ao Graça, and Ben Taskar. 2012. Wiki-ly supervised part-of-speech tagging. In Proceedings of EMNLP-CoNLL.
  • [\citenameLiang2005] Percy Liang. 2005. Semi-supervised learning for natural language. Master’s thesis, Massachusetts Institute of Technology.
  • [\citenameMann and McCallum2010] Gideon Mann and Andrew McCallum. 2010. Generalized expectation criteria for semi-supervised learning with weakly labeled data. JMLR, 11:955–984.
  • [\citenameMcClosky et al.2006] David McClosky, Eugene Charniak, and Mark Johnson. 2006. Effective self-training for parsing. In Proceedings of NAACL-HLT.
  • [\citenameNaseem et al.2009] Tahira Naseem, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. Multilingual part-of-speech tagging: Two unsupervised approaches. JAIR, 36:1076–9757.
  • [\citenamePetrov et al.2010] Slav Petrov, Pi-Chuan Chang, Michael Ringgaard, and Hiyan Alshawi. 2010. Uptraining for accurate deterministic question parsing. In Proceedings of EMNLP.
  • [\citenameSamdani et al.2012] Rajhans Samdani, Ming-Wei Chang, and Dan Roth. 2012. Unified expectation maximization. In Proceedings of NAACL.
  • [\citenameSang and Meulder2003] Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In Proceedings of CoNLL.
  • [\citenameSindhwani et al.2005] Vikas Sindhwani, Partha Niyogi, and Mikhail Belkin. 2005. A co-regularization approach to semi-supervised learning with multiple views. In Proceedings of ICML Workshop on Learning with Multiple Views, International Conference on Machine Learning.
  • [\citenameSmith2006] Noah A. Smith. 2006. Novel Estimation Methods for Unsupervised Discovery of Latent Structure in Natural Language Text. Ph.D. thesis, Johns Hopkins University.
  • [\citenameSnyder et al.2009] Benjamin Snyder, Tahira Naseem, and Regina Barzilay. 2009. Unsupervised multilingual grammar induction. In Proceedings of ACL.
  • [\citenameSuzuki and Isozaki2008] Jun Suzuki and Hideki Isozaki. 2008. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In Proceedings of ACL.
  • [\citenameTäckström et al.2013] Oscar Täckström, Dipanjan Das, Slav Petrov, Ryan McDonald, and Joakim Nivre. 2013. Token and type constraints for cross-lingual part-of-speech tagging. In Proceedings of ACL.
  • [\citenameTurian et al.2010] Joseph Turian, Lev Ratinov, and Yoshua Bengio. 2010. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics (ACL).
  • [\citenameWang et al.2013] Mengqiu Wang, Wanxiang Che, and Christopher D. Manning. 2013. Effective bilingual constraints for semi-supervised learning of named entity recognizers. In Proceedings of AAAI.
  • [\citenameXi and Hwa2005] Chenhai Xi and Rebecca Hwa. 2005. A backoff model for bootstrapping resources for non-english languages. In Proceedings of HLT-EMNLP.
  • [\citenameYarowsky and Ngai2001] David Yarowsky and Grace Ngai. 2001. Inducing multilingual POS taggers and NP bracketers via robust projection across aligned corpora. In Proceedings of NAACL.
  • [\citenameYarowsky1995] David Yarowsky. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of ACL.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description