Estimation from Indirect Supervision with Linear Moments
Abstract
In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: nonconvexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectlysupervised problems. Our approach is simple: we solve a linear system to estimate sufficient statistics of the model, which we then use to estimate parameters via convex optimization. We analyze the statistical properties of our approach and show empirically that it is effective in two settings: learning with local privacy constraints and learning from lowcost countbased annotations.^{1}^{1}1 This is an extended and updated version of our paper in Proceedings of the International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).
Stanford University, Stanford, CA
1 Introduction
Consider the problem of estimating a probabilistic graphical model from indirect supervision, where only a partial view of the variables is available. We are interested in indirect supervision for two reasons: first, one might not trust a data collector and wish to use privacy mechanisms to reveal only partial information about sensitive data (Warner, 1965; Evfimievski et al., 2004; Dwork et al., 2006; Duchi et al., 2013). Second, if data is generated by human annotators, say in a crowdsourcing setting, it can often be more costeffective to solicit lightweight annotations (Oded & Tomás, 1998; Mann & McCallum, 2008; Quadrianto et al., 2008; Liang et al., 2009). In both cases, we trade statistical efficiency for another resource: privacy or annotator cost.
Indirect supervision is naturally handled by defining a latentvariable model where the structure of interest is treated as a latent variable. While statistically sensible, learning latentvariable models poses two computational challenges. First, maximum marginal likelihood requires nonconvex optimization, where popular procedures such as gradient descent or Expectation Maximization (EM) (Dempster et al., 1977) are only guaranteed to converge to local optima. Second, even the computation of the gradient or performing the Estep can be intractable, requiring probabilistic inference on a loopy graphical model induced by the indirect supervision (Chang et al., 2007; Graça et al., 2008; Liang et al., 2009).
In this paper, we propose an approach that bypasses both computational obstacles for a class which we call linear indirectlysupervised learning problems. We lean on the method of moments (Pearson, 1894), which has recently led to advances in learning latentvariable models (Hsu et al., 2009; Anandkumar et al., 2013; Chaganty & Liang, 2014), although we do not appeal to tensor factorization. Instead, we express indirect supervision as a linear combination of the sufficient statistics of the model, which we recover by solving a simple noisy linear system. Once we have the sufficient statistics, we use convex optimization to solve for the model parameters. The key is that while supervision per example is indirect and leads to intractability, aggregation over a large number of examples renders the problem tractable.
While our momentsbased estimator yields computational benefits, we suffer some statistical loss relative to maximum marginal likelihood. In Section 5, we compare the asymptotic variance of marginallikelihood and momentbased estimators, and provide some geometric insight into their differences in Section 6. Finally, in Section 7, we apply our framework empirically to our two motivating settings: (i) learning a regression model under local privacy constraints, and (ii) learning a partofspeech tagger with lightweight annotations. In both applications, we show that our momentsbased estimator obtains good accuracies.
2 Setup
Notation.
We use superscripts to enumerate instances in a data sample (e.g. ), and squarebracket indexing to enumerate components of a vector or sequence: denotes the component(s) of associated with . For a real vector , we let .
Model.
Consider the structured prediction task of mapping an input to some output . We model this mapping using a conditional exponential family
(1) 
where is the feature mapping, is the parameter vector, and is the logpartition function. For concreteness, we specialize to conditional random fields (CRFs) (Lafferty et al., 2001) over collections of variate labels, where and ; here is the number of variables and is the number of possible labels per variable. We let be the set of cliques in the CRF, so that the features decompose into a sum over cliques: . As one particular example, if consists of all nodes and edges between adjacent nodes , the CRF is chainstructured.
Learning from indirect supervision.
In the indirectly supervised setting that is the focus of this paper, we do not have access to but rather only observations , where is drawn from a known supervision distribution .
Maximum marginal likelihood.
The classic paradigm is to maximize the marginal likelihood:
(2) 
where denotes an expectation over the training sample. While is often statistically efficient, there are two computational difficulties associated with this approach:
Our approach: momentbased estimation.
We present a simple approach to circumvent the above issues for a restricted class of models, in the same vein as Chaganty & Liang (2014). To begin, consider the fullysupervised setting, where we observe examples . In this case, we could maximize the likelihood , solving , where are the sufficient statistics, which converge to . Therefore, if we could construct a consistent estimate of , then we could solve the same convex optimization problem used in the fullysupervised estimator.
Of course, we do not have access to . Instead, in our (linearly) indirectly supervised setting, we are able to define an observation function which is nonetheless in expectation equal to the population sufficient statistics:
(3) 
In general, we construct by solving a linear system. Putting the pieces together yields our estimator (Figure 2):

Sufficient statistics: .

Parameters: .
In the next two sections, we describe the observation function for learning with local privacy (Section 3) and lightweight annotations (Section 4).
3 Learning under local privacy
Suppose we wish to estimate a conditional distribution , where is nonsensitive information about an individual and contains sensitive information (for example, income or disease status). Individuals, because of a variety of reasons—mistrust, embarrassment, fear of discrimination—may wish to keep private and instead release some . To quantify the amount of privacy afforded by , we turn to the literature on privacy in databases and theoretical computer science (Evfimievski et al., 2004; Dwork et al., 2006) and say that is differentially private if any two have comparable probability (up to a factor of ) of generating :
(4) 
What should we employ? We first explore the classical randomized response (RR) mechanism (Section 3.1), and then develop a new mechanism that leverages the graphical model structure (Section 3.2).
3.1 Classic randomized response
Warner (1965) proposed the nowclassical randomized response technique, which proceeds as follows: For some fixed (generally small) , the respondent reveals with probability and with probability draws a sample from a (known) base distribution —generally uniform—over . Formally, the classical randomized response supervision is
(5) 
Estimation.
Privacy.
We can check that the ratio , so classical randomized response is differentially private. For any distribution , this value is at least , a linear dependence on . In classical randomized response settings, , which is unproblematic. In contrast, in structured prediction settings, the number of labels is exponential in the number of variables (), so we must take . The asymptotic variance of scales as (as will be shown in Section 5), which makes classical randomized response unusable for structured prediction.
3.2 Structured randomized response
With this difficulty in mind, we recognize that we must somehow leverage the structure of the sufficient statistics vector to perform estimation. In particular, we show that the supervision should only depend on the sufficient statistics:
Proposition 1.
Let be the set in which observations live. For any privacy mechanism that is differentially private, there exists a mechanism that is at least differentially private, and for any set , we have
(8) 
where and .
In short, we can always construct that only uses the sufficient statistics but yields the same joint distribution over the pairs . Furthermore, is at least as private as the original mechanism . See Appendix A.1 for a proof.
This motivates a focus exclusively on mechanisms that use sufficient statistics, and in particular, we consider the following two structured randomized response mechanisms. Our schemes are both twophase procedures that first binarize the sufficient statistics, and then release a set of observations inspired by Duchi et al.’s minimax optimal approach to estimating a multinomial distribution. For , let . Assume each coordinate of the statistics lies in the interval for some positive scalar . For , draw as a Bernoulli variable with bias . Then:
(Coordinate release) Draw a coordinate from a distribution . Set with probability , otherwise . Release the pair .
(Pervalue RR) Denote by the support of given , let , and take any . For , set with probability , otherwise . Release the vector .
Both are differentially private (see Appendix A.1). For coordinate release, define the observation function
where denotes the ’th standard basis vector. For the pervalue statistics scheme, define the observation function,
(9) 
In either case, we have that , as required by (3) for to be consistent.
The two schemes offer a tradeoff: when is dense, coordinate release is advantageous, as our best norm bound may be as large as the dimension , so although we reveal only a single coordinate at a time, we noise it by a lowervariance distribution rather than the noise of the pervalue scheme. Meanwhile, pervalue RR enjoys lower variance when has low norm. The latter case arises, for instance, if is a sparse binary vector as is common in structured prediction. Appendix A.3 and A.4 present more details about this tradeoff offered by the schemes.
Summarizing, we have three randomized response schemes. Classical RR appeals only in unstructured problems with few outputs . In the structured setting, we can move to the sufficient statistics by Proposition 1, and exploit their structure with either of two schemes based on our knowledge of the 1norm or sparsity of statistics .
4 Learning with lightweight annotations
The  quick  brown  [fox  jumps  over  the  lazy  dog]  
DT  JJ  JJ  [NN  VBZ  IN  DT  JJ  NN]  
# NN = 2 
For a sequence labeling task, e.g., partofspeech (POS) tagging, it can be tedious to obtain fullylabeled inputoutput sequences for training. This motivates a line of work which attempts to learn from weaker forms of annotation (Mann & McCallum, 2008; Haghighi & Klein, 2006; Liang et al., 2009). We focus on region annotations, where an annotator is asked to examine only a particular subsequence of the input and count the number of occurrences of some label (e.g., nouns). The rationale is that it is cognitively easier for the annotator to focus on one label at a time rather than annotating from a large tag set, and physically easier to hit a single yes/no or counter button than to select precise locations, especially in mobilebased crowdsourcing interfaces (Vaish et al., 2014). See Table 1 for an example.
More formally, the supervision is defined as follows: First, choose the starting position uniformly from , and set the ending position , where is a fixed window size. Let denote this region. Next, choose a subset of tags uniformly from the tag set (e.g., ). From here, the observation is generated deterministically: For each tag , the annotator counts the number of occurrences in the region: . The final observation is .
In this setting, not only is the marginal likelihood nonconvex, inference requires summing over possible ways of realizing the counts, which is exponential either in the window size and .
Estimation.
For our estimator to work, we make two assumptions:

The node potentials only depend on : ; and

Under the true conditional distribution, only depends on : .
These are admittedly strong independence assumptions similar to IBM model 1 for word alignment (Brown et al., 1993) or the unordered translation model of Steinhardt & Liang (2015). Even though our model is fully factorized and lacks edge potentials, inference is expensive as conditioning on the indirect supervision couples all of the variables. This typically calls for approximate inference techniques common to the realm of structured prediction. Steinhardt & Liang (2015) developed a relaxation to cope with this supervision, but this still requires approximate inference via sampling and nonconvex optimization.
In contrast to the local privacy examples, the new challenge is that the observation does not provide enough information to evaluate a single node potential, even stochastically. So we cannot directly write in terms of functions of the observations. As a bridge, define the localized conditional distributions: , which by assumption 2 specify the entire conditional distribution. The sufficient statistics can be written as in terms of :
(10) 
We now define constraints that relate the observations to . Recall that each observation includes a region , a tag , and a vector of counts , one for each tag . For each input and tag , we have the identity:
(11) 
5 Asymptotic analysis
We have two estimators: maximum marginal likelihood (), which is difficult to compute, requiring nonconvex optimization and possibly intractable inference; and our momentsbased estimator (), which is easy to compute, requiring only solving a linear system and convex optimization. In this section, we study and compare the statistical efficiency of and . For simplicity, we focus on unconditional setting where is empty, and omit in this section. We also assume our exponential family model is wellspecified and that are the true parameters. All expectations are taken with respect to .
Recall from (3) that . We can therefore think of as a “best guess” of . The following lemma provides the asymptotic variances of the estimators:
Proposition 2 (General asymptotic variances).
Let be the Fisher information matrix. Then
where the asymptotic variances are
(13)  
(14) 
Let us compare the asymptotic variances of and to that of the fullysupervised maximum likelihood estimator , which has access to , and satisfies .
Examining the asymptotic variance of (13), we see that the loss in statistical efficiency with respect to maximum likelihood is the amount of variation in not explained by , . Consequently, if is simply deterministic given , then , and achieves the statistically efficient asymptotic variance .
The story with is dual to that for the marginal likelihood estimator. Considering the second term in expression (14), we see that the loss of efficiency due to our observation model grows linearly in the variability of the observations not explained by . Thus, unlike , even if is deterministic given (so reveals full information about ), we do not recover the efficient covariance . As a trivial example, let and the observation for , so that contains a faithful copy of , and let . Then , and the asymptotic relative efficiency of to is . Roughly, integrates posterior information about better than does.
Proof.
To compute , we follow standard arguments in van der Vaart (1998). If is the marginal loglikelihood, then a straightforward calculation yields . The asymptotic variance is the inverse of ; applying the variance decomposition gives (13).
To compute , recall that the momentsbased estimator computes and . Apply the delta method, where . Finally, decompose and recognize that to obtain (14). ∎
Randomized response.
To obtain concrete intuition for Proposition 2, we specialize to the case where is the randomized response (5). In this setting, for some constant vector . Recall the supervision model: , if and if .
Lemma 1 (asymptotic variances (randomized response)).
Under the randomized response model of (5), the asymptotic variance of is
(15)  
The matrix governs the loss of efficiency, which stems from two sources: (i) , the variance when we sample ; and (ii) the variance in choosing between and . If and have the same distribution, then and .
Proof.
We decompose as
where we used . ∎
(a)  (b) 
An empirical plot.
The HájekLe Cam convolution and local asymptotic minimax theorems give that is the most statistically efficient estimator. We now empirically study the efficiency of relative to , where , the average of the relative variances per coordinate of to . We continue to focus on randomized response in the unconditional case.
To study the effect of , we consider the following probability model: we let , define
and set . We set and to represent weak and strong signals (the latter is harder to estimate, as the Fisher information matrix is much smaller); when , the asymptotic variances are equal, . In Figure 3, we see that the asymptotic efficiency of relative to decreases as , which is explained by the fact that—as we see in expression (13)—the estimator leverages the prior information about based on , while as , expression (15) is dominated by the term, where is uniform. Moreover, as grows larger, the conditional covariance is much smaller than the covariance , so that we expect that .
6 The geometry of twostep estimation
We now provide some geometric intuition about the differences between and , establishing a connection between and the EM algorithm as a byproduct of our discussion. For concreteness, let be a finite set and let be the set of all distributions over (represented as dimensional vectors). Let be a natural exponential family over with . See Figure 4 for an example where and . Note that in the space of distributions, is a nonconvex set.
Let be the set of observations. We can represent the supervision function as a matrix . For , we can express the marginal distribution over as . Let be the empirical distribution over observations.
The maximum marginal likelihood estimator can now be written succinctly as:
(16) 
While the KLdivergence is concave in , the nonconvex constraint set makes the problem difficult.
Our momentbased estimator can be viewed as a relaxation, where we first optimize over a relaxed set and then project onto the exponential family:
(17) 
The first step can be computed directly via if is the squared Euclidean distance. If is KLdivergence, we can use EM (see the composite likelihood objective of Chaganty & Liang (2014)), which converges to the global optimum. The result is a single distribution over . The second step optimizes over via , which is a convex optimization problem, resulting in corresponding to .
Computing generally requires solving a nonconvex optimization problem (see Figure 5 for an example). When has full column rank and the model is wellspecified, is consistent: we have that . This means that eventually the KL projection of problem (17) is essentially an identity operation: we almost have by the rank assumptions, making the problem easy. This assumption strongly depends on the wellspecifiedness of the supervision; indeed, if for any , then , for a constant , even as . We can relax the column rank assumption, however: simply needs to contain enough information about the sufficient statistics, that is, if is matrix of sufficient statistics, we require that for some matrix .
Deterministic supervision.
When the supervision matrix has full column rank, converges to . There are certainly cases where is consistent, but is not. What can we say about in this case?
To obtain intuition, consider the case the supervision is a deterministic function that maps to (region annotations is an example). In this case, every column of is an indicator vector, and .
Here, distributes probability mass evenly across all the that deterministic map to . In this case, simply corresponds to running one iteration of EM on the marginal likelihood, initializing with the uniform distribution over (). The Estep conditions on and places uniform mass over consistent , producing ; the Mstep optimizes based on .
7 Experiments
Local privacy.
Following Section 3, we consider locally private estimation of a structured model. We take linear regression as a simple such structured model: it corresponds to a pairwise random field over the inputs and the response. The sufficient statistics are edge features and for each .
On the housing dataset, the supervision is given under the pervalue RR scheme. On the songs dataset, is a dense vector, motivating the coordinate release scheme instead. We choose at random, with probability reveal and with probability reveal with suitable noise as described in Section 3.2. Note that the noising mechanism privatizes both input variables as well as the response .
Figure 6 visualizes the average (over 10 trials) R coefficient of fit for linear regression on the test set,^{2}^{2}2The (uncentered) R coefficient of parameters in a linear regression with design and labels is . in response to varying the privacy parameter .^{3}^{3}3We use the housing (mlcomp.org/datasets/840) and songs (mlcomp.org/datasets/748) data from mlcomp. As expected, the efficiency degrades with increase in the privacy constraint, though for moderate values of the loss is not significant.
Lightweight annotations.
We experiment with estimating a conditional model for partofspeech (POS) sequence tagging from lightweight annotations.^{4}^{4}4We used the Wall Street Journal portion of the Penn Treebank. Sections 021 comprise the training set and 2224 are test. Every example in the dataset reveals a sentence and the counts of all tags over a consecutive window. Following the modeling assumptions in Section 4, we use a CRF (per Section 2) with only node features:
where is a function on the word (e.g., word, prefix, suffix and word signature).
When the problem is fully supervised, we maximize the loglikelihood with stochastic gradient descent (SGD); in this case, estimation is convex and exact gradients can be tractably computed. Under count supervision, convexity of the marginal likelihood is not guaranteed. Although the model has no edge features, the indirect count supervision places an potential over the region in which counts are revealed (one enforcing that the tag sequence is compatible with the counts). This renders exact inference intractable, so we approximate it using beam search to compute stochastic gradients.^{5}^{5}5The dataset has 45 tag values. We use a beam of size 500 after analytically marginalizing nodes outside the region. The momentbased estimator is unaffected by this issue as it requires no inference and proceeds via a pair convex minimization programs; we minimize both using SGD.
Figure 7 shows train and test accuracies as we make passes over the dataset. Typically, after sufficiently many passes, the marginal likelihood gains an advantage over the momentbased estimator. For small regions, we expect the beam search approximation to be accurate, and indeed the marginal likelihood estimator is dominant there. For larger regions, the momentbased estimator (i) achieves high accuracy early and (ii) dominates for several passes before the marginal likelihood estimator overtakes it. Altogether, the experiment highlights that the momentbased estimator is favorable in computationallyconstrained settings.
8 Related work and discussion
This work was motivated by two use cases of indirect supervision: local privacy and cheap annotations. Each trades off statistical accuracy for another resource: privacy or annotation cost. Local privacy has its roots in classical randomized response for conducting surveys (Warner, 1965), which has been extended to the multivariate (Tamhane, 1981) and conditional (Matloff, 1984) settings. In the computer science community, differential privacy has emerged as a useful formalization of privacy (Dwork, 2006). We work with the stronger notion of local differential privacy (Evfimievski et al., 2004; Kasiviswanathan et al., 2011; Duchi et al., 2013). Our contribution here is twofold: First, we bring local privacy to the graphical model setting, which provides an opportunity for the privacy mechanism to be sensitive to the model structure. While we believe our mechanisms are reasonable, an open question is designing optimal mechanisms in the structured case. Second, we connect privacy with other forms of indirect supervision.
The second use case is learning from lightweight annotations, which has taken many forms in the literature. Multiinstance learning (Oded & Tomás, 1998) is popular in computer vision, where it is natural to label the presence but not location of objects (Babenko et al., 2009). In natural language processing, there also been work on learning from structured outputs where, like this work, only counts of labels are observed (Mann & McCallum, 2008; Liang et al., 2009). However, these works resort to likelihoodbased approaches which involve nonconvex optimization and approximate inference, whereas in this work, we show that linear algebra and convex optimization suffice under modeling assumptions.
Quadrianto et al. (2008) showed how to learn from label proportions of groups of examples, using a linear system technique similar to ours. However, they assume that the group is conditionally independent of the example given the label, which would not apply in our regionbased annotation setup since our regions contain arbitrarily correlated inputs and heterogeneous labels. In return, we do need to make the stronger assumption that each label depends only on a discrete , so that the credit assignment can be done using a linear program. An open challenge is to allow for heterogeneity with complex inputs.
Indirect supervision arises more generally in latentvariable models, which arises in machine translation (Brown et al., 1993), semantic parsing (Liang et al., 2011), object detection (Quattoni et al., 2004), and other missing data problems in statistics (M & Naisyin, 2000). The indirect supervision problems in this paper have additional structure: we have an unknown model and a known supervision function . It is this structure allows us to obtain computationally efficient method of moments procedures.
We started this work to see how much juice we could squeeze out of just linear moment equations, and the answer is more than we expected. Of course, for more general latentvariable models beyond linearly indirectlysupervised problems, we would need more powerful tools. In recent years, tensor factorization techniques have provided efficient methods for a wide class of latentvariable models (Hsu et al., 2012; Anandkumar et al., 2012; Hsu & Kakade, 2013; Anandkumar et al., 2013; Chaganty & Liang, 2013; Halpern & Sontag, 2013; Chaganty & Liang, 2014). One can leverage even more general polynomialsolving techniques to expand the set of models (Wang et al., 2015). In general, the method of moments allows us to leverage statistical structure to alleviate computational intractability, and we anticipate more future developments along these lines.
Reproducibility.
The code, data and experiments for this paper are available on Codalab at https://worksheets.codalab.org/worksheets/0x6a264a96efea41158847eef9ec2f76bc/.
References
 Anandkumar et al. (2012) Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2012.
 Anandkumar et al. (2013) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models. arXiv, 2013.
 Babenko et al. (2009) Babenko, B., Yang, M., and Belongie, S. Visual tracking with online multiple instance learning. In Computer Vision and Pattern Recognition (CVPR), pp. 983–990, 2009.
 Brown et al. (1993) Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311, 1993.
 Chaganty & Liang (2013) Chaganty, A. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning (ICML), 2013.
 Chaganty & Liang (2014) Chaganty, A. and Liang, P. Estimating latentvariable graphical models using moments and likelihoods. In International Conference on Machine Learning (ICML), 2014.
 Chang et al. (2007) Chang, M., Ratinov, L., and Roth, D. Guiding semisupervision with constraintdriven learning. In Association for Computational Linguistics (ACL), pp. 280–287, 2007.
 Dempster et al. (1977) Dempster, A. P., M., L. N., and B., R. D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977.
 Duchi et al. (2013) Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013.
 Dwork (2006) Dwork, C. Differential privacy. In Automata, languages and programming, pp. 1–12, 2006.
 Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference, pp. 265–284, 2006.
 Evfimievski et al. (2004) Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. Privacy preserving mining of association rules. Information Systems, 29(4):343–364, 2004.
 Graça et al. (2008) Graça, J., Ganchev, K., and Taskar, B. Expectation maximization and posterior constraints. In Advances in Neural Information Processing Systems (NIPS), pp. 569–576, 2008.
 Haghighi & Klein (2006) Haghighi, A. and Klein, D. Prototypedriven learning for sequence models. In North American Association for Computational Linguistics (NAACL), pp. 320–327, 2006.
 Halpern & Sontag (2013) Halpern, Y. and Sontag, D. Unsupervised learning of noisyor Bayesian networks. In Uncertainty in Artificial Intelligence (UAI), 2013.
 Hsu & Kakade (2013) Hsu, D. and Kakade, S. M. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013.
 Hsu et al. (2009) Hsu, D., Kakade, S. M., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Conference on Learning Theory (COLT), 2009.
 Hsu et al. (2012) Hsu, D., Kakade, S. M., and Liang, P. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems (NIPS), 2012.
 Kasiviswanathan et al. (2011) Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
 Lafferty et al. (2001) Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling data. In International Conference on Machine Learning (ICML), pp. 282–289, 2001.
 Liang et al. (2009) Liang, P., Jordan, M. I., and Klein, D. Learning from measurements in exponential families. In International Conference on Machine Learning (ICML), 2009.
 Liang et al. (2011) Liang, P., Jordan, M. I., and Klein, D. Learning dependencybased compositional semantics. In Association for Computational Linguistics (ACL), pp. 590–599, 2011.
 M & Naisyin (2000) M, R. J. and Naisyin, W. Inference for imputation estimators. Biometrika, 87(1):113–124, 2000.
 Mann & McCallum (2008) Mann, G. and McCallum, A. Generalized expectation criteria for semisupervised learning of conditional random fields. In Human Language Technology and Association for Computational Linguistics (HLT/ACL), pp. 870–878, 2008.
 Matloff (1984) Matloff, N. S. Use of covariates in randomized response settings. Statistics & Probability Letters, 2(1):31–34, 1984.
 Oded & Tomás (1998) Oded, M. and Tomás, L. A framework for multipleinstance learning. Advances in neural information processing systems, pp. 570–576, 1998.
 Pearson (1894) Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894.
 Quadrianto et al. (2008) Quadrianto, N., Smola, A. J., Caetano, T. S., and Le, Q. V. Estimating labels from label proportions. In International Conference on Machine Learning (ICML), pp. 776–783, 2008.
 Quattoni et al. (2004) Quattoni, A., Collins, M., and Darrell, T. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems (NIPS), 2004.
 Steinhardt & Liang (2015) Steinhardt, J. and Liang, P. Learning with relaxed supervision. In Advances in Neural Information Processing Systems (NIPS), 2015.
 Tamhane (1981) Tamhane, A. C. Randomized response techniques for multiple sensitive attributes. Journal of the American Statistical Association (JASA), 76(376):916–923, 1981.
 Vaish et al. (2014) Vaish, R., Wyngarden, K., Chen, J., Cheung, B., and Bernstein, M. S. Twitch crowdsourcing: crowd contributions in short bursts of time. In Conference on Human Factors in Computing Systems (CHI), pp. 3645–3654, 2014.
 van der Vaart (1998) van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, 1998.
 Wang et al. (2015) Wang, S., Chaganty, A., and Liang, P. Estimating mixture models via mixture of polynomials. In Advances in Neural Information Processing Systems (NIPS), 2015.
 Warner (1965) Warner, S. L. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association (JASA), 60(309):63–69, 1965.
Appendix A Details of privacy schemes
a.1 Local privacy using sufficient statistics
a.2 Privacy guarantees of proposed schemes
In order to show differential privacy of the two schemes proposed in Section 3, we first note that it suffices to have differential privacy of the observations with respect to any (possibly random) data processed given the private variable such that forms a Markov chain.
To see this, suppose is an differentially private channel taking the intermediate variable to and fix any . Let be the distribution of given . Now, for the endtoend channel ,
(19)  
(20)  
(21) 
Coordinate release.
Recall that in the coordinate release mechanism, we first pick a coordinate and release observation after flipping with probability .
(22)  
(23) 
where the final step is by the triangle inequality applied twice.
Pervalue Rr.
Privacy of pervalue RR follows similarly.
Each coordinate of is flipped with probability , where is chosen such that (see Section 3.2)
(24)  
(25) 
a.3 Variance of momentsbased estimator for different privacy schemes.
For simplicity, we once again consider the unconditional case (where is empty) and assume .
Theorem 1 (Asymptotic variance (coordinate release)).
The asymptotic variance of for differentially private coordinate release scheme, under a uniform coordinate sampling distribution is
where
(26) 
As in Lemma 1, the matrix governs the loss in efficiency under the coordinate release mechanism, which arises from two sources: (i) variance due to the stochastic flipping process and (ii) variance due to choosing a random coordinate for release.
Proof.
We decompose as