Estimation from Indirect Supervision with Linear Moments

Estimation from Indirect Supervision with Linear Moments

Aditi Raghunathan    Roy Frostig    John Duchi    Percy Liang
Abstract

In structured prediction problems where we have indirect supervision of the output, maximum marginal likelihood faces two computational obstacles: non-convexity of the objective and intractability of even a single gradient computation. In this paper, we bypass both obstacles for a class of what we call linear indirectly-supervised problems. Our approach is simple: we solve a linear system to estimate sufficient statistics of the model, which we then use to estimate parameters via convex optimization. We analyze the statistical properties of our approach and show empirically that it is effective in two settings: learning with local privacy constraints and learning from low-cost count-based annotations.111 This is an extended and updated version of our paper in Proceedings of the International Conference on Machine Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48. Copyright 2016 by the author(s).

estimation, indirect supervision, method of moments, graphical models, latent variable models, structured prediction, local privacy
\icmladdress

Stanford University, Stanford, CA


1 Introduction

Consider the problem of estimating a probabilistic graphical model from indirect supervision, where only a partial view of the variables is available. We are interested in indirect supervision for two reasons: first, one might not trust a data collector and wish to use privacy mechanisms to reveal only partial information about sensitive data (Warner, 1965; Evfimievski et al., 2004; Dwork et al., 2006; Duchi et al., 2013). Second, if data is generated by human annotators, say in a crowdsourcing setting, it can often be more cost-effective to solicit lightweight annotations (Oded & Tomás, 1998; Mann & McCallum, 2008; Quadrianto et al., 2008; Liang et al., 2009). In both cases, we trade statistical efficiency for another resource: privacy or annotator cost.

Indirect supervision is naturally handled by defining a latent-variable model where the structure of interest is treated as a latent variable. While statistically sensible, learning latent-variable models poses two computational challenges. First, maximum marginal likelihood requires non-convex optimization, where popular procedures such as gradient descent or Expectation Maximization (EM) (Dempster et al., 1977) are only guaranteed to converge to local optima. Second, even the computation of the gradient or performing the E-step can be intractable, requiring probabilistic inference on a loopy graphical model induced by the indirect supervision (Chang et al., 2007; Graça et al., 2008; Liang et al., 2009).

In this paper, we propose an approach that bypasses both computational obstacles for a class which we call linear indirectly-supervised learning problems. We lean on the method of moments (Pearson, 1894), which has recently led to advances in learning latent-variable models (Hsu et al., 2009; Anandkumar et al., 2013; Chaganty & Liang, 2014), although we do not appeal to tensor factorization. Instead, we express indirect supervision as a linear combination of the sufficient statistics of the model, which we recover by solving a simple noisy linear system. Once we have the sufficient statistics, we use convex optimization to solve for the model parameters. The key is that while supervision per example is indirect and leads to intractability, aggregation over a large number of examples renders the problem tractable.

While our moments-based estimator yields computational benefits, we suffer some statistical loss relative to maximum marginal likelihood. In Section 5, we compare the asymptotic variance of marginal-likelihood and moment-based estimators, and provide some geometric insight into their differences in Section 6. Finally, in Section 7, we apply our framework empirically to our two motivating settings: (i) learning a regression model under local privacy constraints, and (ii) learning a part-of-speech tagger with lightweight annotations. In both applications, we show that our moments-based estimator obtains good accuracies.

2 Setup

Figure 1: We solve a structured prediction problem from to . During training, we observe not , but indirect supervision .

Notation.

We use superscripts to enumerate instances in a data sample (e.g. ), and square-bracket indexing to enumerate components of a vector or sequence: denotes the component(s) of associated with . For a real vector , we let .

Model.

Consider the structured prediction task of mapping an input to some output . We model this mapping using a conditional exponential family

(1)

where is the feature mapping, is the parameter vector, and is the log-partition function. For concreteness, we specialize to conditional random fields (CRFs) (Lafferty et al., 2001) over collections of -variate labels, where and ; here is the number of variables and is the number of possible labels per variable. We let be the set of cliques in the CRF, so that the features decompose into a sum over cliques: . As one particular example, if consists of all nodes and edges between adjacent nodes , the CRF is chain-structured.

Learning from indirect supervision.

In the indirectly supervised setting that is the focus of this paper, we do not have access to but rather only observations , where is drawn from a known supervision distribution .

For each , let be drawn from some unknown data-generating distribution (by default, we do not assume the CRF is well-specified), and is drawn according to as in Figure 1. The learning problem is then the natural one: given the training examples , we wish to produce an estimate for the model (1).

Maximum marginal likelihood.

The classic paradigm is to maximize the marginal likelihood:

(2)

where denotes an expectation over the training sample. While is often statistically efficient, there are two computational difficulties associated with this approach:

  1. The log-likelihood objective (2) is typically non-convex, so computing exactly is in general intractable; see Section 6 for a more detailed discussion. Local algorithms like Expectation Maximization (Dempster et al., 1977) are only guaranteed to converge to local optima.

  2. Computing the gradient or the E-step requires computing , which is intractable, not due to the model , but to the supervision . This motivates a number of relaxations (Graça et al., 2008; Liang et al., 2009; Steinhardt & Liang, 2015), but there are no guarantees on approximation quality.

Our approach: moment-based estimation.

Figure 2: Our approach is to (i) solve a linear system based on the data to estimate the sufficient statistics , then (ii) use convex optimization to estimate the model parameters .

We present a simple approach to circumvent the above issues for a restricted class of models, in the same vein as Chaganty & Liang (2014). To begin, consider the fully-supervised setting, where we observe examples . In this case, we could maximize the likelihood , solving , where are the sufficient statistics, which converge to . Therefore, if we could construct a consistent estimate of , then we could solve the same convex optimization problem used in the fully-supervised estimator.

Of course, we do not have access to . Instead, in our (linearly) indirectly supervised setting, we are able to define an observation function which is nonetheless in expectation equal to the population sufficient statistics:

(3)

In general, we construct by solving a linear system. Putting the pieces together yields our estimator (Figure 2):

  1. Sufficient statistics: .

  2. Parameters: .

In the next two sections, we describe the observation function for learning with local privacy (Section 3) and lightweight annotations (Section 4).

3 Learning under local privacy

Suppose we wish to estimate a conditional distribution , where is non-sensitive information about an individual and contains sensitive information (for example, income or disease status). Individuals, because of a variety of reasons—mistrust, embarrassment, fear of discrimination—may wish to keep private and instead release some . To quantify the amount of privacy afforded by , we turn to the literature on privacy in databases and theoretical computer science (Evfimievski et al., 2004; Dwork et al., 2006) and say that is -differentially private if any two have comparable probability (up to a factor of ) of generating :

(4)

What should we employ? We first explore the classical randomized response (RR) mechanism (Section 3.1), and then develop a new mechanism that leverages the graphical model structure (Section 3.2).

3.1 Classic randomized response

Warner (1965) proposed the now-classical randomized response technique, which proceeds as follows: For some fixed (generally small) , the respondent reveals with probability and with probability draws a sample from a (known) base distribution —generally uniform—over . Formally, the classical randomized response supervision is

(5)

Estimation.

Our goal is to construct a function satisfying (3). Towards that end, let us start with what we can estimate and expand based on (5):

(6)

where . Rearranging (6), we see that we can solve for . Indeed, if we define the observation function:

(7)

we can verify that .

Privacy.

We can check that the ratio , so classical randomized response is -differentially private. For any distribution , this value is at least , a linear dependence on . In classical randomized response settings, , which is unproblematic. In contrast, in structured prediction settings, the number of labels is exponential in the number of variables (), so we must take . The asymptotic variance of scales as (as will be shown in Section 5), which makes classical randomized response unusable for structured prediction.

3.2 Structured randomized response

With this difficulty in mind, we recognize that we must somehow leverage the structure of the sufficient statistics vector to perform estimation. In particular, we show that the supervision should only depend on the sufficient statistics:

Proposition 1.

Let be the set in which observations live. For any privacy mechanism that is -differentially private, there exists a mechanism that is at least -differentially private, and for any set , we have

(8)

where and .

In short, we can always construct that only uses the sufficient statistics but yields the same joint distribution over the pairs . Furthermore, is at least as private as the original mechanism . See Appendix A.1 for a proof.

This motivates a focus exclusively on mechanisms that use sufficient statistics, and in particular, we consider the following two structured randomized response mechanisms. Our schemes are both two-phase procedures that first binarize the sufficient statistics, and then release a set of observations inspired by Duchi et al.’s minimax optimal approach to estimating a multinomial distribution. For , let . Assume each coordinate of the statistics lies in the interval for some positive scalar . For , draw as a Bernoulli variable with bias . Then:

(Coordinate release)    Draw a coordinate from a distribution . Set with probability , otherwise . Release the pair .

(Per-value -RR)    Denote by the support of given , let , and take any . For , set with probability , otherwise . Release the vector .

Both are -differentially private (see Appendix A.1). For coordinate release, define the observation function

where denotes the ’th standard basis vector. For the per-value statistics scheme, define the observation function,

(9)

In either case, we have that , as required by (3) for to be consistent.

The two schemes offer a tradeoff: when is dense, coordinate release is advantageous, as our best norm bound may be as large as the dimension , so although we reveal only a single coordinate at a time, we noise it by a lower-variance distribution rather than the noise of the per-value scheme. Meanwhile, per-value -RR enjoys lower variance when has low norm. The latter case arises, for instance, if is a sparse binary vector as is common in structured prediction. Appendix A.3 and A.4 present more details about this tradeoff offered by the schemes.

Summarizing, we have three randomized response schemes. Classical RR appeals only in unstructured problems with few outputs . In the structured setting, we can move to the sufficient statistics by Proposition 1, and exploit their structure with either of two schemes based on our knowledge of the 1-norm or sparsity of statistics .

4 Learning with lightweight annotations

The quick brown [fox jumps over the lazy dog]
DT JJ JJ [NN VBZ IN DT JJ NN]
# NN = 2
Table 1: Part-of-speech tagging with region annotations. An annotator is given a region (bold, in brackets) and asked to count the number of times particular tags (e.g., NN) occurs.

For a sequence labeling task, e.g., part-of-speech (POS) tagging, it can be tedious to obtain fully-labeled input-output sequences for training. This motivates a line of work which attempts to learn from weaker forms of annotation (Mann & McCallum, 2008; Haghighi & Klein, 2006; Liang et al., 2009). We focus on region annotations, where an annotator is asked to examine only a particular subsequence of the input and count the number of occurrences of some label (e.g., nouns). The rationale is that it is cognitively easier for the annotator to focus on one label at a time rather than annotating from a large tag set, and physically easier to hit a single yes/no or counter button than to select precise locations, especially in mobile-based crowdsourcing interfaces (Vaish et al., 2014). See Table 1 for an example.

More formally, the supervision is defined as follows: First, choose the starting position uniformly from , and set the ending position , where is a fixed window size. Let denote this region. Next, choose a subset of tags uniformly from the tag set (e.g., ). From here, the observation is generated deterministically: For each tag , the annotator counts the number of occurrences in the region: . The final observation is .

In this setting, not only is the marginal likelihood non-convex, inference requires summing over possible ways of realizing the counts, which is exponential either in the window size and .

Estimation.

For our estimator to work, we make two assumptions:

  1. The node potentials only depend on : ; and

  2. Under the true conditional distribution, only depends on : .

These are admittedly strong independence assumptions similar to IBM model 1 for word alignment (Brown et al., 1993) or the unordered translation model of Steinhardt & Liang (2015). Even though our model is fully factorized and lacks edge potentials, inference is expensive as conditioning on the indirect supervision couples all of the variables. This typically calls for approximate inference techniques common to the realm of structured prediction. Steinhardt & Liang (2015) developed a relaxation to cope with this supervision, but this still requires approximate inference via sampling and non-convex optimization.

In contrast to the local privacy examples, the new challenge is that the observation does not provide enough information to evaluate a single node potential, even stochastically. So we cannot directly write in terms of functions of the observations. As a bridge, define the localized conditional distributions: , which by assumption 2 specify the entire conditional distribution. The sufficient statistics can be written as in terms of :

(10)

We now define constraints that relate the observations to . Recall that each observation includes a region , a tag , and a vector of counts , one for each tag . For each input and tag , we have the identity:

(11)

While we do not observe the LHS, we observe , which is unbiased estimate of the RHS of (11). We can therefore solve a regression problem with response to recover a consistent estimate of :

(12)

For instance, the example in Table 1 contributes: . Finally, we plug in into (10) obtain .

5 Asymptotic analysis

We have two estimators: maximum marginal likelihood (), which is difficult to compute, requiring non-convex optimization and possibly intractable inference; and our moments-based estimator (), which is easy to compute, requiring only solving a linear system and convex optimization. In this section, we study and compare the statistical efficiency of and . For simplicity, we focus on unconditional setting where is empty, and omit in this section. We also assume our exponential family model is well-specified and that are the true parameters. All expectations are taken with respect to .

Recall from (3) that . We can therefore think of as a “best guess” of . The following lemma provides the asymptotic variances of the estimators:

Proposition 2 (General asymptotic variances).

Let be the Fisher information matrix. Then

where the asymptotic variances are

(13)
(14)

Let us compare the asymptotic variances of and to that of the fully-supervised maximum likelihood estimator , which has access to , and satisfies .

Examining the asymptotic variance of (13), we see that the loss in statistical efficiency with respect to maximum likelihood is the amount of variation in not explained by , . Consequently, if is simply deterministic given , then , and achieves the statistically efficient asymptotic variance .

The story with is dual to that for the marginal likelihood estimator. Considering the second term in expression (14), we see that the loss of efficiency due to our observation model grows linearly in the variability of the observations not explained by . Thus, unlike , even if is deterministic given (so reveals full information about ), we do not recover the efficient covariance . As a trivial example, let and the observation for , so that contains a faithful copy of , and let . Then , and the asymptotic relative efficiency of to is . Roughly, integrates posterior information about better than does.

Proof.

To compute , we follow standard arguments in van der Vaart (1998). If is the marginal log-likelihood, then a straightforward calculation yields . The asymptotic variance is the inverse of ; applying the variance decomposition gives (13).

To compute , recall that the moments-based estimator computes and . Apply the delta method, where . Finally, decompose and recognize that to obtain (14). ∎

Randomized response.

To obtain concrete intuition for Proposition 2, we specialize to the case where is the randomized response (5). In this setting, for some constant vector . Recall the supervision model: , if and if .

Lemma 1 (asymptotic variances (randomized response)).

Under the randomized response model of (5), the asymptotic variance of is

(15)

The matrix governs the loss of efficiency, which stems from two sources: (i) , the variance when we sample ; and (ii) the variance in choosing between and . If and have the same distribution, then and .

Proof.

We decompose as

where we used . ∎

(a) (b)
Figure 3: The efficiency of relative to as varies for weak (a) and strong (b) signals .

An empirical plot.

The Hájek-Le Cam convolution and local asymptotic minimax theorems give that is the most statistically efficient estimator. We now empirically study the efficiency of relative to , where , the average of the relative variances per coordinate of to . We continue to focus on randomized response in the unconditional case.

To study the effect of , we consider the following probability model: we let , define

and set . We set and to represent weak and strong signals (the latter is harder to estimate, as the Fisher information matrix is much smaller); when , the asymptotic variances are equal, . In Figure 3, we see that the asymptotic efficiency of relative to decreases as , which is explained by the fact that—as we see in expression (13)—the estimator leverages the prior information about based on , while as , expression (15) is dominated by the term, where is uniform. Moreover, as grows larger, the conditional covariance is much smaller than the covariance , so that we expect that .

6 The geometry of two-step estimation

We now provide some geometric intuition about the differences between and , establishing a connection between and the EM algorithm as a byproduct of our discussion. For concreteness, let be a finite set and let be the set of all distributions over (represented as -dimensional vectors). Let be a natural exponential family over with . See Figure 4 for an example where and . Note that in the space of distributions, is a non-convex set.

Let be the set of observations. We can represent the supervision function as a matrix . For , we can express the marginal distribution over as . Let be the empirical distribution over observations.

The maximum marginal likelihood estimator can now be written succinctly as:

(16)

While the KL-divergence is concave in , the non-convex constraint set makes the problem difficult.

Figure 4: Visualization of the exponential family and all distributions over ; here is the 3-simplex. The model features for are . The blue curve marks out the exponential family . Observations yield two moment equations (dotted green) whose intersection with the simplex pins down the data distribution.

Our moment-based estimator can be viewed as a relaxation, where we first optimize over a relaxed set and then project onto the exponential family:

(17)

The first step can be computed directly via if is the squared Euclidean distance. If is KL-divergence, we can use EM (see the composite likelihood objective of Chaganty & Liang (2014)), which converges to the global optimum. The result is a single distribution over . The second step optimizes over via , which is a convex optimization problem, resulting in corresponding to .

Figure 5: Log-marginal likelihood , where the exponential family features are , , . The model is well specified with .

Computing generally requires solving a non-convex optimization problem (see Figure 5 for an example). When has full column rank and the model is well-specified, is consistent: we have that . This means that eventually the KL projection of problem (17) is essentially an identity operation: we almost have by the rank assumptions, making the problem easy. This assumption strongly depends on the well-specifiedness of the supervision; indeed, if for any , then , for a constant , even as . We can relax the column rank assumption, however: simply needs to contain enough information about the sufficient statistics, that is, if is matrix of sufficient statistics, we require that for some matrix .

Deterministic supervision.

When the supervision matrix has full column rank, converges to . There are certainly cases where is consistent, but is not. What can we say about in this case?

To obtain intuition, consider the case the supervision is a deterministic function that maps to (region annotations is an example). In this case, every column of is an indicator vector, and .

Here, distributes probability mass evenly across all the that deterministic map to . In this case, simply corresponds to running one iteration of EM on the marginal likelihood, initializing with the uniform distribution over (). The E-step conditions on and places uniform mass over consistent , producing ; the M-step optimizes based on .

7 Experiments

Figure 6: R coefficient for linear regression when estimating from privately revealed sufficient statistics on two datasets.

Local privacy.

Following Section 3, we consider locally private estimation of a structured model. We take linear regression as a simple such structured model: it corresponds to a pairwise random field over the inputs and the response. The sufficient statistics are edge features and for each .

On the housing dataset, the supervision is given under the per-value RR scheme. On the songs dataset, is a dense vector, motivating the coordinate release scheme instead. We choose at random, with probability reveal and with probability reveal with suitable noise as described in Section 3.2. Note that the noising mechanism privatizes both input variables as well as the response .

Figure 6 visualizes the average (over 10 trials) R coefficient of fit for linear regression on the test set,222The (uncentered) R coefficient of parameters in a linear regression with design and labels is . in response to varying the privacy parameter .333We use the housing (mlcomp.org/datasets/840) and songs (mlcomp.org/datasets/748) data from mlcomp. As expected, the efficiency degrades with increase in the privacy constraint, though for moderate values of the loss is not significant.

Lightweight annotations.

We experiment with estimating a conditional model for part-of-speech (POS) sequence tagging from lightweight annotations.444We used the Wall Street Journal portion of the Penn Treebank. Sections 0-21 comprise the training set and 22-24 are test. Every example in the dataset reveals a sentence and the counts of all tags over a consecutive window. Following the modeling assumptions in Section 4, we use a CRF (per Section 2) with only node features:

where is a function on the word (e.g., word, prefix, suffix and word signature).

When the problem is fully supervised, we maximize the log-likelihood with stochastic gradient descent (SGD); in this case, estimation is convex and exact gradients can be tractably computed. Under count supervision, convexity of the marginal likelihood is not guaranteed. Although the model has no edge features, the indirect count supervision places an potential over the region in which counts are revealed (one enforcing that the tag sequence is compatible with the counts). This renders exact inference intractable, so we approximate it using beam search to compute stochastic gradients.555The dataset has 45 tag values. We use a beam of size 500 after analytically marginalizing nodes outside the region. The moment-based estimator is unaffected by this issue as it requires no inference and proceeds via a pair convex minimization programs; we minimize both using SGD.

Figure 7: Train and test per-position accuracies for and on part-of-speech tagging, under various sized regions of count annotations, as training passes are taken through the dataset.

Figure 7 shows train and test accuracies as we make passes over the dataset. Typically, after sufficiently many passes, the marginal likelihood gains an advantage over the moment-based estimator. For small regions, we expect the beam search approximation to be accurate, and indeed the marginal likelihood estimator is dominant there. For larger regions, the moment-based estimator (i) achieves high accuracy early and (ii) dominates for several passes before the marginal likelihood estimator overtakes it. Altogether, the experiment highlights that the moment-based estimator is favorable in computationally-constrained settings.

8 Related work and discussion

This work was motivated by two use cases of indirect supervision: local privacy and cheap annotations. Each trades off statistical accuracy for another resource: privacy or annotation cost. Local privacy has its roots in classical randomized response for conducting surveys (Warner, 1965), which has been extended to the multivariate (Tamhane, 1981) and conditional (Matloff, 1984) settings. In the computer science community, differential privacy has emerged as a useful formalization of privacy (Dwork, 2006). We work with the stronger notion of local differential privacy (Evfimievski et al., 2004; Kasiviswanathan et al., 2011; Duchi et al., 2013). Our contribution here is two-fold: First, we bring local privacy to the graphical model setting, which provides an opportunity for the privacy mechanism to be sensitive to the model structure. While we believe our mechanisms are reasonable, an open question is designing optimal mechanisms in the structured case. Second, we connect privacy with other forms of indirect supervision.

The second use case is learning from lightweight annotations, which has taken many forms in the literature. Multi-instance learning (Oded & Tomás, 1998) is popular in computer vision, where it is natural to label the presence but not location of objects (Babenko et al., 2009). In natural language processing, there also been work on learning from structured outputs where, like this work, only counts of labels are observed (Mann & McCallum, 2008; Liang et al., 2009). However, these works resort to likelihood-based approaches which involve non-convex optimization and approximate inference, whereas in this work, we show that linear algebra and convex optimization suffice under modeling assumptions.

Quadrianto et al. (2008) showed how to learn from label proportions of groups of examples, using a linear system technique similar to ours. However, they assume that the group is conditionally independent of the example given the label, which would not apply in our region-based annotation setup since our regions contain arbitrarily correlated inputs and heterogeneous labels. In return, we do need to make the stronger assumption that each label depends only on a discrete , so that the credit assignment can be done using a linear program. An open challenge is to allow for heterogeneity with complex inputs.

Indirect supervision arises more generally in latent-variable models, which arises in machine translation (Brown et al., 1993), semantic parsing (Liang et al., 2011), object detection (Quattoni et al., 2004), and other missing data problems in statistics (M & Naisyin, 2000). The indirect supervision problems in this paper have additional structure: we have an unknown model and a known supervision function . It is this structure allows us to obtain computationally efficient method of moments procedures.

We started this work to see how much juice we could squeeze out of just linear moment equations, and the answer is more than we expected. Of course, for more general latent-variable models beyond linearly indirectly-supervised problems, we would need more powerful tools. In recent years, tensor factorization techniques have provided efficient methods for a wide class of latent-variable models (Hsu et al., 2012; Anandkumar et al., 2012; Hsu & Kakade, 2013; Anandkumar et al., 2013; Chaganty & Liang, 2013; Halpern & Sontag, 2013; Chaganty & Liang, 2014). One can leverage even more general polynomial-solving techniques to expand the set of models (Wang et al., 2015). In general, the method of moments allows us to leverage statistical structure to alleviate computational intractability, and we anticipate more future developments along these lines.

Reproducibility.

The code, data and experiments for this paper are available on Codalab at https://worksheets.codalab.org/worksheets/0x6a264a96efea41158847eef9ec2f76bc/.

References

  • Anandkumar et al. (2012) Anandkumar, A., Foster, D. P., Hsu, D., Kakade, S. M., and Liu, Y. Two SVDs suffice: Spectral decompositions for probabilistic topic modeling and latent Dirichlet allocation. In Advances in Neural Information Processing Systems (NIPS), 2012.
  • Anandkumar et al. (2013) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. Tensor decompositions for learning latent variable models. arXiv, 2013.
  • Babenko et al. (2009) Babenko, B., Yang, M., and Belongie, S. Visual tracking with online multiple instance learning. In Computer Vision and Pattern Recognition (CVPR), pp. 983–990, 2009.
  • Brown et al. (1993) Brown, P. F., Pietra, S. A. D., Pietra, V. J. D., and Mercer, R. L. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19:263–311, 1993.
  • Chaganty & Liang (2013) Chaganty, A. and Liang, P. Spectral experts for estimating mixtures of linear regressions. In International Conference on Machine Learning (ICML), 2013.
  • Chaganty & Liang (2014) Chaganty, A. and Liang, P. Estimating latent-variable graphical models using moments and likelihoods. In International Conference on Machine Learning (ICML), 2014.
  • Chang et al. (2007) Chang, M., Ratinov, L., and Roth, D. Guiding semi-supervision with constraint-driven learning. In Association for Computational Linguistics (ACL), pp. 280–287, 2007.
  • Dempster et al. (1977) Dempster, A. P., M., L. N., and B., R. D. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1–38, 1977.
  • Duchi et al. (2013) Duchi, J. C., Jordan, M. I., and Wainwright, M. J. Local privacy and statistical minimax rates. In Foundations of Computer Science (FOCS), 2013.
  • Dwork (2006) Dwork, C. Differential privacy. In Automata, languages and programming, pp. 1–12, 2006.
  • Dwork et al. (2006) Dwork, C., McSherry, F., Nissim, K., and Smith, A. Calibrating noise to sensitivity in private data analysis. In Proceedings of the 3rd Theory of Cryptography Conference, pp. 265–284, 2006.
  • Evfimievski et al. (2004) Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J. Privacy preserving mining of association rules. Information Systems, 29(4):343–364, 2004.
  • Graça et al. (2008) Graça, J., Ganchev, K., and Taskar, B. Expectation maximization and posterior constraints. In Advances in Neural Information Processing Systems (NIPS), pp. 569–576, 2008.
  • Haghighi & Klein (2006) Haghighi, A. and Klein, D. Prototype-driven learning for sequence models. In North American Association for Computational Linguistics (NAACL), pp. 320–327, 2006.
  • Halpern & Sontag (2013) Halpern, Y. and Sontag, D. Unsupervised learning of noisy-or Bayesian networks. In Uncertainty in Artificial Intelligence (UAI), 2013.
  • Hsu & Kakade (2013) Hsu, D. and Kakade, S. M. Learning mixtures of spherical Gaussians: Moment methods and spectral decompositions. In Innovations in Theoretical Computer Science (ITCS), 2013.
  • Hsu et al. (2009) Hsu, D., Kakade, S. M., and Zhang, T. A spectral algorithm for learning hidden Markov models. In Conference on Learning Theory (COLT), 2009.
  • Hsu et al. (2012) Hsu, D., Kakade, S. M., and Liang, P. Identifiability and unmixing of latent parse trees. In Advances in Neural Information Processing Systems (NIPS), 2012.
  • Kasiviswanathan et al. (2011) Kasiviswanathan, S. P., Lee, H. K., Nissim, K., Raskhodnikova, S., and Smith, A. What can we learn privately? SIAM Journal on Computing, 40(3):793–826, 2011.
  • Lafferty et al. (2001) Lafferty, J., McCallum, A., and Pereira, F. Conditional random fields: Probabilistic models for segmenting and labeling data. In International Conference on Machine Learning (ICML), pp. 282–289, 2001.
  • Liang et al. (2009) Liang, P., Jordan, M. I., and Klein, D. Learning from measurements in exponential families. In International Conference on Machine Learning (ICML), 2009.
  • Liang et al. (2011) Liang, P., Jordan, M. I., and Klein, D. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pp. 590–599, 2011.
  • M & Naisyin (2000) M, R. J. and Naisyin, W. Inference for imputation estimators. Biometrika, 87(1):113–124, 2000.
  • Mann & McCallum (2008) Mann, G. and McCallum, A. Generalized expectation criteria for semi-supervised learning of conditional random fields. In Human Language Technology and Association for Computational Linguistics (HLT/ACL), pp. 870–878, 2008.
  • Matloff (1984) Matloff, N. S. Use of covariates in randomized response settings. Statistics & Probability Letters, 2(1):31–34, 1984.
  • Oded & Tomás (1998) Oded, M. and Tomás, L. A framework for multiple-instance learning. Advances in neural information processing systems, pp. 570–576, 1998.
  • Pearson (1894) Pearson, K. Contributions to the mathematical theory of evolution. Philosophical Transactions of the Royal Society of London. A, 185:71–110, 1894.
  • Quadrianto et al. (2008) Quadrianto, N., Smola, A. J., Caetano, T. S., and Le, Q. V. Estimating labels from label proportions. In International Conference on Machine Learning (ICML), pp. 776–783, 2008.
  • Quattoni et al. (2004) Quattoni, A., Collins, M., and Darrell, T. Conditional random fields for object recognition. In Advances in Neural Information Processing Systems (NIPS), 2004.
  • Steinhardt & Liang (2015) Steinhardt, J. and Liang, P. Learning with relaxed supervision. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • Tamhane (1981) Tamhane, A. C. Randomized response techniques for multiple sensitive attributes. Journal of the American Statistical Association (JASA), 76(376):916–923, 1981.
  • Vaish et al. (2014) Vaish, R., Wyngarden, K., Chen, J., Cheung, B., and Bernstein, M. S. Twitch crowdsourcing: crowd contributions in short bursts of time. In Conference on Human Factors in Computing Systems (CHI), pp. 3645–3654, 2014.
  • van der Vaart (1998) van der Vaart, A. W. Asymptotic statistics. Cambridge University Press, 1998.
  • Wang et al. (2015) Wang, S., Chaganty, A., and Liang, P. Estimating mixture models via mixture of polynomials. In Advances in Neural Information Processing Systems (NIPS), 2015.
  • Warner (1965) Warner, S. L. Randomized response: A survey technique for eliminating evasive answer bias. Journal of the American Statistical Association (JASA), 60(309):63–69, 1965.

Appendix A Details of privacy schemes

a.1 Local privacy using sufficient statistics

Proof of Proposition 1.

Because is a sufficient statistic, by definition there exists some channel and a distribution such that . If we define

(18)

then (8) follows by substitution and algebra: ∎

a.2 Privacy guarantees of proposed schemes

In order to show differential privacy of the two schemes proposed in Section 3, we first note that it suffices to have differential privacy of the observations with respect to any (possibly random) data processed given the private variable such that forms a Markov chain.

To see this, suppose is an -differentially private channel taking the intermediate variable to and fix any . Let be the distribution of given . Now, for the end-to-end channel ,

(19)
(20)
(21)

Coordinate release.

Recall that in the coordinate release mechanism, we first pick a coordinate and release observation after flipping with probability .

(22)
(23)

where the final step is by the triangle inequality applied twice.

Per-value -Rr.

Privacy of per-value -RR follows similarly.

Each coordinate of is flipped with probability , where is chosen such that (see Section  3.2)

(24)
(25)

a.3 Variance of moments-based estimator for different privacy schemes.

For simplicity, we once again consider the unconditional case (where is empty) and assume .

Theorem 1 (Asymptotic variance (coordinate release)).

The asymptotic variance of for -differentially private coordinate release scheme, under a uniform coordinate sampling distribution is

where

(26)

As in Lemma 1, the matrix governs the loss in efficiency under the coordinate release mechanism, which arises from two sources: (i) variance due to the stochastic flipping process and (ii) variance due to choosing a random coordinate for release.

Proof.

When is uniform, the observation function takes the following form.

From (14), we have that .

We decompose as