Faster Algorithms for Privately Releasing MarginalsAn extended abstract of this work appears in ICALP ’12 [22].

Faster Algorithms for Privately
Releasing Marginalsthanks: An extended abstract of this work appears in Icalp ’12 [22].

Justin Thaler http://seas.harvard.edu/j̃thaler. Supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program, and in part by NSF grants CCF-0915922 and IIS-0964473.    Jonathan Ullman http://seas.harvard.edu/j̃ullman. Supported by NSF grant CNS-0831289 and a gift from Google, Inc.    Salil Vadhan

School of Engineering and Applied Sciences &
Center for Research on Computation and Society
Harvard University, Cambridge, MA
{jthaler,jullman,salil}@seas.harvard.edu
http://seas.harvard.edu/s̃alil. Supported by NSF grant CNS-0831289 and a gift from Google, Inc.
Abstract

We study the problem of releasing -way marginals of a database , while preserving differential privacy. The answer to a -way marginal query is the fraction of ’s records with a given value in each of a given set of up to columns. Marginal queries enable a rich class of statistical analyses of a dataset, and designing efficient algorithms for privately releasing marginal queries has been identified as an important open problem in private data analysis (cf. Barak et. al., PODS ’07).

We give an algorithm that runs in time and releases a private summary capable of answering any -way marginal query with at most error on every query as long as . To our knowledge, ours is the first algorithm capable of privately releasing marginal queries with non-trivial worst-case accuracy guarantees in time substantially smaller than the number of -way marginal queries, which is (for ).

1 Introduction

Consider a database in which each of the rows corresponds to an individual’s record, and each record consists of binary attributes. The goal of privacy-preserving data analysis is to enable rich statistical analyses on the database while protecting the privacy of the individuals. In this work, we seek to achieve differential privacy [6], which guarantees that no individual’s data has a significant influence on the information released about the database.

One of the most important classes of statistics on a dataset is its marginals. A marginal query is specified by a set and a pattern . The query asks, “What fraction of the individual records in has each of the attributes set to ?” A major open problem in privacy-preserving data analysis is to efficiently create a differentially private summary of the database that enables analysts to answer each of the marginal queries. A natural subclass of marginals are -way marginals, the subset of marginals specified by sets such that .

Privately answering marginal queries is a special case of the more general problem of privately answering counting queries on the database, which are queries of the form, “What fraction of individual records in satisfy some property ?” Early work in differential privacy [5, 2, 6] showed how to approximately answer any set of of counting queries by perturbing the answers with appropriately calibrated noise, providing good accuracy (say, within of the true answer) as long as .

In a setting where the queries arrive online, or are known in advance, it may be reasonable to assume that . However, many situations necessitate a non-interactive data release, where the data owner computes and publishes a single differentially private summary of the database that enables analysts to answer a large class of queries, say all -way marginals for a suitable choice of . In this case , and it may be impractical to collect enough data to ensure . Fortunately, the remarkable work of Blum et. al. [3] and subsequent refinements [7, 9, 19, 14, 13, 12], have shown how to privately release approximate answers to any set of counting queries, even when is exponentially larger than . For example, these algorithms can release all -way marginals as long as . Unfortunately, all of these algorithms have running time at least , even when is the set of -way marginals (and this is inherent for algorithms that produce “synthetic data” [23]; as discussed below).

Given this state of affairs, it is natural to seek efficient algorithms capable of privately releasing approximate answers to marginal queries even when . A recent series of works [11, 4, 15] have shown how to privately release answers to -way marginal queries with small average error (over various distributions on the queries) with both running time and minimum database size much smaller than (e.g.  for product distributions [11, 4] and for arbitrary distributions [15]). Hardt et. al. [15] also gave an algorithm for privately releasing -way marginal queries with small worst-case error and minimum database size much smaller than . However the running time of their algorithm is still , which is polynomial in the number of queries.

In this paper, we give the first algorithms capable of releasing -way marginals up to small worst-case error, with both running time and minimum database size substantially smaller than . Specifically, we show how to create a private summary in time that gives approximate answers to all -way marginals as long as is at least . When , our algorithm runs in time , and is the first algorithm for releasing all marginals in time .

1.1 Our Results and Techniques

In this paper, we give faster algorithms for releasing marginals and other classes of counting queries.

Theorem 1.1 (Releasing Marginals).

There exists a constant such that for every with , every , and every , there is an -differentially private sanitizer that, on input a database , runs in time and releases a summary that enables computing each of the -way marginal queries on up to an additive error of at most , provided that .

For notational convenience, we focus on monotone -way disjunction queries. However, our results extend straightforwardly to general non-monotone -way disjunction queries (see Section 4.1), which are equivalent to -way marginals. A monotone -way disjunction is specified by a set of size and asks what fraction of records in have at least one of the attributes in set to .

Our algorithm is inspired by a series of works reducing the problem of private query release to various problems in learning theory. One ingredient in this line of work is a shift in perspective introduced by Gupta, Hardt, Roth, and Ullman [11]. Instead of viewing disjunction queries as a set of functions on the database, they view the database as a function , in which each vector is interpreted as the indicator vector of a set , and equals the evaluation of the disjunction specified by on the database . They use the structure of the functions to privately learn an approximation that has small average error over any product distribution on disjunctions.111In their learning algorithm, privacy is defined with respect to the rows of the database that defines , not with respect to the examples given to the learning algorithm (unlike earlier works on “private learning” [16]).

Cheraghchi, Klivans, Kothari, and Lee [4] observed that the functions can be approximated by a low-degree polynomial with small average error over the uniform distribution on disjunctions. They then use a private learning algorithm for low-degree polynomials to release an approximation to ; and thereby obtain an improved dependence on the accuracy parameter, as compared to [11].

Hardt, Rothblum, and Servedio [15] observe that is itself an average of disjunctions (each row of specifies a disjunction of bits in the indicator vector of the query), and thus develop private learning algorithms for threshold of sums of disjunctions. These learning algorithms are also based on low-degree approximations of sums of disjunctions. They show how to use their private learning algorithms to obtain a sanitizer with small average error over arbitrary distributions with running time and minimum database size . They then are able to apply the private boosting technique of Dwork, Rothblum, and Vadhan [9] to obtain worst-case accuracy guarantees. Unfortunately, the boosting step incurs a blowup of in the running time.

We improve the above results by showing how to directly compute (a noisy version of) a polynomial that is privacy-preserving and still approximates on all -way disjunctions, as long as is sufficiently large. Specifically, the running time and the database size requirement of our algorithm are both polynomial in the number of monomials in , which is . By “directly”, we mean that we compute from the database itself and perturb its coefficients, rather than using a learning algorithm. Our construction of the polynomial uses the same low-degree approximations exploited by Hardt et. al. in the development of their private learning algorithms.

In summary, the main difference between prior work and ours is that prior work used learning algorithms that have restricted access to the database, and released the hypothesis output by the learning algorithm. In contrast, we do not make use of any learning algorithms, and give our release algorithm direct access to the database. This enables our algorithm to achieve a worst-case error guarantee while maintaining a minimal database size and running time much smaller than the size of the query set. Our algorithm is also substantially simpler than that of Hardt et. al.

We also consider other families of counting queries. We define the class of -of- queries. Like a monotone -way disjunction, an -of- query is defined by a set such that . The query asks what fraction of the rows of have at least of the attributes in set to . For , these queries are exactly monotone -way disjunctions, and -of- queries are a strict generalization.

Theorem 1.2 (Releasing -of- Queries).

For every with , every , and every there is an -differentially private sanitizer that, on input a database , runs in time and releases a summary that enables computing each of the -of- queries on up to an additive error of at most , provided that .

Note that monotone -way disjunctions are just -of- queries where , thus Theorem 1.2 implies a release algorithm for disjunctions with quadratically better dependence on , at the cost of slightly worse dependence on (implicit in the switch from to ).

Finally, we present a sanitizer for privately releasing databases in which the rows of the database are interpreted as decision lists, and the queries are inputs to the decision lists. That is, instead of each record in being a string of attributes, each record is an element of the set , which consists of all length- decision lists over input variables. (See Section 4.3 for a precise definition.) A query is specified by a string and asks “What fraction of database participants would make a certain decision based on the input ?”

As an example application, consider a database that allows high school students to express their preferences for colleges in the form of a decision list. For example, a student may say, “If the school is ranked in the top ten nationwide, I am willing to apply to it. Otherwise, if the school is rural, I am unwilling to apply. Otherwise, if the school has a good basketball team then I am willing to apply to it.” And so on. Each student is allowed to use up to attributes out of a set of binary attributes. Our sanitizer allows any college (represented by its binary attributes) to determine the fraction of students willing to apply.

Theorem 1.3 (Releasing Decision Lists).

For any s.t. , any , and any , there is an -differentially private sanitizer with running time that, on input a database , releases a summary that enables computing any length- decision list query up to an additive error of at most on every query, provided that .

For comparison, we note that all the results on releasing -way disjunctions (including ours) also apply to a dual setting where the database records specify a -way disjunction over bits and the queries are -bit strings (in this setting plays the role of ). Theorem 1.3 generalizes this dual version of Theorem 1.1, as length- decision lists are a strict generalization of -way disjunctions.

We prove the latter two results (Theorems 1.2 and 1.3) using the same approach outlined for marginals (Theorem 1.1), but with different low-degree polynomial approximations appropriate for the different types of queries.

Paper Running Time Database Size Error Type222Worst case error indicates that the accuracy guarantee holds for every marginal. The other types of error indicate that accuracy holds for random marginals over a given distribution from a particular class of distributions (e.g. product distributions). Synthetic Data? [5, 8, 2, 6] Worst case N [1] Worst case Y [3, 7, 9, 13] Worst case Y [11] Product Dists. N [4] Uniform Dist.333The results of [4] apply only to the uniform distribution over all marginals. N [15] Any Dist. N [15] Worst case N [15] Any Dist. N [15] Worst case N This paper Worst case N
Table 1: Summary of prior results on differentially private release of -way marginals. The database size column indicates the minimum database size required to release answers to -way marginals up to an additive error of . For clarity, we ignore the dependence on the privacy parameters and the failure probability of the algorithms. Notice that this paper contains the first algorithm capable of releasing -way marginals with running time and worst-case error substantially smaller than the number of queries.

On Synthetic Data.

An attractive type of summary is a synthetic database. A synthetic database is a new database whose rows are “fake”, but such that approximately preserves many of the statistical properties of the database (e.g. all the marginals). Some of the previous work on counting query release has provided synthetic data, starting with Barak et. al. [1] and including [3, 7, 9, 13].

Unfortunately, Ullman and Vadhan [23] (building on [7]) have shown that no differentially private sanitizer with running time can take a database and output a private synthetic database , all of whose -way marginals are approximately equal to those of (assuming the existence of one-way functions). They also showed that there is a constant such that no differentially private sanitizer with running time can output a private synthetic database, all of whose -way marginals are approximately equal to those of (under stronger cryptographic assumptions).

When , our sanitizer runs in time and releases a private summary that enables an analyst to approximately answer any marginal query on . Prior to our work it was not known how to release any summary enabling approximate answers to all marginals in time . Thus, our results show that releasing a private summary for all marginal queries can be done considerably more efficiently if we do not require the summary to be a synthetic database (under the hardness assumptions made in [23]).

2 Preliminaries

2.1 Differentially Private Sanitizers

Let a database be a collection of rows from a data universe . We say that two databases are adjacent if they differ only on a single row, and we denote this by .

A sanitizer takes a database as input and outputs some data structure in . We are interested in sanitizers that satisfy differential privacy.

Definition 2.1 (Differential Privacy [6]).

A sanitizer is -differentially private if for every two adjacent databases and every subset , In the case where we say that is -differentially private.

Since a sanitizer that always outputs satisfies Definition 2.1, we also need to define what it means for a sanitizer to be accurate. In particular, we are interested in sanitizers that give accurate answers to counting queries. A counting query is defined by a boolean predicate . We define the evaluation of the query on a database to be We use to denote a set of counting queries.

Since may output an arbitrary data structure, we must specify how to answer queries in from the output . Hence, we require that there is an evaluator that estimates from the output of . For example, if outputs a vector of “noisy answers” , where is a random variable for each , then and is the -th component of . Abusing notation, we write and as shorthand for and , respectively. Since we are interested in the efficiency of the sanitization process as a whole, when we refer to the running time of , we also include the running time of the evaluator . We say that is “accurate” for the query set if the values are close to the answers . Formally,

Definition 2.2 (Accuracy).

An output of a sanitizer is -accurate for the query set if for every . A sanitizer is -accurate for the query set if for every database ,

where the probability is taken over the coins of .

We will make use of the Laplace mechanism. Let denote a draw from the random variable over in which each coordinate is chosen independently according to the density function . Let be a database and be a function such that for every pair of adjacent databases , Then we have the following two theorems:

Lemma 2.3 (Laplace Mechanism, -Differential Privacy [6]).

For as above, the mechanism satisfies -differential privacy. Furthermore, for any , for

The choice of the norm in the accuracy guarantee of the lemma is for convenience, and doesn’t matter for the parameters of Theorems 1.1-1.3 (except for the hidden constants).

If the privacy requirement is relaxed to -differential privacy (for , then it is sufficient to perturb each coordinate of with noise from a Laplace distribution of smaller magnitude, leading to smaller error.

Lemma 2.4 (Laplace Mechanism, -Differential Privacy [5, 8, 2, 9]).

For as above, and for every , the mechanism satisfies -differential privacy. Furthermore, for any , for

2.2 Query Function Families

We take the approach of Gupta et. al. [11] and think of the database as specifying a function mapping queries to their answers , which we call the -representation of . We now describe this transformation more formally:

Definition 2.5 (-Function Family).

Let be a set of counting queries on a data universe , where each query is indexed by an -bit string. We define the index set of to be the set .

We define the -function family as follows: For every possible database row , the function is defined as . Given a database we define the function where . When is clear from context we will drop the subscript and simply write , , and .

For some intuition about this transformation, when the queries are monotone -way disjunctions on a database , the queries are defined by sets , . In this case each query can be represented by the -bit indicator vector of the set , with at most non-zero entries. Thus we can take and .

2.3 Polynomial Approximations

An -variate real polynomial of degree and () norm can be written as where for every . Recall that there are at most coefficients in an -variate polynomial of total degree . Often we will want to associate a polynomial of degree and norm with its coefficient vector . Specifically, Given a vector and a point we use to indicate the evaluation of the polynomial described by the vector at the point . Observe this is equivalent to computing where is defined as for every , .

Let be the family of all -variate real polynomials of degree and norm . In many cases, the functions can be approximated well on all the indices in by a family of polynomials with low degree and small norm. Formally:

Definition 2.6 (Uniform Approximation by Polynomials).

Given a family of -variate functions and a set , we say that the family uniformly -approximates on if for every , there exists such that .

We say that efficiently and uniformly -approximates if there is an algorithm that takes as input, runs in time , and outputs a coefficient vector such that .

3 From Polynomial Approximations to Data Release Algorithms

In this section we present an algorithm for privately releasing any family of counting queries such that that can be efficiently and uniformly approximated by polynomials. The algorithm will take an -row database and, for each row , constructs a polynomial that uniformly approximates the function (recall that , for each ). From these, it constructs a polynomial that uniformly approximates . The final step is to perturb each of the coefficients of using noise from a Laplace distribution (Theorem 2.3) and bound the error introduced from the perturbation.

Theorem 3.1 (Releasing Polynomials).

Let be a set of counting queries over , and be the function family (Definition 2.5). Assume that efficiently and uniformly -approximates on (Definition 2.6). Then there is a sanitizer that

  1. is -differentially private,

  2. runs in time , and

  3. is -accurate for for

Proof.

First we construct the sanitizer . See the relevant codebox below.

The Sanitizer

  Input: A database , an explicit family of polynomials , and a parameter .
  For
     Using efficient approximation of by , compute a polynomial that -approximates on .
  Let , where the sum denotes standard entry-wise vector addition.
  Let , where is drawn from an -variate Laplace distribution with parameter (Section 2.1).
  Output: .

Privacy.

We establish that is -differentially private. This follows from the observation that for any two adjacent that differ only on row ,

The last inequality is from the fact that for every , is a vector of norm at most . Part 1 of the Theorem now follows directly from the properties of the Laplace Mechanism (Theorem 2.3). Now we construct the evaluator .

The Evaluator for the Sanitizer

  Input: A vector and the description of a query .
  Output: . Recall that we view as an -variate polynomial, , and is the evaluation of on the point .

Efficiency.

Next, we show that runs in time . Recall that we assumed the polynomial construction algorithm runs in time . The algorithm needs to run on each of the rows, and then it needs to generate samples from a univariate Laplace distribution with magnitude , which can also be done in time . We also establish that runs in time , observe that needs to expand the input into an appropriate vector of dimension and take the inner product with the vector , whose entries have magnitude . These observations establish Part 2 of the Theorem.

Accuracy. Finally, we analyze the accuracy of the sanitizer . First, by the assumption that uniformly -approximates on , we have

Now we want to establish that for

where the probability is taken over the coins of . Part (3) of the Theorem will then follow by the triangle inequality.

To see that the above statement is true, observe that by the properties of the Laplace mechanism (Theorem 2.3), we have where the probability is taken over the coins of . Given that , it holds that for every ,

The first inequality follows from the fact that every monomial evaluates to or at the point . This completes the proof of the theorem.

Using Theorem 2.4, we can improve the bound on the error at the expense of relaxing the privacy guarantee to -differential privacy. This improved error only affects the hidden constants in Theorems 1.1-1.3, so we only state those theorems for -differential privacy.

Theorem 3.2 (Releasing Polynomials, -Differential Privacy).

Let be a set of counting queries over , and be the function family (Definition 2.5). Assume that efficiently and uniformly -approximates on (Definition 2.6). Then there is a sanitizer that

  1. is -differentially private,

  2. runs in time ,

  3. is -accurate for for

The proof of this theorem is identical to that of Theorem 3.1, but using the analysis of the Laplace mechanism from Theorem 2.4 in place of that of Theorem 2.3.

4 Applications

In this section we establish the existence of explicit families of low-degree polynomials approximating the families for some interesting query sets.

4.1 Releasing Monotone Disjunctions

We define the class of monotone -way disjunctions as follows:

Definition 4.1 (Monotone -Way Disjunctions).

Let . The query set of monotone -way disjunctions over contains a query for every . Each query is defined as . The function family contains a function for every .

Thus the family consists of all disjunctions, and the image of , which we denote , consists of all vectors with at most non-zero entries. We can approximate disjunctions over the set using a well-known transformation of the Chebyshev polynomials (see, e.g., [17, Theorem 8] and [15, Claim 5.4]). First we recall the useful properties of the univariate Chebyshev polynomials.

Fact 4.2 (Chebyshev Polynomials).

For every and , there exists a univariate real polynomial of degree such that

  1. ,

  2. for every ,

  3. , and

  4. for every , .

Moreover, such a polynomial can be constructed in time (e.g. using linear programming, though more efficient algorithms are known).

We can use Lemma 4.2 to approximate -way monotone disjunctions. Note that our result easily extends to monotone -way conjunctions via the identity
. Moreover, it extends to non-monotone conjunctions and disjunctions: we may extend the data universe as in [15, Theorem 1.2] to , and include the negation of each item in the original domain. Non-monotone conjunctions over domain correspond to monotone conjunctions over the expanded domain .

The next lemma shows that can be efficiently and uniformly approximated by polynomials of low degree and low norm. The statement is a well-known application of Chebyshev polynomials, and a similar statement appears in [15] but without bounding the running time of the construction or a bound on the norm of the polynomials. We include the statement and a proof for completeness, and to verify the additional properties we need.

Lemma 4.3 (Approximating by polynomials, similar to [15]).

For every such that and every , the family of -variate real polynomials of degree and norm efficiently and uniformly -approximates the family on the set .

Proof.

The algorithm for constructing the polynomials appears in the relevant codebox above.

  Input: a vector .
  Let: be the polynomial described in Lemma 4.2.
  Let: be the expansion of
  Output: .

Since is a degree- polynomial applied to a degree-1 polynomial (in the variables ), its degree is at most . To see the stated norm bound, note that every monomial of total degree in comes from the expansion of , and every coefficient in this expansion is a non-negative integer less than or equal to . In , each of these terms is multiplied by (the -th coefficient of ). Thus the norm of is at most . To see that is efficient, note that we can find every coefficient of of total degree by expanding into all of its terms and multiplying by , which can be done in time , as is required.

To see that -approximates , observe that for every , and for every , This completes the proof. ∎

Theorem 1.1 in the introduction follows by combining Theorems 3.1 and 4.3.

4.2 Releasing Monotone -of- Queries

We define the class of monotone -of- queries as follows:

Definition 4.4 (Monotone -of- Queries).

Let and such that . The query set of monotone -of- queries over contains a query for every . Each query is defined as . The function family contains a function for every .

Sherstov [20, Lemma 3.11] gives an explicit construction of polynomials that can be used to approximate the family over with low degree. It can be verified by inspecting the construction that the coefficients of the resulting polynomial are not too large.

Lemma 4.5 ([20]).

For every such that and , there exists a univariate polynomial of degree such that and

  1. ,

  2. for every ,

  3. for every , , and

  4. for every , .

Moreover, can be constructed in time (e.g. using linear programming).

For completeness we include a proof of Lemma 4.5 in the appendix. We can use these polynomials to approximate monotone -of- queries.

Lemma 4.6 (Approximating on ).

For every such that and every , the family of -variate real polynomials of degree and norm efficiently and uniformly -approximates the family on the set .

Proof.

The construction and proof is identical to that of Theorem 4.3 with the polynomials of Lemma 4.5 in place of the polynomials described in Lemma 4.2. ∎

Theorem 1.2 in the introduction now follows by combining Theorems 3.1 and 4.6. Note that our result also extends easily to non-monotone -of- queries in the same manner as Theorem 1.1.

Remark 4.7.

Using the principle of inclusion-exclusion, the answer to a monotone -of- query can be written as a linear combination of the answers to monotone -way disjunctions. Thus, a sanitizer that is -accurate for monotone -way disjunctions implies a sanitizer that is -accurate for monotone -of- queries. However, combining this implication with Theorem 1.1 yields a sanitizer with running time , which has a worse dependence on than what we achieve in Theorem 1.2.

4.3 Releasing Decision Lists

A length- decision list over variables is a function which can be written in the form “if then output else else if then output else output ,” where each is a boolean literal in , and each is an output bit in Note that decision lists of length- strictly generalize -way disjunctions and conjunctions. We use to denote the set of all length- decision lists over binary input variables.

Definition 4.8 (Evaluations of Length- Decision Lists).

Let such that and . The query set of evaluations of length- decision lists contains a query