Differentially Private Database Releasevia Kernel Mean Embeddings

Differentially Private Database Release
via Kernel Mean Embeddings

Matej Balog Ilya Tolstikhin Empirical Inference Department, Max Planck Institute for Intelligent Systems, Tübingen Bernhard Schölkopf Empirical Inference Department, Max Planck Institute for Intelligent Systems, Tübingen
Abstract

We lay theoretical foundations for new database release mechanisms that allow third-parties to construct consistent estimators of population statistics, while ensuring that the privacy of each individual contributing to the database is protected. The proposed framework rests on two main ideas. First, releasing (an estimate of) the kernel mean embedding of the data generating random variable instead of the database itself still allows third-parties to construct consistent estimators of a wide class of population statistics. Second, the algorithm can satisfy the definition of differential privacy by basing the released kernel mean embedding on entirely synthetic data points, while controlling accuracy through the metric available in a Reproducing Kernel Hilbert Space. We describe two instantiations of the proposed framework, suitable under different scenarios, and prove theoretical results guaranteeing differential privacy of the resulting algorithms and the consistency of estimators constructed from their outputs.

1 Introduction

We aim to contribute to the body of research on the trade-off between releasing datasets from which publicly beneficial statistical inferences can be drawn, and between protecting the privacy of individuals who contribute to such datasets. Currently the most successful formalization of protecting user privacy is provided by differential privacy [Dwork and Roth, 2014], which is a definition that any algorithm operating on a database may or may not satisfy. An algorithm that does satisfy the definition ensures that a particular individual does not lose too much privacy by deciding to contribute to the database on which the algorithm operates.

While differentially private algorithms for releasing entire databases have been studied previously [Blum et al., 2008, Wasserman and Zhou, 2010, Zhou et al., 2009], most algorithms focus on releasing a privacy-protected version of a particular summary statistic, or of a statistical model trained on the private dataset. In this work we revisit the more difficult non-interactive, or offline setting, where the database owner aims to release a privacy-protected version of the entire database without knowing what statistics third-parties may wish to compute in the future.

In our new framework we propose to use the kernel mean embedding [Smola et al., 2007] as an intermediate representation of a database. It is (1) sufficiently rich in the sense that it captures a wide class of statistical properties of the data, while at the same time (2) it lives in a Reproducing Kernel Hilbert Space (RKHS), where it can be handled mathematically in a principled way and privacy-protected in a unified manner, independently of the type of data appearing in the database.

Although kernel mean embeddings are functions in an abstract Hilbert space, in practice they can be (at least approximately) represented using a possibly weighted set of data points in input space (i.e., a set of database rows). The privacy-protected kernel mean embedding is released to the public in this representation, however, using synthetic datapoints instead of the private ones. As a result, our framework can be seen as leading to synthetic database algorithms.

We validate our approach by instantiating two concrete algorithms and proving that they output consistent estimators of the true kernel mean embedding of the data generating process, while satisfying the definition of differential privacy. The consistency results ensure that third-parties can carry out a wide variety of statistically founded computation on the released data, such as constructing consistent estimators of population statistics, estimating the Maximum Mean Discrepancy (MMD) between distributions and performing two-sample testing [Gretton et al., 2012], or using the data in the kernel probabilistic programming framework for random variable arithmetics [Schölkopf et al., 2015, Simon-Gabriel et al., 2016, Section 3], repeatedly and unlimitedly without being able to, or having to worry about, violating user privacy.

One of our algorithms is especially suited to the interesting scenario where a (small) subset of a database has already been published. This situation can arise in a wide variety of settings, for example, due to weaker privacy protections in the past, due to a leak, or due to the presence of an incentive, financial or otherwise, for users to publish their data. In such a situation our algorithm can provide a principled approach for reweighting the published dataset in such a way that the accuracy of statistical inferences on this dataset benefits from the larger sample size (including the private data), while maintaining differential privacy for the undisclosed data.

In summary, the contributions of this paper are:

  1. A framework for designing database release algorithms with the guarantee of differential privacy, using kernel mean embeddings as intermediate database representations, so that the RKHS metric can be used to control accuracy of the released synthetic database in a principled manner (Section 3).

  2. Two concrete instantiations of our framework in the form of two synthetic database release algorithms, with proofs of their consistency and differential privacy (Sections 4 and 5).

2 Background

2.1 Differential privacy

Definition 1 (Differential privacy [Dwork, 2006]).

An algorithm is said to be -differentially private if for all neighbouring databases (differing in at most one element) and all measurable subsets of the co-domain of ,

(1)

The parameter controls the amount of information the algorithm can leak about an individual, while a positive value of allows the algorithm to produce an unlikely output that leaks more information, but only with probability up to . This notion is sometimes called approximate differential privacy; an algorithm that is -differentially private is simply said to be -differentially private. Note that any non-trivial differentially private algorithm must be randomized; the definition asserts that the distribution of algorithm outputs is not too sensitive to changing one row in the database.

When the output of an algorithm is a finite vector , two standard random perturbation mechanisms are available to make the output differentially private: the Laplace mechanism and the Gaussian mechanism. As the perturbation needs to mask the contribution of a particular entry of the database , the scale of the added noise is closely linked to the notion of sensitivity, measuring how much the algorithm output can change due to changing a single data point:

(2)

The Laplace mechanism adds i.i.d.  noise to each of the coordinates of the output vector and ensures pure -differential privacy, while the Gaussian mechanism adds i.i.d.  noise to each coordinate, where , and ensures -differential privacy. Applying these mechanisms thus requires computing (a provable upper bound on) the relevant sensitivity.

Differential privacy is preserved under post-processing: if an algorithm is -differentially private, then so is its sequential composition with any other algorithm that does not have direct or indirect access to the private database  [Dwork and Roth, 2014].

2.2 Kernels, RKHS, and kernel mean embeddings

A kernel on a non-empty set (data type) is a binary positive-definite function . Intuitively it can be thought of as expressing the similarity between any two elements in . The literature on kernels is vast and their properties are well studied [Schölkopf and Smola, 2002]; many kernels are known for a large variety of data types such as vectors, strings, time series, graphs, etc, and kernels can be composed to yield valid kernels for composite data types (e.g., the type of a database row containing both numerical and string data).

The kernel mean embedding (KME) of an -valued random variable in the RKHS is the function given by , which is defined whenever  [Smola et al., 2007]. Several popular kernels have been proven to be characteristic [Fukumizu et al., 2008], in which case the map , where is the distribution of , is injective. This means that no information about the distribution of is lost when passing to its kernel mean embedding .

In practice, the kernel mean embedding of a random variable is approximated using a sample drawn from , which can be used to construct an empirical kernel mean embedding of in the RKHS: a function given by . When the observations are i.i.d., under a boundedness condition converges to the true kernel mean embedding at rate , independently of the dimension of [Lopez-Paz et al., 2015]. While formally coincides with the kernel density estimate for , our approach relies on the metric of the RKHS in which these kernel mean embeddings live. The RKHS is a space of functions, endowed with an inner product that satisfies the reproducing property for all and . The inner product induces a norm , which can be used to measure distances between distributions of and . This can be exploited for various purposes such as two-sample tests [Gretton et al., 2012], independence testing [Gretton et al., 2005], or one can attempt to minimize this distance in order to match one distribution to another.

An example of such minimization are reduced set methods [Burges, 1996, Schölkopf and Smola, 2002, Chapter 18], which replace a collection of points with a (potentially smaller) weighted set , where the points can, but need not equal any of the s, such that the kernel mean embedding computed using the reduced set is close to the kernel mean embedding computed using the original set , as measured by the RKHS norm:

(3)

Reduced set methods are usually motivated by the computational savings arising when ; we will invoke them mainly to replace a collection of private data points with a (possibly weighted) set of synthetic data points.

3 Framework

3.1 Problem formulation

Throughout this work, we assume the following setup. A database curator wishes to publicly release a database containing private data about individuals, with each data point (database row) taking values in a non-empty set . The set can be arbitrarily rich, for example, it could be a product of Euclidean spaces, integer spaces, sets of strings, etc.; we only require availability of a kernel function on . We assume that the rows in the database can be thought of as i.i.d. observations from some -valued data-generating random variable (but see Section 7 for a discussion about relaxing this assumption). The database curator, wishing to protect the privacy of individuals in the database, seeks a database release mechanism that satisfies the definition of -differential privacy, with and given. The main purpose of releasing the database is to allow third parties to construct estimators of population statistics (i.e., properties of the distribution of ), but it is not known at the time of release what statistics the third-parties will be interested in.

To lighten notation, henceforth we drop the superscript from kernel mean embeddings (such as ) and the subscript from the RKHS , when is the kernel on chosen by the database curator.

3.2 Algorithm template

We propose the following general algorithm template for differentially private database release:

  1. Construct a consistent estimator of the KME of using the private database.

  2. Obtain a perturbed version of the constructed estimator to ensure differential privacy.

  3. Release a (potentially approximate) representation of in terms of a (possibly weighted) dataset .

The released representation should be such that is a consistent estimator of the true KME , i.e., such that the RKHS distance between the two converges to in probability as the private database size , and together with it the synthetic database size , go to infinity.

Each step of this template admits several possibilities. For the first step we have discussed the standard empirical kernel mean embedding with i.i.d. observations of , but the framework remains valid with improved estimators such as kernel-based quadrature [Chen et al., 2010] or the shrinkage kernel mean estimators of Muandet et al. [2016].

As the kernel mean embeddings and live in the RKHS of the kernel , a natural mechanism for privatising in the second step would be to follow Hall et al. [2013] and add pointwise a suitably scaled sample path of a Gaussian process with covariance function to . This does ensure -differential privacy of the resulting function , but unfortunately , because the RKHS norm of a Gaussian process sample path with the same kernel is infinite almost surely [Rasmussen and Williams, 2005]. While our framework allows pursuing this direction by, for example, moving to a larger function space that does contain the Gaussian process sample path, in this work we will present algorithms that achieve differential privacy by mapping into a finite-dimensional Hilbert space and then employing the standard Laplace or Gaussian mechanisms to the finite coordinate vector.

While differential privacy is preserved under post-processing, the third step requires some care to ensure that private data is not leaked. Specifically, when several possible (approximate) representations in terms of a weighted dataset are possible, committing to a particular one reveals more information than just the function (consider, for example, the extreme case where the representation would be in terms of the private points ). One thus needs to either control the privacy leak due to choosing a representation in a way that depends on the private data, or, as we do in our concrete algorithms below, choose a representation independently of the private data (but still minimizing its RKHS distance to the privacy-protected ).

3.3 Versatility

Algorithms in our framework release a possibly weighted synthetic dataset such that is a consistent estimator of the true kernel mean embedding of the data generating random variable . This allows third-parties to perform a wide spectrum of statistical computation, all without having to worry about violating differential privacy:

  1. Kernel probabilistic programming [Schölkopf et al., 2015]: the versatility of our approach is greatly expanded thanks to the result of Simon-Gabriel et al. [2016], who showed that under technical conditions, applying a continuous function to all points in the synthetic dataset yields a consistent estimator of the kernel mean embedding of the transformed random variable , even when the points are not i.i.d. (as they may not be, depending on the particular synthetic database release algorithm).

  2. Consistent estimation of population statistics: for any RKHS function , we have , so a consistent estimator of yields a consistent estimator of the expectation of . It can be evaluated using the reproducing kernel property:

    (4)

    For example, approximating the indicator function of a set with functions in the RKHS allows estimating probabilities: (note that itself may not be an element of the RKHS).

  3. MMD estimation and two-sample testing [Gretton et al., 2012]: Given another random variable on , one can consistently estimate the Maximum Mean Discrepancy (MMD) distance between the distributions of and , and in particular to construct a two-sample test based on this distance. Given a sample :

    which can again be evaluated using the reproducing property.

  4. Subsequent use of synthetic data: Since the output of the algorithm is a (possibly weighted) database, third-parties are free to use this data for arbitrary purposes, such as training any machine learning model on this data. Models trained purely on this data can be released with differential privacy guaranteed; however, the accuracy of such models on real data remains an empirical question that is beyond the scope of this work.

An orthogonal spectrum of versatility arises from the fact that the third step in the algorithm template can constrain the released dataset to be more convenient or more computationally efficient for further processing. For example, one could fix the weights to uniform to obtain an unweighted dataset, or to replace an expensive data type with a cheaper subset, such as requesting floats instead of doubles in the ’s. All this can be performed while an RKHS distance is available to control accuracy between and its released representation.

3.4 Concrete algorithms

As a first illustrative example, we describe how a particular case of an existing, but inefficient synthetic database algorithm already fits into our framework. The exponential mechanism [McSherry and Talwar, 2007] is a general mechanism for ensuring -differential privacy of an algorithm, and in our setting it operates as follows: given a similarity measure between (private) databases of size and (synthetic) databases of size , output a random (synthetic) database with probability proportional to , where is the actual private database and is the sensitivity of w.r.t. . This ensures -differential privacy [McSherry and Talwar, 2007].

To fit into our framework, we can take to be the negative RKHS distance between the kernel mean embeddings computed using and , and achieve differential privacy by releasing with probability proportional to . This solves steps 2 and 3 of our general algorithm template simultaneously, as it directly samples a concrete representation of a “perturbed” kernel mean embedding . The algorithm essentially corresponds to the SmallDB algorithm of Blum et al. [2008], except for choosing the RKHS distance as a well-studied similarity measure between two databases.

The principal issue with this algorithm is its computational infeasibility except in trivial cases, as it requires sampling from a probability distribution supported on all potential synthetic databases, while employing an approximate sampling scheme runs the risk of breaking the differential privacy guarantee of the exponential mechanism. In Section 4 and 5 respectively, we describe two concrete synthetic database release algorithms that may possess failure modes where they become inefficient, but employing approximations in those cases can only affect their statistical accuracy, not the promise of differential privacy.

4 Perturbation in synthetic-data subspace

In this section we describe an instantiation of the framework proposed in Section 3 that achieves differential privacy of the kernel mean embedding by projecting it onto a finite-dimensional subspace of the RKHS spanned by feature maps of synthetic data points , and perturbing the resulting finite coordinate vector. To ensure differential privacy, the synthetic data points are chosen independently of the private database. As a result, statistical efficiency of this approach will depend on the choice of synthetic data points, with efficiency going up if there are enough synthetic data points to capture the patterns in the private data. Therefore this algorithm is especially suited to the scenario discussed in Section 1, where part of the database (or a similar one) has already been published, as this can serve as a good starting set for the synthetic data points.

The setting where some observations from have already been released highlights the fact that differential privacy only protects against additional privacy violation due to an individual deciding to contribute to the private database; if a particular user’s data has already been published, differential privacy does not protect against privacy violations based on exploiting this previously published data.

The algorithm is formalized as Algorithm 1 in the box below. Lines 1-2 choose the synthetic data points independently of the private data (only using the size of the database ). Lines 3-4 construct the linear subspace of spanned by feature maps of the chosen synthetic data points, and compute a finite basis for it. Only then the private data is accessed: the empirical kernel mean embedding is computed (line 5), then projected onto the subspace and expanded in terms of the precomputed basis (line 6). The basis coefficients of the projection are then perturbed to achieve differential privacy (line 7), and the perturbed element is then re-expressed in terms of the spanning set containing feature maps of synthetic data points (line 8). This expansion is finally released to the public (line 9).

0:  database , kernel , privacy parameters and
0:  -differentially private, approximate version of the RKHS embedding of
1:  , number of synthetic data points to use
2:   initialized deterministically or randomly from distribution on
3:  
4:   orthonormal basis of (obtained using, e.g., Gram-Schmidt)
5:  
6:  , projection of onto
7:  , an -differentially private version of
8:  , re-expressed in terms of ’s
9:  return  
Algorithm 1 Differentially private database release via synthetic data subspace of the RKHS

Line 1 stipulates that the number of synthetic data points as , but asymptotically slower than . This is to ensure that the privatization noise added in the subspace to each coordinate is small enough overall to preserve consistency, as stated below.

The following Theorem 2 assures us that the output of Algorithm 1 produces a consistent estimator of the true kernel mean embedding , if the synthetic data points are sampled from a distribution with sufficiently large support. Due to space constraints, please see Appendix A.1 for a proof.

Theorem 2.

Let be a compact metric space and a continuous kernel on . Suppose that the synthetic data points are sampled i.i.d. from a probability distribution on . If the support of is included in the support of , then Algorithm 1 outputs a consistent estimator of the kernel mean embedding in the sense that

(5)

As discussed by Simon-Gabriel et al. [2016], the assumptions of Theorem 2 are usually satisfied: can be taken to be compact whenever the data comes from measurements with any bounded range, and many kernels are continuous, including all kernels on discrete spaces (w.r.t to the discrete topology). In order to use the output of Algorithm 1 in the very general kernel probabilistic programming framework and obtain a consistent estimator of the kernel mean embedding of for any continuous function , the norm of the released weights may need to remain bounded by a constant as  [Simon-Gabriel et al., 2016]. This is not enforced by Algorithm 1, but Theorem 8 in Appendix A.1 shows how a simple regularization in the final stage of the algorithm achieves this without breaking consistency (or privacy).

As the next result (proved in Appendix A.2) shows, Algorithm 1 is differentially private whenever for all . This is a weak assumption that holds for all normalized kernels, and can be achieved by simple rescaling for any bounded kernel (such that ). When is a compact domain, this condition holds true for all continuous kernels.

Proposition 3.

If for all , then Algorithm 1 is -differentially private.

Remark 4.

One usually requires that decreases faster than polynomially with the database size  [Dwork and Roth, 2014]. The proof of Theorem 2 remains valid whenever , so for example we could have and . Also, if one desires pure -differential privacy (i.e., ), this can be achieved by employing the Laplace mechanism instead of the Gaussian mechanism on line 7 of Algorithm 1:

(6)

Due to a looser bound on the sensitivity of , the scale of the Laplace noise grows faster with , and as a result we require for the proof of Theorem 2 to go through. This can be easily achieved, e.g., by choosing , but note that the hidden multiplicative constant is still allowed to depend on the size of , which is constant in . ∎

5 Perturbation in random-features RKHS

Another approach to ensuring differential privacy is to map the potentially infinite dimensional RKHS of into a different, finite-dimensional RKHS using random features  [Rahimi and Recht, 2007], and privacy-protect the finite coordinate vector in this space. We then employ a reduced set method to find an expansion of this RKHS element in terms of synthetic data points. In contrast to Algorithm 1, both the weights and locations of synthetic data points can be optimized here.

The algorithm is formalized as Algorithm 2 below. Lines 1-2 pick the number of random features to use (dependent on ), and construct a random feature map with that many features. Lines 3-4 compute the empirical kernel mean embedding of in the RKHS corresponding to the kernel induced by the random features, and then privacy-protect the resulting finite, real-valued vector. Lines 5-6 run a blindly initialized Reduced set method to find a weighted synthetic dataset whose kernel mean embedding in is close to the privacy-protected mean embedding of the private database. Line 7 releases this weighted dataset to the public.

0:  database , kernel , privacy parameters and
0:  -differentially private, approximate version of the RKHS embedding of
1:   number of random features to use
2:   random feature map for kernel
3:   {empirical KME in RKHS of }
4:  , an -differentially private version of the vector
5:  , number of synthetic expansion points to use for representing
6:   approximate in the RKHS using a Reduced set method:
(7)
7:  return  
Algorithm 2 Differentially private release of RKHS embedding via random features

The following Theorem 5, proved in Appendix A.3, confirms that Algorithm 2 outputs a consistent estimator of the true kernel mean embedding , provided that the random features converge to the kernel uniformly in . This requirement is satisfied by general schemes such as random Fourier features and random binning of Rahimi and Recht [2007] for shift-invariant kernels, or by random feature maps for dot product kernels [Kar and Karnick, 2012].

Theorem 5.

Suppose that converges uniformly in as the number of random features . Then Algorithm 2 outputs a consistent estimator of the kernel mean embedding in the sense that as .

Proposition 6.

If for all , then Algorithm 2 is -differentially private.

See Appendix A.4 for a proof. The -boundedness requirement on the random feature vectors is reasonable under the weak assumption for all discussed in Section 4, as in that case .

Remark 4 holds here as well, with the number of synthetic points replaced by the number of random features . So for example, one can achieve -differential privacy by choosing .

6 Related work

Synthetic database release algorithms with a differential privacy guarantee have been studied in the literature before. Machanavajjhala et al. [2008] analyzed such a procedure for count data, ensuring privacy by sampling a distribution and then synthetic counts from a Dirichlet-Multinomial posterior. Blum et al. [2008] studied the exponential mechanism applied to synthetic database generation, which leads to a very general, but unfortunately inefficient algorithm (see also Section 3.4). Wasserman and Zhou [2010] provided a theoretical comparison of this algorithm to sampling synthetic databases from deterministically smoothed, or randomly perturbed histograms. Unlike our approach, these algorithms achieve differential privacy by sampling synthetic data points from a specific distribution, where resorting to approximate sampling can break the privacy guarantee. In our framework we propose to arrive at the synthetic database using a reduced set method, where poor performance could affect statistical usefulness of the synthetic database, but cannot break its differential privacy.

Zhou et al. [2009] and Kenthapadi et al. [2012] proposed randomized database compression schemes that yield synthetic databases useful for particular types of algorithms, while guaranteeing differential privacy. The former compresses the number of data points using a random linear or affine transformation of the entire database, and the result can be used by procedures that rely on the empirical covariance of the original data. The latter compresses the number of data point dimensions while approximately preserving distances between original, private data points.

Differentially private learning in a RKHS has also been studied, with Chaudhuri et al. [2011] and Rubinstein et al. [2012] having independently presented release mechanisms for the result of an empirical risk minimization procedure (such as a SVM). Similarly to our Algorithm 2, they map data points into a finite-dimensional space defined by random features and carry out the privacy-protecting perturbation there. However, they do not require the final stage of invoking a Reduced set method to construct a synthetic database, because the output (such as a trained SVM) is only used for evaluation on test points, for which it suffices to additionally release the used random feature map .

As our framework stipulates ensuring differential privacy of an empirical kernel mean embedding, which is a function , the work on differential privacy for functional data is of relevance here. Hall et al. [2013] showed how an RKHS element can be made differentially private via pointwise addition of a Gaussian process sample path, but as discussed in Section 3.2, the resulting function is no longer an element of the RKHS. Recently, Aldà and Rubinstein [2017] proposed a general Bernstein mechanism for -differentially private function release. The released function can be evaluated pointwise arbitrarily many times, but again, the geometry of the RKHS to which the unperturbed function belonged cannot be easily exploited anymore.

7 Discussion

We proposed a framework for constructing differentially private synthetic database release algorithms, based on the idea of using kernel mean embeddings in RKHS as intermediate database representations. To justify our framework, we presented two concrete algorithms and proved theoretical results guaranteeing their consistency and differential privacy. We believe that exploring other instantiations of this framework, and comparing them theoretically and empirically, can be a fruitful direction for future research. Theoretical comparisons in terms of convergence rates can proceed using similar ideas as in [Simon-Gabriel et al., 2016].

The i.i.d. assumption on database rows can be relaxed. For example, if they are identically distributed (as a random variable ), but not necessarily independent, the framework remains valid as long as a consistent estimator of the kernel mean embedding can be constructed from the database rows. A common situation where this arises is, for example, duplication of database rows due to user error.

References

  • Aldà and Rubinstein [2017] F. Aldà and B. I. P. Rubinstein. The Bernstein mechanism: Function release under differential privacy. In 31st Conference on Artificial Intelligence (AAAI), 2017.
  • Blum et al. [2008] A. Blum, K. Ligett, and A. Roth. A learning theory approach to non-interactive database privacy. In 40th ACM Symposium on Theory of Computing (STOC), 2008.
  • Burges [1996] C. J. C. Burges. Simplified support vector decision rules. In 30th International Conference on Machine Learning (ICML), 1996.
  • Chaudhuri et al. [2011] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12, 2011.
  • Chen et al. [2010] Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In 26th Conference on Uncertainty in Artificial Intelligence (UAI), 2010.
  • Dwork [2006] C. Dwork. Differential privacy. In 33rd International Conference on Automata, Languages and Programming (ICALP), 2006.
  • Dwork and Roth [2014] C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9, 2014.
  • Fukumizu et al. [2008] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In 20th Conference on Neural Information Processing Systems (NIPS), 2008.
  • Gretton et al. [2005] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with Hilbert-Schmidt norms. In 16th International Conference on Algorithmic Learning Theory (ALT), 2005.
  • Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13, 2012.
  • Hall et al. [2013] R. Hall, A. Rinaldo, and L. Wasserman. Differential Privacy for Functions and Functional Data. Journal of Machine Learning Research, 14, 2013.
  • Kar and Karnick [2012] P. Kar and H. Karnick. Random feature maps for dot product kernels. In 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
  • Kenthapadi et al. [2012] K. Kenthapadi, A. Korolova, I. Mironov, and N. Mishra. Privacy via the Johnson-Lindenstrauss transform. arXiv:1204.2606 [cs], 2012.
  • Lopez-Paz et al. [2015] D. Lopez-Paz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a learning theory of cause-effect inference. In 32nd International Conference on Machine Learning (ICML), 2015.
  • Machanavajjhala et al. [2008] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering, 2008.
  • McSherry and Talwar [2007] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.
  • Muandet et al. [2016] K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, and B. Schölkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17, 2016.
  • Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for large scale kernel machines. In 21st Conference on Neural Information Processing Systems (NIPS), 2007.
  • Rasmussen and Williams [2005] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
  • Rubinstein et al. [2012] B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacy-preserving mechanisms for SVM learning. The Journal of Privacy and Confidentiality, 4(1), 2012.
  • Schölkopf and Smola [2002] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.
  • Schölkopf et al. [2015] B. Schölkopf, K. Muandet, K. Fukumizu, S. Harmeling, and J. Peters. Computing functions of random variables via Reproducing Kernel Hilbert Space representations. Statistics and Computing, 25, 2015.
  • Simon-Gabriel et al. [2016] C.-J. Simon-Gabriel, A. Ścibior, I. Tolstikhin, and B. Schölkopf. Consistent Kernel Mean Estimation for Functions of Random Variables. In 29th Conference on Neural Information Processing Systems (NIPS), 2016.
  • Smola et al. [2007] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In 18th International Conference on Algorithmic Learning Theory (ALT), 2007.
  • Sriperumbudur et al. [2011] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 2011.
  • Wasserman and Zhou [2010] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Journal of the American Statistical Association, 105, 2010.
  • Zhou et al. [2009] S. Zhou, K. Ligett, and L. Wasserman. Differential privacy with compression. In IEEE International Conference on Symposium on Information Theory (ISIT), 2009.

Appendix A Proofs

a.1 Synthetic data subspace: consistency

We start with a lemma showing that if the feature map corresponding to the chosen kernel is uniformly continuous and the synthetic data points are sampled i.i.d. from a probability distribution whose support is larger than that of , then the projection of onto the subspace spanned by synthetic data points converges in probability to as .

Lemma 7.

Let be a compact metric space and a continuous kernel on . Suppose that the synthetic data points are sampled i.i.d. from a probability distribution on . If the support of is included in the support of , then

(8)
Proof.

Let . As is continuous on , which as a product of compact spaces is itself compact by Tychonoff’s theorem, the kernel is uniformly continuous and in particular there exists such that for all we have whenever . As is compact, it is totally bounded, and thus so is its subset . Therefore can be covered with finitely many open balls of radius . Let the sequence be sampled i.i.d. from , and let be the event that some of these balls contains no element of . Since by assumption, we have for all and therefore as .

Note that if all balls contain an element of (i.e., holds), then for each one can find such that . In that case

[property of projection] (9)
[as ] (10)
[Triangle inequality] (11)
[see below] (12)
(13)

where we have used the reproducing property, the Triangle inequality and our choices of and to see that

(14)
(15)
(16)
(17)
(18)

Hence we have that . But since was arbitrary and as by construction, the claimed convergence in probability result follows. ∎

Theorem (2.)

Let be a compact metric space and a continuous kernel on . Suppose that the synthetic data points are sampled i.i.d. from a probability distribution on . If the support of is included in the support of , then Algorithm 1 outputs a consistent estimator of the kernel mean embedding in the sense that

(19)
Proof.

Using the Triangle inequality, we can upper bound the RKHS distance between the output of Algorithm 1 and the true kernel mean embedding as follows:

(20)

The finite sample error tends to as by the law of large numbers, while the projection error tends to as by Lemma 7. For the privacy error, using orthonormality of the basis we have

(21)

As a function of , the size of the basis is a non-decreasing function, so it either converges to some , in which case the obtained expression clearly tends to as with probability , or as . In this latter case as a.s. by the law of large numbers, and as since . Hence the privacy error goes to as either way, as required to complete the proof. ∎

Theorem 8.

Suppose that the kernel is -universal [Sriperumbudur et al., 2011] and is any continuous function mapping from to some space . Let be any finite constant. If line 8 of Algorithm 2 is replaced with a regularized reduced set method solving the constrained minimization problem

(22)

then the points output by Algorithm 2 yields a consistent estimator of the kernel mean embedding of in the sense that

(23)
Proof.

Let be the RKHS element output by Algorithm 1 after adding the stated regularization. First we show that despite the regularization, remains a consistent estimator of the true kernel mean embedding as .

The modification introduces an additional regularization error term into the upper bound on , compared to the corresponding bound (20) in the proof of Theorem 2. So to show the first desired consistency result, it remains to show that this extra regularization error term converges to in probability as . To this end, let be arbitrary. Define and for as in the proof of Lemma 7. Note that the RKHS element is in the feasible set of the regularized minimization problem (22), because the sum of absolute values of expansions coefficients is

(24)

Therefore the regularization error can be upper bounded as

[property of min] (25)
[Triangle inequality] (26)

We recognize the first term as the privacy error from the proof of Theorem 2, which goes to as . The probability that the second term is larger than converges to as using the argument given in the proof of Lemma 7. Hence we have the desired convergence of the modified Algorithm 1’s output to the true kernel mean embedding as , in probability.

This means that the modified algorithm still outputs a consistent estimator of the kernel mean embedding of . Moreover, the weights in the released finite expansion now have their norm bounded by the constant by construction, so Theorem 1 of Simon-Gabriel et al. [2016] applies and gives the desired conclusion regarding consistency of the estimator for the kernel mean embedding of . ∎

a.2 Synthetic data subspace: privacy

Lemma 9.

If for all , then the RKHS norm sensitivity of the empirical kernel mean embedding with respect to changing one data point is at most .

Proof.

Let and be two databases of the same cardinality , differing in a single row. Without loss of generality for . Let and be the empirical kernel mean embeddings computed using and , respectively. Then

(27)
(28)
(29)

As and were arbitrary neighbouring databases, the claimed result follows. ∎

Proposition (3.)

If for all , then Algorithm 1 is -differentially private.

Proof.

As the synthetic data points do not depend on the private data, it suffices to show that the weights are -differentially private. However, these weights result from data-independent post-processing of the coefficients , which are a perturbed version of the coefficients , with the perturbation provided by the privacy-protecting Gaussian mechanism [Dwork and Roth, 2014]. It remains to verify that the Gaussian mechanism employs sufficiently scaled noise; in particular we need to verify that .

But indeed, since are orthonormal, for any and computed using neighbouring databases,

(30)

(last inequality is Lemma 9) as required to verify the Gaussian mechanism. Then -differential privacy for the entire algorithm follows. ∎

a.3 Random features algorithm: consistency

Theorem (5)

Suppose that the random features converge uniformly in as the number of random features . Assume also availability of an approximate Reduced set construction method that solves the minimization (7) either up to a constant multiplicative error, or with an absolute error that can be made arbitrarily small. Then Algorithm 2 outputs a consistent estimator of the kernel mean embedding in the sense that

(31)
Proof.

The output of Algorithm 2 specifies an element in the RKHS of . Its RKHS distance to the true kernel mean embedding of can be upper bounded by a decomposition using the Triangle inequality, where we write for the element of that the Reduced set method constructs to approximate the privacy-protected :

(32)
(33)
(34)
(35)

The finite sample error tends to as in probability by consistency of the empirical kernel mean estimate. The random features error goes to as since as and uniformly in as , implying convergence of the norms and