Differentially Private Database Release
via Kernel Mean Embeddings
Abstract
We lay theoretical foundations for new database release mechanisms that allow thirdparties to construct consistent estimators of population statistics, while ensuring that the privacy of each individual contributing to the database is protected. The proposed framework rests on two main ideas. First, releasing (an estimate of) the kernel mean embedding of the data generating random variable instead of the database itself still allows thirdparties to construct consistent estimators of a wide class of population statistics. Second, the algorithm can satisfy the definition of differential privacy by basing the released kernel mean embedding on entirely synthetic data points, while controlling accuracy through the metric available in a Reproducing Kernel Hilbert Space. We describe two instantiations of the proposed framework, suitable under different scenarios, and prove theoretical results guaranteeing differential privacy of the resulting algorithms and the consistency of estimators constructed from their outputs.
1 Introduction
We aim to contribute to the body of research on the tradeoff between releasing datasets from which publicly beneficial statistical inferences can be drawn, and between protecting the privacy of individuals who contribute to such datasets. Currently the most successful formalization of protecting user privacy is provided by differential privacy [Dwork and Roth, 2014], which is a definition that any algorithm operating on a database may or may not satisfy. An algorithm that does satisfy the definition ensures that a particular individual does not lose too much privacy by deciding to contribute to the database on which the algorithm operates.
While differentially private algorithms for releasing entire databases have been studied previously [Blum et al., 2008, Wasserman and Zhou, 2010, Zhou et al., 2009], most algorithms focus on releasing a privacyprotected version of a particular summary statistic, or of a statistical model trained on the private dataset. In this work we revisit the more difficult noninteractive, or offline setting, where the database owner aims to release a privacyprotected version of the entire database without knowing what statistics thirdparties may wish to compute in the future.
In our new framework we propose to use the kernel mean embedding [Smola et al., 2007] as an intermediate representation of a database. It is (1) sufficiently rich in the sense that it captures a wide class of statistical properties of the data, while at the same time (2) it lives in a Reproducing Kernel Hilbert Space (RKHS), where it can be handled mathematically in a principled way and privacyprotected in a unified manner, independently of the type of data appearing in the database.
Although kernel mean embeddings are functions in an abstract Hilbert space, in practice they can be (at least approximately) represented using a possibly weighted set of data points in input space (i.e., a set of database rows). The privacyprotected kernel mean embedding is released to the public in this representation, however, using synthetic datapoints instead of the private ones. As a result, our framework can be seen as leading to synthetic database algorithms.
We validate our approach by instantiating two concrete algorithms and proving that they output consistent estimators of the true kernel mean embedding of the data generating process, while satisfying the definition of differential privacy. The consistency results ensure that thirdparties can carry out a wide variety of statistically founded computation on the released data, such as constructing consistent estimators of population statistics, estimating the Maximum Mean Discrepancy (MMD) between distributions and performing twosample testing [Gretton et al., 2012], or using the data in the kernel probabilistic programming framework for random variable arithmetics [Schölkopf et al., 2015, SimonGabriel et al., 2016, Section 3], repeatedly and unlimitedly without being able to, or having to worry about, violating user privacy.
One of our algorithms is especially suited to the interesting scenario where a (small) subset of a database has already been published. This situation can arise in a wide variety of settings, for example, due to weaker privacy protections in the past, due to a leak, or due to the presence of an incentive, financial or otherwise, for users to publish their data. In such a situation our algorithm can provide a principled approach for reweighting the published dataset in such a way that the accuracy of statistical inferences on this dataset benefits from the larger sample size (including the private data), while maintaining differential privacy for the undisclosed data.
In summary, the contributions of this paper are:

A framework for designing database release algorithms with the guarantee of differential privacy, using kernel mean embeddings as intermediate database representations, so that the RKHS metric can be used to control accuracy of the released synthetic database in a principled manner (Section 3).
2 Background
2.1 Differential privacy
Definition 1 (Differential privacy [Dwork, 2006]).
An algorithm is said to be differentially private if for all neighbouring databases (differing in at most one element) and all measurable subsets of the codomain of ,
(1) 
The parameter controls the amount of information the algorithm can leak about an individual, while a positive value of allows the algorithm to produce an unlikely output that leaks more information, but only with probability up to . This notion is sometimes called approximate differential privacy; an algorithm that is differentially private is simply said to be differentially private. Note that any nontrivial differentially private algorithm must be randomized; the definition asserts that the distribution of algorithm outputs is not too sensitive to changing one row in the database.
When the output of an algorithm is a finite vector , two standard random perturbation mechanisms are available to make the output differentially private: the Laplace mechanism and the Gaussian mechanism. As the perturbation needs to mask the contribution of a particular entry of the database , the scale of the added noise is closely linked to the notion of sensitivity, measuring how much the algorithm output can change due to changing a single data point:
(2) 
The Laplace mechanism adds i.i.d. noise to each of the coordinates of the output vector and ensures pure differential privacy, while the Gaussian mechanism adds i.i.d. noise to each coordinate, where , and ensures differential privacy. Applying these mechanisms thus requires computing (a provable upper bound on) the relevant sensitivity.
Differential privacy is preserved under postprocessing: if an algorithm is differentially private, then so is its sequential composition with any other algorithm that does not have direct or indirect access to the private database [Dwork and Roth, 2014].
2.2 Kernels, RKHS, and kernel mean embeddings
A kernel on a nonempty set (data type) is a binary positivedefinite function . Intuitively it can be thought of as expressing the similarity between any two elements in . The literature on kernels is vast and their properties are well studied [Schölkopf and Smola, 2002]; many kernels are known for a large variety of data types such as vectors, strings, time series, graphs, etc, and kernels can be composed to yield valid kernels for composite data types (e.g., the type of a database row containing both numerical and string data).
The kernel mean embedding (KME) of an valued random variable in the RKHS is the function given by , which is defined whenever [Smola et al., 2007]. Several popular kernels have been proven to be characteristic [Fukumizu et al., 2008], in which case the map , where is the distribution of , is injective. This means that no information about the distribution of is lost when passing to its kernel mean embedding .
In practice, the kernel mean embedding of a random variable is approximated using a sample drawn from , which can be used to construct an empirical kernel mean embedding of in the RKHS: a function given by . When the observations are i.i.d., under a boundedness condition converges to the true kernel mean embedding at rate , independently of the dimension of [LopezPaz et al., 2015]. While formally coincides with the kernel density estimate for , our approach relies on the metric of the RKHS in which these kernel mean embeddings live. The RKHS is a space of functions, endowed with an inner product that satisfies the reproducing property for all and . The inner product induces a norm , which can be used to measure distances between distributions of and . This can be exploited for various purposes such as twosample tests [Gretton et al., 2012], independence testing [Gretton et al., 2005], or one can attempt to minimize this distance in order to match one distribution to another.
An example of such minimization are reduced set methods [Burges, 1996, Schölkopf and Smola, 2002, Chapter 18], which replace a collection of points with a (potentially smaller) weighted set , where the points can, but need not equal any of the s, such that the kernel mean embedding computed using the reduced set is close to the kernel mean embedding computed using the original set , as measured by the RKHS norm:
(3) 
Reduced set methods are usually motivated by the computational savings arising when ; we will invoke them mainly to replace a collection of private data points with a (possibly weighted) set of synthetic data points.
3 Framework
3.1 Problem formulation
Throughout this work, we assume the following setup. A database curator wishes to publicly release a database containing private data about individuals, with each data point (database row) taking values in a nonempty set . The set can be arbitrarily rich, for example, it could be a product of Euclidean spaces, integer spaces, sets of strings, etc.; we only require availability of a kernel function on . We assume that the rows in the database can be thought of as i.i.d. observations from some valued datagenerating random variable (but see Section 7 for a discussion about relaxing this assumption). The database curator, wishing to protect the privacy of individuals in the database, seeks a database release mechanism that satisfies the definition of differential privacy, with and given. The main purpose of releasing the database is to allow third parties to construct estimators of population statistics (i.e., properties of the distribution of ), but it is not known at the time of release what statistics the thirdparties will be interested in.
To lighten notation, henceforth we drop the superscript from kernel mean embeddings (such as ) and the subscript from the RKHS , when is the kernel on chosen by the database curator.
3.2 Algorithm template
We propose the following general algorithm template for differentially private database release:

Construct a consistent estimator of the KME of using the private database.

Obtain a perturbed version of the constructed estimator to ensure differential privacy.

Release a (potentially approximate) representation of in terms of a (possibly weighted) dataset .
The released representation should be such that is a consistent estimator of the true KME , i.e., such that the RKHS distance between the two converges to in probability as the private database size , and together with it the synthetic database size , go to infinity.
Each step of this template admits several possibilities. For the first step we have discussed the standard empirical kernel mean embedding with i.i.d. observations of , but the framework remains valid with improved estimators such as kernelbased quadrature [Chen et al., 2010] or the shrinkage kernel mean estimators of Muandet et al. [2016].
As the kernel mean embeddings and live in the RKHS of the kernel , a natural mechanism for privatising in the second step would be to follow Hall et al. [2013] and add pointwise a suitably scaled sample path of a Gaussian process with covariance function to . This does ensure differential privacy of the resulting function , but unfortunately , because the RKHS norm of a Gaussian process sample path with the same kernel is infinite almost surely [Rasmussen and Williams, 2005]. While our framework allows pursuing this direction by, for example, moving to a larger function space that does contain the Gaussian process sample path, in this work we will present algorithms that achieve differential privacy by mapping into a finitedimensional Hilbert space and then employing the standard Laplace or Gaussian mechanisms to the finite coordinate vector.
While differential privacy is preserved under postprocessing, the third step requires some care to ensure that private data is not leaked. Specifically, when several possible (approximate) representations in terms of a weighted dataset are possible, committing to a particular one reveals more information than just the function (consider, for example, the extreme case where the representation would be in terms of the private points ). One thus needs to either control the privacy leak due to choosing a representation in a way that depends on the private data, or, as we do in our concrete algorithms below, choose a representation independently of the private data (but still minimizing its RKHS distance to the privacyprotected ).
3.3 Versatility
Algorithms in our framework release a possibly weighted synthetic dataset such that is a consistent estimator of the true kernel mean embedding of the data generating random variable . This allows thirdparties to perform a wide spectrum of statistical computation, all without having to worry about violating differential privacy:

Kernel probabilistic programming [Schölkopf et al., 2015]: the versatility of our approach is greatly expanded thanks to the result of SimonGabriel et al. [2016], who showed that under technical conditions, applying a continuous function to all points in the synthetic dataset yields a consistent estimator of the kernel mean embedding of the transformed random variable , even when the points are not i.i.d. (as they may not be, depending on the particular synthetic database release algorithm).

Consistent estimation of population statistics: for any RKHS function , we have , so a consistent estimator of yields a consistent estimator of the expectation of . It can be evaluated using the reproducing kernel property:
(4) For example, approximating the indicator function of a set with functions in the RKHS allows estimating probabilities: (note that itself may not be an element of the RKHS).

MMD estimation and twosample testing [Gretton et al., 2012]: Given another random variable on , one can consistently estimate the Maximum Mean Discrepancy (MMD) distance between the distributions of and , and in particular to construct a twosample test based on this distance. Given a sample :
which can again be evaluated using the reproducing property.

Subsequent use of synthetic data: Since the output of the algorithm is a (possibly weighted) database, thirdparties are free to use this data for arbitrary purposes, such as training any machine learning model on this data. Models trained purely on this data can be released with differential privacy guaranteed; however, the accuracy of such models on real data remains an empirical question that is beyond the scope of this work.
An orthogonal spectrum of versatility arises from the fact that the third step in the algorithm template can constrain the released dataset to be more convenient or more computationally efficient for further processing. For example, one could fix the weights to uniform to obtain an unweighted dataset, or to replace an expensive data type with a cheaper subset, such as requesting floats instead of doubles in the ’s. All this can be performed while an RKHS distance is available to control accuracy between and its released representation.
3.4 Concrete algorithms
As a first illustrative example, we describe how a particular case of an existing, but inefficient synthetic database algorithm already fits into our framework. The exponential mechanism [McSherry and Talwar, 2007] is a general mechanism for ensuring differential privacy of an algorithm, and in our setting it operates as follows: given a similarity measure between (private) databases of size and (synthetic) databases of size , output a random (synthetic) database with probability proportional to , where is the actual private database and is the sensitivity of w.r.t. . This ensures differential privacy [McSherry and Talwar, 2007].
To fit into our framework, we can take to be the negative RKHS distance between the kernel mean embeddings computed using and , and achieve differential privacy by releasing with probability proportional to . This solves steps 2 and 3 of our general algorithm template simultaneously, as it directly samples a concrete representation of a “perturbed” kernel mean embedding . The algorithm essentially corresponds to the SmallDB algorithm of Blum et al. [2008], except for choosing the RKHS distance as a wellstudied similarity measure between two databases.
The principal issue with this algorithm is its computational infeasibility except in trivial cases, as it requires sampling from a probability distribution supported on all potential synthetic databases, while employing an approximate sampling scheme runs the risk of breaking the differential privacy guarantee of the exponential mechanism. In Section 4 and 5 respectively, we describe two concrete synthetic database release algorithms that may possess failure modes where they become inefficient, but employing approximations in those cases can only affect their statistical accuracy, not the promise of differential privacy.
4 Perturbation in syntheticdata subspace
In this section we describe an instantiation of the framework proposed in Section 3 that achieves differential privacy of the kernel mean embedding by projecting it onto a finitedimensional subspace of the RKHS spanned by feature maps of synthetic data points , and perturbing the resulting finite coordinate vector. To ensure differential privacy, the synthetic data points are chosen independently of the private database. As a result, statistical efficiency of this approach will depend on the choice of synthetic data points, with efficiency going up if there are enough synthetic data points to capture the patterns in the private data. Therefore this algorithm is especially suited to the scenario discussed in Section 1, where part of the database (or a similar one) has already been published, as this can serve as a good starting set for the synthetic data points.
The setting where some observations from have already been released highlights the fact that differential privacy only protects against additional privacy violation due to an individual deciding to contribute to the private database; if a particular user’s data has already been published, differential privacy does not protect against privacy violations based on exploiting this previously published data.
The algorithm is formalized as Algorithm 1 in the box below. Lines 12 choose the synthetic data points independently of the private data (only using the size of the database ). Lines 34 construct the linear subspace of spanned by feature maps of the chosen synthetic data points, and compute a finite basis for it. Only then the private data is accessed: the empirical kernel mean embedding is computed (line 5), then projected onto the subspace and expanded in terms of the precomputed basis (line 6). The basis coefficients of the projection are then perturbed to achieve differential privacy (line 7), and the perturbed element is then reexpressed in terms of the spanning set containing feature maps of synthetic data points (line 8). This expansion is finally released to the public (line 9).
Line 1 stipulates that the number of synthetic data points as , but asymptotically slower than . This is to ensure that the privatization noise added in the subspace to each coordinate is small enough overall to preserve consistency, as stated below.
The following Theorem 2 assures us that the output of Algorithm 1 produces a consistent estimator of the true kernel mean embedding , if the synthetic data points are sampled from a distribution with sufficiently large support. Due to space constraints, please see Appendix A.1 for a proof.
Theorem 2.
Let be a compact metric space and a continuous kernel on . Suppose that the synthetic data points are sampled i.i.d. from a probability distribution on . If the support of is included in the support of , then Algorithm 1 outputs a consistent estimator of the kernel mean embedding in the sense that
(5) 
As discussed by SimonGabriel et al. [2016], the assumptions of Theorem 2 are usually satisfied: can be taken to be compact whenever the data comes from measurements with any bounded range, and many kernels are continuous, including all kernels on discrete spaces (w.r.t to the discrete topology). In order to use the output of Algorithm 1 in the very general kernel probabilistic programming framework and obtain a consistent estimator of the kernel mean embedding of for any continuous function , the norm of the released weights may need to remain bounded by a constant as [SimonGabriel et al., 2016]. This is not enforced by Algorithm 1, but Theorem 8 in Appendix A.1 shows how a simple regularization in the final stage of the algorithm achieves this without breaking consistency (or privacy).
As the next result (proved in Appendix A.2) shows, Algorithm 1 is differentially private whenever for all . This is a weak assumption that holds for all normalized kernels, and can be achieved by simple rescaling for any bounded kernel (such that ). When is a compact domain, this condition holds true for all continuous kernels.
Proposition 3.
If for all , then Algorithm 1 is differentially private.
Remark 4.
One usually requires that decreases faster than polynomially with the database size [Dwork and Roth, 2014]. The proof of Theorem 2 remains valid whenever , so for example we could have and . Also, if one desires pure differential privacy (i.e., ), this can be achieved by employing the Laplace mechanism instead of the Gaussian mechanism on line 7 of Algorithm 1:
(6) 
Due to a looser bound on the sensitivity of , the scale of the Laplace noise grows faster with , and as a result we require for the proof of Theorem 2 to go through. This can be easily achieved, e.g., by choosing , but note that the hidden multiplicative constant is still allowed to depend on the size of , which is constant in . ∎
5 Perturbation in randomfeatures RKHS
Another approach to ensuring differential privacy is to map the potentially infinite dimensional RKHS of into a different, finitedimensional RKHS using random features [Rahimi and Recht, 2007], and privacyprotect the finite coordinate vector in this space. We then employ a reduced set method to find an expansion of this RKHS element in terms of synthetic data points. In contrast to Algorithm 1, both the weights and locations of synthetic data points can be optimized here.
The algorithm is formalized as Algorithm 2 below. Lines 12 pick the number of random features to use (dependent on ), and construct a random feature map with that many features. Lines 34 compute the empirical kernel mean embedding of in the RKHS corresponding to the kernel induced by the random features, and then privacyprotect the resulting finite, realvalued vector. Lines 56 run a blindly initialized Reduced set method to find a weighted synthetic dataset whose kernel mean embedding in is close to the privacyprotected mean embedding of the private database. Line 7 releases this weighted dataset to the public.
(7) 
The following Theorem 5, proved in Appendix A.3, confirms that Algorithm 2 outputs a consistent estimator of the true kernel mean embedding , provided that the random features converge to the kernel uniformly in . This requirement is satisfied by general schemes such as random Fourier features and random binning of Rahimi and Recht [2007] for shiftinvariant kernels, or by random feature maps for dot product kernels [Kar and Karnick, 2012].
Theorem 5.
Suppose that converges uniformly in as the number of random features . Then Algorithm 2 outputs a consistent estimator of the kernel mean embedding in the sense that as .
Proposition 6.
If for all , then Algorithm 2 is differentially private.
See Appendix A.4 for a proof. The boundedness requirement on the random feature vectors is reasonable under the weak assumption for all discussed in Section 4, as in that case .
Remark 4 holds here as well, with the number of synthetic points replaced by the number of random features . So for example, one can achieve differential privacy by choosing .
6 Related work
Synthetic database release algorithms with a differential privacy guarantee have been studied in the literature before. Machanavajjhala et al. [2008] analyzed such a procedure for count data, ensuring privacy by sampling a distribution and then synthetic counts from a DirichletMultinomial posterior. Blum et al. [2008] studied the exponential mechanism applied to synthetic database generation, which leads to a very general, but unfortunately inefficient algorithm (see also Section 3.4). Wasserman and Zhou [2010] provided a theoretical comparison of this algorithm to sampling synthetic databases from deterministically smoothed, or randomly perturbed histograms. Unlike our approach, these algorithms achieve differential privacy by sampling synthetic data points from a specific distribution, where resorting to approximate sampling can break the privacy guarantee. In our framework we propose to arrive at the synthetic database using a reduced set method, where poor performance could affect statistical usefulness of the synthetic database, but cannot break its differential privacy.
Zhou et al. [2009] and Kenthapadi et al. [2012] proposed randomized database compression schemes that yield synthetic databases useful for particular types of algorithms, while guaranteeing differential privacy. The former compresses the number of data points using a random linear or affine transformation of the entire database, and the result can be used by procedures that rely on the empirical covariance of the original data. The latter compresses the number of data point dimensions while approximately preserving distances between original, private data points.
Differentially private learning in a RKHS has also been studied, with Chaudhuri et al. [2011] and Rubinstein et al. [2012] having independently presented release mechanisms for the result of an empirical risk minimization procedure (such as a SVM). Similarly to our Algorithm 2, they map data points into a finitedimensional space defined by random features and carry out the privacyprotecting perturbation there. However, they do not require the final stage of invoking a Reduced set method to construct a synthetic database, because the output (such as a trained SVM) is only used for evaluation on test points, for which it suffices to additionally release the used random feature map .
As our framework stipulates ensuring differential privacy of an empirical kernel mean embedding, which is a function , the work on differential privacy for functional data is of relevance here. Hall et al. [2013] showed how an RKHS element can be made differentially private via pointwise addition of a Gaussian process sample path, but as discussed in Section 3.2, the resulting function is no longer an element of the RKHS. Recently, Aldà and Rubinstein [2017] proposed a general Bernstein mechanism for differentially private function release. The released function can be evaluated pointwise arbitrarily many times, but again, the geometry of the RKHS to which the unperturbed function belonged cannot be easily exploited anymore.
7 Discussion
We proposed a framework for constructing differentially private synthetic database release algorithms, based on the idea of using kernel mean embeddings in RKHS as intermediate database representations. To justify our framework, we presented two concrete algorithms and proved theoretical results guaranteeing their consistency and differential privacy. We believe that exploring other instantiations of this framework, and comparing them theoretically and empirically, can be a fruitful direction for future research. Theoretical comparisons in terms of convergence rates can proceed using similar ideas as in [SimonGabriel et al., 2016].
The i.i.d. assumption on database rows can be relaxed. For example, if they are identically distributed (as a random variable ), but not necessarily independent, the framework remains valid as long as a consistent estimator of the kernel mean embedding can be constructed from the database rows. A common situation where this arises is, for example, duplication of database rows due to user error.
References
 Aldà and Rubinstein [2017] F. Aldà and B. I. P. Rubinstein. The Bernstein mechanism: Function release under differential privacy. In 31st Conference on Artificial Intelligence (AAAI), 2017.
 Blum et al. [2008] A. Blum, K. Ligett, and A. Roth. A learning theory approach to noninteractive database privacy. In 40th ACM Symposium on Theory of Computing (STOC), 2008.
 Burges [1996] C. J. C. Burges. Simplified support vector decision rules. In 30th International Conference on Machine Learning (ICML), 1996.
 Chaudhuri et al. [2011] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12, 2011.
 Chen et al. [2010] Y. Chen, M. Welling, and A. Smola. Supersamples from kernel herding. In 26th Conference on Uncertainty in Artificial Intelligence (UAI), 2010.
 Dwork [2006] C. Dwork. Differential privacy. In 33rd International Conference on Automata, Languages and Programming (ICALP), 2006.
 Dwork and Roth [2014] C. Dwork and A. Roth. The Algorithmic Foundations of Differential Privacy. Foundations and Trends in Theoretical Computer Science, 9, 2014.
 Fukumizu et al. [2008] K. Fukumizu, A. Gretton, X. Sun, and B. Schölkopf. Kernel measures of conditional dependence. In 20th Conference on Neural Information Processing Systems (NIPS), 2008.
 Gretton et al. [2005] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with HilbertSchmidt norms. In 16th International Conference on Algorithmic Learning Theory (ALT), 2005.
 Gretton et al. [2012] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13, 2012.
 Hall et al. [2013] R. Hall, A. Rinaldo, and L. Wasserman. Differential Privacy for Functions and Functional Data. Journal of Machine Learning Research, 14, 2013.
 Kar and Karnick [2012] P. Kar and H. Karnick. Random feature maps for dot product kernels. In 15th International Conference on Artificial Intelligence and Statistics (AISTATS), 2012.
 Kenthapadi et al. [2012] K. Kenthapadi, A. Korolova, I. Mironov, and N. Mishra. Privacy via the JohnsonLindenstrauss transform. arXiv:1204.2606 [cs], 2012.
 LopezPaz et al. [2015] D. LopezPaz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a learning theory of causeeffect inference. In 32nd International Conference on Machine Learning (ICML), 2015.
 Machanavajjhala et al. [2008] A. Machanavajjhala, D. Kifer, J. Abowd, J. Gehrke, and L. Vilhuber. Privacy: Theory meets practice on the map. In IEEE 24th International Conference on Data Engineering, 2008.
 McSherry and Talwar [2007] F. McSherry and K. Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2007.
 Muandet et al. [2016] K. Muandet, B. Sriperumbudur, K. Fukumizu, A. Gretton, and B. Schölkopf. Kernel mean shrinkage estimators. Journal of Machine Learning Research, 17, 2016.
 Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for large scale kernel machines. In 21st Conference on Neural Information Processing Systems (NIPS), 2007.
 Rasmussen and Williams [2005] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2005.
 Rubinstein et al. [2012] B. I. P. Rubinstein, P. L. Bartlett, L. Huang, and N. Taft. Learning in a large function space: Privacypreserving mechanisms for SVM learning. The Journal of Privacy and Confidentiality, 4(1), 2012.
 Schölkopf and Smola [2002] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, 2002.
 Schölkopf et al. [2015] B. Schölkopf, K. Muandet, K. Fukumizu, S. Harmeling, and J. Peters. Computing functions of random variables via Reproducing Kernel Hilbert Space representations. Statistics and Computing, 25, 2015.
 SimonGabriel et al. [2016] C.J. SimonGabriel, A. Ścibior, I. Tolstikhin, and B. Schölkopf. Consistent Kernel Mean Estimation for Functions of Random Variables. In 29th Conference on Neural Information Processing Systems (NIPS), 2016.
 Smola et al. [2007] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In 18th International Conference on Algorithmic Learning Theory (ALT), 2007.
 Sriperumbudur et al. [2011] B. K. Sriperumbudur, K. Fukumizu, and G. R. G. Lanckriet. Universality, characteristic kernels and RKHS embedding of measures. Journal of Machine Learning Research, 2011.
 Wasserman and Zhou [2010] L. Wasserman and S. Zhou. A statistical framework for differential privacy. Journal of the American Statistical Association, 105, 2010.
 Zhou et al. [2009] S. Zhou, K. Ligett, and L. Wasserman. Differential privacy with compression. In IEEE International Conference on Symposium on Information Theory (ISIT), 2009.
Appendix A Proofs
a.1 Synthetic data subspace: consistency
We start with a lemma showing that if the feature map corresponding to the chosen kernel is uniformly continuous and the synthetic data points are sampled i.i.d. from a probability distribution whose support is larger than that of , then the projection of onto the subspace spanned by synthetic data points converges in probability to as .
Lemma 7.
Let be a compact metric space and a continuous kernel on . Suppose that the synthetic data points are sampled i.i.d. from a probability distribution on . If the support of is included in the support of , then
(8) 
Proof.
Let . As is continuous on , which as a product of compact spaces is itself compact by Tychonoff’s theorem, the kernel is uniformly continuous and in particular there exists such that for all we have whenever . As is compact, it is totally bounded, and thus so is its subset . Therefore can be covered with finitely many open balls of radius . Let the sequence be sampled i.i.d. from , and let be the event that some of these balls contains no element of . Since by assumption, we have for all and therefore as .
Note that if all balls contain an element of (i.e., holds), then for each one can find such that . In that case
[property of projection]  (9)  
[as ]  (10)  
[Triangle inequality]  (11)  
[see below]  (12)  
(13) 
where we have used the reproducing property, the Triangle inequality and our choices of and to see that
(14)  
(15)  
(16)  
(17)  
(18) 
Hence we have that . But since was arbitrary and as by construction, the claimed convergence in probability result follows. ∎
Theorem (2.)
Let be a compact metric space and a continuous kernel on . Suppose that the synthetic data points are sampled i.i.d. from a probability distribution on . If the support of is included in the support of , then Algorithm 1 outputs a consistent estimator of the kernel mean embedding in the sense that
(19) 
Proof.
Using the Triangle inequality, we can upper bound the RKHS distance between the output of Algorithm 1 and the true kernel mean embedding as follows:
(20) 
The finite sample error tends to as by the law of large numbers, while the projection error tends to as by Lemma 7. For the privacy error, using orthonormality of the basis we have
(21) 
As a function of , the size of the basis is a nondecreasing function, so it either converges to some , in which case the obtained expression clearly tends to as with probability , or as . In this latter case as a.s. by the law of large numbers, and as since . Hence the privacy error goes to as either way, as required to complete the proof. ∎
Theorem 8.
Suppose that the kernel is universal [Sriperumbudur et al., 2011] and is any continuous function mapping from to some space . Let be any finite constant. If line 8 of Algorithm 2 is replaced with a regularized reduced set method solving the constrained minimization problem
(22) 
then the points output by Algorithm 2 yields a consistent estimator of the kernel mean embedding of in the sense that
(23) 
Proof.
Let be the RKHS element output by Algorithm 1 after adding the stated regularization. First we show that despite the regularization, remains a consistent estimator of the true kernel mean embedding as .
The modification introduces an additional regularization error term into the upper bound on , compared to the corresponding bound (20) in the proof of Theorem 2. So to show the first desired consistency result, it remains to show that this extra regularization error term converges to in probability as . To this end, let be arbitrary. Define and for as in the proof of Lemma 7. Note that the RKHS element is in the feasible set of the regularized minimization problem (22), because the sum of absolute values of expansions coefficients is
(24) 
Therefore the regularization error can be upper bounded as
[property of min]  (25)  
[Triangle inequality]  (26) 
We recognize the first term as the privacy error from the proof of Theorem 2, which goes to as . The probability that the second term is larger than converges to as using the argument given in the proof of Lemma 7. Hence we have the desired convergence of the modified Algorithm 1’s output to the true kernel mean embedding as , in probability.
This means that the modified algorithm still outputs a consistent estimator of the kernel mean embedding of . Moreover, the weights in the released finite expansion now have their norm bounded by the constant by construction, so Theorem 1 of SimonGabriel et al. [2016] applies and gives the desired conclusion regarding consistency of the estimator for the kernel mean embedding of . ∎
a.2 Synthetic data subspace: privacy
Lemma 9.
If for all , then the RKHS norm sensitivity of the empirical kernel mean embedding with respect to changing one data point is at most .
Proof.
Let and be two databases of the same cardinality , differing in a single row. Without loss of generality for . Let and be the empirical kernel mean embeddings computed using and , respectively. Then
(27)  
(28)  
(29) 
As and were arbitrary neighbouring databases, the claimed result follows. ∎
Proposition (3.)
If for all , then Algorithm 1 is differentially private.
Proof.
As the synthetic data points do not depend on the private data, it suffices to show that the weights are differentially private. However, these weights result from dataindependent postprocessing of the coefficients , which are a perturbed version of the coefficients , with the perturbation provided by the privacyprotecting Gaussian mechanism [Dwork and Roth, 2014]. It remains to verify that the Gaussian mechanism employs sufficiently scaled noise; in particular we need to verify that .
But indeed, since are orthonormal, for any and computed using neighbouring databases,
(30) 
(last inequality is Lemma 9) as required to verify the Gaussian mechanism. Then differential privacy for the entire algorithm follows. ∎
a.3 Random features algorithm: consistency
Theorem (5)
Suppose that the random features converge uniformly in as the number of random features . Assume also availability of an approximate Reduced set construction method that solves the minimization (7) either up to a constant multiplicative error, or with an absolute error that can be made arbitrarily small. Then Algorithm 2 outputs a consistent estimator of the kernel mean embedding in the sense that
(31) 
Proof.
The output of Algorithm 2 specifies an element in the RKHS of . Its RKHS distance to the true kernel mean embedding of can be upper bounded by a decomposition using the Triangle inequality, where we write for the element of that the Reduced set method constructs to approximate the privacyprotected :
(32)  
(33)  
(34)  
(35) 
The finite sample error tends to as in probability by consistency of the empirical kernel mean estimate. The random features error goes to as since as and uniformly in as , implying convergence of the norms and