Characteristic and Universal Tensor Product Kernels
Abstract
Maximum mean discrepancy (MMD), also called energy distance or Ndistance in statistics and HilbertSchmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernelbased foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel.
8/17meila00aZoltán Szabó and Bharath K. Sriperumbudur \ShortHeadingsCharacteristic and Universal Tensor Product KernelsSzabó and Sriperumbudur \firstpageno1
Le Song
tensor product kernel, kernel mean embedding, characteristic kernel, characteristic kernel, universality, maximum mean discrepancy, HilbertSchmidt independence criterion
1 Introduction
Kernel methods (Schölkopf and Smola, 2002) are among the most flexible and influential tools in machine learning and statistics, with superior performance demonstrated in a large number of areas and applications. The key idea in these methods is to map the data samples into a possibly infinitedimensional feature space—precisely, a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950)—and apply linear methods in the feature space, without the explicit need to compute the map. A generalization of this idea to probability measures, i.e., mapping probability measures into an RKHS (Berlinet and ThomasAgnan, 2004, Chapter 4; Smola et al., 2007) has found novel applications in nonparametric statistics and machine learning. Formally, given a probability measure defined on a measurable space and an RKHS with as the reproducing kernel (which is symmetric and positive definite), is embedded into as
(1) 
where is called the mean element or kernel mean embedding of . The mean embedding of has lead to a new generation of solutions in twosample testing (Baringhaus and Franz, 2004; Székely and Rizzo, 2004, 2005; Borgwardt et al., 2006; Harchaoui et al., 2007; Gretton et al., 2012), domain adaptation (Zhang et al., 2013) and generalization (Blanchard et al., 2017), kernel belief propagation (Song et al., 2011), kernel Bayes’ rule (Fukumizu et al., 2013), model criticism (Lloyd et al., 2014; Kim et al., 2016), approximate Bayesian computation (Park et al., 2016), probabilistic programming (Schölkopf et al., 2015), distribution classification (Muandet et al., 2011; Zaheer et al., 2017), distribution regression (Szabó et al., 2016; Law et al., 2018) and topological data analysis (Kusano et al., 2016). For a recent survey on the topic, the reader is referred to (Muandet et al., 2017).
Crucial to the success of the mean embedding based representation is whether it encodes all the information about the distribution, in other words whether the map in (1) is injective in which case the kernel is referred to as characteristic (Fukumizu et al., 2008; Sriperumbudur et al., 2010). Various characterizations for the characteristic property of is known in the literature (e.g., see Fukumizu et al., 2008, 2009; Gretton et al., 2012; Sriperumbudur et al., 2010) using which the popular kernels on such as Gaussian, Laplacian, Bspline, inverse multiquadrics, and the Matérn class are shown to be characteristic. The characteristic property is closely related to the notion of universality (Steinwart, 2001; Micchelli et al., 2006; Carmeli et al., 2010; Sriperumbudur et al., 2011)— is said to be universal if the corresponding RKHS is dense in a certain target function class, for example, the class of continuous functions on compact domains—and the relation between these notions has recently been explored in (Sriperumbudur et al., 2011; SimonGabriel and Schölkopf, 2016).
Based on the mean embedding in (1), Smola et al. (2007) and Gretton et al. (2012) defined a semimetric, called the maximum mean discrepancy (MMD) on the space of probability measures:
which is a metric iff is characteristic. A fundamental application of MMD is in nonparametric hypothesis testing that includes twosample (Gretton et al., 2012) and independence tests (Gretton et al., 2008). Particularly in independence testing, as a measure of independence, MMD measures the distance between the joint distribution and the product of marginals of two random variables and which are respectively defined on measurable spaces and , with the kernel being defined on . As aforementioned, if is characteristic, then implies , i.e., and are independent. A simple way to define a kernel on is through the tensor product of kernels and defined on and respectively: , i.e., , with the corresponding RKHS being the tensor product space generated by and . This means, when ,
(2) 
In addition to the simplicity of defining a joint kernel on , the tensor product kernel offers a principled way of combining inner products ( and ) on domains that can correspond to different modalities (say images, texts, audio). By exploiting the isomorphism between tensor product Hilbert spaces and the space of HilbertSchmidt operators^{1}^{1}1In the equivalence one assumes that , are separable; this holds under mild conditions, for example if and are separable topological domains and , are continuous (Steinwart and Christmann, 2008, Lemma 4.33)., it follows from (2) that
(3) 
which is the HilbertSchmidt norm of the crosscovariance operator and is known as the HilbertSchmidt independence criterion (HSIC) (Gretton et al., 2005). HSIC has enjoyed tremendous success in a variety of applications such as independent component analysis (Gretton et al., 2005), feature selection (Song et al., 2012), independence testing (Gretton et al., 2008; Jitkrittum et al., 2017), post selection inference (Yamada et al., 2018) and causal detection (Mooij et al., 2016; Pfister et al., 2017; Strobl et al., 2017). Recently, MMD and HSIC (as defined in (3) for two components) have been shown by Sejdinovic et al. (2013b) to be equivalent to other popular statistical measures such as the energy distance (Baringhaus and Franz, 2004; Székely and Rizzo, 2004, 2005)—also known as Ndistance (Zinger et al., 1992; Klebanov, 2005)—and distance covariance (Székely et al., 2007; Székely and Rizzo, 2009; Lyons, 2013) respectively. HSIC has been generalized to 2 components (Quadrianto et al., 2009; Sejdinovic et al., 2013a) to measure the joint independence of random variables
where is a joint measure on the product space and are the marginal measures of defined on respectively. The extended HSIC measure has recently been analyzed in the context of independence testing (Pfister et al., 2017). In addition to testing, the extended HSIC measure is also useful in the problem of independent subspace analysis (ISA) (Cardoso, 1998), wherein the latent sources are separated by maximizing the degree of independence among them. In all the applications of HSIC, the key requirement is that captures the joint independence of random variables (with joint distribution )—we call this property as characteristic—, which is guaranteed if is characteristic. Since is defined in terms of , it is of fundamental importance to understand the characteristic and characteristic properties of in terms of the characteristic property of , which is one of the main goals of this work.
For , the characterization of independence, i.e., the characteristic property of , is studied by Blanchard et al. (2011) and Gretton (2015) where it has been shown that if and are universal, then is universal^{2}^{2}2Blanchard et al. (2011) deals with universal kernels while Gretton (2015) deals with universal kernels. A brief description of these notions are provided in Section 3. We refer the reader to (Carmeli et al., 2010; Sriperumbudur et al., 2010) for more details on these notions of universality. and therefore HSIC captures independence. A stronger version of this result can be obtained by combining (Lyons, 2013, Theorem 3.11) and (Sejdinovic et al., 2013b, Proposition 29): if and are characteristic, then the HSIC associated with characterizes independence. Apart from these results, not much is known about the characteristic/characteristic/universality properties of in terms of the individual kernels. Our goal is to resolve this question and understand the characteristic, characteristic and universal property of the product kernel () in terms of the kernel components () for . Because of the relatedness of MMD and HSIC to energy distance and distance covariance, our results also contribute to the better understanding of these other measures that are popular in the statistical literature.
Specifically, our results shed light on the following surprising phenomena of the characteristic property of for :

characteristic property of is not sufficient but necessary for to be characteristic;

universality of is sufficient for to be characteristic, and

if at least one of is only characteristic and not universal, then need not be characteristic.
The paper is organized as follows. In Section 3, we conduct a comprehensive analysis about the above mentioned properties of and for any positive integer . To this end, we define various notions of characteristic property on the product space (see Definition 3 and Figure 2(a) in Section 3) and explore the relation between them. In order to keep our presentation in this section to be nontechnical, we relegate the problem formulation to Section 3, with the main results of the paper being presented in Section 4. A summary of the results is captured in Figure 1 while the proofs are provided in Section 5. Various definitions and notation that are used throughout the paper are collected in Section 2.
2 Definitions & Notation
and denotes the set of natural numbers and real numbers respectively. For , . and denotes the matrix of zeros. For and , is the Euclidean inner product. For sets and , is their difference, is the cardinality of and is the Descartes product of sets . denotes the power set of a set , i.e., all subsets of (including the empty set and ). The Kronecker delta is defined as if , and zero otherwise. is the indicator function of set : if and otherwise. is the set of sized tensors.
For a topological space , is the Borel sigmaalgebra on induced by the topology . Probability and finite signed measures in the paper are meant w.r.t. the measurable space . Given topological spaces, their product is enriched with the product topology; it is the coarsest topology for which the canonical projections are continuous for all . A topological space is called secondcountable if has a countable basis.^{3}^{3}3Secondcountability implies separability; in metric spaces the two notions coincide (Dudley, 2004, Proposition 2.1.4). By the Urysohn’s theorem, a topological space is separable and metrizable if and only if it is regular, Hausdorff and secondcountable. Any uncountable discrete space is not secondcountable. denotes the space of continuous functions on . denotes the class of realvalued functions vanishing at infinity on a locally compact Hausdorff (LCH) space^{4}^{4}4LCH spaces include , discrete spaces, and topological manifolds. Open or closed subsets, finite products of LCH spaces are LCH. Infinitedimensional Hilbert spaces are not LCH. , i.e., for any , the set is compact. is endowed with the uniform norm . and are the space of finite signed measures and probability measures on , respectively. For , denotes the product probability measure on the product space , i.e., . is the Dirac measure supported on . For , the finite signed measure denotes its marginal on . is the reproducing kernel Hilbert space (RKHS) associated with the reproducing kernel , which in this paper is assumed to be measurable and bounded. The tensor product of is a kernel, defined as
whose associated RKHS is denoted as (Berlinet and ThomasAgnan, 2004, Theorem 13), where the r.h.s. is the tensor product of RKHSs . For , , the multilinear operator is defined as
A kernel defined on a LCH space is called a kernel if for all . is said to be a translation invariant kernel on if for a positive definite function . denotes the kernel mean embedding of to which is defined as , where the integral is meant in the Bochner sense.
3 Problem Formulation
In this section, we formally introduce the goal of the paper. To this end, we start with a definition. For simplicity, throughout the paper, we assume that all kernels are bounded. The definition is based on the observation (Sriperumbudur et al., 2010, Lemma 8) that a bounded kernel on a topological space is characteristic if and only if
In other words, characteristic kernels are integrally strictly positive definite (ispd) (see Sriperumbudur et al., 2010, p. 1523) w.r.t. the class of finite signed measures that assign zero measure to . The following definition extends this observation to tensor product kernels on product spaces. {definition}[ispd tensor product kernel] Suppose is a bounded kernel on a topological space . Let be such that where . is said to be ispd if
(4) 
Specifically,

if s are kernels on locally compact Polish (LCP) ^{5}^{5}5A topological space is called Polish if it is complete, separable and metrizable. For example, and countable discrete spaces are Polish. Open and closed subsets, products and disjoint unions of countably many Polish spaces are Polish. Every secondcountable LCH space is Polish. spaces s and , then is called universal.

if
then is called characteristic, characteristic and characteristic, respectively.
In Definition 3, being characteristic matches the usual notion of characteristic kernels on a product space, i.e., there are no two distinct probability measures on such that the MMD between them is zero. The other notions such as characteristic and characteristic are typically weaker than the usual characteristic property since
(5) 
Below we provide further intuition on the measure classes enlisted in Definition 3.

If s are kernels on LCH spaces for all , then is also a kernel on LCH space implying that if satisfies (4), then is universal (see Sriperumbudur et al., 2010, Proposition 2). It is well known that universality reduces to universality (i.e., the notion of universality proposed by Steinwart, 2001) if is compact (see Sriperumbudur et al., 2010 for details) which is guaranteed if and only if each is compact.

This family is useful to describe the joint independence of random variables—hence the name characteristic—defined on kernelendowed domains : If denotes the joint distribution of random variables and are the associated marginals on , then by definition is characteristic iff
In other words, HSIC captures joint independence exactly with characteristic kernels.

In this case is chosen to be the product of finite signed measures on such that each marginal measure assigns zero to the corresponding space . This choice is relevant as the characteristic property of individual kernels need not imply the characteristic property of , but is equivalent to the characteristic property of . The equivalence holds for bounded kernels on topological spaces () since for any , (
and the l.h.s. is positive iff each term on the r.h.s. is positive.
Having defined the ispd property, our goal is to investigate whether the characteristic or universal property of s () imply different ispd properties of , and vice versa.
4 Main Results
In this section, we present our main results related to the ispd property of tensor product kernels, which are summarized in Figure 1. The results in this section will deal with various assumptions on , such as secondcountability, Hausdorff, locally compact Hausdorff (LCH) and locally compact Polish (LCP), so that they are presented in more generality. However, for simplicity, all these assumptions can be unified by simply assuming a stronger condition that ’s are LCP.
Our first example illustrates that the characteristic property of s does not imply the characteristic property of the tensor product kernel. In light of Remark 3(iv) of Section 3, it follows that the class of characteristic tensor product kernels form a strictly larger class than characteristic tensor product kernels; see also Figure 2. {example} Let , , . It is easy to verify that and are characteristic. However, it can be proved that is not characteristic. On the hand, interestingly, is characteristic. We refer the reader to Section 5.1 for details.
In the above example, we showed that the tensor product of and (which are characteristic kernels) is characteristic. The following result generalizes this behavior for any bounded characteristic kernels. In addition, under a mild assumption, it shows the converse to be true for any .
Let be bounded kernels on topological spaces for all , . Then the following holds.

Suppose is secondcountable for all with . If and are characteristic, then is characteristic.

Suppose is Hausdorff and for all . If is characteristic, then are characteristic.
Lyons (2013) has showed an analogous result to Theorem 4(i) for distance covariances () on metric spaces of negative type (see Theorem 3.11 in Lyons, 2013), which via Sejdinovic et al. (2013b, Proposition 29) holds for HSIC yielding the characteristic property of . Recently, Gretton (2015) presented a direct proof showing that HSIC corresponding to captures independence if and are translation invariant characteristic kernels on (which is equivalent to universality). Blanchard et al. (2011) proved a result similar to Theorem 4(i) assuming that ’s are compact and , being universal. In contrast, Theorem 4(i) establishes the result for bounded kernels on general secondcountable topological spaces. In fact, the results of (Gretton, 2015; Blanchard et al., 2011) are special cases of Theorems 4 and 4 below. Theorem 4(i) raises a pertinent question: whether is characteristic if s are characteristic for all where ? The following example provides a negative answer to this question. On a positive side, however, we will see in Theorem 4 that the characteristic property of can be guaranteed for any if a stronger condition is imposed on s (and s). Theorem 4(ii) generalizes Proposition 3.15 of (Lyons, 2013) for any , which states that every kernel being characteristic is necessary for the tensor kernel to be characteristic.
Let and , , (). As mentioned in Example 4, are characteristic. However, it can be shown that is not characteristic. See Section 5.3 for details.
In Remark 3(iii) and Example 4, we showed that in general, only the characteristic property of is equivalent to the characteristic property of s. Our next result shows that all the various notions of characteristic property of coincide if s are translationinvariant, continuous bounded kernels on .
Suppose are continuous, bounded and translationinvariant kernels for all . Then the following statements are equivalent:

s are characteristic for all ;

is characteristic;

is characteristic;

is characteristic.
The following result shows that on LCP spaces, the tensor product of universal kernels is also universal, and vice versa.
Suppose are kernels on LCP spaces (). Then is universal iff s are universal for all . {remark}

A special case of Theorem 4 for is proved by Lyons (2013, Lemma 3.8) in the context of distance covariance which reduces to Theorem 4 through the equivalence established by Sejdinovic et al. (2013b). Another special case of Theorem 4 is proved by Blanchard et al. (2011, Lemma 5.2) for universality with using the StoneWeierstrass theorem: if and are universal then is universal.

Since the notions of universality and characteristic property are equivalent for translation invariant kernels on (see Carmeli et al., 2010, Proposition 5.16 and Sriperumbudur et al., 2010, Theorem 9), Theorem 4 can be considered as a special case of Theorem 4. In other words, requiring to be also kernels in Theorem 4(i)(iv) is equivalent to

s are universal for all ;

is universal.

In Example 4 and Theorem 4, we showed that for components while the characteristic property of is not sufficient, their universality is enough to guarantee the characteristic property of . The next example demonstrates that these results are tight: If at least one is not universal but only characteristic, then might not be characteristic. {example} Let and , , for all , , and (). is characteristic (Example 4), and are universal since the associated Gram matrix is an identity matrix, which is strictly positive definite (). However, is not characteristic. See Section 5.6 for details.
5 Proofs
In this section, we provide the proofs of our results presented in Section 4.
5.1 Proof of Example 4
The proof is structured as follows.

First we show that is a kernel and it is characteristic.

Next it is proved that is not characteristic.

Finally, the characteristic property of is established.
The individual steps are as follows:
is a kernel. Assume w.l.o.g. that , . Then it is easy to verify that the Gram matrix where and is the transpose of . Clearly is positive semidefinite and so is a kernel.
is characteristic. We will show that satisfies (4).
On a finite signed measure takes the form for some . Thus,
(7) 
Consider
(8) 
where we used (7) and the facts that , .
is not characteristic. We show a slightly stronger statement: is not even ispd with
, in other words
we construct a witness such that
(9) 
and
(10) 
Finite signed measures on take the form , form, where . With these notations, (9) and (10) can be rewritten as
Keeping the solutions where neither nor is the zero vector, there are 2 (symmetric) possibilities: (i) , and (ii) , . In other words, for any , the possibilities are (i) , and (ii) , .
This establishes the nonispd property of .
is characteristic.
Our goal is to show that is characteristic, i.e., for any , implies ,
where . We divide the proof into two parts:

First we derive the equations of
(11) for general finite signed measures on .

Then, we apply the parameterization and solve for that satisfies (11) to conclude that , i.e., . Note that in the chosen parametrization for , holds automatically.
The details are as follows.
Step 1.
(12)  
(13) 
(14) 
Step 2. Any can be parametrized as
(15) 
Let ; for illustration see Table 1.
: 
