Characteristic and Universal Tensor Product Kernels

# Characteristic and Universal Tensor Product Kernels

\nameZoltán Szabó \emailzoltan.szabo@polytechnique.edu
Route de Saclay, 91128 Palaiseau, France \AND\nameBharath K. Sriperumbudur \emailbks18@psu.edu
Pennsylvania State University
314 Thomas Building
University Park, PA 16802
###### Abstract

Maximum mean discrepancy (MMD), also called energy distance or N-distance in statistics and Hilbert-Schmidt independence criterion (HSIC), specifically distance covariance in statistics, are among the most popular and successful approaches to quantify the difference and independence of random variables, respectively. Thanks to their kernel-based foundations, MMD and HSIC are applicable on a wide variety of domains. Despite their tremendous success, quite little is known about when HSIC characterizes independence and when MMD with tensor product kernel can discriminate probability distributions. In this paper, we answer these questions by studying various notions of characteristic property of the tensor product kernel.

8/17meila00aZoltán Szabó and Bharath K. Sriperumbudur \ShortHeadingsCharacteristic and Universal Tensor Product KernelsSzabó and Sriperumbudur \firstpageno1

\editor

Le Song

{keywords}

tensor product kernel, kernel mean embedding, characteristic kernel, -characteristic kernel, universality, maximum mean discrepancy, Hilbert-Schmidt independence criterion

## 1 Introduction

Kernel methods (Schölkopf and Smola, 2002) are among the most flexible and influential tools in machine learning and statistics, with superior performance demonstrated in a large number of areas and applications. The key idea in these methods is to map the data samples into a possibly infinite-dimensional feature space—precisely, a reproducing kernel Hilbert space (RKHS) (Aronszajn, 1950)—and apply linear methods in the feature space, without the explicit need to compute the map. A generalization of this idea to probability measures, i.e., mapping probability measures into an RKHS (Berlinet and Thomas-Agnan, 2004, Chapter 4; Smola et al., 2007) has found novel applications in nonparametric statistics and machine learning. Formally, given a probability measure defined on a measurable space and an RKHS with as the reproducing kernel (which is symmetric and positive definite), is embedded into as

 P↦∫Xk(⋅,x)dP(x)=:μk(P), (1)

where is called the mean element or kernel mean embedding of . The mean embedding of has lead to a new generation of solutions in two-sample testing (Baringhaus and Franz, 2004; Székely and Rizzo, 2004, 2005; Borgwardt et al., 2006; Harchaoui et al., 2007; Gretton et al., 2012), domain adaptation (Zhang et al., 2013) and generalization (Blanchard et al., 2017), kernel belief propagation (Song et al., 2011), kernel Bayes’ rule (Fukumizu et al., 2013), model criticism (Lloyd et al., 2014; Kim et al., 2016), approximate Bayesian computation (Park et al., 2016), probabilistic programming (Schölkopf et al., 2015), distribution classification (Muandet et al., 2011; Zaheer et al., 2017), distribution regression (Szabó et al., 2016; Law et al., 2018) and topological data analysis (Kusano et al., 2016). For a recent survey on the topic, the reader is referred to (Muandet et al., 2017).

Crucial to the success of the mean embedding based representation is whether it encodes all the information about the distribution, in other words whether the map in (1) is injective in which case the kernel is referred to as characteristic (Fukumizu et al., 2008; Sriperumbudur et al., 2010). Various characterizations for the characteristic property of is known in the literature (e.g., see Fukumizu et al., 2008, 2009; Gretton et al., 2012; Sriperumbudur et al., 2010) using which the popular kernels on such as Gaussian, Laplacian, B-spline, inverse multiquadrics, and the Matérn class are shown to be characteristic. The characteristic property is closely related to the notion of universality (Steinwart, 2001; Micchelli et al., 2006; Carmeli et al., 2010; Sriperumbudur et al., 2011)— is said to be universal if the corresponding RKHS is dense in a certain target function class, for example, the class of continuous functions on compact domains—and the relation between these notions has recently been explored in (Sriperumbudur et al., 2011; Simon-Gabriel and Schölkopf, 2016).

Based on the mean embedding in (1), Smola et al. (2007) and Gretton et al. (2012) defined a semi-metric, called the maximum mean discrepancy (MMD) on the space of probability measures:

 MMDk(P,Q):=∥μk(P)−μk(Q)∥Hk,

which is a metric iff is characteristic. A fundamental application of MMD is in non-parametric hypothesis testing that includes two-sample (Gretton et al., 2012) and independence tests (Gretton et al., 2008). Particularly in independence testing, as a measure of independence, MMD measures the distance between the joint distribution and the product of marginals of two random variables and which are respectively defined on measurable spaces and , with the kernel being defined on . As aforementioned, if is characteristic, then implies , i.e., and are independent. A simple way to define a kernel on is through the tensor product of kernels and defined on and respectively: , i.e., , with the corresponding RKHS being the tensor product space generated by and . This means, when ,

 MMDk(PXY,PX⊗PY)=∥∥μkX⊗kY(PXY)−μkX⊗kY(PX⊗PY)∥∥HkX⊗HkY. (2)

In addition to the simplicity of defining a joint kernel on , the tensor product kernel offers a principled way of combining inner products ( and ) on domains that can correspond to different modalities (say images, texts, audio). By exploiting the isomorphism between tensor product Hilbert spaces and the space of Hilbert-Schmidt operators111In the equivalence one assumes that , are separable; this holds under mild conditions, for example if and are separable topological domains and , are continuous (Steinwart and Christmann, 2008, Lemma 4.33)., it follows from (2) that

 MMDk(PXY,PX⊗PY)=∥CXY∥HS=:HSICk(PXY), (3)

which is the Hilbert-Schmidt norm of the cross-covariance operator and is known as the Hilbert-Schmidt independence criterion (HSIC) (Gretton et al., 2005). HSIC has enjoyed tremendous success in a variety of applications such as independent component analysis (Gretton et al., 2005), feature selection (Song et al., 2012), independence testing (Gretton et al., 2008; Jitkrittum et al., 2017), post selection inference (Yamada et al., 2018) and causal detection (Mooij et al., 2016; Pfister et al., 2017; Strobl et al., 2017). Recently, MMD and HSIC (as defined in (3) for two components) have been shown by Sejdinovic et al. (2013b) to be equivalent to other popular statistical measures such as the energy distance (Baringhaus and Franz, 2004; Székely and Rizzo, 2004, 2005)—also known as N-distance (Zinger et al., 1992; Klebanov, 2005)—and distance covariance (Székely et al., 2007; Székely and Rizzo, 2009; Lyons, 2013) respectively. HSIC has been generalized to 2 components (Quadrianto et al., 2009; Sejdinovic et al., 2013a) to measure the joint independence of random variables

 HSICk(P)=∥∥μ⊗Mm=1km(P)−⊗Mm=1μkm(Pm)∥∥⊗Mm=1Hkm,

where is a joint measure on the product space and are the marginal measures of defined on respectively. The extended HSIC measure has recently been analyzed in the context of independence testing (Pfister et al., 2017). In addition to testing, the extended HSIC measure is also useful in the problem of independent subspace analysis (ISA) (Cardoso, 1998), wherein the latent sources are separated by maximizing the degree of independence among them. In all the applications of HSIC, the key requirement is that captures the joint independence of random variables (with joint distribution )—we call this property as -characteristic—, which is guaranteed if is characteristic. Since is defined in terms of , it is of fundamental importance to understand the characteristic and -characteristic properties of in terms of the characteristic property of , which is one of the main goals of this work.

For , the characterization of independence, i.e., the -characteristic property of , is studied by Blanchard et al. (2011) and Gretton (2015) where it has been shown that if and are universal, then is universal222Blanchard et al. (2011) deals with -universal kernels while Gretton (2015) deals with -universal kernels. A brief description of these notions are provided in Section 3. We refer the reader to (Carmeli et al., 2010; Sriperumbudur et al., 2010) for more details on these notions of universality. and therefore HSIC captures independence. A stronger version of this result can be obtained by combining (Lyons, 2013, Theorem 3.11) and (Sejdinovic et al., 2013b, Proposition 29): if and are characteristic, then the HSIC associated with characterizes independence. Apart from these results, not much is known about the characteristic/-characteristic/universality properties of in terms of the individual kernels. Our goal is to resolve this question and understand the characteristic, -characteristic and universal property of the product kernel () in terms of the kernel components () for . Because of the relatedness of MMD and HSIC to energy distance and distance covariance, our results also contribute to the better understanding of these other measures that are popular in the statistical literature.

Specifically, our results shed light on the following surprising phenomena of the -characteristic property of for :

1. characteristic property of is not sufficient but necessary for to be -characteristic;

2. universality of is sufficient for to be -characteristic, and

3. if at least one of is only characteristic and not universal, then need not be -characteristic.

The paper is organized as follows. In Section 3, we conduct a comprehensive analysis about the above mentioned properties of and for any positive integer . To this end, we define various notions of characteristic property on the product space (see Definition 3 and Figure 2(a) in Section 3) and explore the relation between them. In order to keep our presentation in this section to be non-technical, we relegate the problem formulation to Section 3, with the main results of the paper being presented in Section 4. A summary of the results is captured in Figure 1 while the proofs are provided in Section 5. Various definitions and notation that are used throughout the paper are collected in Section 2.

## 2 Definitions & Notation

and denotes the set of natural numbers and real numbers respectively. For , . and denotes the matrix of zeros. For and , is the Euclidean inner product. For sets and , is their difference, is the cardinality of and is the Descartes product of sets . denotes the power set of a set , i.e., all subsets of (including the empty set and ). The Kronecker delta is defined as if , and zero otherwise. is the indicator function of set : if and otherwise. is the set of -sized tensors.

For a topological space , is the Borel sigma-algebra on induced by the topology . Probability and finite signed measures in the paper are meant w.r.t. the measurable space . Given topological spaces, their product is enriched with the product topology; it is the coarsest topology for which the canonical projections are continuous for all . A topological space is called second-countable if has a countable basis.333Second-countability implies separability; in metric spaces the two notions coincide (Dudley, 2004, Proposition 2.1.4). By the Urysohn’s theorem, a topological space is separable and metrizable if and only if it is regular, Hausdorff and second-countable. Any uncountable discrete space is not second-countable. denotes the space of continuous functions on . denotes the class of real-valued functions vanishing at infinity on a locally compact Hausdorff (LCH) space444LCH spaces include , discrete spaces, and topological manifolds. Open or closed subsets, finite products of LCH spaces are LCH. Infinite-dimensional Hilbert spaces are not LCH. , i.e., for any , the set is compact. is endowed with the uniform norm . and are the space of finite signed measures and probability measures on , respectively. For , denotes the product probability measure on the product space , i.e., . is the Dirac measure supported on . For , the finite signed measure denotes its marginal on . is the reproducing kernel Hilbert space (RKHS) associated with the reproducing kernel , which in this paper is assumed to be measurable and bounded. The tensor product of is a kernel, defined as

 =M∏m=1km(xm,x′m),xm,x′m∈Xm,

whose associated RKHS is denoted as (Berlinet and Thomas-Agnan, 2004, Theorem 13), where the r.h.s. is the tensor product of RKHSs . For , , the multi-linear operator is defined as

A kernel defined on a LCH space is called a -kernel if for all . is said to be a translation invariant kernel on if for a positive definite function . denotes the kernel mean embedding of to which is defined as , where the integral is meant in the Bochner sense.

## 3 Problem Formulation

In this section, we formally introduce the goal of the paper. To this end, we start with a definition. For simplicity, throughout the paper, we assume that all kernels are bounded. The definition is based on the observation (Sriperumbudur et al., 2010, Lemma 8) that a bounded kernel on a topological space is characteristic if and only if

 ∫X∫Xk(x,x′)dF(x)dF(x′)>0,∀F∈Mb(X)∖{0}such thatF(X)=0.

In other words, characteristic kernels are integrally strictly positive definite (ispd) (see Sriperumbudur et al., 2010, p. 1523) w.r.t. the class of finite signed measures that assign zero measure to . The following definition extends this observation to tensor product kernels on product spaces. {definition}[-ispd tensor product kernel] Suppose is a bounded kernel on a topological space . Let be such that where . is said to be -ispd if

 μk(F)=0⇒F=0(F∈F), or equivalently ∥μk(F)∥2Hk=∫×Mm=1Xm∫×Mm=1Xm(⊗Mm=1km)(x,x′)dF(x)dF(x′)>0,∀F∈F∖{0}. (4)

Specifically,

• if -s are -kernels on locally compact Polish (LCP) 555A topological space is called Polish if it is complete, separable and metrizable. For example, and countable discrete spaces are Polish. Open and closed subsets, products and disjoint unions of countably many Polish spaces are Polish. Every second-countable LCH space is Polish. spaces -s and , then is called -universal.

• if

 F =[Mb(X)]0:={F∈Mb(X):F(X)=0}, F =I:={P−⊗Mm=1Pm:P∈M+1(×Mm=1Xm)},(M≥2) F

then is called characteristic, -characteristic and -characteristic, respectively.

In Definition 3, being characteristic matches the usual notion of characteristic kernels on a product space, i.e., there are no two distinct probability measures on such that the MMD between them is zero. The other notions such as -characteristic and -characteristic are typically weaker than the usual characteristic property since

 (5)

Below we provide further intuition on the measure classes enlisted in Definition 3.

{remark}
• If -s are -kernels on LCH spaces for all , then is also a -kernel on LCH space implying that if satisfies (4), then is -universal (see Sriperumbudur et al., 2010, Proposition 2). It is well known that -universality reduces to -universality (i.e., the notion of universality proposed by Steinwart, 2001) if is compact (see Sriperumbudur et al., 2010 for details) which is guaranteed if and only if each is compact.

• This family is useful to describe the joint independence of random variables—hence the name -characteristic—defined on kernel-endowed domains : If denotes the joint distribution of random variables and are the associated marginals on , then by definition is -characteristic iff

 \emphHSICk(P)=0⟺P=⊗Mm=1Pm.

In other words, HSIC captures joint independence exactly with -characteristic kernels.

• In this case is chosen to be the product of finite signed measures on such that each marginal measure assigns zero to the corresponding space . This choice is relevant as the characteristic property of individual kernels need not imply the characteristic property of , but is equivalent to the -characteristic property of . The equivalence holds for bounded kernels on topological spaces () since for any , (

 ∥μk(F)∥2H⊗Mm=1km =M∏m=1∥μkm(Fm)∥2Hkm,

and the l.h.s. is positive iff each term on the r.h.s. is positive.

• -ispd relations: Given the relations in (5), it immediately follows that satisfies

 (6)

when for all are LCP. A visual illustration of (5) and (6) is provided in Figure 2.

Having defined the -ispd property, our goal is to investigate whether the characteristic or -universal property of -s () imply different -ispd properties of , and vice versa.

## 4 Main Results

In this section, we present our main results related to the -ispd property of tensor product kernels, which are summarized in Figure 1. The results in this section will deal with various assumptions on , such as second-countability, Hausdorff, locally compact Hausdorff (LCH) and locally compact Polish (LCP), so that they are presented in more generality. However, for simplicity, all these assumptions can be unified by simply assuming a stronger condition that ’s are LCP.

Our first example illustrates that the characteristic property of -s does not imply the characteristic property of the tensor product kernel. In light of Remark 3(iv) of Section 3, it follows that the class of -characteristic tensor product kernels form a strictly larger class than characteristic tensor product kernels; see also Figure 2. {example} Let , , . It is easy to verify that and are characteristic. However, it can be proved that is not characteristic. On the hand, interestingly, is -characteristic. We refer the reader to Section 5.1 for details.

In the above example, we showed that the tensor product of and (which are characteristic kernels) is -characteristic. The following result generalizes this behavior for any bounded characteristic kernels. In addition, under a mild assumption, it shows the converse to be true for any .

{theorem}

Let be bounded kernels on topological spaces for all , . Then the following holds.

• Suppose is second-countable for all with . If and are characteristic, then is -characteristic.

• Suppose is Hausdorff and for all . If is -characteristic, then are characteristic.

Lyons (2013) has showed an analogous result to Theorem 4(i) for distance covariances () on metric spaces of negative type (see Theorem 3.11 in Lyons, 2013), which via Sejdinovic et al. (2013b, Proposition 29) holds for HSIC yielding the -characteristic property of . Recently, Gretton (2015) presented a direct proof showing that HSIC corresponding to captures independence if and are translation invariant characteristic kernels on (which is equivalent to -universality). Blanchard et al. (2011) proved a result similar to Theorem 4(i) assuming that ’s are compact and , being -universal. In contrast, Theorem 4(i) establishes the result for bounded kernels on general second-countable topological spaces. In fact, the results of (Gretton, 2015; Blanchard et al., 2011) are special cases of Theorems 4 and 4 below. Theorem 4(i) raises a pertinent question: whether is -characteristic if -s are characteristic for all where ? The following example provides a negative answer to this question. On a positive side, however, we will see in Theorem 4 that the -characteristic property of can be guaranteed for any if a stronger condition is imposed on -s (and -s). Theorem 4(ii) generalizes Proposition 3.15 of (Lyons, 2013) for any , which states that every kernel being characteristic is necessary for the tensor kernel to be -characteristic.

{example}

Let and , , (). As mentioned in Example 4, are characteristic. However, it can be shown that is not -characteristic. See Section 5.3 for details.

In Remark 3(iii) and Example 4, we showed that in general, only the -characteristic property of is equivalent to the characteristic property of -s. Our next result shows that all the various notions of characteristic property of coincide if -s are translation-invariant, continuous bounded kernels on .

{theorem}

Suppose are continuous, bounded and translation-invariant kernels for all . Then the following statements are equivalent:

1. -s are characteristic for all ;

2. is -characteristic;

3. is -characteristic;

4. is characteristic.

The following result shows that on LCP spaces, the tensor product of -universal kernels is also -universal, and vice versa.

{theorem}

Suppose are -kernels on LCP spaces (). Then is -universal iff -s are -universal for all . {remark}

• A special case of Theorem 4 for is proved by Lyons (2013, Lemma 3.8) in the context of distance covariance which reduces to Theorem 4 through the equivalence established by Sejdinovic et al. (2013b). Another special case of Theorem 4 is proved by Blanchard et al. (2011, Lemma 5.2) for -universality with using the Stone-Weierstrass theorem: if and are -universal then is -universal.

• Since the notions of -universality and characteristic property are equivalent for translation invariant -kernels on (see Carmeli et al., 2010, Proposition 5.16 and Sriperumbudur et al., 2010, Theorem 9), Theorem 4 can be considered as a special case of Theorem 4. In other words, requiring to be also -kernels in Theorem 4(i)-(iv) is equivalent to

1. -s are -universal for all ;

2. is -universal.

• Since the -universality of implies its -characteristic property (see (6)), Theorem 4 also provides a generalization of Theorem 4(i) to under additional assumptions on -s, while constraining -s to LCP-s instead of second-countable topological spaces.

In Example 4 and Theorem 4, we showed that for components while the characteristic property of is not sufficient, their universality is enough to guarantee the -characteristic property of . The next example demonstrates that these results are tight: If at least one is not universal but only characteristic, then might not be -characteristic. {example} Let and , , for all , , and (). is characteristic (Example 4), and are universal since the associated Gram matrix is an identity matrix, which is strictly positive definite (). However, is not -characteristic. See Section 5.6 for details.

## 5 Proofs

In this section, we provide the proofs of our results presented in Section 4.

### 5.1 Proof of Example 4

The proof is structured as follows.

1. First we show that is a kernel and it is characteristic.

2. Next it is proved that is not characteristic.

3. Finally, the -characteristic property of is established.

The individual steps are as follows:
is a kernel. Assume w.l.o.g. that , . Then it is easy to verify that the Gram matrix where and is the transpose of . Clearly is positive semidefinite and so is a kernel.
is characteristic. We will show that satisfies (4). On a finite signed measure takes the form for some . Thus,

 F∈Mb(X)∖{0}⇔(a1,a2)≠0andF(X)=0⇔a1+a2=0. (7)

Consider

 ∫X∫Xk(x,x′)dF(x)dF(x′) =a21k(1,1)+a22k(2,2)+2a1a2k(1,2) =a21+a22−2a1a2=(a1−a2)2=4a21>0, (8)

where we used (7) and the facts that , .
is not characteristic. We show a slightly stronger statement: is not even -ispd with , in other words we construct a witness such that

 F(X1×X2) =F1(X1)F2(X2)=0, (9)

and

 0 =∫X1×X2∫X1×X2(k1⊗k2)((i1,i2),(i′1,i′2))k1(i1,i′1)k2(i2,i′2)dF(i1,i2)dF(i′1,i′2) =2∏m=1∫Xm∫Xmkm(im,i′m)dFm(im)dFm(i′m). (10)

Finite signed measures on take the form , form, where . With these notations, (9) and (10) can be rewritten as

 0 =(a1+a2)(b1+b2), 0 =⎡⎣2∑i,i′=1k1(i,i′)aiai′⎤⎦⎡⎣2∑j,j′=1k2(j,j′)bjbj′⎤⎦=(a1−a2)2(b1−b2)2.

Keeping the solutions where neither nor is the zero vector, there are 2 (symmetric) possibilities: (i) , and (ii) , . In other words, for any , the possibilities are (i) , and (ii) , . This establishes the non--ispd property of .
is -characteristic. Our goal is to show that is -characteristic, i.e., for any , implies , where . We divide the proof into two parts:

1. First we derive the equations of

 F(X1×X2) =0and∫∫(X1×X2)2(k1⊗k2)((i,j),(r,s))dF(i,j)dF(r,s)=0 (11)

for general finite signed measures on .

2. Then, we apply the parameterization and solve for that satisfies (11) to conclude that , i.e., . Note that in the chosen parametrization for , holds automatically.

The details are as follows.
Step 1.

 0 =F(X1×X2)⇔0=a11+a12+a21+a22, (12) 0 =2∑i,j=12∑r,s=1k1(i,r)k2(j,s)aijars=2∑i,r=1k1(i,r)2∑j,s=1k2(j,s)aijars =k1(1,1)[k2(1,1)a11a11+k2(1,2)a11a12+k2(2,1)a12a11+k2(2,2)a12a12] +k1(1,2)[k2(1,1)a11a21+k2(1,2)a11a22+k2(2,1)a12a21+k2(2,2)a12a22] +k1(2,1)[k2(1,1)a21a11+k2(1,2)a21a12+k2(2,1)a22a11+k2(2,2)a22a12] +k1(2,2)[k2(1,1)a21a21+k2(1,2)a21a22+k2(2,1)a22a21+k2(2,2)a22a22] =(a211−2a11a12+a212)(a11−a12)2+(a221−2a21a22+a222)(a21−a22)2−2(a11a21−a11a22−a12a21+a12a22)(a11−a12)(a21−a22) =(a11−a12−a21+a22)2. (13)

Solving (12) and (13) yields

 a11+a22 =0anda12+a21=0. (14)

Step 2. Any can be parametrized as

 P =2∑i,j=1pijδ(i,j),pij≥0,∀(i,j)and2∑i,j=1pij=1. (15)

Let ; for illustration see Table 1.