Direct Estimation of the Derivative of Quadratic Mutual Information with Application in Supervised Dimension Reduction

# Direct Estimation of the Derivative of Quadratic Mutual Information with Application in Supervised Dimension Reduction

Voot Tangkaratt The University of Tokyo Hiroaki Sasaki The University of Tokyo and Masashi Sugiyama The University of Tokyo
###### Abstract

A typical goal of supervised dimension reduction is to find a low-dimensional subspace of the input space such that the projected input variables preserve maximal information about the output variables. The dependence maximization approach solves the supervised dimension reduction problem through maximizing a statistical dependence between projected input variables and output variables. A well-known statistical dependence measure is mutual information (MI) which is based on the Kullback-Leibler (KL) divergence. However, it is known that the KL divergence is sensitive to outliers. On the other hand, quadratic MI (QMI) is a variant of MI based on the distance which is more robust against outliers than the KL divergence, and a computationally efficient method to estimate QMI from data, called least-squares QMI (LSQMI), has been proposed recently. For these reasons, developing a supervised dimension reduction method based on LSQMI seems promising. However, not QMI itself, but the derivative of QMI is needed for subspace search in supervised dimension reduction, and the derivative of an accurate QMI estimator is not necessarily a good estimator of the derivative of QMI. In this paper, we propose to directly estimate the derivative of QMI without estimating QMI itself. We show that the direct estimation of the derivative of QMI is more accurate than the derivative of the estimated QMI. Finally, we develop a supervised dimension reduction algorithm which efficiently uses the proposed derivative estimator, and demonstrate through experiments that the proposed method is more robust against outliers than existing methods.

## 1 Introduction

Supervised learning is one of the central problems in machine learning which aims at learning an input-output relationship from given input-output paired data samples. Although many methods were proposed to perform supervised learning, they often work poorly when the input variables have high dimensionality. Such a situation is commonly referred to as the curse of dimensionality (Bishop, 2006), and a common approach to mitigate the curse of dimensionality is to preprocess the input variables by dimension reduction (Burges, 2010).

A typical goal of dimension reduction in supervised learning is to find a low-dimensional subspace of the input space such that the projected input variables preserve maximal information about the output variables. Thus, a successive supervised learning method can use the low-dimensional projection of the input variables to learn the input-output relationship with a minimal loss of information. The purpose of this paper is to develop a novel supervised dimension reduction method.

The dependence maximization approach solves the supervised dimension reduction problem through maximizing a statistical dependence measure between projected input variables and output variables. Mutual information (MI) is a well-known tool for measuring statistical dependency between random variables (Cover and Thomas, 1991). MI is well-studied and many methods were proposed to estimate MI from data. A notable method is the maximum likelihood MI (MLMI) (Suzuki et al., 2008), which does not require any assumption on the data distribution and can perform model selection via cross-validation. For these reasons, MLMI seems to be an appealing tool for supervised dimension reduction. However, MI is defined based on the Kullback-Leibler divergence (Kullback and Leibler, 1951), which is known to be sensitive to outliers (Basu et al., 1998). Hence, MI is not an appropriate tool when it is applied on a dataset containing outliers.

Quadratic MI (QMI) is a variant of MI (Principe et al., 2000). Unlike MI, QMI is defined based on the distance. A notable advantage of the distance over the KL divergence is that the distance is more robust against outliers (Basu et al., 1998). Moreover, a computationally efficient method to estimate QMI from data, called least-squares QMI (LSQMI) (Sainui and Sugiyama, 2013), was proposed recently. LSQMI does not require any assumption on the data distribution and it can perform model selection via cross-validation. For these reasons, developing a supervised dimension reduction method based on LSQMI is more promising.

An approach to use LSQMI for supervised dimension reduction is to firstly estimate QMI between projected input variables and output variables by LSQMI, and then search for a subspace which maximizes the estimated QMI by a nonlinear optimization method such as gradient ascent. However, the essential quantity of the subspace search is the derivative of QMI w.r.t. the subspace, not QMI itself. Thus, LSQMI may not be an appropriate tool for developing supervised dimension reduction methods since the derivative of an accurate QMI estimator is not necessarily an accurate estimator of the derivative of QMI.

To cope with the above problem, in this paper, we propose a novel method to directly estimate the derivative of QMI without estimating QMI itself. The proposed method has the following advantageous properties: it does not require any assumption on the data distribution, the estimator can be computed analytically, and the tuning parameters can be objectively chosen by cross-validation. We show through experiments that the proposed direct estimator of the derivative of QMI is more accurate than the derivative of the estimated QMI. Then we develop a fixed-point iteration which efficiently uses the proposed estimator of the derivative of QMI to perform supervised dimension reduction. Finally, we demonstrate the usefulness of the proposed supervised dimension reduction method through experiments and show that the proposed method is more robust against outliers than existing methods.

The organization of this paper is as follows. We firstly formulate the supervised dimension reduction problem and review some of existing methods in Section 2, Then we give an overview of QMI and review some of QMI estimators in Section 3. The details of the proposed derivative estimator are given in Section 4. Then in Section 5 we develop a supervised dimension reduction algorithm based on the proposed derivative estimator. The experiment results are given in Section 6. The conclusion of this paper is given in Section 7.

## 2 Supervised Dimension Reduction

In this section, we firstly formulate the supervised dimension reduction problem. Then we briefly review existing supervised dimension reduction methods and discuss their problems.

### 2.1 Problem Formulation

Let and be the input domain and output domain with dimensionality and , respectively, and be a joint probability density on . Firstly, assume that we are given an input-output paired data set , where each data sample is drawn independently from the joint density:

 {(xi,yi)}ni=1\lx@stackreli.i.d.∼p(x,y).

Next, let be an orthonormal matrix with a known constant , where denotes the -by- identity matrix and denotes the matrix transpose. Then assume that there exists a -dimensional subspace in spanned by the rows of such that the projection of onto this subspace denoted by   preserves the maximal information about of . That is, we can substitute by with a minimal loss of information about . We refer to the problem of estimating from the given data as supervised dimension reduction. Below, we review some of the existing supervised dimension reduction methods.

### 2.2 Sliced Inverse Regression

Sliced inverse regression (SIR) (Li, 1991) is a well known supervised dimension reduction method. SIR formulates supervised dimension reduction as a problem of finding which makes and conditionally independent given :

 (xto0.0pt$⊥$⊥y) | z. (1)

The key principal of SIR lies on the following equality 111For simplicity, we assume that is standardized so that and

 E[c⊤x|Wx] =a0+dz∑i=1aiw⊤ix, (2)

where denotes the conditional expectation and denotes the -th row of . The importance of this equality is that if the equality holds for any and some constants , then the inverse regression curve lies on the space spanned by which satisfies Eq.(1). Based on this fact, SIR estimates as follows. First, the range of is sliced into multiple slices. Then is estimated as the mean of for each slice of . Finally, is obtained as the largest principal components of the covariance matrix of the means.

The significant advantages of SIR are its simplicity and scalability to large datasets. However, SIR relies on the equality in Eq.(2) which typically requires that is an elliptically symmetric distribution such as Gaussian. This is restrictive and thus the practical usefulness of SIR is limited.

### 2.3 Minimum Average Variance Estimation based on the Conditional Density Functions

The minimum average variance estimation based on the conditional density functions (dMAVE) (Xia, 2007) is a supervised dimension method which does not require any assumption on the data distribution and is more practical compared to SIR. Briefly speaking, dMAVE aims to find a matrix which yields an accurate non-parametric estimation of the conditional density .

The essential part of dMAVE is the following model:

 Hb(˜y−y) =mb(z,y)+εb(y|z),

where denotes a symmetric kernel function with bandwidth , denotes a conditional expectation of given , and with . An important property of this model is that when as . Then, dMAVE estimates by a local linear smoother (Fan et al., 1996). More specifically, a local linear smoother of is given by

 mb(zi,yk) ≈mb(zj,yk)+∂mb(zj,yk)∂z(zi−zj) =ajk+b⊤jkW(xi−xj), (3)

where is an arbitrary point close to , and and are parameters. Based on this local linear smoother, dMAVE solves the following minimization problem:

 minW,ajk,bjk1n3n∑j,k=1ρ(xj,yk)n∑i=1[Hb(yi−yk)−ajk−b⊤jkW(xi−xj)]2Kh(xi,xj), (4)

where is a symmetric kernel function with bandwidth . The function is a trimming function which is evaluated as zero when the densities of or are lower than some threshold. A solution to this minimization problem is obtained by alternatively solving quadratic programming problems for , and until convergence.

The main advantage of dMAVE is that it does not require any assumption on the data distribution. However, a significant disadvantage of dMAVE is that there is no systematic method to choose the kernel bandwidths and the trimming threshold. In practice, dMAVE uses a bandwidth selection method based on the normal-reference rule of the non-parametric conditional density estimation (Silverman, 1986; Fan et al., 1996), and a fixed trimming threshold. Although this model selection strategy works reasonably well in general, it does not always guarantee good performance on all kind of datasets.

Another disadvantage of dMAVE is that the optimization problem in Eq.(4) may have many local solutions. To cope with this problem, dMAVE proposed to use a supervised dimension reduction method called the outer product of gradient based on conditional density functions (dOPG) (Xia, 2007) to obtain a good initial solution. Thus, dMAVE may not perform well if dOPG fails to provide a good initial solution.

### 2.4 Kernel Dimension Reduction

Another supervised dimension reduction method which does not require any assumption on the data distribution is kernel dimension reduction (KDR) (Fukumizu et al., 2009). Unlike dMAVE which focuses on the conditional density, KDR aims to find a matrix which satisfies the conditional independence in Eq.(1). The key idea of KDR is to evaluate the conditional independence through a conditional covariance operator over reproducing kernel Hilbert spaces (RKHSs) (Aronszajn, 1950).

Throughout this subsection, we use to denote an RKHS of functions on the domain equipped with reproducing kernel :

 ⟨f,kz(⋅,z)⟩Hz =f(z),

for and . The RKHSs of functions on domains and are also defined similarly as and , respectively. The cross-covariance operator : satisfies the following equality for all and :

 =Ezy[f(z)g(y)]−Ez[f(z)]Ey[g(y)],

where , , and denotes expectations over densities , , and , respectively. Then, the conditional covariance operator can be defined using cross-covariance operators as

 ΣYY|Z=ΣYY−ΣYZΣ−1ZZΣZY, (5)

where it is assumed that always exists. The importance of the conditional covariance operator in supervised dimension reduction lies in the following relations:

 ΣYY|Z≥ΣYY|X, (6)

where the inequality refers to the partial order of self-adjoint operators, and

 (7)

These relations mean that the conditional independence can be achieved by finding a matrix which minimizes in the partial order of self-adjoint operators. Based on this fact, KDR solves the following minimization problem:

 minW∈{W|WW⊤=Idz}Tr[GY(GZ+λnIn)−1], (8)

where denotes a regularization parameter, and denotes centered Gram matrices with the kernels and , respectively, and denotes the trace of an operator. A solution to this minimization problem is obtained by a gradient descent method.

KDR does not require any assumption on the data distribution and was shown to work well on various regression and classification tasks (Fukumizu et al., 2009). However, KDR has two disadvantages in practice. The first disadvantage of KDR is that even though the kernel parameters and the regularization parameter can heavily affect the performance, there seems to be no justifiable model selection method to choose these parameters so far. Although it is always possible to choose these tuning parameters based on a criterion of a successive supervised learning method with cross-validation, this approach results in a nested loop of model selection for both KDR itself and the successive supervised learning method. Moreover, this approach makes supervised dimension reduction depends on the successive supervised learning method which is unfavorable in practice.

The second disadvantage is that the optimization problem in Eq.(8) is non-convex and may have many local solutions. Thus, if the initial solution is not properly chosen, the performance of KDR may be unreliable. A simple approach to cope with this problem is to choose the best solution with cross-validation based on the successive supervised learning method, but this approach makes supervised dimension reduction depends on the successive supervised learning method and is unfavorable. A more sophisticated approach was considered in Fukumizu and Leng (2014) which proposed to use a solution of a supervised dimension reduction method called gradient-based kernel dimension reduction (gKDR) as an initial solution for KDR. However, it is not guarantee that gKDR always provide a good initial solution for KDR.

### 2.5 Least-Squares Dimension Reduction

The least-squares dimension reduction (LSDR) (Suzuki and Sugiyama, 2013) is another supervised dimension reduction method which does not require any assumption on the data distribution. Similarly to KDR, LSDR aims to find a matrix which satisfies the conditional independence in Eq.(1). However, instead of the conditional covariance operators, LSDR evaluates the conditional independence through a statistical dependence measure.

LSDR utilizes a statistical dependence measure called squared-loss mutual information (SMI). SMI between random variables and is defined as

 SMI(Z,Y)=12∬p(z)p(y)(p(z,y)p(z)p(y)−1)2dzdy. (9)

is always non-negative and equals to zero if and only if and are statistically independent, i.e., . The important properties of SMI in supervised dimension reduction are the following relations:

 SMI(Z,Y)≤SMI(X,Y),

and

Thus, the conditional independence can be achieved by finding a matrix which maximizes . Since is typically unknown, it is estimated by the least-squares mutual information (Suzuki et al., 2009) method which directly estimates the density ratio without performing any density estimation. Then, LSDR solves the following maximization problem:

 maxW∈{W|WW⊤=Idz}ˆSMI(Z,Y), (10)

where denotes the estimated SMI. The solution to this maximization problem is obtained by a gradient ascent method. Note that this maximization problem is non-convex and may have many local solutions.

LSDR does not require any assumption on the data distribution, similarly to dMAVE and KDR. However, the significant advantage of LSDR over dMAVE and KDR is that LSDR can perform model selection via cross-validation and avoid a poor local solution without requiring any successive supervised learning method. This is a favorable property as a supervised dimension reduction method.

However, a disadvantage of LSDR is that the density ratio function can be highly fluctuated, especially when the data contains outliers. Since it is typically difficult to accurately estimate a highly fluctuated function, LSDR could be unreliable in the presence of outliers.

Next, we consider a supervised dimension reduction approach based on quadratic mutual information which can overcome the disadvantages of the existing methods.

In this section, we briefly introduce quadratic mutual information and discuss how it can be used to perform robust supervised dimension reduction.

### 3.1 Quadratic Mutual Information and Mutual Information

Quadratic mutual information (QMI) is a measure for statistical dependency between random variables (Principe et al., 2000), and is defined as

 QMI(Z,Y)=12∬(p(z,y)−p(z)p(y))2dzdy. (11)

is always non-negative and equals to zero if and only if and are statistically independent, i.e., . Such a property of QMI is similar to that of the ordinary mutual information (MI), which is defined as

 MI(Z,Y)=∬p(z,y)log(p(z,y)p(z)p(y))dzdy. (12)

The essential difference between QMI and MI is the discrepancy measure. is the distance between and , while is the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951).

MI has been studied and applied to many data analysis tasks (Cover and Thomas, 1991). Moreover, an efficient method to estimate MI from data is also available (Suzuki et al., 2008). However, MI is not always the optimal choice for measuring statistical dependence because it is not robust against outliers. An intuitive explanation is that MI contains the log function and the density ratio: the value of logarithm can be highly sharp near zero, and density ratio can be highly fluctuated and diverge to infinity. Thus, the value of MI tends to be unstable and unreliable in the presence of outliers. In contrast, QMI does not contain the log function and the density ratio, and thus QMI should be more robust against outliers than MI.

Another explanation of the robustness of QMI and MI can be understood based on their discrepancy measures. Both distance (QMI) and KL divergence (MI) can be regarded as members of a more general divergence class called the density power divergence (Basu et al., 1998):

 DPα(p∥q)=∫(p(x)1+α−(1+1α)p(x)q(x)α+1αq(x)1+α)dx, (13)

where . Based on this divergence class, the distance and the KL divergence can be obtained by setting and , respectively. As discussed in Basu et al. (1998), the parameter controls the robustness against outliers of the divergence, where a large value of indicates high robustness. This means that the distance () is more robust against outliers than the KL divergence ().

In supervised dimension reduction, robustness against outliers is an important requirement because outliers often make supervised dimension reduction methods to work poorly. Thus, developing a supervised dimension reduction method based on QMI is an attractive approach since QMI is robust against outliers. This QMI-based supervised dimension reduction method is performed by finding a matrix which maximizes :

 W∗=argmaxW∈{W|WW⊤=Idz}QMI(Z,Y).

The motivation is that, if is maximized then and are maximally dependent on each other, and thus we may disregard with a minimal loss of information about .

Since is typically unknown, it needs to be estimated from data. Below, we review existing QMI estimation methods and then discuss a weakness of performing supervised dimension reduction using these QMI estimation methods.

### 3.2 Existing QMI Estimation Methods

We review two QMI estimation methods which estimate from the given data. The first method estimates QMI through density estimation, and the second method estimates QMI through density difference estimation.

#### 3.2.1 QMI Estimator based on Density Estimation

Expanding Eq.(11) allows us to express as

 QMI(Z,Y) =12∬(p(z,y)2−2p(z,y)p(z)p(y)+p(z)2p(y)2)dzdy. (14)

A naive approach to estimate is to separately estimate the unknown densities , , and by density estimation methods such as kernel density estimation (KDE) (Silverman, 1986), and then plug the estimates into Eq.(14).

Following this approach, the KDE-based QMI estimator has been studied and applied to many problems such as feature extraction for classification (Torkkola, 2003; Principe et al., 2000), blind source separation (Principe et al., 2000), and image registration (Atif et al., 2003). Although this density estimation based approach was shown to work well, accurately estimating densities for high-dimensional data is known to be one of the most challenging tasks (Vapnik, 1998). Moreover, the densities contained in Eq.(14) are estimated independently without regarding the accuracy of the QMI estimator. Thus, even if each density is accurately estimated, the QMI estimator obtained from these density estimates does not necessarily give an accurate QMI. An approach to mitigate this problem is to consider density estimators which their combination minimizes the estimation error of QMI. Although this approach shows better performance than the independent density estimation approach, it still performs poorly in high-dimensional problems (Sugiyama et al., 2013).

#### 3.2.2 Least-Squares QMI

To avoid the separate density estimation, an alternative method called least-squares QMI (LSQMI) (Sainui and Sugiyama, 2013) was proposed. Below, we briefly review the LSQMI method.

First, notice that can be expressed in term of the density difference as

 QMI(Z,Y) =12∬f(z,y)2dzdy, (15)

where

 f(z,y)=p(z,y)−p(z)p(y).

The key idea of LSQMI is to directly estimate the density difference without going through any density estimation by the procedure of the least-squares density difference (Sugiyama et al., 2013). Letting be a model of the density difference, LSQMI learns so that it is fitted to the true density difference under the squared loss:

 12∬(d(z,y)−f(z,y))2dzdy.

By expanding the integrand, we obtain

 12∬d(z,y)2dzdy−∬d(z,y)f(z,y)dzdy+12∬f(z,y)2dzdy.

Since the last term is a constant w.r.t. the model , we omit it and obtain the following criterion:

 12∬d(z,y)2dzdy−∬d(z,y)f(z,y)dzdy. (16)

Then, the density difference estimator is obtained as the solution of the following minimization problem:

 ˆd=argmind[12∬d(z,y)2dzdy−∬d(z,y)f(z,y)dzdy]. (17)

The solution of the minimization problem in Eq.(17) depends on the choice of the model . LSQMI employs the following linear-in-parameter model

 d(z,y)=α⊤ψ(z,y),

where is a parameter vector and is a basis function vector. For this model, finding the solution of Eq.(17) is equivalent to solving

 minα[12α⊤Dα−α⊤q],

where

 D =∬ψ(z,y)ψ(z,y)⊤dzdy, (18) q =∬ψ(z,y)f(z,y)dzdy =∬ψ(z,y)p(z,y)dzdy−∬ψ(z,y)p(z)p(y)dzdy. (19)

By approximating the expectation over the densities , , and with sample averages, we obtain the following empirical minimization problem

 minα[12α⊤Dα−α⊤ˆq],

where is the sample approximation of Eq.(19):

 ˆq =1nn∑i=1ψ(zi,yi)−1n2n∑i,j=1ψ(zi,yj).

By including the regularization term, we obtain

 ˆα=argminα[12α⊤Dα−α⊤ˆq+λ2α⊤α],

where is the regularization parameter. Then, the solution is obtained analytically as

 ˆα=(D+λI)−1ˆq. (20)

Therefore, the density difference estimator is obtained as

 ˆd(z,y)=ˆα⊤ψ(z,y).

Finally, QMI estimator is obtained by substituting the density difference estimator into Eq.(15). A direct substitution yields two possible QMI estimators:

 ˆQMI(Z,Y) =12ˆα⊤ˆq, (21) ˆQMI(Z,Y) =12ˆα⊤Dˆα. (22)

However, it was shown in Sugiyama et al. (2013) that a linear combination of the two estimators defined as

 ˆQMI(Z,Y) =ˆα⊤ˆq−12ˆα⊤Dˆα, (23)

provides smaller bias and is a more appropriate QMI estimator.

As shown above, LSQMI avoids multiple-step density estimation by directly estimating the density difference contained in QMI. It was shown that such direct estimation procedure tends to be more accurate than multiple-step estimation (Sugiyama et al., 2013). Moreover, LSQMI is able to objectively choose the tuning parameter contained in the basis function and the regularization parameter based on cross-validation. This property allows LSQMI to solve challenging tasks such as clustering (Sainui and Sugiyama, 2013) and unsupervised dimension reduction (Sainui and Sugiyama, 2014) in an objective way.

### 3.3 Supervised Dimension Reduction via LSQMI

Given an efficient QMI estimation method such as LSQMI, supervised dimension reduction can be performed by finding a matrix defined as

 W∗=argmaxW∈{W|WW⊤=Idz}ˆQMI(Z,Y). (24)

A straightforward approach to solving Eq.(24) is to perform the gradient ascent:

 W←W+t∂ˆQMI(Z,Y)∂W,

where denotes the step size. The update formula means that the essential point of the QMI-based supervised dimension reduction method is not the accuracy of the QMI estimator, but the accuracy of the estimator of the derivative of the QMI. Thus, the existing LSQMI-based approach which first estimates QMI and then compute the derivatives of the QMI estimator is not necessarily appropriate since an accurate estimator of QMI does not necessarily mean that its derivative is an accurate estimator of the derivative of QMI. Next, we describe our proposed method which overcomes this problem.

## 4 Derivative of Quadratic Mutual Information

To cope with the weakness of the QMI estimation methods when performing supervised dimension reduction, we propose to directly estimate the derivative of QMI without estimating QMI itself.

### 4.1 Direct Estimation of the Derivative of Quadratic Mutual Information

From Eq.(15), the derivative of the w.r.t. the -th element of can be expressed by 222Throughout this section, we use instead of when we consider its derivative for notational convenience. However, they still represent the QMI between random variables and .

 ∂QMI(W)∂Wℓ,ℓ′ =∂∂Wℓ,ℓ′(12∬f(z,y)2dzdy) =∬f(z,y)∂f(z,y)∂Wℓ,ℓ′dzdy =∬f(z,y)∂f(z,y)∂z⊤∂z∂Wℓ,ℓ′dzdy =∬p(z,y)∂f(z,y)∂z⊤∂z∂Wℓ,ℓ′dzdy =−∬p(z)p(y)∂f(z,y)∂z⊤∂z∂Wℓ,ℓ′dzdy, (25)

where in the second line we assume that the order of the derivative and the integration is interchangeable. By approximating the expectations over the densities , , and with sample averages, we obtain an approximation of the derivative of QMI as

 ˆ∂QMI(W)∂Wℓ,ℓ′ =n∑i=1∂f(zi,yi)∂z⊤∂zi∂Wℓ,ℓ′−n∑i,j=1∂f(zi,yj)∂z⊤∂zi∂Wℓ,ℓ′. (26)

Note that since , we have that is the -dimensional vector with zero everywhere except at the -th dimension which has value . Hence, Eq.(26) can be simplified as

 ˆ∂QMI(W)∂Wℓ,ℓ′ =n∑i=1∂f(zi,yi)∂z(ℓ)x(ℓ′)i−n∑i,j=1∂f(zi,yj)∂z(ℓ)x(ℓ′)i. (27)

This means that the derivative of w.r.t.  can be obtained once we know the derivatives of the density difference w.r.t.  for all . However, these derivatives are often unknown and need to be estimated from data. Below, we first discuss existing approaches and their drawbacks. Then we propose our approach which can overcome the drawbacks.

### 4.2 Existing Approaches to Estimate the Derivative of the Density Difference

Our current goal is to obtain the derivative of the density difference w.r.t.  which can be rewritten as

 ∂f(z,y)∂z(ℓ) =∂p(z,y)∂z(ℓ)−∂p(z)∂z(ℓ)p(y). (28)

All terms in Eq.(28) are unknown in practice and need to be estimated from data. There are three existing approaches to estimate them.

(A) Density estimation

Separately estimate the densities , , and by, e.g., kernel density estimation. Then estimate the right-hand side of Eq.(28) as

 ∂ˆp(z,y)∂z(ℓ)−∂ˆp(z)∂z(ℓ)ˆp(y),

where , , and denote the estimated densities.

(B) Density derivative estimation

Estimate the density by e.g., kernel density estimation. Next, separately estimate the densities derivative and by, e.g., the method of mean integrated square error for derivatives (Sasaki et al., 2015), which can estimate the density derivative without estimating the density itself. Then estimate the right-hand side of Eq.(28) as

 ˆ∂p(z,y)∂z(ℓ)−ˆ∂p(z)∂z(ℓ)ˆp(y),

where denotes the estimated density, and and denote the (directly) estimated density derivatives.

(C) Density difference estimation

Estimate the density difference by e.g., least-squares density difference (Sugiyama et al., 2013), which can estimate the density difference without estimating the densities themselves. Then estimate the left-hand side of Eq.(28) as

 ∂ˆf(z,y)∂z(ℓ),

where denotes the (directly) estimated density difference.

The problem of approaches (A) and (B) is that they involve multiple estimation steps where some quantities are estimated first and then they are plugged into Eq.(28). Such multiple-step methods are not appropriate since each estimated quantity is obtained without regarding the others and the succeeding plug-in step using these estimates can magnify the estimation error contained in each estimated quantity.

On the other hand, approach (C) seems more promising than the previous two approaches since there is only one estimated quantity . However, it is still not the optimal approach due to the fact that an accurate estimator of the density difference does not necessarily means that its derivative is an accurate estimator of the derivative of the density difference.

To avoid the above problems, we propose a new approach which directly estimates the derivative of the density difference.

### 4.3 Direct Estimation of the Derivative of the Density Difference

We propose to estimate the derivative of the density difference w.r.t.  using a model :

 ∂f(z,y)∂z(ℓ)≈gℓ(z,y).

The model is learned so that it is fitted to its corresponding derivative under the square loss:

 12∬(gℓ(z,y)−∂f(z,y)∂z(ℓ))2dzdy. (29)

By expanding the square, we obtain

 12∬gℓ(z,y)2dzdy−∬gℓ(z,y)∂f(z,y)∂z(ℓ)dzdy+12∬(∂f(z,y)∂z(ℓ))2dzdy.

Since the last term is a constant w.r.t. the model , we omit it and obtain the following criterion:

 12∬gℓ(z,y)2dzdy−∬gℓ(z,y)∂f(z,y)∂z(ℓ)dzdy. (30)

The second term is intractable due to the unknown derivative of the density difference. To make this term tractable, we use integration by parts (Kasube, 1983) to obtain the following:

 ∬[gℓ(z,y)f(z,y)]z(ℓ)=∞z(ℓ)=−∞dz∖z(ℓ)dy =∬f(z,y)∂gℓ(z,y)∂z(ℓ)dzdy+∬gℓ(z,y)∂f(z,y)∂z(ℓ)dzdy, (31)

where denotes an integration over except for the -th element. Here, we require

 [gℓ(z,y)f(z,y)]z(ℓ)=∞z(ℓ)=−∞=0, (32)

which is a mild assumption since the tails of the density difference often vanish when approaches infinity. Applying the assumption to the left-hand side of Eq.(31) allows us to express Eq.(30) as

 12∬gℓ(z,y)2dzdy+∬f(z,y)∂gℓ(z,y)∂z(ℓ)dzdy.

Then, the estimator is obtained as a solution of the following minimization problem:

 ˆgℓ =argmingℓ[12∬gℓ(z,y)2dzdy+∬f(z,y)∂gℓ(z,y)∂z(ℓ)dzdy]. (33)

The solution of Eq.(33) depends on the choice of the model. Let us employ the following linear-in-parameter model as :

 gℓ(z,y)=θ⊤ℓφℓ(z,y), (34)

where is a parameter vector and is a basis function vector whose practical choice will be discussed later in detail. For this model, finding the solution of Eq.(33) is equivalent to solving

 minθℓ[12θ⊤ℓHℓθℓ+θ⊤ℓhℓ], (35)

where we define

 Hℓ =∬φℓ(z,y)φℓ(z,y)⊤dzdy, (36) hℓ =∬f(z,y)∂φℓ(z,y)∂z(ℓ)dzdy =∬p(z,y)∂φℓ(z,y)∂z(ℓ)dzdy−∬p(z)p(y)∂φℓ(z,y)∂z(ℓ)dzdy. (37)

By approximating the expectation over the densities , , and with sample averages, we obtain the following empirical minimization problem:

 minθℓ[12θ⊤ℓHℓθℓ+θ⊤ℓˆhℓ], (38)

where is the sample approximation of Eq.(37):

 ˆhℓ =1nn∑i=1∂φℓ(zi,yi)∂z(ℓ)−1n2n∑i,j=1∂φℓ(zi,yj)∂z(ℓ). (39)

By including the regularization term to control the model complexity, we obtain

 ˆθℓ =argminθℓ[12θ⊤ℓHℓθℓ+θ⊤ℓˆhℓ+λℓ2θ⊤ℓθℓ], (40)

where denotes the regularization parameter. This minimization problem is convex w.r.t. the parameter , and the solution can be obtained analytically as

 ˆθℓ =−(Hℓ+λℓI)−1ˆhℓ, (41)

where denotes the identity matrix. Finally, the estimator of the derivative of the density difference is obtained by substituting the solution into the model Eq.(34) as

 ˆgℓ(z,y)=ˆθ⊤ℓφℓ(z,y). (42)

Using this solution, an estimator of the derivative of QMI can be directly obtained by substituting Eq.(42) into Eq.(27) as

 ˆ∂QMI(W)∂Wℓ,ℓ′ =1nn∑i=1ˆθ⊤ℓφℓ(zi,yi)x(ℓ′)i−1n2n∑i,j=1ˆθ⊤ℓφℓ(zi,yj)x(ℓ′)i. (43)

We call this method the least-squares QMI derivative (LSQMID).

### 4.4 Basis Function Design

As basis function , we propose to use

 φℓ(z,y)=[φ(1)ℓ(z,y),⋯,φ(b)ℓ(z,y)]⊤,

where . First, let us define the -th Gaussian function as

 ϕ(k)ℓ(z,y)=exp(−∥z−uk∥2+∥y−vk∥22σ2ℓ), (44)

where and denote Gaussian centers chosen randomly from the data samples , and denotes the Gaussian width. We may use different Gaussian widths for and , but this approach significantly increases the computation time for model selection which will be discussed in Section 4.5. In our implementation, we standardize each dimension of and to have unit variance and zero mean, and then use the common Gaussian width for both and . We also set in the experiments.

Based on the above Gaussian function, we propose to use the following function as the -th basis for the -th model of the derivative of the density difference:

 φ(k)ℓ(z,y) =∂ϕ(k)ℓ(z,y)∂z(ℓ) =−1σ2ℓ(z(ℓ)−u(ℓ)k)ϕ(k)ℓ(z,y). (45)

This function is the derivative of the -th Gaussian basis function w.r.t. . A benefit of this basis function design is that the integral appeared in can be computed analytically. Through some simple calculation, we obtain the -th element of as follows:

 H(k,k′)ℓ =1σ4ℓ(√πσℓ)dz+dyexp(−∥uk−uk′∥2−∥vk−vk′∥24σ2ℓ) ×⎛⎝u(ℓ)ku(ℓ)k′−(u(ℓ)k+u(ℓ)k′)22+(u(ℓ)k+u(ℓ)k′2)2+σ2ℓ2⎞⎠.

As discussed in Section 5, this basis function choice has further benefits when we develop a supervised dimension reduction method.

### 4.5 Model Selection by Cross-Validation

The practical performance of the LSQMID method depends on the choice of the Gaussian width and the regularization parameter included in the estimator . These tuning parameters can be objectively chosen by the -fold cross-validation (CV) procedure which is described below.

1. Divide the training data into disjoint subsets with approximately the same size. In the experiments, we choose .

2. For each candidate and each subset , compute a solution by Eq.(41) with the candidate and samples from (i.e., all data samples except samples in ).

3. Compute the CV score of each candidate pair by

 CVℓ(M) =1KK∑j=1[12ˆθ⊤ℓ,M,∖jHℓ,Mˆθℓ,M,∖j+ˆθ⊤ℓ,M,∖jˆhℓ,M,j],

where denotes computed from the candidate and samples in .

4. Choose the tuning parameter pair such that it minimizes the CV score as

 (ˆσℓ,ˆλℓ)=argminMCVℓ(M).

## 5 Supervised Dimension Reduction via LSQMID

In this section, we propose a supervised dimension reduction method based on the proposed LSQMID estimator.

### 5.1 Gradient ascent via LSQMID

Recall that our goal in supervised dimension reduction is to find the matrix :

 W∗=argmaxW∈{W|WW⊤=Idz}QMI(Z,Y). (46)

A straightforward approach to find a solution of Eq.(46) using the proposed method is to perform gradient ascent as

 W←W+tˆ∂QMI(W)∂W, (47)

where denotes the step size. It is known that choosing a good step size is a difficult task in practice (Nocedal and Wright, 2006). Line search is an algorithm to choose a good step size by finding a step size which satisfies certain conditions such as the Armijo rule (Armijo, 1966). However, these conditions often require access to the objective value which is unavailable in our current setup since the QMI derivative is directly estimated without estimating QMI. Thus, if we want to perform line search, QMI needs to be estimated separately. However, this is problematic since the estimation of the derivative of the QMI and the estimation of the QMI are performed independently without regard to the other, and thus they may not be consistent. For example, the gradient ,which is supposed to be an ascent direction, may be regarded as a descent direction on the surface of the estimated QMI. For such a case, the step size chosen by any line search algorithm is unreliable and the resulting may not be a good solution.

Below, we consider two approaches which can cope with this problem.

### 5.2 QMI Approximation via LSQMID

To avoid separate QMI estimation, we consider an approximated QMI which is obtained as a by-product of the proposed method. Recall that the proposed method models the derivative of the density difference as

 ∂f(z,y)∂z(ℓ) ≈gℓ(z,y) =θ⊤ℓφℓ(z,y) =θ⊤ℓ∂ϕℓ(z,y)∂z(ℓ) =∂(θ⊤ℓϕℓ(z,y))∂z(ℓ).

This means that the density difference can be approximated by

 ˜fℓ(z,y)=ˆθ⊤ℓϕℓ(z,y)+cℓ, (48)

where is an unknown quantity which is a constant w.r.t. .

In a special case where