Feature Robust Optimal Transport for High-dimensional Data

Feature Robust Optimal Transport for High-dimensional Data

Abstract

Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose feature-robust optimal transport (FROT) for high-dimensional data, which solves high-dimensional OT problems using feature selection to avoid the curse of dimensionality. Specifically, we find a transport plan with discriminative features. To this end, we formulate the FROT problem as a min–max optimization problem. We then propose a convex formulation of the FROT problem and solve it using a Frank–Wolfe-based optimization algorithm, whereby the subproblem can be efficiently solved using the Sinkhorn algorithm. Since FROT finds the transport plan from selected features, it is robust to noise features. To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can find a strong correspondence by determining important layers. We show that the FROT algorithm achieves state-of-the-art performance in real-world semantic correspondence datasets.

1 Introduction

Optimal transport (OT) is a machine learning problem with several applications in the computer vision and natural language processing communities. The applications include Wasserstein distance estimation [29], domain adaptation [39], multitask learning [18], barycenter estimation [7], semantic correspondence [23], feature matching [34], and photo album summarization [22]. The OT problem is extensively studied in the computer vision community as the earth mover’s distance (EMD) [33]. However, the computational cost of EMD is cubic and highly expensive. Recently, the entropic regularized EMD problem was proposed; this problem can be solved using the Sinkhorn algorithm with a quadratic cost [6]. Owing to the development of the Sinkhorn algorithm, researchers have replaced the EMD computation with its regularized counterparts. However, the optimal transport problem for high-dimensional data has remained unsolved for many years.

Recently, a robust variant of the OT was proposed for high-dimensional OT problems and used for divergence estimation [27, 28]. In the robust OT framework, the transport plan is computed with the discriminative subspace of the two data matrices and . The subspace can be obtained using dimensionality reduction. An advantage of the subspace robust approach is that it does not require prior information about the subspace. However, given prior information such as feature groups, we can consider a computationally efficient formulation. The computation of the subspace can be expensive if the dimensionality of data is high, for example, .

One of the most common prior information items is a feature group. The use of group features is popular in feature selection problems in the biomedical domain and has been extensively studied in Group Lasso [40]. The key idea of Group Lasso is to prespecify the group variables and select the set of group variables using the group norm (also known as the sum of norms). For example, if we use a pretrained neural network as a feature extractor and compute OT using the features, then we require careful selection of important layers to compute OT. Specifically, each layer output is regarded as a grouped input. Therefore, using a feature group as prior information is a natural setup and is important for considering OT for deep neural networks (DNNs).

(a) OT on clean data.
(b) OT on noisy data.
(c) FROT on noisy data ().
Figure 1: transport plans between two synthetic distributions with -dimensional vectors , , where two-dimensional vectors and are true features; and and are noisy features. (\subreffig:syntetic_OT_data) OT between distribution and is a reference. (\subreffig:syntetic_OT_noise) OT between distribution and . (\subreffig:syntetic_FROT_noise) FROT transport plan between distribution and where true features and noisy features are grouped, respectively.

In this paper, we propose a high-dimensional optimal transport method by utilizing prior information in the form of grouped features. Specifically, we propose a feature-robust optimal transport (FROT) problem, for which we select distinct group feature sets to estimate a transport plan instead of determining its distinct subsets, as proposed in [27, 28]. To this end, we formulate the FROT problem as a min–max optimization problem and transform it into a convex optimization problem, which can be accurately solved using the Frank–Wolfe algorithm [12, 17]. The FROT’s subproblem can be efficiently solved using the Sinkhorn algorithm [6]. An advantage of FROT is that it can yield a transport plan from high-dimensional data using feature selection, using which the significance of the features is obtained without any additional cost. Therefore, the FROT formulation is highly suited for high-dimensional OT problems. Through synthetic experiments, we initially demonstrate that the proposed FROT is robust to noise dimensions (See Figure 1). Furthermore, we apply FROT to a semantic correspondence problem [23] and show that the proposed algorithm achieves SOTA performance.

2 Background

In this section, we briefly introduce the OT problem.

Optimal transport (OT): The following are given: independent and identically distributed (i.i.d.) samples from a -dimensional distribution , and i.i.d. samples from the -dimensional distribution . In the Kantorovich relaxation of OT, admissible couplings are defined by the set of the transport plan:

where is called the transport plan, is the -dimensional vector whose elements are ones, and and are the weights. The OT problem between two discrete measures and determines the optimal transport plan of the following problem:

(1)

where is a cost function. For example, the squared Euclidean distance is used, that is, . To solve the OT problem, Eq. (1) (also known as the earth mover’s distance) using linear programming requires computation, which is computationally expensive. To address this, an entropic-regularized optimal transport is used [6].

where is the regularization parameter, and is the entropic regularization. If , then the regularized OT problem reduces to the EMD problem. Owing to entropic regularization, the entropic regularized OT problem can be accurately solved using Sinkhorn iteration [6] with a computational cost (See Algorithm 1).

Wasserstein distance: If the cost function is defined as with as a distance function and , then we define the -Wasserstein distance of two discrete measures and as

Recently, a robust variant of the Wasserstein distance, called the subspace robust Wasserstein distance (SRW), was proposed [27]. The SRW computes the OT problem in the discriminative subspace. This can be determined by solving dimensionality-reduction problems. Owing to the robustness, it can compute the Wasserstein from noisy data. The SRW is given as

(2)

where is the projection matrix with , and is the identity matrix. The SRW or its relaxed problem can be efficiently estimated using either eigenvalue decomposition or the Frank–Wolfe algorithm.

3 Proposed Method

This paper proposes FROT. We assume that the vectors are grouped as and . Here, and are the dimensional vectors, where . This setting is useful if we know the explicit group structure for the feature vectors a priori. In an application in -layer neural networks, we consider and as outputs of the th layer of the network. If we do not have a priori information, we can consider each feature independently (i.e., and ). All proofs in this section are provided in the Supplementary Material.

3.1 Feature-Robust Optimal Transport (FROT)

The FROT formulation is given by

(3)

where is the probability simplex. The underlying concept of FROT is to estimate the transport plan using distinct groups with large distances between and . We note that determining the transport plan in nondistinct groups is difficult because the data samples in and overlap. By contrast, in distinct groups, and are different, and this aids in determining an optimal transport plan. This is an intrinsically similar idea to the subspace robust Wasserstein distance [27], which estimates the transport plan in the discriminative subspace, while our approach selects important groups. Therefore, FROT can be regarded as a feature selection variant of the vanilla OT problem in Eq. (1), whereas the subspace robust version uses dimensionality-reduction counterparts.

1:  Input:
2:  Initialize
3:  while  and not converge do
4:     
5:     
6:     
7:  end while
8:  return  
Algorithm 1 Sinkhorn algorithm.
1:  Input: , , , and .
2:  Initialize , compute .
3:  for  do
4:     
5:     
6:     with .
7:  end for
8:  return  
Algorithm 2 FROT with the Frank–Wolfe.

Using FROT, we can define a -feature robust Wasserstein distance (-FRWD).

Proposition 1

For the distance function ,

(4)

is a distance for .

Note that we can show that 2-FRWD is a special case of SRW with (See Supplementary Material). The key difference between SRW and FRWD is that FRWD can use any distance, while SRW can only use .

3.2 FROT Optimization

Here, we propose two FROT algorithms based on the Frank–Wolfe algorithm and linear programming.

Frank–Wolfe: We propose a continuous variant of the FROT algorithm using the Frank–Wolfe algorithm, which can be fully differentiable. To this end, we introduce entropic regularization for and rewrite the FROT as a function of . Therefore, we solve the following problem for :

where is the regularization parameter, and is the entropic regularization for . An advantage of entropic regularization is that the nonnegative constraint is naturally satisfied, and the entropic regularizer is a strong convex function.

Lemma 2

The optimal solution of the optimization problem

with a fixed admissible transport plan , is given by

Using Lemma 2 (or Lemma 4 in Nesterov [26]) together with the setting , , the global problem is equivalent to

(5)

Note that this is known as a smoothed max-operator [26, 4]. Specifically, regularization parameter controls the “smoothness” of the maximum.

Proposition 3

is a convex function relative to .

The derived optimization problem of FROT is convex. Therefore, we can determine globally optimal solutions. Note that the SRW optimization problem is not jointly convex [27] for the projection matrix and the transport plan. In this study, we employ the Frank–Wolfe algorithm [12, 17], using which we approximate with linear functions at and move toward the optimal solution in the convex set (See Algorithm 2).

The derivative of the loss function at is given by

Then, we update the transport plan by solving the EMD problem:

where . Note that is given by the weighted sum of the cost matrices. Thus, we can utilize multiple features to estimate the transport plan for the relaxed problem in Eq. (5).

Using the Frank–Wolfe algorithm, we can obtain the optimal solution. However, solving the EMD problem requires a cubic computational cost that can be expensive if and are large. To address this, we can solve the regularized OT problem, which requires . We denote the Frank–Wolfe algorithm with EMD as FW-EMD and the Frank–Wolfe algorithm with Sinkhorn as FW-Sinkhorn.

Computational complexity: The proposed method depends on the Sinkhorn algorithm, which requires an operation. The computation of the cost matrix in each subproblem needs an operation, where is the number of groups. Therefore, the entire complexity is , where is the number of Frank–Wolfe iterations (in general, is sufficient).

Proposition 4

For each , the iteration of Algorithm 2 satisfies

where is the largest eigenvalue of the matrix and ; and is the accuracy to which internal linear subproblems are solved.

Based on Proposition 4, the number of iterations depends on , , and the number of groups. If we set a small , convergence requires more time. In addition, if we use entropic regularization with a large , the in Proposition 4 can be large. Finally, if we use more groups, the largest eigenvalue of the matrix can be larger. Note that the constant term of the upper bound is large; however, the Frank–Wolfe algorithm converges quickly in practice.

Linear Programming: Because , the FROT problem can also be written as

(6)

Because the objective is the max of linear functions, it is convex with respect to . We can solve the problem via linear programming:

(7)

This optimization can be easily solved using an off-the-shelf LP package. However, the computational cost of this LP problem is high in general (i.e., ).

3.3 Application: Semantic Correspondence

We applied our proposed FROT algorithm to semantic correspondence. The semantic correspondence is a problem that determines the matching of objects in two images. That is, given input image pairs , with common objects, we formulated the semantic correspondence problem to estimate the transport plan from the key points in to those in ; this framework was proposed in [23]. In Figure 2, we show an overview of our proposed framework.

Cost matrix computation : In our framework, we employed a pretrained convolutional neural network to extract dense feature maps for each convolutional layer. The dense feature map of the th layer output of the th image is given by

where and are the width and height of the th image, respectively, and is the dimension of the th layer’s feature map. Note that because the dimension of the dense feature map is different for each layer, we sample feature maps to the size of the st layer’s feature map size (i.e., ).

The th layer’s cost matrix for images and is given by

Figure 2: Semantic correspondence framework based on FROT.

A potential problem with FROT is that the estimation depends significantly on the magnitude of the cost of each layer (also known as a group). Hence, normalizing each cost matrix is important. Therefore, we normalized each feature vector by . Consequently, the cost matrix is given by . We can use distances such as the distance.

Computation of and with staircase re-weighting: For semantic correspondence, setting and is important because semantic correspondence can be affected by background clutter. Therefore, we generated the class activation maps [42] for the source and target images and used them as and , respectively. For CAM, we chose the class with the highest classification probability and normalized it to the range .

4 Related Work

OT algorithms: The Wasserstein distance can be determined by solving the OT problem. An advantage of the Wasserstein distance is its robustness to noise; moreover, we can obtain the transport plan, which is useful for many machine learning applications. To reduce the computation cost for the Wasserstein distance, the sliced Wasserstein distance is useful [19]. Recently, a tree variant of the Wasserstein distance was proposed [11, 21, 35]; the sliced Wasserstein distance is a special case of this algorithm.

In addition to accelerating the computation, structured optimal transport incorporates structural information directly into OT problems [1]. Specifically, they formulate the submodular optimal transport problem and solve the problem using a saddle-point mirror prox algorithm. Recently, more complex structured information was introduced in the OT problem, including the hierarchical structure [2, 41]. These approaches successfully incorporate structured information into OT problems with respect to data samples. By contrast, FROT incorporates the structured information into features.

The approach most closely related to FROT is a robust variant of the Wasserstein distance, called the subspace robust Wasserstein distance (SRW) [27]. SRW computes the OT problem in a discriminative subspace; this is possible by solving dimensionality-reduction problems. Owing to the robustness, SRW can successfully compute the Wasserstein distance from noisy data. The max–sliced Wasserstein distance [9] and its generalized counterpart [20] can also be regarded as subspace-robust Wasserstein methods. Note that SRW [27] is a min–max based approach, while the max–sliced Wasserstein distances [9, 20] are max–min approaches. The FROT is a feature selection variant of the Wasserstein distance, whereas the subspace approaches are used for dimensionality reduction.

As a parallel work, a general minimax optimal transport problem called the robust Kantorovich problem (RKP) was recently proposed [10]. RKP involves using a cutting-set method for a general minmax optimal transport problem that includes the FROT problem as a special case. The approaches are technically similar; however, our problem and that of Dhouib et al. [10] are intrinsically different. Specifically, we aim to solve a high-dimensional OT problem using feature selection and apply it to semantic correspondence problems, while the RKP approach focuses on providing a general framework and uses it for color transformation problems. As a technical difference, the cutting-set method may not converge to an optimal solution if we use the regularized OT [10]. By contrast, because we use a Frank–Wolfe algorithm, our algorithm converges to a true objective function with regularized OT solvers. The multiobjective optimal transport (MOT) is an approach [36] parallel to ours. The key difference between FROT and MOT is that MOT tries to use the weighted sum of cost functions, while FROT considers the worst case. Moreover, as applications, we focus on the cost matrices computed from subsets of features, while MOT considers cost matrices with different distance functions.

OT applications: OT has received significant attention for use in several computer vision tasks. Applications include Wasserstein distance estimation [29], domain adaptation [39], multitask learning [18], barycenter estimation [7], semantic correspondence [23], feature matching [34], photo album summarization [22], generative model [3, 5], graph matching [37, 38], and the semantic correspondence [23].

5 Experiments

(a) Objective score.
(b) MSE ().
(c) MSE ().
Figure 3: (a) Objective scores for LP, FW-EMD, and FW-Sinkhorn. (b) MSE between transport plan of LP and FW-EMD and that with LP and FW-Sinkhorn with different . (c) MSE between transport plan of LP and FW-Sinkhorn with different .

5.1 Synthetic Data

We compare FROT with a standard OT using synthetic datasets. In these experiments, we initially generate two-dimensional vectors and . Here, we set , , . Then, we concatenate and to and , respectively, to give , .

For FROT, we set and the number of iterations of the Frank–Wolfe algorithm as . The regularization parameter is set to for all methods. To show the proof-of-concept, we set the true features as a group and the remaining noise features as another group.

Fig. 0(a) shows the correspondence from and with the vanilla OT algorithm. Figs. 0(b) and 0(c) show the correspondence of FROT and OT with and , respectively. Although FROT can identify a suitable matching, the OT fails to obtain a significant correspondence. We observed that the parameter corresponding to a true group is . Moreover, we compared the objective scores of the FROT with LP, FW-EMD, and FW-Sinkhorn (). Figure 2(a) shows the objective scores of FROTs with the different solvers, and both FW-EMD and FW-Sinkhorn can achieve almost the same objective score with a relatively small . Moreover, Figure 2(b) shows the mean squared error between the LP method and the FW counterparts. Similar to the objective score cases, it can yield a similar transport plan with a relatively small . Finally, we evaluated the FW-Sinkhorn by changing the regularization parameter . In this experiment, we set and varied the values. The result shows that we can obtain an accurate transport plan with a relatively small .

Methods aero bike bird boat bottle bus car cat chair cow dog horse moto person plant sheep train tv all
SPair-71k
finetuned
models
CNNGeo [30] 23.4 16.7 40.2 14.3 36.4 27.7 26.0 32.7 12.7 27.4 22.8 13.7 20.9 21.0 17.5 10.2 30.8 34.1 20.6
A2Net [16] 22.6 18.5 42.0 16.4 37.9 30.8 26.5 35.6 13.3 29.6 24.3 16.0 21.6 22.8 20.5 13.5 31.4 36.5 22.3
WeakAlign [31] 22.2 17.6 41.9 15.1 38.1 27.4 27.2 31.8 12.8 26.8 22.6 14.2 20.0 22.2 17.9 10.4 32.2 35.1 20.9
NC-Net [32] 17.9 12.2 32.1 11.7 29.0 19.9 16.1 39.2 9.9 23.9 18.8 15.7 17.4 15.9 14.8 9.6 24.2 31.1 20.1
SPair-71k
validation
HPF [24] 25.2 18.9 52.1 15.7 38.0 22.8 19.1 52.9 17.9 33.0 32.8 20.6 24.4 27.9 21.1 15.9 31.5 35.6 28.2
OT-HPF [23] 32.6 18.9 62.5 20.7 42.0 26.1 20.4 61.4 19.7 41.3 41.7 29.8 29.6 31.8 25.0 23.5 44.7 37.0 33.9
Without
SPair-71k
validation
OT 30.1 16.5 50.4 17.3 38.0 22.9 19.7 54.3 17.0 28.4 31.3 22.1 28.0 19.5 21.0 17.8 42.6 28.8 28.3
FROT () 35.0 20.9 56.3 23.4 40.7 27.2 21.9 62.0 17.5 38.8 36.2 27.9 28.0 30.4 26.9 23.1 49.7 38.4 33.7
FROT () 34.1 18.8 56.9 19.9 40.0 25.6 19.2 61.9 17.4 38.7 36.5 25.6 26.9 27.2 26.3 22.1 50.3 38.6 32.8
FROT () 33.4 19.4 56.6 20.0 39.6 26.1 19.1 62.4 17.9 38.0 36.5 26.0 27.5 26.5 25.5 21.6 49.7 38.9 32.7
SRW (layers = {1, 32–34}) 29.4 14.0 43.7 15.6 33.8 21.0 17.6 48.0 12.9 23.3 26.5 19.8 25.5 17.6 16.7 15.2 37.1 20.5 24.5
SRW (layers = {1, 31–34}) 29.7 14.3 44.3 15.7 34.2 21.3 17.8 48.5 13.1 23.6 27.1 20.0 25.8 18.1 16.9 15.2 37.3 21.0 24.8
SRW (layers = {1, 30–34}) 29.8 14.7 45.6 15.9 34.8 21.5 18.0 49.3 13.3 24.0 27.7 20.6 25.7 18.7 17.2 15.3 37.7 21.5 25.2
Table 1: Per-class PCK () results using SPair-71k. All models use ResNet101. The numbers in the bracket of SRW are the input layer indicies.

5.2 Semantic correspondence

We evaluated our FROT algorithm for semantic correspondence. In this study, we used the SPair-71k [25]. The SPair-71k dataset consists of image pairs with variations in viewpoint and scale. For evaluation, we employed a percentage of accurate key points (PCK), which counts the number of accurately predicted key points given a fixed threshold [25]. All semantic correspondence experiments were run on a Linux server with NVIDIA P100.

For the optimal transport based frameworks, we employed ResNet101 [15] pretrained on ImageNet [8] for feature and activation map extraction. The ResNet101 consists of 34 convolutional layers and the entire number of features is . Note that we did not fine-tune the network. We compared the proposed method with several baselines [25] and the SRW1. Owing to the computational cost and the required memory size for SRW, we used the first and the last few convolutional layers of ResNet101 as the input of SRW. In our experiments, we empirically set and for FROT and SRW, respectively. For SRW, we set the number of latent dimension as for all experiments. HPF [24] and OT-HPF [23] are state-of-the-art methods for semantic correspondence. HPF and OT-HPF required the validation dataset to select important layers, whereas SRW and FROT did not require the validation dataset. OT is a simple optimal transport-based method that does not select layers.

Table 1 lists the per-class PCK results obtained using the SPair-71k dataset. FROT outperforms most existing baselines, including HPF and OT. Moreover, FROT is consistent with OT-HPF [23], which requires the validation dataset to select important layers. In this experiment, setting results in favorable performance (See Table 3 in the Supplementary Material). The computational costs of FROT is 0.29, while SRWs are 8.73, 11.73, 15.76, respectively. Surprisingly, FROT outperformed SRWs. However, this is mainly due to the used input layers. Therefore, scaling up SRW would be an interesting future work.

6 Conclusion

In this paper, we proposed FROT for high-dimensional data. This approach jointly solves feature selection and OT problems. An advantage of FROT is that it is a convex optimization problem and can determine an accurate globally optimal solution using the Frank–Wolfe algorithm. We used FROT for high-dimensional feature selection and semantic correspondence problems. Through extensive experiments, we demonstrated that the proposed algorithm is consistent with state-of-the-art algorithms in both feature selection and semantic correspondence.

Proof of Proposition 1

For the distance function , we prove that

is a distance for .

The symmetry can be read directly from the definition as we used distances that are symmetric. For the identity of indiscernibles, when with the optimal and , there exists such that (as is in the simplex set). As there is a max in the definition and , this means that and . Therefore, we have when .

When , this means that , , and , and we have for . Thus, for any , the optimal transport plan is for and for . Therefore, when , we have .

Triangle Inequality

Let , , and , we prove that

To simplify the notations in this proof, we define as the distance ”matrix” such that is the th-row and th-column element of the matrix , , and . Moreover, note that is the ”matrix,” where each element is an element of raised to the power .

Consider that is the optimal transport plan of , and is the optimal transport plan of , where is a discrete measure. Similar to the proof for the Wasserstein distance in [29], let with be a vector such that if , and otherwise. We can show that .

By letting and , the right-hand side of this inequality can be rewritten as

by the Minkovski inequality.

This inequality is valid for all . Therefore, we have

FROT with Linear Programming

Linear Programming: The FROT is a convex piecewise-linear minimization because the objective is the max of linear functions. Thus, we can solve the FROT problem via linear programming:

This optimization can be easily solved using an off-the-shelf LP package. However, the computational cost of this LP problem is high in general (i.e., ).

The FROT problem can be written as

s.t.

This problem can be transformed to an equivalent linear program by first forming an epigraph problem:

s.t.

Thus, the linear programming for FROT is given as

s.t.

Next, we transform this linear programming problem into the canonical form. For matrix and , we can vectorize the matrix using the following linewise operator:

Using this vectorization operator, we can write as

where is a vector whose elements are zero.

For the constraints