Feature Robust Optimal Transport for Highdimensional Data
Abstract
Optimal transport is a machine learning problem with applications including distribution comparison, feature selection, and generative adversarial networks. In this paper, we propose featurerobust optimal transport (FROT) for highdimensional data, which solves highdimensional OT problems using feature selection to avoid the curse of dimensionality. Specifically, we find a transport plan with discriminative features. To this end, we formulate the FROT problem as a min–max optimization problem. We then propose a convex formulation of the FROT problem and solve it using a Frank–Wolfebased optimization algorithm, whereby the subproblem can be efficiently solved using the Sinkhorn algorithm. Since FROT finds the transport plan from selected features, it is robust to noise features. To show the effectiveness of FROT, we propose using the FROT algorithm for the layer selection problem in deep neural networks for semantic correspondence. By conducting synthetic and benchmark experiments, we demonstrate that the proposed method can find a strong correspondence by determining important layers. We show that the FROT algorithm achieves stateoftheart performance in realworld semantic correspondence datasets.
1 Introduction
Optimal transport (OT) is a machine learning problem with several applications in the computer vision and natural language processing communities. The applications include Wasserstein distance estimation [29], domain adaptation [39], multitask learning [18], barycenter estimation [7], semantic correspondence [23], feature matching [34], and photo album summarization [22]. The OT problem is extensively studied in the computer vision community as the earth mover’s distance (EMD) [33]. However, the computational cost of EMD is cubic and highly expensive. Recently, the entropic regularized EMD problem was proposed; this problem can be solved using the Sinkhorn algorithm with a quadratic cost [6]. Owing to the development of the Sinkhorn algorithm, researchers have replaced the EMD computation with its regularized counterparts. However, the optimal transport problem for highdimensional data has remained unsolved for many years.
Recently, a robust variant of the OT was proposed for highdimensional OT problems and used for divergence estimation [27, 28]. In the robust OT framework, the transport plan is computed with the discriminative subspace of the two data matrices and . The subspace can be obtained using dimensionality reduction. An advantage of the subspace robust approach is that it does not require prior information about the subspace. However, given prior information such as feature groups, we can consider a computationally efficient formulation. The computation of the subspace can be expensive if the dimensionality of data is high, for example, .
One of the most common prior information items is a feature group. The use of group features is popular in feature selection problems in the biomedical domain and has been extensively studied in Group Lasso [40]. The key idea of Group Lasso is to prespecify the group variables and select the set of group variables using the group norm (also known as the sum of norms). For example, if we use a pretrained neural network as a feature extractor and compute OT using the features, then we require careful selection of important layers to compute OT. Specifically, each layer output is regarded as a grouped input. Therefore, using a feature group as prior information is a natural setup and is important for considering OT for deep neural networks (DNNs).
In this paper, we propose a highdimensional optimal transport method by utilizing prior information in the form of grouped features. Specifically, we propose a featurerobust optimal transport (FROT) problem, for which we select distinct group feature sets to estimate a transport plan instead of determining its distinct subsets, as proposed in [27, 28]. To this end, we formulate the FROT problem as a min–max optimization problem and transform it into a convex optimization problem, which can be accurately solved using the Frank–Wolfe algorithm [12, 17]. The FROTâs subproblem can be efficiently solved using the Sinkhorn algorithm [6]. An advantage of FROT is that it can yield a transport plan from highdimensional data using feature selection, using which the significance of the features is obtained without any additional cost. Therefore, the FROT formulation is highly suited for highdimensional OT problems. Through synthetic experiments, we initially demonstrate that the proposed FROT is robust to noise dimensions (See Figure 1). Furthermore, we apply FROT to a semantic correspondence problem [23] and show that the proposed algorithm achieves SOTA performance.
2 Background
In this section, we briefly introduce the OT problem.
Optimal transport (OT): The following are given: independent and identically distributed (i.i.d.) samples from a dimensional distribution , and i.i.d. samples from the dimensional distribution . In the Kantorovich relaxation of OT, admissible couplings are defined by the set of the transport plan:
where is called the transport plan, is the dimensional vector whose elements are ones, and and are the weights. The OT problem between two discrete measures and determines the optimal transport plan of the following problem:
(1) 
where is a cost function. For example, the squared Euclidean distance is used, that is, . To solve the OT problem, Eq. (1) (also known as the earth moverâs distance) using linear programming requires computation, which is computationally expensive. To address this, an entropicregularized optimal transport is used [6].
where is the regularization parameter, and is the entropic regularization. If , then the regularized OT problem reduces to the EMD problem. Owing to entropic regularization, the entropic regularized OT problem can be accurately solved using Sinkhorn iteration [6] with a computational cost (See Algorithm 1).
Wasserstein distance: If the cost function is defined as with as a distance function and , then we define the Wasserstein distance of two discrete measures and as
Recently, a robust variant of the Wasserstein distance, called the subspace robust Wasserstein distance (SRW), was proposed [27]. The SRW computes the OT problem in the discriminative subspace. This can be determined by solving dimensionalityreduction problems. Owing to the robustness, it can compute the Wasserstein from noisy data. The SRW is given as
(2) 
where is the projection matrix with , and is the identity matrix. The SRW or its relaxed problem can be efficiently estimated using either eigenvalue decomposition or the Frank–Wolfe algorithm.
3 Proposed Method
This paper proposes FROT. We assume that the vectors are grouped as and . Here, and are the dimensional vectors, where . This setting is useful if we know the explicit group structure for the feature vectors a priori. In an application in layer neural networks, we consider and as outputs of the th layer of the network. If we do not have a priori information, we can consider each feature independently (i.e., and ). All proofs in this section are provided in the Supplementary Material.
3.1 FeatureRobust Optimal Transport (FROT)
The FROT formulation is given by
(3) 
where is the probability simplex. The underlying concept of FROT is to estimate the transport plan using distinct groups with large distances between and . We note that determining the transport plan in nondistinct groups is difficult because the data samples in and overlap. By contrast, in distinct groups, and are different, and this aids in determining an optimal transport plan. This is an intrinsically similar idea to the subspace robust Wasserstein distance [27], which estimates the transport plan in the discriminative subspace, while our approach selects important groups. Therefore, FROT can be regarded as a feature selection variant of the vanilla OT problem in Eq. (1), whereas the subspace robust version uses dimensionalityreduction counterparts.
Using FROT, we can define a feature robust Wasserstein distance (FRWD).
Proposition 1
For the distance function ,
(4) 
is a distance for .
Note that we can show that 2FRWD is a special case of SRW with (See Supplementary Material). The key difference between SRW and FRWD is that FRWD can use any distance, while SRW can only use .
3.2 FROT Optimization
Here, we propose two FROT algorithms based on the Frank–Wolfe algorithm and linear programming.
Frank–Wolfe: We propose a continuous variant of the FROT algorithm using the Frank–Wolfe algorithm, which can be fully differentiable. To this end, we introduce entropic regularization for and rewrite the FROT as a function of . Therefore, we solve the following problem for :
where is the regularization parameter, and is the entropic regularization for . An advantage of entropic regularization is that the nonnegative constraint is naturally satisfied, and the entropic regularizer is a strong convex function.
Lemma 2
The optimal solution of the optimization problem
with a fixed admissible transport plan , is given by
Using Lemma 2 (or Lemma 4 in Nesterov [26]) together with the setting , , the global problem is equivalent to
(5) 
Note that this is known as a smoothed maxoperator [26, 4]. Specifically, regularization parameter controls the âsmoothnessâ of the maximum.
Proposition 3
is a convex function relative to .
The derived optimization problem of FROT is convex. Therefore, we can determine globally optimal solutions. Note that the SRW optimization problem is not jointly convex [27] for the projection matrix and the transport plan. In this study, we employ the Frank–Wolfe algorithm [12, 17], using which we approximate with linear functions at and move toward the optimal solution in the convex set (See Algorithm 2).
The derivative of the loss function at is given by
Then, we update the transport plan by solving the EMD problem:
where . Note that is given by the weighted sum of the cost matrices. Thus, we can utilize multiple features to estimate the transport plan for the relaxed problem in Eq. (5).
Using the Frank–Wolfe algorithm, we can obtain the optimal solution. However, solving the EMD problem requires a cubic computational cost that can be expensive if and are large. To address this, we can solve the regularized OT problem, which requires . We denote the Frank–Wolfe algorithm with EMD as FWEMD and the Frank–Wolfe algorithm with Sinkhorn as FWSinkhorn.
Computational complexity: The proposed method depends on the Sinkhorn algorithm, which requires an operation. The computation of the cost matrix in each subproblem needs an operation, where is the number of groups. Therefore, the entire complexity is , where is the number of Frank–Wolfe iterations (in general, is sufficient).
Proposition 4
For each , the iteration of Algorithm 2 satisfies
where is the largest eigenvalue of the matrix and ; and is the accuracy to which internal linear subproblems are solved.
Based on Proposition 4, the number of iterations depends on , , and the number of groups. If we set a small , convergence requires more time. In addition, if we use entropic regularization with a large , the in Proposition 4 can be large. Finally, if we use more groups, the largest eigenvalue of the matrix can be larger. Note that the constant term of the upper bound is large; however, the Frank–Wolfe algorithm converges quickly in practice.
Linear Programming: Because , the FROT problem can also be written as
(6) 
Because the objective is the max of linear functions, it is convex with respect to . We can solve the problem via linear programming:
(7) 
This optimization can be easily solved using an offtheshelf LP package. However, the computational cost of this LP problem is high in general (i.e., ).
3.3 Application: Semantic Correspondence
We applied our proposed FROT algorithm to semantic correspondence. The semantic correspondence is a problem that determines the matching of objects in two images. That is, given input image pairs , with common objects, we formulated the semantic correspondence problem to estimate the transport plan from the key points in to those in ; this framework was proposed in [23]. In Figure 2, we show an overview of our proposed framework.
Cost matrix computation : In our framework, we employed a pretrained convolutional neural network to extract dense feature maps for each convolutional layer. The dense feature map of the th layer output of the th image is given by
where and are the width and height of the th image, respectively, and is the dimension of the th layer’s feature map. Note that because the dimension of the dense feature map is different for each layer, we sample feature maps to the size of the st layer’s feature map size (i.e., ).
The th layer’s cost matrix for images and is given by
A potential problem with FROT is that the estimation depends significantly on the magnitude of the cost of each layer (also known as a group). Hence, normalizing each cost matrix is important. Therefore, we normalized each feature vector by . Consequently, the cost matrix is given by . We can use distances such as the distance.
Computation of and with staircase reweighting: For semantic correspondence, setting and is important because semantic correspondence can be affected by background clutter. Therefore, we generated the class activation maps [42] for the source and target images and used them as and , respectively. For CAM, we chose the class with the highest classification probability and normalized it to the range .
4 Related Work
OT algorithms: The Wasserstein distance can be determined by solving the OT problem. An advantage of the Wasserstein distance is its robustness to noise; moreover, we can obtain the transport plan, which is useful for many machine learning applications. To reduce the computation cost for the Wasserstein distance, the sliced Wasserstein distance is useful [19]. Recently, a tree variant of the Wasserstein distance was proposed [11, 21, 35]; the sliced Wasserstein distance is a special case of this algorithm.
In addition to accelerating the computation, structured optimal transport incorporates structural information directly into OT problems [1]. Specifically, they formulate the submodular optimal transport problem and solve the problem using a saddlepoint mirror prox algorithm. Recently, more complex structured information was introduced in the OT problem, including the hierarchical structure [2, 41]. These approaches successfully incorporate structured information into OT problems with respect to data samples. By contrast, FROT incorporates the structured information into features.
The approach most closely related to FROT is a robust variant of the Wasserstein distance, called the subspace robust Wasserstein distance (SRW) [27]. SRW computes the OT problem in a discriminative subspace; this is possible by solving dimensionalityreduction problems. Owing to the robustness, SRW can successfully compute the Wasserstein distance from noisy data. The max–sliced Wasserstein distance [9] and its generalized counterpart [20] can also be regarded as subspacerobust Wasserstein methods. Note that SRW [27] is a min–max based approach, while the max–sliced Wasserstein distances [9, 20] are max–min approaches. The FROT is a feature selection variant of the Wasserstein distance, whereas the subspace approaches are used for dimensionality reduction.
As a parallel work, a general minimax optimal transport problem called the robust Kantorovich problem (RKP) was recently proposed [10]. RKP involves using a cuttingset method for a general minmax optimal transport problem that includes the FROT problem as a special case. The approaches are technically similar; however, our problem and that of Dhouib et al. [10] are intrinsically different. Specifically, we aim to solve a highdimensional OT problem using feature selection and apply it to semantic correspondence problems, while the RKP approach focuses on providing a general framework and uses it for color transformation problems. As a technical difference, the cuttingset method may not converge to an optimal solution if we use the regularized OT [10]. By contrast, because we use a Frank–Wolfe algorithm, our algorithm converges to a true objective function with regularized OT solvers. The multiobjective optimal transport (MOT) is an approach [36] parallel to ours. The key difference between FROT and MOT is that MOT tries to use the weighted sum of cost functions, while FROT considers the worst case. Moreover, as applications, we focus on the cost matrices computed from subsets of features, while MOT considers cost matrices with different distance functions.
OT applications: OT has received significant attention for use in several computer vision tasks. Applications include Wasserstein distance estimation [29], domain adaptation [39], multitask learning [18], barycenter estimation [7], semantic correspondence [23], feature matching [34], photo album summarization [22], generative model [3, 5], graph matching [37, 38], and the semantic correspondence [23].
5 Experiments
5.1 Synthetic Data
We compare FROT with a standard OT using synthetic datasets. In these experiments, we initially generate twodimensional vectors and . Here, we set , , . Then, we concatenate and to and , respectively, to give , .
For FROT, we set and the number of iterations of the Frank–Wolfe algorithm as . The regularization parameter is set to for all methods. To show the proofofconcept, we set the true features as a group and the remaining noise features as another group.
Fig. 0(a) shows the correspondence from and with the vanilla OT algorithm. Figs. 0(b) and 0(c) show the correspondence of FROT and OT with and , respectively. Although FROT can identify a suitable matching, the OT fails to obtain a significant correspondence. We observed that the parameter corresponding to a true group is . Moreover, we compared the objective scores of the FROT with LP, FWEMD, and FWSinkhorn (). Figure 2(a) shows the objective scores of FROTs with the different solvers, and both FWEMD and FWSinkhorn can achieve almost the same objective score with a relatively small . Moreover, Figure 2(b) shows the mean squared error between the LP method and the FW counterparts. Similar to the objective score cases, it can yield a similar transport plan with a relatively small . Finally, we evaluated the FWSinkhorn by changing the regularization parameter . In this experiment, we set and varied the values. The result shows that we can obtain an accurate transport plan with a relatively small .
Methods  aero  bike  bird  boat  bottle  bus  car  cat  chair  cow  dog  horse  moto  person  plant  sheep  train  tv  all  

CNNGeo [30]  23.4  16.7  40.2  14.3  36.4  27.7  26.0  32.7  12.7  27.4  22.8  13.7  20.9  21.0  17.5  10.2  30.8  34.1  20.6  
A2Net [16]  22.6  18.5  42.0  16.4  37.9  30.8  26.5  35.6  13.3  29.6  24.3  16.0  21.6  22.8  20.5  13.5  31.4  36.5  22.3  
WeakAlign [31]  22.2  17.6  41.9  15.1  38.1  27.4  27.2  31.8  12.8  26.8  22.6  14.2  20.0  22.2  17.9  10.4  32.2  35.1  20.9  
NCNet [32]  17.9  12.2  32.1  11.7  29.0  19.9  16.1  39.2  9.9  23.9  18.8  15.7  17.4  15.9  14.8  9.6  24.2  31.1  20.1  

HPF [24]  25.2  18.9  52.1  15.7  38.0  22.8  19.1  52.9  17.9  33.0  32.8  20.6  24.4  27.9  21.1  15.9  31.5  35.6  28.2  
OTHPF [23]  32.6  18.9  62.5  20.7  42.0  26.1  20.4  61.4  19.7  41.3  41.7  29.8  29.6  31.8  25.0  23.5  44.7  37.0  33.9  

OT  30.1  16.5  50.4  17.3  38.0  22.9  19.7  54.3  17.0  28.4  31.3  22.1  28.0  19.5  21.0  17.8  42.6  28.8  28.3  
FROT ()  35.0  20.9  56.3  23.4  40.7  27.2  21.9  62.0  17.5  38.8  36.2  27.9  28.0  30.4  26.9  23.1  49.7  38.4  33.7  
FROT ()  34.1  18.8  56.9  19.9  40.0  25.6  19.2  61.9  17.4  38.7  36.5  25.6  26.9  27.2  26.3  22.1  50.3  38.6  32.8  
FROT ()  33.4  19.4  56.6  20.0  39.6  26.1  19.1  62.4  17.9  38.0  36.5  26.0  27.5  26.5  25.5  21.6  49.7  38.9  32.7  
SRW (layers = {1, 32–34})  29.4  14.0  43.7  15.6  33.8  21.0  17.6  48.0  12.9  23.3  26.5  19.8  25.5  17.6  16.7  15.2  37.1  20.5  24.5  
SRW (layers = {1, 31–34})  29.7  14.3  44.3  15.7  34.2  21.3  17.8  48.5  13.1  23.6  27.1  20.0  25.8  18.1  16.9  15.2  37.3  21.0  24.8  
SRW (layers = {1, 30–34})  29.8  14.7  45.6  15.9  34.8  21.5  18.0  49.3  13.3  24.0  27.7  20.6  25.7  18.7  17.2  15.3  37.7  21.5  25.2 
5.2 Semantic correspondence
We evaluated our FROT algorithm for semantic correspondence. In this study, we used the SPair71k [25]. The SPair71k dataset consists of image pairs with variations in viewpoint and scale. For evaluation, we employed a percentage of accurate key points (PCK), which counts the number of accurately predicted key points given a fixed threshold [25]. All semantic correspondence experiments were run on a Linux server with NVIDIA P100.
For the optimal transport based frameworks, we employed ResNet101 [15] pretrained on ImageNet [8] for feature and activation map extraction. The ResNet101 consists of 34 convolutional layers and the entire number of features is . Note that we did not finetune the network. We compared the proposed method with several baselines [25] and the SRW
Table 1 lists the perclass PCK results obtained using the SPair71k dataset. FROT outperforms most existing baselines, including HPF and OT. Moreover, FROT is consistent with OTHPF [23], which requires the validation dataset to select important layers. In this experiment, setting results in favorable performance (See Table 3 in the Supplementary Material). The computational costs of FROT is 0.29, while SRWs are 8.73, 11.73, 15.76, respectively. Surprisingly, FROT outperformed SRWs. However, this is mainly due to the used input layers. Therefore, scaling up SRW would be an interesting future work.
6 Conclusion
In this paper, we proposed FROT for highdimensional data. This approach jointly solves feature selection and OT problems. An advantage of FROT is that it is a convex optimization problem and can determine an accurate globally optimal solution using the Frank–Wolfe algorithm. We used FROT for highdimensional feature selection and semantic correspondence problems. Through extensive experiments, we demonstrated that the proposed algorithm is consistent with stateoftheart algorithms in both feature selection and semantic correspondence.
Proof of Proposition 1
For the distance function , we prove that
is a distance for .
The symmetry can be read directly from the definition as we used distances that are symmetric. For the identity of indiscernibles, when with the optimal and , there exists such that (as is in the simplex set). As there is a max in the definition and , this means that and . Therefore, we have when .
When , this means that , , and , and we have for . Thus, for any , the optimal transport plan is for and for . Therefore, when , we have .
Triangle Inequality
Let , , and , we prove that
To simplify the notations in this proof, we define as the distance ”matrix” such that is the throw and thcolumn element of the matrix , , and . Moreover, note that is the ”matrix,” where each element is an element of raised to the power .
Consider that is the optimal transport plan of , and is the optimal transport plan of , where is a discrete measure. Similar to the proof for the Wasserstein distance in [29], let with be a vector such that if , and otherwise. We can show that .
By letting and , the righthand side of this inequality can be rewritten as
by the Minkovski inequality.
This inequality is valid for all . Therefore, we have
FROT with Linear Programming
Linear Programming: The FROT is a convex piecewiselinear minimization because the objective is the max of linear functions. Thus, we can solve the FROT problem via linear programming:
This optimization can be easily solved using an offtheshelf LP package. However, the computational cost of this LP problem is high in general (i.e., ).
The FROT problem can be written as
s.t. 
This problem can be transformed to an equivalent linear program by first forming an epigraph problem:
s.t.  
Thus, the linear programming for FROT is given as
s.t.  
Next, we transform this linear programming problem into the canonical form. For matrix and , we can vectorize the matrix using the following linewise operator:
Using this vectorization operator, we can write as
where is a vector whose elements are zero.
For the constraints