# Learning to Recommend via Inverse Optimal Matching

## Abstract

We consider recommendation in the context of optimal matching, i.e., we need to pair or match a user with an item in an optimal way. The framework is particularly relevant when the supply of an individual item is limited and it can only satisfy a small number of users even though it may be preferred by many. We leverage the methodology of optimal transport of discrete distributions and formulate an inverse optimal transport problem in order to learn the cost which gives rise to the observed matching. It leads to a non-convex optimization problem which is solved by alternating optimization. A key novel aspect of our formulation is the incorporation of marginal relaxation via regularized Wasserstein distance, significantly improving the robustness of the method in the face of observed empirical matchings. Our model has wide applicability including labor market, online dating, college application recommendation. We back up our claims with experiments on both synthetic data and real world datasets.

## 1 Introduction

The last two decades have witnessed rapid research progress (Adomavicius and Tuzhilin, 2005; Ricci et al., 2011) and abundant practical applications (e.g. Amazon, Netflix, Google News) of recommendation systems. However, several challenges still remain. For example, whom should we recommend to if an item is of limited supply but preferred by many? Recommendation of used cars and online auctions face such challenges where dealers want to target users who are potentially interested and are likely to offer high bid to advertise their inventory. Another interesting yet challenging question is how we can make recommendation to users in a market of bilateral preferences as are the cases of labor market, online dating and college admissions, in which cases it probably makes more sense if we could recommend the most ‘fit’ ones rather than the ‘best’ ones. All of these issues are beyond the reach of conventional collaborative filtering (Sarwar et al., 2001), content-based (Pazzani and Billsus, 2007) and hybrid Burke (2002) recommendation systems.

To address above issues, we consider a general optimal-transport-based framework where the amount of supply and demand is modeled by probability distributions, the preference of both user and item sides as well as the internal matching mechanism within the market is captured by the optimal transport theory and we make recommendation to users in a way so as to optimize a social surplus (Carlier and Ekeland, 2010). Concretely, we view user-item matchings as a flow between the item side and the user side and an optimal flow gives the most cost-effective way to distribute the items to the users. We assume the data for our learning problem is given in the form of an observed matching matrix, for example, in the college admission case, we know who was admitted to which college; and the goal is to understand the mechanism that gives rise to the observed matching using the optimal transport framework. A key ingredient in our model is the cost matrix which gives rise to the observed matching matrix and it is parameterized by an interaction matrix describing the interaction between users and items. We would like to learn the cost from observed matching matrix so that we can make recommendation for a new group of users and items in the future. We formulate an inverse optimal transport problem with marginal relaxation via regularized Wasserstein distance to learn the cost. Marginal relaxation in our proposed model is of great importance as it helps to robustify the solution in the face of noisy observed matching matrix. Our formulation leads to a non-convex problem and can be solved efficiently by an alternating minimization algorithm. We investigate robustness and learning performance of our model on both synthetic data and real world datasets. Experimental results show that our approach significantly improves the robustness and outperforms its competitors in terms of learning performance. The most significant distinction between our proposed model and conventional recommendation systems is that our model addresses user-item matching and it is conceptually different from score-based preference learning methods.

The main novelty of our work is twofold. First, from modeling point of view, we build the connection between recommendation systems and optimal transport. To the best of our knowledge, this is the first attempt to incorporate optimal transport into recommendation systems. Our optimal-transport-based recommendation system can be suitably applied to settings where the market has an internal matching mechanism and is of limited supply and demand. Second, from methodology perspective, the inverse optimal transport problem we formulate to learn the cost incorporates marginal relaxation through regularized Wasserstein distance, which significantly improves the robustness of our method compared with state-of-art competitors. We also provide an efficient alternating algorithm to solve it so that the additional robustness does not come with a high price tag.

This paper is organized as follows: we briefly summarize related work in section 2 and review both discrete optimal transport, its regularized version as well as their close connections in section 3. Section 4 describes the setup of our model and introduces our robust formulation via regularized Wasserstein distance, which models matching mechanism by leveraging regularized optimal transport. The optimization algorithm is detailed in section 5. We evaluate our model in section 6 on both synthetic data and real-world datasets. Last section concludes the paper and shows several directions for potential future research.

## 2 Related Work

Recommendation Systems meet Optimal Transport Several recent work relate optimal transport to recommendation systems, (Qian et al., 2017) uses Wasserstein distance to perform image retrieval and recommendation, (Huang, 2017) adopts Wasserstein distance for multi-label prediction in social friend recommendation context. Our work differs from theirs in that instead of merely using Wasserstein distance as a dissimilarity measure, we also leverages optimal transport to model the internal matching mechanism in the market. Another related work studies total surplus maximization in recommending online service. (Zhang et al., 2016).

Distance Metric Learning Our model essentially aims to learn a distance-like cost matrix and is closely related to distance metric learning. Prior research on learning different distance metrics in various contexts are fruitful, such as learning cosine similarity for face verification (Nguyen and Bai, 2010) and learning Mahalanobis distance for clustering (Xing et al., 2003) and face identification (Guillaumin et al., 2009). However, distance learning for optimal transport distance is largely unexplored. Cuturi proposed to learn the ground metric by minimizing the difference of two convex polyhedral functions 2014. Wang and Guibas formulated a SVM-like minimization problem to learn Earth Mover’s distance 2012. Both approaches work with Wasserstein distance which involves solving linear programming as subroutine hence may be computationally too expensive. A most close work to ours is (Dupuy et al., 2016), where they work with regularized optimal transport distance and propose to learn the linear cost with a fixed marginal formulation. The key difference between their work and ours is that we propose a formulation with marginal relaxation which significantly improves the robustness and learning performance.

Reciprocal Recommendation Another line of related research is reciprocal recommendation (Brozovsky and Petricek, 2007; Pizzato et al., 2013), which also tries to model two-side preference by computing reciprocal score via a hand-craft score function. Our model learns how two sides interact from observed matching matrix.

## 3 Background and Preliminaries

In this section, we present Kantorovich’s formulation of optimal transportation problem and its regularized version.

### 3.1 Notation

In what follows, and denote component-wise multiplication and division, is the vector of ones with appropriate dimension, is the probability simplex, is the Frobenius inner product, is the cost matrix, is the transport polytope of and , namely the set of all -by- matrices satisfying marginal constraints specified by

is a convex, closed and bounded set. Moreover, has a nice probabilistic interpretation, let be two multinomial random variables with distribution and , respectively, then is the set of all couplings of and .

### 3.2 Optimal Transport

Given a cost matrix and two probability vectors , define

This quantity describes how costly to redistribute to , hence provides a nice measure of how far two distributions are. In particular, when , where is the cone of distance matrices (Brickell et al., 2008) and is defined as

is in fact a distance metric on (Villani, 2008), and is called optimal transport distance (also called Wasserstein distance, Earth Mover Distance). The minimizer is called optimal transport plan.

In discrete case, computing OT distance amounts to solving a linear programming problem, for which there exists dedicated algorithm with time complexity (Pele and Werman, 2009)(assuming ). Nevertheless, this is still computationally too expensive in many large scale settings. In addition, OT plan typically admits a sparse (pure) form which is not robust in data-driven applications.

### 3.3 Regularized Optimal Transport

One way to mitigate the sparsity and improve the smoothness of OT plan is by adding entropy regularization. This idea was explored by (Wilson, 1969; Galichon and Salanié, 2010; Cuturi, 2013; Benamou et al., 2015). Consider

(1) |

where is the entropy of , is the regularization parameter controlling the trade-off between sparsity and uniformity of . We call the above quantity regularized optimal transport (ROT) distance (regularized Wasserstein distance) though it may not be an actual distance measure. Due to the strict convexity of introduced entropy term, admits a unique minimizer with full support , which we call regularized optimal transport plan in the sequel. The ROT plan has a semi-closed form solution

where are positive vectors and are uniquely determined up to a multiplicative constant and is the component-wise exponential of . We can compute and by Sinkhorn-Knopp matrix scaling algorithm (Sinkhorn and Knopp, 1967) (also known as iterative proportional fitting procedure (IPFP) in statistics community). The algorithm alternately scales rows and columns of to fit the specified marginals, Appendix A describes the procedure in detail.

Not surprisingly, , ROT distance recovers OT distance as tends to infinity. Moreover, let

be the set of all OT plan and

be the joint distribution with highest entropy among , then , in another word, ROT plan converges to the most uniform OT plan and the rate of convergence is exponential, as shown in (Cominetti and San Martín, 1994). With such nice convergence result, it is theoretically appealing to approximate OT distance by ROT distance. However, it suffers from round-off errors when is large.

ROT has more favorable computational properties than OT does, as it only involves component-wise operation and matrix-vector multiplication, all of which are of quadratic complexity, and can be easily parallelized (Cuturi, 2013). This fact makes ROT popular for measuring dissimilarity between potentially unnormalized distributions in many research areas: deep learning (Geneway et al., 2017), computer vision (Cuturi and Doucet, 2014), machine learning (Rolet et al., 2016; Laclau et al., 2017) and image processing (Papadakis, 2015).

Besides computation efficiency, we argue in next section why it is more appropriate to use ROT in our setting from a modeling perspective.

## 4 Learning to Recommend

For the ease of exposition, we refer two sides of the market as users and items as in e-commerce context. The methodology can be suitable in various applications where optimal matching is considered under stock limitations, such as marriage, cab hailing, college admission etc. Suppose we have user profiles , item profiles and , the count of times user matches with item . Let be the sum of all matchings, be the observed matching matrix and be the sample marginals. Suppose we are also given two cost matrices and , measuring user-user dissimilarity and item-item dissimilarity respectively, we can then pick two appropriate constants and and use and to measure the dissimilarity of probability distributions over user profile space and that of over item profile space.

### 4.1 Modeling Observed Matching Matrix

The observed matching matrix (hence the empirical marginals) often contains noisy, corrupted, and/or missing entries, hence it is more robust to employ a regularized optimal transport plan rather than enforce an exact matching to empirical data in cost function learning. To that end, we propose to use ROT in our learning task. This also has several important benefits that take the following aspects into consideration:

Unobserved Features. For matched (user, item) pairs, they usually observe a complete set of each other’s features. However, some of those are unobservable to researchers. Therefore, entropy term takes into consideration the possibility that there are important hidden features unavailable and the model may not be able to explain all matchings (Galichon and Salanié, 2010).

Enforced Diversity. In certain matching settings, diversity is enforced, as is the case when college admission committee making decisions on applications, diversity is usually an important criterion and committee may prefer underrepresented minorities. Entropy term captures the uncertainty introduced by diversity.

Aggregated Data. Sometimes due to privacy issues or insufficient number of matched (user, item) pairs, only grouped or aggregated data, rather than individual data are available. Accordingly, the grouped matching is denser than individual level matching and is less likely to exhibit sparsity.

### 4.2 Cost Function via Kernel Representation

The cost function is critically important as it determines utility loss of user and item . The lower the cost is, the more likely user will match item , subject to stock limit of items. A main contribution of this work is to learn an adaptive, nonlinear representation of the cost function from empirical matching data. To that end, we present several properties of cost function in optimal matching that support the feasibility.

First of all, we show in the following proposition that the cost function can be uniquely determined by a ROT plan under certain conditions (e.g., is a distance matrix).

###### Proposition 1

If and are symmetric matrices, with same diagonal and in the domain of mapping , then if and only if . In other words, is injective.

Proof:

“” The uniqueness is proved in Sinkhorn and Knopp (1967).

“”: There exist positive vectors and such that

If , we have

where is component-wise exponential, , .

Since are symmetric matrices, it follows that . By appropriately rescaling and to make them equal, we have

where . Inspecting entry of both sides, we immediately conclude that and .

Secondly, as generalized distance functions (Schölkopf, 2001), we consider kernel function representation of the cost function in our model

where is a specific (possibly nonlinear) kernel, and are two unknown linear transformations to be learned. can be interpreted as the latent profile associated with users and items and are studied in (Agarwal and Chen, 2009).

For a wide class of commonly used kernels including linear kernel, polynomial kernel, sigmoid kernel and Gaussian kernel restricted on sphere, they depend only on the inner product of two arguments through a link function

For such kernels, we have and it suffices to learn . In this case, cost matrix is parametrized by and we refer as interaction matrix. Here we apply component-wise on . For ease of presentation, we will work with kernels of this form.

### 4.3 Kernel Inference with Wasserstein Marginal Regularization

A straight forward approach to learn cost matrix in kernel representation is through solving parameter from

(2) |

where , i.e., one uses empirical marginals as hard constraints (Dupuy et al., 2016). In practice, however, the size of samples available is usually small compared with that of population, hence the empirical marginals inferred from samples can be incomplete
and noisy, which causes a *systematic error* no smaller than
as given in proposition 3.

###### Lemma 2

Supppose , and , we have

where is Frobenius norm.

Proof: Denote . It is easily seen that and

Consider the Lagrangian function of right hand side minimization problem

The KKT condition is

By solving KKT condition, we have and where and and

###### Proposition 3

Proof: We know and , by lemma 2 and inequalities between 1-norm and 2-norm of vectors

where , we have

To address this issue, We propose a robust formulation with marginal relaxation. Concretely, we consider the following optimization problem.

(3) |

where is the ROT plan of , is the relaxation parameter controlling the fitness of two marginals, are hyper-parameters controlling the regularity of ROT distance used in the formulation.

The intuition of this formulation is that instead of enforcing noisy empirical marginals as hard constraints, we incorporate them as soft constraints in objective function. We use ROT as regularization for the following reasons. Firstly, optimal transport map tends to send neighboring probability mass to neighboring destinations hence can tolerate some noise caused by misclassified samples. Next, as an approximation of Wasserstein metric, it drives to but also allows some uncertainty. Further, the computational-friendly property of ROT (Cuturi, 2013) also makes it attractive and can be used to derive an efficient algorithm to solve (3).

We assume access to and in our model because learning user-user/item-item similarity is relatively easier than our task, there are many existing work dedicated to that end (Cheung and Tian, 2004; Agarwal and Bharadwaj, 2013) and we want to single out and highlight our main contribution - learning the cost matrix that models the matching mechanism. In fact, our framework can also be easily generalized to learn and jointly if needed.

We postpone detailed algorithmic derivation of the solution of (3) to next section.

### 4.4 Recommendation with Learned Interaction Matrix

Once the interaction matrix is learned, we may then use it to make recommendations. Concretely, for a group of new users and items , we compute the distribution of users , the distribution of items , the cost matrix and . For each user , we scan the -th row of and find the largest entries , then have the highest probability of matching with user so we recommend them to user . Similarly, we may recommend top- users to item .

## 5 Derivation of Optimization Algorithm

In this section, we propose an efficient alternating minimization algorithm to solve problem (3).

Since the constraint set of ROT problem satisfies Slater’s condition (Boyd and Vandenberghe, 2004), we have by strong duality that

where . are essentially the Lagrangian multipliers corresponding to constraints and . See also (Genevay et al., 2016). Hence we have

where and . Given sample marginals, we see that once are fixed, are also fixed.

Then we can convert (3) into a min-max problem

(4) |

where constants are omitted. The optimal solution is a saddle-point of the objective in (4).

To solve this min-max problem, we alternately update the primal variable and dual variable , each time with the other one fixed.

### 5.1 Update for fixed

Now are all fixed, and we need to solve

where .

Note that for some , such that , , and . Therefore we can rewrite the minimization in an equivalent form:

(5) |

For fixed , denote the optimum of objective function in equation (5) subject to the constraint by

(6) |

subject to

Recall that the ultimate goal in this step is to find the interaction matrix that cost depends on, such that the minimum above can be attained. For any , we have the expression , written in kernel representation parameterized by interaction matrix . Therefore the minimization above is equivalent to

(7) |

To solve (7), the critical step is to evaluate gradient and we have

###### Proposition 4

The gradient of is

where is the Lagrangian multiplier of the constrained minimization problem in equation (6).

Proof: By chain rule, we have that

With the kernel representation, is readily available. For fixed , by envelop theorem Milgrom and Segal (2002), we have

Therefore, for each evaluation of , we need to solve minimization problem (6) once. If we denote , , and , then the constrained minimization in equation (6) is equivalent to

where logarithm is component-wise and constants are omitted.

Note that this is a non-convex optimization problem, both the objective function and constraints are non-convex which is difficult to solve in general. However, once we fix , the problem with respect to alone is a convex problem and vice versa. We can solve this problem efficiently by alternately updating .

###### Proposition 5

Proof: From the definition of , it is easily seen that is lower bounded. Moreover, since

we have .

The KKT condition of solving for and are

(10) | ||||

(11) | ||||

Let tend to infinity and take inner product with for equation (10) and take inner product with for equation (11), compare two equations and use the fact that both and are probability vectors, we find that (denoted by ) and solves the joint KKT condition. Therefore, is a local minimizer and is a local minimum.

Once we obtain , we can then get by setting and and . Then plug in to evaluate for the current .

A careful analysis of the KKT condition of equation (8) and (9) shows that is the root of

We can solve this univariate root finding problem efficiently by leveraging off-the-shelf implementation. After obtaining , we can update directly. Similar, is the root of

and update of can be also done in the same fashion. Computationally, this approach to solving problem (8) and (9) is much cheaper than gradient-type iterative methods when and/or are large.

### 5.2 Update for fixed

When are fixed, hence is also fixed, we then only need to solve

and this is equivalent to applying Sinkhorn-Knopp algorithm to compute and .

To summarize, in each iteration, we perform a gradient-type update for , followed by two calls of Sinkhorn-Knopp algorithm for and . Algorithm 1 describes the algorithm in detail.

The time complexity of our algorithm is where is the time cost of root-finding algorithm, is the number of iterations in Sinkhorn-Knopp algorithm and term is due to the evaluation of . Sinkhorn-Knopp algorithm is known to have linear convergence (Franklin and Lorenz, 1989) and empirically, we have observed the algorithm usually converges in a few steps hence we may safely set to a relatively small number in practice. The space complexity is (storing to save run time).