# Inductive Matrix Completion Based on Graph Neural Networks

###### Abstract

We propose an inductive matrix completion model without using side information. By factorizing the (rating) matrix into the product of low-dimensional latent embeddings of rows (users) and columns (items), a majority of existing matrix completion methods are transductive, since the learned embeddings cannot generalize to unseen rows/columns or to new matrices. To make matrix completion inductive, content (side information), such as user’s age or movie’s genre, has to be used previously. However, high-quality content is not always available, and can be hard to extract. Under the extreme setting where not any side information is available other than the matrix to complete, can we still learn an inductive matrix completion model? In this paper, we investigate this seemingly impossible problem and propose an Inductive Graph-based Matrix Completion (IGMC) model without using any side information. It trains a graph neural network (GNN) based purely on local subgraphs around (user, item) pairs generated from the rating matrix and maps these subgraphs to their corresponding ratings. Our model achieves highly competitive performance with state-of-the-art transductive baselines. In addition, since our model is inductive, it can generalize to users/items unseen during the training (given that their ratings exist), and can even transfer to new tasks. Our transfer learning experiments show that a model trained out of the MovieLens dataset can be directly used to predict Douban movie ratings and works surprisingly well. Our work demonstrates that: 1) it is possible to train inductive matrix completion models without using any side information while achieving state-of-the-art performance; 2) local graph patterns around a (user, item) pair are effective predictors of the rating this user gives to the item; and 3) we can transfer models trained on existing recommendation tasks to new tasks without any retraining.

## 1 Introduction

Matrix completion (candes2009exact) is one common formulation of recommender systems, where rows and columns of a matrix represent users and items, respectively, and predicting users’ interest in items corresponds to filling in the missing entries of the rating matrix. By assuming a low-rank rating matrix, many of the most popular matrix completion algorithms use factorization techniques that decompose a rating r_{ij} into \bm{\mathrm{w}}^{\top}_{i}\bm{\mathrm{h}}_{j}, the inner product of user i’s and item j’s latent feature vectors \bm{\mathrm{w}}_{i} and \bm{\mathrm{h}}_{j}, respectively, which have achieved great successes (adomavicius2005toward; schafer2007collaborative; koren2009matrix; bobadilla2013recommender)

However, matrix factorization is intrinsically transductive, meaning that the learned latent features (embeddings) for users/items are not generalizable to users/items unseen during the training. When the rating matrix has changed values or has new rows/columns added, it often requires a complete retraining to get the new embeddings. To make matrix completion inductive, Inductive Matrix Completion (IMC) has been proposed, which leverages content (side information) of users and items (jain2013provable; xu2013speedup). In IMC, a rating is decomposed by r_{ij}=\bm{\mathrm{x}}^{\top}_{i}\bm{\mathrm{Q}}\bm{\mathrm{y}}_{j}, where \bm{\mathrm{x}}_{i} and \bm{\mathrm{y}}_{j} are content feature vectors of user i and item j, respectively, and \bm{\mathrm{Q}} is a learnable matrix modeling the feature interactions. To accurately predict missing entries, IMC methods have strong constraints on the content quality, which often leads to inferior performance due to the lack of high-quality content. Other content-based recommender systems (lops2011content) face similar problems. In some extreme settings, there is even no content available, such as a website where users are completely anonymous. In these cases, inductive matrix completion seems impossible.

In this paper, we propose a novel inductive matrix completion method that does not use any content, while achieving highly competitive performance with state-of-the-art transductive methods. The key that frees us from using content is graph pattern. If for each observed rating we add an edge between the corresponding user and item, we can build a bipartite graph from the rating matrix. Subsequently, predicting unknown ratings converts equivalently to predicting labeled links in this bipartite graph. This transforms matrix completion into a link prediction problem (liben2007link), where graph patterns play a major role in determining link existences.

A major class of link prediction methods are heuristic methods, which predict links based on some heuristic scores. For example, the common neighbors heuristic count the common neighbors between two nodes to predict links, while the Katz index (katz1953new) uses a weighted sum of all the walks between two nodes. See (liben2007link) for an overview. These heuristics can be seen as some predefined graph structure features calculated based on the local or global graph patterns around links, which have achieved great successes due to their simplicity and effectiveness.

However, these traditional link prediction heuristics only work for simple graphs where nodes and edges both only have a single type. Can we find some heuristics for labeled link prediction in bipartite graph? Intuitively, such heuristics should exist. For example, if a user u_{0} likes an item v_{0}, we may expect to see very often that v_{0} is also liked by some other user u_{1} who shares a similar taste to u_{0}. By similar taste, we mean u_{1} and u_{0} have together both liked some other item v_{1}. In the bipartite graph, such a pattern is realized as a “like” path (u_{0}\rightarrow_{\text{like}}v_{1}\rightarrow_{\text{liked by}}u_{1}% \rightarrow_{\text{like}}v_{0}). If there are many such paths between u_{0} and v_{0}, we may infer that u_{0} is highly likely to like v_{0}. Thus, we may count the number of such paths as an indicator of how likely u_{0} likes v_{0}. In fact, many neighborhood-based recommender systems (desrosiers2011comprehensive) rely on similar heuristics.

Of course we can try to manually define many such intuitive heuristics and test their effectiveness. In this work, however, we take a different approach that automatically learns suitable heuristics from the given bipartite graph. To do so, we first extract an h-hop enclosing subgraph for each training user-item pair (u,v), which is defined to be the subgraph induced from the bipartite graph by nodes u,v and the neighbors of u and v within h hops. Such local subgraphs contain rich information about the rating that u may give to v. For example, the number of (u_{0}\rightarrow_{\text{like}}v_{1}\rightarrow_{\text{liked by}}u_{1}% \rightarrow_{\text{like}}v_{0}) paths can just be computed from the 1-hop enclosing subgraph around (u_{0},v_{0}). By feeding these enclosing subgraphs to a graph neural network (GNN), we train a graph regression model that maps each subgraph to the rating that its center user gives to its center item.

Figure 1 illustrates the overall framework. Due to the superior graph learning ability, a GNN can learn highly expressive graph structure features useful for inferring the ratings without restricting the features to predefined heuristics. Our Inductive Graph-based Matrix Completion (IGMC) model is inductive, meaning that it can generalize to unseen users/items without retraining. Note that IGMC does not address the extreme cold-start problem, as it still requires an unseen user-item pair’s enclosing subgraph (i.e., the user and item should at least have some interactions with neighbors so that the enclosing subgraph is not empty). We compare IGMC with state-of-the-art transductive baselines on five benchmark matrix completion datasets. Without using any content, IGMC achieves the smallest RMSEs on four out of five datasets, beating those baselines using side information. Our model is also equipped with excellent transfer learning ability. We show that an IGMC model trained on the MovieLens-100K dataset can be directly used to predict Douban movie ratings and even outperform baselines trained specifically on Douban. We also analyze our model’s behavior on sparse rating matrices. Under extremely sparse cases (0.001 of original ratings are kept), our model outperforms state-of-the-art transductive baseline by a large margin. Further, our visualization experiments confirm that local enclosing subgraphs are indeed strong predictors of ratings.

## 2 Related Work

Graph neural networks Graph neural networks (GNNs) are a new type of neural networks for learning over graphs (scarselli2009graph; bruna2013spectral; duvenaud2015convolutional; li2015gated; kipf2016semi; niepert2016learning; dai2016discriminative; hamilton2017inductive; zhang2018end). There are two types of GNNs: Node-level GNNs use message passing layers to iteratively pass messages between each node and its neighbors in order to extract a feature vector for each node encoding its local substructure. Graph-level GNNs additionally use a pooling layer such as summing which aggregates node feature vectors into a graph representation so that graph-level tasks such as graph classification/regression become feasible. Due to the superior representation learning ability for graphs, GNNs have achieved state-of-the-art performance on semi-supervised node classification (kipf2016semi), network embedding (hamilton2017inductive), graph classification (zhang2018end), and link prediction (zhang2018link), etc.

GNNs for matrix completion The matrix completion problem has been studied using GNNs. monti2017geometric develop a multi-graph CNN (MGCNN) model to extract user and item latent features from their respective nearest-neighbor networks. Later, berg2017graph propose graph convolutional matrix completion (GC-MC) to directly operate on the user-item bipartite graph to extract user and item latent features using a GNN. The SpectralCF model of (zheng2018spectral) uses a spectral-GNN on the bipartite graph to learn node embeddings. Although using GNNs for matrix completion, all these models are still transductive – MGCNN and SpectralCF require graph Laplacians which do not generalize to new tasks, while GC-MC uses one-hot encoding of node IDs as their initial features input to the GNN, thus cannot generalize to unseen users/items. A recent inductive graph-based recommender system, PinSage (ying2018graph), replaces the node ID initial features in GC-MC with node content features, and is successfully used in recommending related pins in Pinterest. Although being inductive, PinSage relies heavily on the rich visual and text content associated with the pins, which is not often accessible in other recommendation tasks. In comparison, our IGMC model is inductive and does not rely on any content. All previous approaches use node-level GNNs to extract features for nodes, while our IGMC uses a graph-level GNN to learn representations for subgraphs. We will discuss this crucial difference in more details in section 4.

Link prediction based on graph patterns Learning supervised heuristics (graph patterns) has been studied for link prediction in simple graphs. zhang2017weisfeiler propose Weisfeiler-Lehman Neural Machine (WLNM), which learns graph structure features using a fully-connected neural network on the subgraphs’ adjacency matrices. Later, they improve this work by replacing the fully-connected neural network with a GNN and achieves state-of-the-art link prediction results (zhang2018link). Our work generalizes this line of research from predicting link existence in simple graphs to predicting values of links in bipartite graphs (i.e., matrix completion). In (chen2005link; zhou2007bipartite), traditional link prediction heuristics are adapted to bipartite graphs and show promising performance for recommender systems. Our work differs in that we do not use any predefined heuristics, but learn general graph structure features using a GNN. Another similar work to ours is (li2013recommendation), where graph kernels are used to learn graph structure features. However, graph kernels require quadratic time and space complexity to compute and store the kernel matrices thus are unsuitable for modern recommender systems.

## 3 Inductive Graph-based Matrix Completion (IGMC)

We now present our Inductive Graph-based Matrix Completion (IGMC) framework. We use G to denote the undirected bipartite graph constructed from the given rating matrix \bm{\mathrm{R}}. In G, a node is either a user (denoted by u, corresponding to a row in \bm{\mathrm{R}}) or an item (denoted by v, corresponding to a column in \bm{\mathrm{R}}). Edges can exist between user and item, but cannot exist between two users or two items. Each edge (u,v) has a value r=\bm{\mathrm{R}}_{u,v}, corresponding to the rating that u gives to v. We use \mathcal{R} to denote the set of all possible ratings, and use \mathcal{N}_{r}(u) to denote the set of u’s neighbors that connect to u with edge type r.

### 3.1 Enclosing subgraph extraction

The first part of the IGMC framework is enclosing subgraph extraction. For each observed rating \bm{\mathrm{R}}_{u,v}, we extract an h-hop enclosing subgraph around (u,v) from G. Algorithm 1 describes how we extract h-hop enclosing subgraphs. We will feed these enclosing subgraphs to a GNN and regress on their ratings. Then, for each testing (u,v) pair, we again extract its h-hop enclosing subgraph from G, and use the trained GNN model to predict its rating. Note that after extracting a training enclosing subgraph for (u,v), we should remove the edge (u,v) because it is the target to predict.

### 3.2 Node labeling

The second part of IGMC is node labeling. Before we feed an enclosing subgraph to the GNN, we first apply a node labeling to it, which gives an integer label to every node in the subgraph. The purpose is to use different labels to mark nodes’ different roles in a subgraph. Ideally, our node labeling should be able to: 1) distinguish the target user and target item between which the target rating is located, and 2) differentiate user-type nodes from item-type nodes. Otherwise, the GNN cannot tell between which user and item to predict the rating, and might lose node-type information. To satisfy these conditions, we propose a node labeling as follows: We first give label 0 and 1 to the target user and target item, respectively. Then, we determine other nodes’ labels according to at which hop they are included in the subgraph in Algorithm 1. If a user-type node is included at the i^{\text{th}} hop, we will give it a label 2i. If an item-type node is included at the i^{\text{th}} hop, we will give it 2i+1. Such a node labeling can sufficiently discriminate: 1) target nodes from “context” nodes, 2) users from items (users always have even labels), and 3) nodes of different distances to the target rating.

Note that this is not the only possible way of node labeling, but we empirically verified its excellent performance. The one-hot encoding of these node labels will be treated as the initial node features of the subgraph when fed to the GNN. Note that our node labels are determined completely inside each enclosing subgraph, thus are independent of the global bipartite graph. Given a new enclosing subgraph, we can as well predict its rating even if all of its nodes are from a different bipartite graph. This is because IGMC uses pure graph patterns within local enclosing subgraphs to predict ratings without leveraging any global information specific to the bipartite graph. Our node labeling is also different from using the global node IDs as in GC-MC (berg2017graph). Using one-hot encoding of global IDs is essentially transforming the first message passing layer’s parameters into latent node embedding associated with each particular ID (equivalent to an embedding layer). Such a model cannot generalize to nodes whose IDs are out of range, thus is transductive.

### 3.3 Graph neural network architecture

The third part of IGMC is to train a graph neural network (GNN) model predicting ratings from the enclosing subgraphs. In previous node-based approaches such as GC-MC, a node-level GNN is applied to the entire bipartite graph to extract node embeddings. Then, the node embeddings of u and v are input to an inner-product or bilinear operator to reconstruct the rating on (u,v). In contrast, here we apply a graph-level GNN to the enclosing subgraph around (u,v) to directly map the subgraph to the rating. There are thus two components in our GNN: 1) message passing layers that extract a feature vector for each node in the subgraph, and 2) a pooling layer to summarize a subgraph representation from node features.

To handle different edge types, we adopt the relational graph convolutional operator (R-GCN) (schlichtkrull2018modeling) as our GNN’s message passing layers, which has the following form:

\displaystyle\bm{\mathrm{x}}^{l+1}_{i}=\bm{\mathrm{W}}^{l}_{0}\bm{\mathrm{x}}^% {l}_{i}+\sum_{r\in\mathcal{R}}\sum_{j\in\mathcal{N}_{r}(i)}\frac{1}{|\mathcal{% N}_{r}(i)|}\bm{\mathrm{W}}^{l}_{r}\bm{\mathrm{x}}^{l}_{j}, | (1) |

where \bm{\mathrm{x}}^{l}_{i} denotes node i’s feature vector at layer l, \bm{\mathrm{W}}^{l}_{0} and \{\bm{\mathrm{W}}^{l}_{r}|r\in\mathcal{R}\} are learnable parameter matrices. Since neighbors j connected to i with different edge types r are processed by different parameter matrices \bm{\mathrm{W}}^{l}_{r}, we are able to learn from the rich graph patterns inside the edge types, such as the average rating the target user gives to items, the average rating the target item receives, and by which paths the two target nodes are connected, etc. We stack L message passing layers with tanh activations between two layers. Node i’s feature vectors from different layers are concatenated as its final representation \bm{\mathrm{h}}_{i}:

\displaystyle\bm{\mathrm{h}}_{i}=\text{concat}(\bm{\mathrm{x}}^{1}_{i},\bm{% \mathrm{x}}^{2}_{i},\ldots,\bm{\mathrm{x}}^{L}_{i}). | (2) |

Next, we pool the node representations into a graph-level feature vector. There are many choices such as summing, averaging, SortPooling (zhang2018end), DiffPooling (ying2018hierarchical), etc. In this work, however, we use a different pooling layer which concatenates the final representations of only the target user and item as the graph representation:

\displaystyle\bm{\mathrm{g}}=\text{concat}(\bm{\mathrm{h}}_{u},\bm{\mathrm{h}}% _{v}), | (3) |

where we use \bm{\mathrm{h}}_{u} and \bm{\mathrm{h}}_{v} to denote the final representations of the target user and target item, respectively. Our particular choice is due to the extra importance that these two target nodes carry compared to other context nodes. Although being very simple, we empirically verified its better performance than summing and other advanced pooling layers for our matrix completion tasks.

After getting the final graph representation, we use an MLP to output the predicted rating:

\displaystyle\hat{r}=\bm{\mathrm{w}}^{\top}\sigma(\bm{\mathrm{W}}\bm{\mathrm{g% }}), | (4) |

where \bm{\mathrm{W}} and \bm{\mathrm{w}} are parameters of the MLP which map the graph representation \bm{\mathrm{g}} to a scalar rating \hat{r}, and \sigma is an activation function (we take ReLU in this paper).

### 3.4 Model training

Loss function We minimize the mean squared error (MSE) between the predictions and the ground truth ratings:

\displaystyle\mathcal{L}=\frac{1}{|\{(u,v)|\bm{\mathrm{\Omega}}_{u,v}=1\}|}% \sum_{(u,v):\bm{\mathrm{\Omega}}_{u,v}=1}(r_{u,v}-\hat{r}_{u,v})^{2}, | (5) |

where we use r_{u,v} and \hat{r}_{u,v} to denote the true rating and predicted rating of (u,v), repsectively, and \bm{\mathrm{\Omega}} is a 0/1 mask matrix indicating the observed entries of the rating matrix \bm{\mathrm{R}}.

Adjacent rating regularization The R-GCN layer (1) used in our GNN has different parameters \bm{\mathrm{W}}_{r} for different rating types. One drawback here is that it fails to to take the magnitude of ratings into consideration. For instance, a rating of 4 and a rating of 5 in MovieLens both indicate that the user likes the movie, while a rating of 1 indicates that the user does not like the movie. Ideally, we expect our model to be aware of the fact that a rating of 4 is more similar to 5 than 1 is. In R-GCN, however, ratings 1, 4 and 5 are only treated as three independent edge types – the magnitude and order information of the ratings is completely lost. To fix that, we propose an adjacent rating regularization (ARR) technique, which encourages ratings adjacent to each other to have similar parameter matrices. Assume the ratings in \mathcal{R} exhibit an ordering r_{1},r_{2},\ldots,r_{|\mathcal{R}|} which indicates increasingly higher preference that users have for items. Then, the ARR regularizer is:

\displaystyle\mathcal{L}_{\text{ARR}}=\sum_{i=1,2,\ldots,|\mathcal{R}|-1}% \lVert\bm{\mathrm{W}}_{r_{i+1}}-\bm{\mathrm{W}}_{r_{i}}\rVert^{2}_{F}, | (6) |

where \lVert\cdot\rVert_{F} denotes the Frobenius norm of a matrix. The above regularizer restrains the parameter matrices of adjacent ratings from having too much differences, which not only regularizes the model parameters, but also helps the optimization of those infrequent ratings by transferring knowledge from their adjacent ratings. The final loss function is given by:

\displaystyle\mathcal{L}_{\text{final}}=\mathcal{L}+\lambda\mathcal{L}_{\text{% ARR}}, | (7) |

where \lambda trades-off the importance of the MSE loss and the ARR regularizer.

## 4 Graph-level GNN vs. node-level GNN

Compared to previous graph matrix completion approaches such as PinSage and GC-MC, one important difference of IGMC is that it uses a graph-level GNN to map the enclosing subgraph around the target user and item to their rating (left figure (a)), instead of using a node-level GNN on the bipartite graph G to learn target user’s and item’s embeddings and use the node embeddings to get the rating (left figure (b)). One drawback of the latter node-based approach is that the learned node embeddings are essentially encoding the rooted subtrees around the two nodes independently, which loses the interactions and correspondences between the nodes of the two trees. For example, from the two subtrees of the left figure (b) we do not really know whether the two target nodes are just isolated from each other like in (b) or actually densely connected like in (a); these two cases both look identical to a node-based approach.

In comparison, a graph-level GNN can discriminate the two cases through a sufficient number of message passing rounds. Since the learning is confined to the subgraph, stacking multiple graph convolution layers will learn more and more refined local structural features which can adequately discriminate up to all subgraphs that the Weisfeiler-Lehman algorithm can discriminate (xu2018powerful). However, for node-based approaches, since there is no subgraph boundary, stacking multiple graph convolutions will only extend the convolution range to unrelated distant nodes and push all nodes in the bipartite graph to have similar embeddings. This is why previous node-based approaches typically only use one or two message passing layers (berg2017graph; ying2018graph).

## 5 Experiments

We conduct experiments on five common matrix completion datasets: Flixster (jamali2010matrix), Douban (ma2011recommender), YahooMusic (dror2011yahoo), MovieLens-100K and MovieLens-1M (miller2003movielens). For ML-100K, we train and evaluate on the canonical u1.base/u1.test train/test split. For ML-1M, we randomly split it into 90% and 10% train/test sets. For Flixster, Douban and YahooMusic we use the preprocessed subsets and splits provided by (monti2017geometric). Dataset statistics are summarized in Table 1. We implemented IGMC using pytorch_geometric (Fey/Lenssen/2019). We tuned model hyperparameters based on cross validation results on ML-100K, and used them across all datasets. The final architecture uses 4 R-GCN layers with 32, 32, 32, 32 hidden dimensions. Basis decomposition with 4 bases is used to reduce the number of parameters in \bm{\mathrm{W}}_{r} (schlichtkrull2018modeling). The final MLP has 128 hidden units and a dropout rate of 0.5. We use 1-hop enclosing subgraphs for all datasets, and find them sufficiently good. We find using 2 or more hops can slightly increase the performance but take much longer training time. For each enclosing subgraph, we randomly drop out its adjacency matrix entries with a probability of 0.2 during the training. We set the \lambda in (7) to 0.001. We train our model using the Adam optimizer (kingma2014adam) with a batch size of 50 and an initial learning rate of 0.001, and multiply the learning rate by 0.1 every 20 epochs for ML-1M, and every 50 epochs for all other datasets. Our code is publicly available at https://github.com/muhanzhang/IGMC.

Dataset | Users | Items | Ratings | Density | Rating types |
---|---|---|---|---|---|

Flixster | 3,000 | 3,000 | 26,173 | 0.0029 | 0.5, 1, 1.5, …, 5 |

Douban | 3,000 | 3,000 | 136,891 | 0.0152 | 1, 2, 3, 4, 5 |

YahooMusic | 3,000 | 3,000 | 5,335 | 0.0006 | 1, 2, 3, …, 100 |

ML-100K | 943 | 1,682 | 100,000 | 0.0630 | 1, 2, 3, 4, 5 |

ML-1M | 6,040 | 3,706 | 1,000,209 | 0.0447 | 1, 2, 3, 4, 5 |

Model | Inductive | Content | Flixster | Douban | YahooMusic |

GRALS | no | yes | 1.245 | 0.833 | 38.0 |

sRGCNN | no | yes | 0.926 | 0.801 | 22.4 |

GC-MC | no | yes | 0.917 | 0.734 | 20.5 |

IGC-MC | yes | yes | 0.999\pm0.062 | 0.990\pm0.082 | 21.3\pm0.989 |

F-EAE | yes | no | 0.908 | 0.738 | 20.0 |

PinSage | yes | yes | 0.954\pm0.005 | 0.739\pm0.002 | 22.9\pm0.629 |

IGMC (ours) | yes | no | 0.872\pm0.001 | 0.721\pm0.001 | 19.1\pm0.138 |

### 5.1 Flixster, Douban and YahooMusic

For these three datasets, we compare our IGMC with GRALS (rao2015collaborative), sRGCNN (monti2017geometric), GC-MC (berg2017graph), F-EAE (hartford2018deep), and PinSage (ying2018graph). Among them, GRALS is a graph regularized matrix completion algorithm. GC-MC and sRGCNN are transductive node-level-GNN-based matrix completion methods. F-EAE uses a factorized exchangeable autoencoder to perform permutation-equivariant operations on the matrix to reconstruct the ratings, which is an inductive model without using content, similar to our IGMC. PinSage is an inductive node-level-GNN-based model using content, which is originally used to predict related pins and is adapted to predicting ratings here. We further implemented an inductive GC-MC model (IGC-MC) which replaces the one-hot encoding of node IDs with the content features to make it inductive. The content in these datasets are presented in the form of user and item graphs. We summarize whether each algorithm is inductive and whether it uses content in Table 2.

We train our model for 40 epochs, and save the model parameters every 10 epochs. The final predictions are given by averaging the predictions from epochs 10, 20, 30 and 40. We repeat the experiment five times and report the average RMSEs. The baseline results are taken from (hartford2018deep). Table 2 shows the results. Our model achieves the smallest RMSEs on all three datasets without using any content, significantly outperforming all the compared baselines, regardless of whether they are transductive or inductive. Further, except F-EAE, all the baselines have used content information to assist the matrix completion. This further highlights IGMC’s great performance advantages without relying on content.

Model | Inductive | Content | ML-100K |

MC | no | no | 0.973 |

IMC | no | yes | 1.653 |

GMC | no | yes | 0.996 |

GRALS | no | yes | 0.945 |

sRGCNN | no | yes | 0.929 |

GC-MC | no | yes | 0.905 |

IGC-MC | yes | yes | 1.142 |

F-EAE | yes | no | 0.920 |

PinSage | yes | yes | 0.951 |

IGMC | yes | no | 0.905 |

Model | Inductive | Content | ML-1M |

PMF | no | no | 0.883 |

I-RBM | no | no | 0.854 |

NNMF | no | no | 0.843 |

I-AutoRec | no | no | 0.831 |

CF-NADE | no | no | 0.829 |

GC-MC | no | no | 0.832 |

IGC-MC | yes | yes | 1.259 |

F-EAE | yes | no | 0.860 |

PinSage | yes | yes | 0.906 |

IGMC | yes | no | 0.857 |

### 5.2 ML-100K and ML-1M

We further conduct experiments on MovieLens datasets. Side information is present for both users (age, gender, occupation, etc.) and movies (genres). For ML-100K, we compare against matrix completion (MC) (candes2009exact), inductive matrix completion (IMC) (jain2013provable), geometric matrix completion (GMC) (kalofolias2014matrix), as well as GRALS, sRGCNN, GC-MC, F-EAE and PinSage. We train IGMC for 80 epochs and report the ensemble performance of epochs 50, 60, 70 and 80. For ML-1M, besides the baselines GC-MC, F-EAE and PinSage, we further include state-of-the-art algorithms including PMF (mnih2008probabilistic), I-RBM (salakhutdinov2007restricted), NNMF (dziugaite2015neural), I-AutoRec (sedhain2015autorec) and CF-NADE (zheng2016neural). We train IGMC for 40 epochs and report the ensemble performance of epochs 25, 30, 35 and 40. The experiments are repeated five times and the average results are reported in Table 3 (standard deviations are less than 0.001). As we can see, IGMC achieves the best performance on ML-100K, in parallel with GC-MC despite that IGMC is an inductive model, while GC-MC is transductive and additionally uses content information. For ML-1M, IGMC cannot catch up with state-of-the-art transductive models such as CF-NADE and GC-MC, but outperforms other inductive models. We will analyze this dataset further in section 5.3.

### 5.3 Sparse rating matrix analysis

To gain insight into when inductive graph-based matrix completion is more suitable than transductive methods, we compare IGMC with GC-MC on ML-1M under different sparsity levels of the rating matrix. We sequentially decrease the sparsity ratio by randomly keeping only 0.2, 0.1, 0.05, 0.01, and 0.001 of the original training ratings. Then, we train both models on the sparsified rating matrices, and evaluate on the original test set. Figure 2 shows the results. As we can see, although IGMC falls behind GC-MC initially with full ratings, it starts to perform better after the sparsity ratio is less than 20%. The advantage becomes even greater under extremely sparse cases. This seems to indicate that IGMC is a better choice than transductive methods when there is not a large amount of training data, which is particularly suitable for the initial rating collection phase of a recommender system. It also suggests that transductive matrix completion relies more on the dense user-item interactions than inductive graph-based matrix completion does.

### 5.4 Transfer learning

A great advantage of an inductive model is its potential for transferring to other tasks. We conduct a transfer learning experiment by applying the IGMC model trained on ML-100K to Flixster, Douban and YahooMusic. Among the three datasets, only Douban has exactly the same rating types as ML-100K (1,2,3,4,5). Thus for Flixster and YahooMusic, we bin their edge types into groups 1 to 5 before feeding into the ML-100K model, and multiply the YahooMusic predictions by 20 to account for the different scales. Despite all the compromises, the transferred IGMC model achieves excellent performance (Table 4). We also show the transfer learning results of other two inductive models, IGC-MC and F-EAE. Note that an inductive model using content features (such as PinSage) is not transferrable, due to the different feature spaces between MovieLens and the target datasets. Thus for IGC-MC, we replace its content features with node degrees. As we can see, IGMC outperforms the other two models by large margins in terms of transfer learning ability. Furthermore, the transferred IGMC even outperforms a wide range of baselines trained especially on each dataset (Table 2).

Model | Inductive | Content | Flixster | Douban | YahooMusic |
---|---|---|---|---|---|

IGC-MC | yes | no | 1.290 | 1.144 | 25.7 |

F-EAE | yes | no | 0.987 | 0.766 | 23.3 |

IGMC (ours) | yes | no | 0.906 | 0.759 | 20.1 |

### 5.5 Visualization

Finally, we visualize 10 testing enclosing subgraphs with the highest and lowest predicted ratings for Flixster, Douban, YahooMusic, and ML-100K, respectively, in Figure 3. As we can see, there are substantially different patterns between high-score and low-score subgraphs, which is why IGMC can predict ratings merely from these subgraphs. For example, high-score subgraphs typically show both high user average rating and high item average rating, while low-score subgraphs often have mixed ratings from non-target users and have low user average rating.

## 6 Conclusion

In this paper, we have proposed Inductive Graph-based Matrix Completion (IGMC). Instead of learning transductive latent features, IGMC learns local graph patterns related to ratings inductively based on graph neural networks. We show that IGMC has highly competitive performance compared to transductive matrix completion baselines. In addition, IGMC is transferrable to new tasks without any retraining, a property much desired in those recommendation tasks that have few training data. We believe IGMC will open a new direction for matrix completion and recommender systems.