Fusing Similarity Models with Markov Chains for Sparse Sequential Recommendation
Abstract
Predicting personalized sequential behavior is a key task for recommender systems. In order to predict user actions such as the next product to purchase, movie to watch, or place to visit, it is essential to take into account both longterm user preferences and sequential patterns (i.e., shortterm dynamics). Matrix Factorization and Markov Chain methods have emerged as two separate but powerful paradigms for modeling the two respectively. Combining these ideas has led to unified methods that accommodate long and shortterm dynamics simultaneously by modeling pairwise useritem and itemitem interactions.
In spite of the success of such methods for tackling dense data, they are challenged by sparsity issues, which are prevalent in realworld datasets. In recent years, similaritybased methods have been proposed for (sequentiallyunaware) item recommendation with promising results on sparse datasets. In this paper, we propose to fuse such methods with Markov Chains to make personalized sequential recommendations. We evaluate our method, Fossil, on a variety of large, realworld datasets. We show quantitatively that Fossil outperforms alternative algorithms, especially on sparse datasets, and qualitatively that it captures personalized dynamics and is able to make meaningful recommendations.
1Introduction
Modeling and understanding the interactions between users and items, as well as the relationships amongst the items themselves are the core tasks of a recommender system. The former helps answer questions like ‘What kind of item does this specific user like?’ (itemtouser recommendation), and the latter ‘Which type of shirts match the pants just purchased?’ (itemtoitem recommendation). In other words, (longterm) user preferences and (shortterm) sequential patterns are captured by the above two forms of interactions respectively.
In this paper, we are interested in predicting personalized sequential behavior from collaborative data (e.g. purchase histories of users), which is challenging as long and shortterm dynamics need to be combined carefully to account for both personalization and sequential transitions. This challenge is further complicated by sparsity issues in many realworld datasets, which makes it hard to estimate parameters accurately from limited training sequences. Particularly, this challenge is not addressed by models concerned with historical temporal dynamics (e.g. the popularity fluctuation of Harry Potter between 2002 to 2006), where userlevel sequential patterns are typically ignored (e.g. ’what will Tom watch next after watching Harry Potter?’).
To model user preferences, there have been two relevant streams of work. Traditional item recommendation algorithms are typically based on a lowrank factorization of the useritem interaction matrix, referred to as Matrix Factorization [1]. Each user or item is represented with a numerical vector of the same dimension such that the compatibility between them is estimated by the inner product of their respective representations. Recently, an item similaritybased algorithm—Factored Item Similarity Models (FISM)—has been developed which makes recommendations to a user exclusively based on how similar items are to those already consumed/liked by . In spite of not explicitly parameterizing each user, FISM surprisingly outperforms various competing baselines, including Matrix Factorization, especially on sparse datasets [2].
The above methods are unaware of sequential dynamics. In order to tackle sequential prediction tasks, we need to resort to alternate methods, such as Markov Chains, that are able to capture sequential patterns. To this end, Rendle et al. proposed Factorized Personalized Markov Chains (FPMC), which combines Matrix Factorization and a firstorder Markov Chain [3] to model personalized sequential behavior.
Despite the success achieved by FPMC, it suffers from sparsity issues and the longtailed distribution of many datasets, so that the sequential prediction task is only partially solved. In this paper, we propose to fill this gap by fusing similaritybased methods (like FISM) with Markov Chain methods (like FPMC) to tackle sparse realworld datasets with sequential dynamics. The resulting method, FactOrized Sequential Prediction with Item SImilarity ModeLs (or Fossil in short), naturally combines the two by learning a personalized weighting scheme over the sequence of items to characterize users in terms of both preferences and the strength of sequential behavior. Figure 1 demonstrates an example of how Fossil makes recommendations.
Fossil
brings the following benefits for tackling sparsity issues: (1) It parameterizes each user with only the historical items so that colduser (or ‘cooluser’) issues can be alleviated so long as the representations of items can be estimated accurately. (2) For coldusers, Fossil can shift more weight to shortterm dynamics to capitalize from ‘global’ sequential patterns. This flexibility enables Fossil to make reasonable predictions even though only a few actions may have been observed for a given user.
Our contributions are summarized as follows: First, we develop a new method, Fossil, that integrates similaritybased methods with Markov Chains smoothly to make personalized sequential predictions on sparse and longtailed datasets. Second, we demonstrate quantitatively that Fossil is able to outperform a spectrum of stateoftheart algorithms on a variety of large, realworld datasets with around five million user actions in total. Finally, we visualize the learned model and analyze the sequential and personalized dynamics captured.
Data and code are available at https://sites.google.com/a/eng.ucsd.edu/ruininghe/.
2Related Work
The most closely related works to ours are (1) Item recommendation methods that model user preferences but are unaware of sequential dynamics; (2) Works that deal with temporal dynamics but rely on explicit time stamps; and (3) Those that address the sequential prediction task we are interested in.
Item recommendation usually relies on Collaborative Filtering (CF) to learn from explicit feedback like starratings [1]. CF predicts based only on the useritem rating matrix and mainly follows two paradigms: neighborhood and modelbased. Neighborhoodbased methods recommend items that either have been enjoyed by likeminded users (useroriented, e.g. [4]) or are similar to those already consumed (itemoriented, e.g. [7]). Such methods have to rely on some type of predefined similarity metric such as Pearson Correlation or Cosine Similarity. In contrast, modelbased methods directly explain the interactions between users and items. There have been a variety of such algorithms including Bayesian methods [9], Restricted Boltzmann Machines [11], Matrix Factorization (MF) methods (the basis of many stateoftheart recommendation approaches such as [12], [13], [14]), and so on.
In order to tackle implicit feedback data where only positive signals (e.g. purchases, clicks, thumbsup) are observed, both neighborhood and modelbased methods have been extended. Recently, Ning et al. proposed SLIM to learn an itemitem similarity matrix, which has shown to outperform a series of stateoftheart recommendation approaches [15]. Kabbur et al. further explored the lowrank property of the similarity matrix to handle sparse datasets [2]. Since similarity (or neighborhood) relationships are learned from the data, these methods overcome the rigidity of using a predefined similarity metric. On the other hand, MF has also been extended in several ways including pointwise methods that inherently assume nonobserved feedback to be negative [16], and pairwise methods like BPRMF [18] that are based on a more realistic assumption that positive feedback should only be ‘more preferable’ than nonobserved feedback.
Several works take temporal dynamics into account, mostly based on MF techniques [19]. This includes seminal work proposed by Koren [20], where they showed stateoftheart results on Netflix data by modeling the evolution of users and items over time. However, such works are ultimately building models to understand past actions (e.g. ‘What did Tom like in 2008?’, ‘What does Grace like to do on Weekends?’), by making use of the explicit time stamps. The sequential prediction task differs from theirs in that it does not use time stamps directly, but rather models sequential relationships between actions.
Markov Chains have demonstrated their strength at modeling stochastic transitions, from uncovering sequential patterns (e.g. [22]) to directly modeling decision processes [24]. For the sequential prediction/recommendation task, Rendle et al. proposed FPMC which combines the power of MF at modeling personal preferences and the strength of Markov Chains at modeling sequential patterns [3]. Our work follows this thread but contributes in that (1) we make use of a similaritybased method for modeling user preferences so that sparsity issues are mitigated; and (2) we further consider Markov Chains with higher orders to model sequential smoothness across multiple time steps.
3Sequential Prediction
3.1Problem Formulation and Notation
Objects to be recommended in the system are referred to as items. The most common recommendation approaches focus on modeling types of items of interest to each user, without accounting for any sequential information, e.g., the last item purchased/reviewed, or place visited by the user in question.
We are tackling sequential prediction tasks which are formulated as follows. Let denote the set of users and the set of items. Each user is associated with a sequence of actions (or ‘events’) (e.g. items purchased by , or places has checked in): , where . We use to denote the set of items in where the sequential signal is ignored. Using the above data exclusively, our objective is to predict the next action of each user and thus make recommendations accordingly. Notation used throughout this paper is summarized in Table ?. Boldfaced symbols are used to denote matrices.
Notation  Explanation 

user set, item set  
a specific user, item, time step  
the item user interacted with at time step  
action sequence of user ,  
the set of items in ,  
bias term associated with item ,  
latent vector associated with item ,  
latent vector associated with item ,  
dimensionality of the vector representing each user/item  
order of Markov Chains  
global weighting vector,  
personalized weighting vector,  
probability that user chooses item after item  
prediction that user chooses item at time step  
personalized total order of user at time step  
weighting factor,  
learning rate,  
the logistic (sigmoid) function  
3.2Modeling User Preferences
Modeling and understanding longterm preferences of users is a key to any recommender system. Traditional methods such as Matrix Factorization (e.g. [25]) are usually based on a lowrank assumption. They project users and items to a lowrank latent space (dimensional) such that the coordinates of each user within the space capture the preferences towards these latent dimensions. The affinity between user and item is then estimated by the inner product of the vector representations of and :
Recently, a novel similaritybased method, called Sparse Linear Methods (SLIM), has been developed and shown to outperform a series of stateoftheart approaches including Matrix Factorization based methods [15]. By learning an itemtoitem similarity matrix from the user action history (e.g. purchase logs), it predicts the useritem affinity as follows:
where is the set of items has interacted with. is the element at the th row and th column of matrix , denoting the similarity of item to item . The underlying rationale it follows is that the more is similar to those items already consumed/liked by , the more likely will be a preferable choice for .
Without parameterizing each user explicitly, SLIM relaxes the lowrank assumption enforced on user representations and has achieved higher recommendation accuracy (see [15] for details).
The major challenge faced by SLIM comes from the large amount of parameters () to be estimated from the sparse useritem interactions. SLIM approaches this issue by exploring the sparsity characteristic of using norm regularization when inferring the parameters. Another direction is to capitalize on the lowrank potential of the similarity matrix by decomposing into the product of two independent lowrank matrices [2]:
where and are both matrices and . This method is called Factored Item Similarity Models (FISM) and brings two benefits: (1) It significantly reduces the number of parameters and has been shown to generate stateoftheart performance on a series of sparse datasets; and (2) Compared to SLIM, it is stronger at capturing the transitive property of item similarities.
3.3Modeling Sequential Patterns
Sequential (or shortterm) dynamics are typically modeled by Markov Chains. Given the last item that has been interacted with, the firstorder Markov Chain predicts the probability of item being chosen at the next step by maximum likelihood estimation of the itemtoitem transition matrix. A further improvement can be made by factorizing the transition matrix into two lowrank matrices, similar to the idea in Section 3.2 (i.e., Eq. (Equation 2)); thus the transition probability of item to item is estimated by the following inner product:
where and are the latent vector representations of item and respectively. Note that each item is associated with two vectors: and .
Markov Chains are strong at capturing shortterm dynamics. For instance, if one purchased a laptop recently, it is reasonable to recommend relevant items such as peripherals or laptop backpacks. Nevertheless, such methods are limited in their ability to capture user preferences that are both personal and longterm (e.g. what type of laptop backpack this particular user likes). Thus there is a need to combine the two lines of models carefully in order to benefit from modeling long and shortterm dynamics simultaneously.
3.4The Unified Sequential Prediction Model
Recently, Rendle et al. introduced a seminal method that combines Matrix Factorization (i.e., Eq. (Equation 1)) and the firstorder Markov Chain (i.e., Eq. (Equation 3)) to form a unified prediction model, which is dubbed Factorized Personalized Markov Chains (FPMC) [3]. The probability that an arbitrary user transitions from the last item to the next item is estimated by
where the first inner product computes how much likes item and the second calculates the extent to which is ‘similar’ to the last item .
4The Proposed Fossil Model
4.1The Basic Model
In contrast to FPMC, here we take another direction and investigate combining similaritybased methods and Markov Chains to approach the sequential prediction task (see Figure 2). In particular, we take FISM (see Section 3.2) as our starting point, in light of its ability to handle the sparsity issues in realworld datasets.
The basic form of our model is as follows:
where each user is parameterized with only a single scalar that controls the relative weights of the long and shortterm dynamics. is a global parameter shared by all users and helps center at .
The above formulation parameterizes each item with four vectors, i.e., , , , and . Considering the limited number of parameters we can afford in the sparse datasets we are interested in, we reduce the four matrices to two by enforcing and . This makes sense since ultimately sequentiallyrelated items are also ‘similar’ to one another. Adding a bias term and normalizing the longterm dynamics component, we arrive at a new formulation as follows:
4.2Modeling Higherorder Markov Chains
Up to now we have used firstorder Markov Chains to model shortterm temporal dynamics. Next, we extend our formulation to consider highorder Markov Chains to capture smoothness across multiple time steps. Given the most recent items user has consumed (), the new formulation predicts the probability of item being the next item (at time step ) with an order Markov Chain as shown in Eq. (Equation 6). In this new formulation, each user is associated with a vector . Likewise, the global bias becomes the vector . The rationale behind this idea is that each of the previous actions should contribute with different weights to the highorder smoothness.
4.3Inferring Model Parameters
The ultimate goal of the sequential prediction task is to rank observed (or groundtruth) items as high as possible so that the recommender system can make plausible recommendations. This means it is natural to derive a personalized total order (at each step ) to minimize a ranking loss such as sequential Bayesian Personalized Ranking (SBPR) [3]. Here means that item is ranked higher than item for user at step given the action sequence before .
For each user and for each time step , SBPR employs a sigmoid function ( is a shorthand for the prediction in Eq. (Equation 6)) to characterize the probability that groundtruth item is ranked higher than a ‘negative’ item given the model parameters , i.e., . Assuming independence of users and time steps, model parameters are inferred by optimizing the following maximum a posteriori (MAP) estimation:
where the pairwise ranking between the groundtruth and all negative items goes through all users and all time steps.
The large amount of positivenegative pairs in the objective makes conventional batch gradient descent unaffordable. As such, we adopt Stochastic Gradient Descent (SGD) which has seen wide success for learning models in BPRlike optimization frameworks (e.g. [18]). The SGD training procedure works as follows. First, it uniformly samples a user from as well as a time step from . Next, a negative item and is uniformly sampled, which forms a training triple . Finally, the optimization procedure updates parameters in the following fashion:
where is the learning rate and is a regularization hyperparameter.
5Experiments
5.1Datasets and Statistics
To evaluate the ability and applicability of our method to handle different realworld scenarios, we include a spectrum of large datasets from different domains in order to predict actions ranging from the next product to purchase, next movie to watch, to next review to write, and next place to checkin. Note that these datasets also vary significantly in terms of user and item density (i.e., number of actions per user/item).
The first group of large datasets are from Amazon.com,
The next dataset is collected from a popular online consumer review website Epinions.com
Dataset  #users () 
#items () 
#actions  avg. #actions /user 
avg. #actions /item 

AmazonOffice 
16,716  22,357  128,070  7.66  5.73 
AmazonAuto 
34,316  40,287  183,573  5.35  4.56 
AmazonGame 
31,013  23,715  287,107  9.26  12.11 
AmazonToy 
57,617  69,147  410,920  7.13  5.94 
AmazonCell 
68,330  60,083  429,231  6.28  7.14 
AmazonClothes 
184,050  174,484  1,068,972  5.81  6.13 
AmazonElec 
253,996  145,199  2,109,879  8.31  14.53 
Epinions 
5,015  8,335  26,932  5.37  3.23 
Foursquare 
43,110  13,335  306,553  7.11  22.99 
Total 
694,163 
556,942 
4,951,237 
N/A  N/A 
We also include another popular dataset which is often used to evaluate next PointOfInterest prediction algorithms. It is from Foursquare.com
For each of the above datasets, we filter out inactive users and items with fewer than five associated actions. Starratings are converted to implicit feedback (i.e., ‘binary’ actions) by setting the corresponding entries to 1; that is, we care about the purchase/review/checkin actions regardless of the specific rating values. Statistics of each dataset after the above processing are shown in Table 1.
5.2Evaluation Methodology
All methods are evaluated with the AUC (Area Under the ROC curve) metric, not only because it is widely used (e.g. [28]), but also because it is a natural choice in our case as all comparison methods directly optimize this metric on the training set.
For each dataset, we use the two most recent actions of each user to create a validation set and a test set : one action for validation and the other for testing. All other actions are used as the training set . The training set is used to train all comparison methods, and hyperparameters are tuned with the validation set . Finally, all trained models are evaluated on the test set :
where is the groundtruth item of user at the most recent time step . The indicator function returns if the argument is , otherwise. The goal here is for the heldout action to calculate how highly the groundtruth item has been ranked for each user according to the learned personalized total order .
5.3Comparison Methods
We include a series of stateoftheart methods in the field of both item recommendation and sequential prediction.
Popularity (POP): always recommends items based on the rank of their popularity in the system.
Bayesian Personalized Ranking (BPRMF) [18]: is a stateoftheart method for personalized item recommendation. It only considers longterm preferences and uses Matrix Factorization [25] as the underlying predictor.
Factored Item Similarity Models (FISM) [2]: is a recentlyproposed similaritybased algorithm for personalized item recommendation. We build our model on top of it to tackle the sequential prediction task.
Factorized Markov Chains (FMC): factorizes the itemtoitem transition matrix () to capture the likelihood that an arbitrary user transitions from one item to another. Here we use a firstorder Markov Chain as higher orders incur a statespace explosion.
Factorized Personalized Markov Chain (FPMC) [3]: is a method that uses a personalized Markov Chain (see Eq. (Equation 4)) for the sequential prediction task we are interested in. Recall that FPMC is ultimately a combination of Matrix Factorization and firstorder Markov Chains.
Factorized Sequential Prediction with Item Similarity Models (Fossil): is the algorithm proposed in this paper (see Eq. (Equation 6)). Markov Chains of different orders will be experimented with and compared against other methods.
For clarity, the above methods are collated in Table 2 in terms of whether they are ‘personalized,’ ‘sequentiallyaware,’ ‘similaritybased,’ ‘explicitly model users,’ and ‘consider highorder Markov Chains.’ Note that all methods (except POP) directly optimize the pairwise ranking of groundtruth actions versus negative actions in the training set (i.e., the AUC metric), so that the fairness of the comparison is maximized.
Property  POP  BPRMF  FISM  FMC  FPMC  Fossil 

P  
Q  
S  
E  
H  
Dataset 
(a)  (b)  (c)  (d)  (e)  (f1)  (f2)  (f3)  
POP  BPRMF  FISM  FMC  FPMC  Fossil 
Fossil 
Fossil 
e vs. b  f vs. c  f vs. e  f vs. best  
AmazonOffice 
0.6427  0.6736  0.7113  0.6874  0.6891  0.7211  0.7224  0.7221  2.30%  1.56%  4.83%  1.56% 
AmazonAuto 
0.5870  0.6379  0.6736  0.6452  0.6446  0.6910  0.6904  0.6901  1.05%  2.58%  7.20%  2.58% 
AmazonGame 
0.7495  0.8483  0.8639  0.8401  0.8502  0.8793  0.8813  0.8817  0.22%  2.06%  3.71%  2.06% 
AmazonToy 
0.6240  0.7020  0.7499  0.6665  0.7061  0.7625  0.7645  0.7652  0.58%  2.04%  8.37%  2.04% 
AmazonCell 
0.6959  0.7212  0.7755  0.7359  0.7396  0.7982  0.8009  0.8006  2.55%  3.27%  8.29%  3.27% 
AmazonClothes 
0.6189  0.6513  0.7085  0.6673  0.6672  0.7255  0.7256  0.7259  2.44%  2.46%  8.80%  2.46% 
AmazonElec 
0.7837  0.7927  0.8210  0.7992  0.7985  0.8411  0.8438  0.8444  0.73%  2.85%  5.75%  2.85% 
Epinions 
0.4576  0.5520  0.5818  0.5532  0.5477  0.6014  0.6050  0.6048  0.78%  3.99%  10.46% 
3.99% 
Foursquare 
0.9168  0.9506  0.9230  0.9441  0.9485  0.9626  0.9621  0.9618  0.22%  4.29%  1.49%  1.26% 
Avg. 
0.6751  0.7255  0.7565  0.7265  0.7324  0.7759  0.7773  0.7774  0.99%  2.79%  6.54%  2.45% 
Avg. 
0.6751  0.7285  0.7580  0.7293  0.7344  0.7780  0.7795  0.7788  0.88%  2.83%  6.44%  2.54% 
These baselines are designed to demonstrate (1) the performance achieved by stateoftheart sequentiallyunaware recommendation methods (BPRMF and FISM) and purely sequential methods (MC); (2) the effectiveness of the stateoftheart sequential prediction method by combining BPRMF and MC (FPMC); and (3) the strength of our proposed combination of a similaritybased algorithm and (highorder) Markov Chains (Fossil).
5.4Performance and Quantitative Analysis
For simplicity and fair comparison, the number of dimensions of user/item representations (or the rank of the matrices) in all methods is fixed to the same number . We experimented with different values of and demonstrate our results in Table 3 (). Due to the sparsity of these datasets, no algorithm observed significant performance improvements when increasing beyond 10 (see the last row of Table 3 for the average accuracy of all methods when setting to 20).
For clarity, on the right of the table we show the percentage improvement of a variety of methods—FPMC vs. BPRMF, Fossil vs. FISM, Fossil vs. FPMC, and Fossil vs. the best baselines. We make a few comparisons and summarize our findings as follows.
BPRMF and FISM are two powerful methods to model users’ personalized preferences, i.e., longterm dynamics. Although ultimately they all factorize a matrix at their core, they differ significantly both in terms of the rationales they follow and performance they achieve. BPRMF relies on factorization of the useritem interaction matrix and parameterizes each user with a dimensional vector. In contrast, FISM is based on the factorization of the itemitem similarity matrix and relaxes the need to explicitly parameterize users, who may only have a few actions in the training set. According to our experimental results, FISM exhibits significant improvements over BPRMF on all datasets (over 4 percent on average), which makes it a strong buildingblock for our sequential prediction task.
Compared to BPRMF and FISM, FMC focuses on capturing sequential patterns among items, i.e., shortterm dynamics. Notably, FMC achieved comparable prediction accuracy with BPRMF. This suggests that sequential patterns are important dynamics and that it would be limiting to only consider longterm user preferences.
BPRMF and FMC are limited by missing an important type of dynamic prevalent in our datasets. FPMC combines them and emerges as a comprehensive model that is both personalized and sequentiallyaware. Quantitatively, FPMC is the strongest among the three—0.99 percent better than BPRMF and 0.80 percent better than FMC on average (when ).
By fusing FISM, which is strong at modeling longterm dynamics on sparse data, and Markov Chains, Fossil enhances the performance of FISM by as much as 2.79 percent on average, compared to 0.99 percent achieved by FPMC over BPRMF. Comparing Fossil with FPMC, we found that (1) Fossil beats FPMC significantly on all datasets, with a large improvement of 6.54 percent on average, and (2) the improvement is even larger on sparse datasets like AmazonAuto, AmazonClothes, and Epinions. The superior performance of Fossil on various datasets demonstrates its efficacy to handle realworld datasets.
Generally, the performance of Fossil gets better on most datasets when increasing the order of the Markov Chains involved (from 1 to 3 in our experiments), indicating that earlier actions are also useful for prediction. On the other hand, small orders seem to be enough to achieve good performance, presumably because sequential patterns do not involve actions from a long time ago.
The rightmost column of Table 3 demonstrates the performance improvement of Fossil versus the best baseline method in each case. We found that Fossil outperforms all baselines in all cases with an enhancement of 2.5 percent on average.
5.5Reproducibility
The hyperparameter in Eq. (Equation 6) is set to 0.2 on all datasets. Regularization hyperparameters are always tuned with grid search using the validation set . (in Eq. (Equation 8)) yielded the best performance when set to 0.1 in most cases. The learning rate is set to 0.01.
5.6Training Efficiency
All experiments were performed on a single machine with 8 cores and 64GB main memory. The largest dataset—Amazon Electronics takes around 4 hours to train Fossil with thirdorder Markov Chains (i.e., ) and 20 latent dimensions (i.e., ). It is easy to verify that Eq. (Equation 6) takes for prediction. Since and are usually small numbers for sparse datasets, the computational cost is manageable.
6A Study on the Effect of Data Sparsity
We proceed by further studying the effect of dataset sparsity on different methods. To this end, we perform experiments on a popular dataset—MovieLens1M,
We construct a sequence of datasets each with a different threshold on the number of recent user actions; that is, a dataset is constructed by taking only the most recent actions of each user. Actions beyond this point are dropped. Note that sampling is not used as it would break the sequential characteristic exploited by the models. We decrease the threshold from 50 to 5 (leading to a series of increasingly sparse datasets) and observe the performance variation of all methods. Statistics of the datasets are summarized in Table 4. Experimental results are collected in Table ?. As we can see from the table, the accuracy of all methods drops if we decrease the threshold from 50 to 5. This makes sense since we have less information regarding both users and transitions among items. In this section, we compare Fossil with FPMC to answer a series of questions on their effectiveness based on the results in Table ?.
Dataset  Threshold  #users () 
#items () 
#actions  avg. #actions /user 
avg. #actions /item 

ML50 
50  6,040  3,467  215,676  35.71  62.21 
ML30 
30  6,040  3,391  152,160  25.19  44.87 
ML20 
20  6,040  3,324  111,059  18.39  33.41 
ML10 
10  6,040  3,114  59,610  7.13  19.14 
ML5 
5  6,040  2,848  30,175  5.00  10.60 
Dataset 
(a)  (b)  (c)  (d)  (e)  (f1)  (f2)  (f3) 
POP  BPRMF  FISM  FMC  FPMC  Fossil 
Fossil 
Fossil 

ML50 
0.8032  0.8587  0.8564  0.8566  0.8825  0.8802  0.8837  0.8865 
ML30 
0.7980  0.8523  0.8515  0.8463  0.8674  0.8748  0.8794  0.8797 
ML20 
0.7919  0.8447  0.8476  0.8357  0.8503  0.8704  0.8735  0.8728 
ML10 
0.7722  0.8728  0.8276  0.8026  0.8301  0.8540  0.8546  0.8551 
ML5 
0.7352  0.7551  0.7711  0.7275  0.7458  0.7945  0.7940  0.7931 
6.1How Much do Shortterm Dynamics Help?
We begin by investigating the improvements due to modeling sequential patterns on top of BPRMF and FISM. In Figure 3 we demonstrate the improvement of FPMC over BPRMF, in contrast to that achieved by Fossil (with 1st, 2nd, and 3rdorder Markov Chains) over FISM. From the figure we observe that the improvement of FPMC over BPRMF decreases as data become sparser, due to the amount of additional parameters introduced to model shortterm dynamics. In contrast, Fossil introduces only a small amount of parameters (i.e., the weighting vectors and ), and thus outperforms FISM more consistently. Additionally, high orders of Markov Chains appear to help more when the dataset is dense, e.g., ML50 vs. others.
6.2How Much do Longterm Dynamics Help?
One may consider FPMC and Fossil as two different methods to enhance Markov Chains by incorporating longterm user preferences. To demonstrate the benefits of modeling longterm dynamics, in Figure 4 we show the amount of improvement achieved by FPMC and Fossil over FMC. As datasets become sparser, the performance gap between Fossil and FMC increases by as much as 9 percent, in contrast to the relatively ‘flat’ improvement of around 3 percent from FPMC. This demonstrates the strength of Fossil as well as the compatibility of FISM and Markov Chains since they are both ultimately modeling relationships among items.
6.3Fossil and FPMC
In Figure 5 we demonstrate the performance gain of our proposed method over FPMC. Fossil outperforms FPMC increasingly as sparsity grows. Note that FPMC requires as many as 50 actions per user in order to achieve comparable performance with Fossil.
For further investigation, we also performed additional experiments where we reduced the number of parameters of FPMC by setting the matrix in Eq. (Equation 4), in the hope that it could favor sparse datasets. However, no significant improvement (or even worse prediction accuracy) was observed.
To sum up, the two key components of Fossil, i.e., the similaritybased method and (highorder) Markov Chains, both contribute considerably to its performance especially on sparse datasets. And thus the precise combination of the two generates strong results for the sparse sequential recommendation task.
7Visualization and Qualitative Analysis
In this section, we visualize the learned Fossil model and qualitatively analyze our findings. We choose to visualize the results achieved on Clothing, Shoes and Jewelry dataset from Amazon.com (see Section 5.1) due to its large size, significant variability, and the convenience to demonstrate user actions. The model we use for visualization is the firstorder Fossil model trained on the dataset with set to .
7.1Visualizing Sequential Dynamics
First, we visualize the transition among items to answer questions like ‘What kind of outfits are compatible with this outdoor cap?’. Fossil encodes this dynamic by the inner product of and where is the item already consumed and the item considered for recommendation. Quantitatively, given a ‘query’ item at the current time step, items that are most likely to appear at next step are computed according to
For demonstration, we take a few samples from the above dataset, as shown on the left of the separator in Figure 6. Next, we use them as queries to get corresponding recommendations according to Eq. (Equation 9). Items retrieved for each query are shown on the right. We make two observations from this figure. On the one hand, although the model is unaware of the identity of items, it learns the underlying homogeneity correctly, as we see from the first row (i.e., the Star Wars theme) and last three rows (i.e., watches, shirts and jewelry respectively). On the other hand, items from different subcategories are surfaced to generate compatible outfits, e.g. rows 2 and 5.
7.2Visualizing Personal Dynamics
Recall that each user is parameterized with a vector to model the personalized comparative importance of shortterm dynamics versus longterm dynamics. The scatter plot in Figure 7 shows such weights () learned for users in the dataset.
From the scatter plot we make a few observations as follows. (1) Fossil gives ‘cool’ users (users with few actions) higher weights, which makes sense since sequential patterns have to carry more weight when we do not have enough observations to infer users’ preferences. This also confirms that it is necessary to model shortterm dynamics on sparse datasets in order to enhance the performance of item recommendation methods like FISM. (2) As we increment the number of actions, Fossil relies more on longterm preferences as they become increasingly accurate.
In Figure 8 we demonstrate recommendations made for a few users as well as the corresponding groundtruth items (from the test set ). Here the users are sampled from those with the largest and at least 5 actions in the training set . The threshold is used so that is forced to capture the ‘sequential consistency’ of user to some degree, instead of merely the user sparsity involved. From Figure 8 we can observe a certain amount of such consistency within each sequence. E.g., jewelry (row 1), wearables for boys (row 4) and business men (row 5).
In conclusion, it is precisely the ability to carefully accommodate multiple types of dynamics as well as personalization that makes Fossil a successful method to handle the sequential recommendation task.
8Conclusion
In this paper, we proposed a new method, Fossil, that fuses similaritybased models with Markov Chains to predict personalized sequential behavior. We performed extensive experiments on multiple large, realworld datasets, and found that Fossil outperforms existing methods considerably. We studied the effect of sparsity on different methods and found that Fossil is especially strong when the prediction task is challenged by sparsity issues. We visualized the learned Fossil model on a large dataset from Amazon.com and observed that it captures sequential and personalized dynamics in a reasonable way, along with the favorable quantitative results achieved.
This work is supported by NSFIIS1636879, and donations from Adobe, Symantec, and NVIDIA.
Footnotes
 For instance, if items and , and are two copurchase pairs in the training data but is not, SLIM will erroneously estimate the similarity of and to be .
 In [3], the authors took a tensorfactorization perspective of the predictor, which brings an additional term modeling the interactions between and the last item . However, this term is not required as it always gets canceled out when making predictions.
 i.e., optimize the AUC metric (See Section 5.2).
 https://www.amazon.com/
 http://epinions.com/
 https://foursquare.com/
 http://grouplens.org/datasets/movielens/1m/
References
 F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender systems handbook.1em plus 0.5em minus 0.4emSpringer US, 2011.
 S. Kabbur, X. Ning, and G. Karypis, “FISM: factored item similarity models for topn recommender systems,” in SIGKDD, 2013.
 S. Rendle, C. Freudenthaler, and L. SchmidtThieme, “Factorizing personalized markov chains for nextbasket recommendation,” in WWW, 2010.
 J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl, “Grouplens: applying collaborative filtering to usenet news,” Communications of the ACM, 1997.
 U. Shardanand and P. Maes, “Social information filtering: algorithms for automating “word of mouth”,” in SIGCHI, 1995.
 W. Hill, L. Stead, M. Rosenstein, and G. Furnas, “Recommending and evaluating choices in a virtual community of use,” in SIGCHI, 1995.
 G. Linden, B. Smith, and J. York, “Amazon.com recommendations: itemtoitem collaborative filtering,” Internet Computing, 2003.
 M. Deshpande and G. Karypis, “Itembased topn recommendation algorithms,” TOIS, 2004.
 K. Miyahara and M. J. Pazzani, “Collaborative filtering with the simple bayesian classifier,” in PRICAI, 2000.
 J. S. Breese, D. Heckerman, and C. Kadie, “Empirical analysis of predictive algorithms for collaborative filtering,” in UAI, 1998.
 R. Salakhutdinov, A. Mnih, and G. Hinton, “Restricted boltzmann machines for collaborative filtering,” in ICML, 2007.
 R. M. Bell, Y. Koren, and C. Volinsky, “The bellkor solution to the netflix prize,” 2007.
 J. Bennett and S. Lanning, “The netflix prize,” in KDDCup, 2007.
 A. Paterek, “Improving regularized singular value decomposition for collaborative filtering,” in KDDCup, 2007.
 X. Ning and G. Karypis, “Slim: Sparse linear methods for topn recommender systems,” in ICDM, 2011.
 Y. Hu, Y. Koren, and C. Volinsky, “Collaborative filtering for implicit feedback datasets,” in ICDM, 2008.
 R. Pan, Y. Zhou, B. Cao, N. N. Liu, R. Lukose, M. Scholz, and Q. Yang, “Oneclass collaborative filtering,” in ICDM, 2008.
 S. Rendle, C. Freudenthaler, Z. Gantner, and L. SchmidtThieme, “BPR: bayesian personalized ranking from implicit feedback,” in UAI, 2009.
 Y. Koren, “Factorization meets the neighborhood: a multifaceted collaborative filtering model,” in SIGKDD, 2008.
 ——, “Collaborative filtering with temporal dynamics,” Communications of the ACM, 2010.
 Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for recommender systems,” 2009.
 A. Zimdars, D. M. Chickering, and C. Meek, “Using temporal data for making recommendations,” in UAI, 2001.
 B. Mobasher, H. Dai, T. Luo, and M. Nakagawa, “Using sequential and nonsequential patterns in predictive web usage mining tasks,” in ICDM, 2002.
 G. Shani, R. I. Brafman, and D. Heckerman, “An mdpbased recommender system,” in UAI, 2002.
 Y. Koren and R. Bell, “Advances in collaborative filtering,” in Recommender Systems Handbook.1em plus 0.5em minus 0.4em Springer, 2011.
 J. J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, “Imagebased recommendations on styles and substitutes,” in SIGIR, 2015.
 T. Zhao, J. McAuley, and I. King, “Leveraging social connections to improve personalized ranking for collaborative filtering,” in CIKM, 2014.
 R. He and J. McAuley, “VBPR: visual bayesian personalized ranking from implicit feedback,” in AAAI, 2016.