PRINCE: Providerside Interpretability with Counterfactual Explanations in Recommender Systems
Abstract.
Interpretable explanations for recommender systems and other machine learning models are crucial to gain user trust. Prior works that have focused on paths connecting users and items in a heterogeneous network have several limitations, such as discovering relationships rather than true explanations, or disregarding other users’ privacy. In this work, we take a fresh perspective, and present Prince: a providerside mechanism to produce tangible explanations for endusers, where an explanation is defined to be a set of minimal actions performed by the user that, if removed, changes the recommendation to a different item. Given a recommendation, Prince uses a polynomialtime optimal algorithm for finding this minimal set of a user’s actions from an exponential search space, based on random walks over dynamic graphs. Experiments on two realworld datasets show that Prince provides more compact explanations than intuitive baselines, and insights from a crowdsourced userstudy demonstrate the viability of such actionbased explanations. We thus posit that Prince produces scrutable, actionable, and concise explanations, owing to its use of counterfactual evidence, a user’s own actions, and minimal sets, respectively.
1. Introduction
Motivation. Providing usercomprehensible explanations for machine learning models has gained prominence in multiple communities (Zhang et al., 2019c; Miller et al., 2019; Rimchala et al., 2019; Zhang et al., 2019b). Several studies have shown that explanations increase users’ trust in systems that generate personalized recommendations or other rankings (in news, entertainment, etc.) (Ribeiro et al., 2016; Kunkel et al., 2019; Kouki et al., 2019). Recommenders have become very sophisticated, exploiting signals from a complex interplay of factors like users’ activities, interests and social links (Zhang and Chen, 2018). Hence the pressing need for explanations.
Explanations for recommenders can take several forms, depending on the generator (explanations by whom?) and the consumer (explanations for whom?). As generators, only service providers can produce true explanations for how systems compute the recommended items (Balog et al., 2019; Zhang et al., 2014; Wang et al., 2018b); third parties can merely discover relationships and create posthoc rationalizations for blackbox models that may look convincing to users (Ghazimatin et al., 2019; Peake and Wang, 2018; Wang et al., 2018c). On the consumer side, endusers can grasp tangible aspects like activities, likes/dislikes/ratings or demographic factors. Unlike system developers or accountability engineers, endusers would obtain hardly any insight from transparency of internal system workings. In this work, we deal with explanations by the provider and for the enduser.
Limitations of stateoftheart. At the core of most recommender systems is some variant of matrix or tensor decomposition (e.g., (Koren et al., 2009)) or spectral graph analysis (e.g., (Jamali and Ester, 2009)), with various forms of regularization and often involving gradientdescent methods for parameter learning. One of the recent and popular paradigms is based on heterogeneous information networks (HIN) (Yu et al., 2014, 2013; Shi et al., 2017; Zhang et al., 2019a), a powerful model that represents relevant entities and actions as a directed and weighted graph with multiple node and edge types. Prior efforts towards explanations for HINbased recommendations have mostly focused on paths that connect the user with the recommended item (Shi et al., 2015; Ghazimatin et al., 2019; Ai et al., 2018; Wang et al., 2018a, 2019; Xian et al., 2019; Yang et al., 2018). An application of pathbased explanations, for an online shop, would be of the form:

User received item because follows user , who bought item , which has the same category as .
However, such methods come with critical privacy concerns arising from nodes in paths that disclose other users’ actions or interests to user , like the purchase of user above. Even if user ’s id was anonymized, user would know whom she is following and could often guess who user actually is, that bought item , assuming that has a relatively small set of followees (Machanavajjhala et al., 2011). If entire paths containing other users are suppressed instead, then such explanations would no longer be faithful to the true cause. Another family of pathbased methods (Ghazimatin et al., 2019; Peake and Wang, 2018; Wang et al., 2018c) presents plausible connections between users and items as justifications. However, this is merely posthoc rationalization, and not actual causality.
Approach. This paper presents Prince, a method for Providerside Interpretability with Counterfactual Evidence, that overcomes the outlined limitations. Prince is a providerside solution aimed at detecting the actual cause responsible for the recommendation, in a heterogeneous information network with users, items, reviews, and categories. Prince’s explanations are grounded in the user’s own actions, and thus preclude privacy concerns of pathbased models. Fig. 1 shows an illustrative example. Here, Alice’s actions like bought shoes, reviewed a camera, and rated a power bank are deemed as explanations for her backpack recommendation. One way of identifying a user’s actions for an explanation would be to compute scores of actions with regard to the recommended item. However, this would be an unwieldy distribution over potentially hundreds of actions – hardly comprehensible to an enduser. Instead, we operate in a counterfactual setup (Martens and Provost, 2014). Prince identifies a small (and actually minimal) set of a user’s actions such that removing these actions would result in replacing the recommended item with a different item. In Fig. 1, the item rec = “Jack Wolfskin backpack” would be replaced, as the system’s top recommendation, by “iPad Air” (the ’s represent candidate replacement items). Note that there may be multiple such minimal sets, but uniqueness is not a concern here.
Another perspective here is that the goal of an explanation is often to show users what they can do in order to receive more relevant recommendations. Under this claim, the enduser has no control on the network beyond her immediate neighborhood, i.e., the network beyond is not actionable (shaded zone in Fig. 1), motivating Prince’s choice of grounding explanations in users’ own actions.
For true explanations, we need to commit ourselves to a specific family of recommender models. In this work, we choose a general framework based on Personalized PageRank (PPR), as used in the stateoftheart RecWalk system (Nikolakopoulos and Karypis, 2019), and adapt it to the HIN setup. The heart of Prince is a polynomialtime algorithm for exploring the (potentially exponential) search space of subsets of user actions – the candidates for causing the recommendation. The algorithm efficiently computes PPR contributions for groups of actions with regard to an item, by adapting the reverse local push algorithm of (Andersen et al., 2007) to a dynamic graph setting (Zhang et al., 2016). In summary, the desiderata for the explanations from Prince (in bold) connect to the technical approaches adopted (in italics) in the following ways. Our explanations are:

Scrutable, as they are derived in a counterfactual setup;

Actionable, as they are grounded in the user’s own actions;

Concise, as they are minimal sets changing a recommendation.
Extensive experiments with Amazon and Goodreads datasets show that Prince’s minimal explanations, achieving the desired itemreplacement effect, cannot be easily obtained by heuristic methods based on contribution scores and shortest paths. A crowdsourced user study on Amazon Mechanical Turk (AMT) provides additional evidence that Prince’s explanations are more useful than ones based on paths (Yang et al., 2018). Our code is public at https://github.com/azinmatin/prince/.
Contributions. Our salient contributions in this work are:

Prince is the first work that explores counterfactual evidence for discovering causal explanations in a heterogeneous information network;

Prince is the first work that defines explanations for recommenders in terms of users’ own actions;

We present an optimal algorithm that explores the search space of action subsets in polynomial time, for efficient computation of a minimal subset of user actions;

Experiments with two large datasets and a user study show that Prince can effectively aid a service provider in generating usercomprehensible causal explanations for recommended items.
2. Computational Model
Heterogeneous Information Networks (HIN). A heterogeneous graph consists of a set of nodes , a set of edges and a mapping from each node and each edge to their types, such that and with . In our work, a heterogenous graph contains at least two node types, users and items . For simplicity, we use the notations and to refer both to the type of a node and the set of all nodes of that type. A graph is weighted if there is a weight assigned to each edge, , and a graph is directed if is a set of ordered pairs of nodes. We denote with and the sets of outneighbors and inneighbors of node , respectively. A directed and weighted heterogeneous graph where each node and each edge belong to exactly one type, is called a heterogenous information network (HIN) (Shi et al., 2017).
Personalized PageRank (PPR) for recommenders. We use Personalized PageRank (PPR) for recommendation in HINs (Haveliwala, 2003; Nikolakopoulos and Karypis, 2019). PPR is the stationary distribution of a random walk in in which, at a given step, with probability , a surfer teleports to a set of seed nodes , and with probability , continues the walk to a randomly chosen outgoing edge from the current node. More precisely, given , teleportation probability , a single seed , the onehot vector , and the transition matrix , the Personalized PageRank vector is defined recursively as:
(1) 
Let be the PPR score of node personalized for . We define the PPR recommendation for user , or the top recommendation, as:
(2) 
Given a set of edges , we use the notation to define the PPR of an item personalized for a user in the graph . We refer to this graph as . To improve top recommendations, Nikolakopoulos et al. (Nikolakopoulos and Karypis, 2019) define a random walk in an HIN as follows:

With probability , the surfer teleports to

With probability , the surfer continues the walk in the following manner:

With probability , the random surfer moves to a node of the same type, using a similaritybased stochastic transition matrix

With probability , the surfer chooses any outgoing edge at random.

For each node type in , there is an associated stochastic similarity matrix , which encodes the relationship between the nodes of type . When nodes of the same type are not comparable, the similarity matrix is the identity matrix, i.e. . Otherwise, an entry in corresponds to the similarity between node and node . The stochastic process described by this walk is a nearly uncoupled Markov chain (Nikolakopoulos and Karypis, 2019). The stationary distribution of the random walk is the PPR with teleportation probability in a graph (referred to as in (Nikolakopoulos and Karypis, 2019)), where the transition probability matrix of is:
(3) 
The matrix is the transition probability matrix of the original graph . Matrix is a diagonal matrix of order .
Counterfactual Explanations. A user interacts with items via different types of actions , such as clicks, purchases, ratings or reviews, which are captured as interaction edges in the graph . Our goal is to present user with a set of interaction edges (where is a neighbor of ) responsible for an item recommendation ; we refer to this as a counterfactual explanation. An explanation is counterfactual, if after removing the edges from the graph, the user receives a different topranked recommendation . A counterfactual explanation is minimal if there is no smaller set such that and is also a counterfactual explanation for .
Formal problem statement. Given a heterogenous information network and the topranked recommendation for user , find a minimum counterfactual explanation for .
3. The Prince Algorithm
In this section, we develop an algorithm for computing a minimum counterfactual explanation for user receiving recommended item , given the PPRbased recommender framework (Nikolakopoulos and Karypis, 2019). A naïve optimal algorithm enumerates all subsets of actions , and checks whether the removal of each of these subsets replaces with a different item as the top recommendation, and finally selects the subset with the minimum size. This approach is exponential in the number of actions of the user.
To devise a more efficient and practically viable algorithm, we express the scores as follows (Jeh and Widom, 2003), with denoting the PPR of personalized for (i.e., jumping back to ):
(4) 
where denotes the teleportation probability (probability of jumping back to ) and is the Kronecker delta function. The only required modification, with regard to (Nikolakopoulos and Karypis, 2019), is the transformation of the transition probability matrix from to . For simplicity, we will refer to the adjusted probability matrix as .
Eq. 4 shows that the PPR of personalized for user , , is a function of the PPR values of personalized for the neighbors of . Hence, in order to decrease , we can remove edges . To replace the recommendation with a different item , a simple heuristic would remove edges in nonincreasing order of their contributions . However, although this would reduce the PPR of , it also affects and possibly reduces the PPR of other items, too, due to the recursive nature of PPR, where all paths matter.
Let be the set of outgoing edges of a user and let be a subset of , such that . The main intuition behind our algorithm is that we can express after the removal of , denoted by , as a function of two components: and the values , where and . The score does not depend on , and the score is independent of .
Based on these considerations, we present Algorithm 1, proving its correctness in Sec. 4. Algorithm 1 takes as input a graph , a user , a recommendation , and a set of items . In lines 11, we iterate through the items , and find the minimum counterfactual explanation . Here, refers to the actions whose removal swaps the orders of items and . In addition, we ensure that after removing , we return the item with the highest PPR score as the replacement item (lines 11). Note that in the next section, we propose an equivalent formulation for the condition , eliminating the need for recomputing scores in .
The core of our algorithm is the function SwapOrder, which receives as input two items, and , and a user . In lines 11, we sort the interaction edges in nonincreasing order of their contributions . In lines 11, we remove at each step, the outgoing interaction edge with the highest contribution, and update and correspondingly. The variable is strictly positive if in the current graph configuration (), . This constitutes the main building block of our approach. Fig. 2 illustrates the execution of Algorithm 1 on a toy example.
The time complexity of the algorithm is , plus the cost of computing PPR for these nodes. The key to avoiding the exponential cost of considering all subsets of is the insight that we need only to compute PPR values for alternative items with personalization based on a graph where all user actions are removed. This is feasible because the action deletions affect only outgoing edges of the teleportation target , as elaborated in Sec. 4.
The PPR computation could simply rerun a poweriteration algorithm for the entire graph, or compute the principal eigenvector for the underlying matrix. This could be cubic in the graph size (e.g., if we use fullfledged SVD), but it keeps us in the regime of polynomial runtimes. In our experiments, we use the much more efficient reverse local push algorithm (Andersen et al., 2007) for PPR calculations.
4. Correctness Proof
We prove two main results:

can be computed as a product of two components where one depends on the modified graph with the edge set (i.e., removing all user actions) and the other depends on the choice of but not on the choice of .

To determine if some replaces the top node with a different node which is not an outneighbor of , we need to compute only the first of the two components in (i).
Theorem 4.1 ().
Given a graph , a node with outgoing edges such that , a set of edges , a node , the PPR of personalized for in the modified graph can be expressed as follows:
where is an aggregation function.
Proof.
Assuming that each node has at least one outgoing edge, the PPR can be expressed as the sum over the probabilities of walks of length starting at a node (Andersen et al., 2006):
(5) 
where is the onehot vector for . To analyze the effect of deleting , we split the walks from to into two parts, () the part representing the sum over probabilities of walks that start at and pass again by , which is equivalent to (division by is required as the walk does not stop at ), and () the part representing the sum over probabilities of walks starting at node and ending at without revisiting again, denoted by . Combining these constituent parts, PPR can be stated as follows:
(6) 
As stated previously, represents the sum over the probabilities of the walks from to without revisiting . We can express these walks using the remaining neighbors of after removing :
(7) 
where refers to the walks starting at () and ending at that do not visit . We replace with its equivalent formulation . in graph is computed as the sum over the probabilities of walks that never pass by . Eq. 6 can be rewritten as follows:
(8) 
This equation directly implies:
(9) 
∎
Theorem 4.2 ().
The minimum counterfactual explanation for can be computed in polynomial time.
Proof.
We show that there exists a polynomialtime algorithm for finding the minimum set such that , if such a set exists. Using Theorem 4.1, we show that one can compute if some can replace the original as the top recommendation, solely based on PPR scores from a single graph where all user actions are removed:
(10) 
The last equivalence is derived from:
(11) 
For a fixed choice of , the summands in expression 10 do not depend on , and so they are constants for all possible choices of . Therefore, by sorting the summands in descending order, we can greedily expand from a single action to many actions until some outranks . This approach is then guaranteed to arrive at a minimum subset.
∎
5. Graph Experiments
We now describe experiments performed with graphbased recommenders built from real datasets to evaluate Prince.
5.1. Setup
Datasets. We used two real datasets:

The Amazon Customer Review dataset (released by Amazon: s3.amazonaws.com/amazonreviewspds/readme.html), and,

The Goodreads review dataset (crawled by the authors of (Wan and McAuley, 2018): sites.google.com/eng.ucsd.edu/ucsdbookgraph/home).
Each record in both datasets consists of a user, an item, its categories, a review, and a rating value (on a scale). In addition, a Goodreads data record has the book author(s) and the book description. We augmented the Goodreads collection with social links (users following users) that we crawled from the Goodreads website.
The high diversity of categories in the Amazon data, ranging from household equipment to food and toys, allows scope to examine the interplay of crosscategory information within explanations. The key reason for additionally choosing Goodreads is to include the effect of social connections (absent in the Amazon data). The datasets were converted to graphs with “users”, “items”, “categories”, and “reviews” as nodes, and “rated” (useritem), “reviewed” (useritem), “hasreview” (itemreview), “belongsto” (itemcategory) and “follows” (useruser) as edges. In Goodreads, there is an additional node type “author” and an edge type “hasauthor” (itemauthor). All the edges, except the ones with type “follows”, are bidirectional. Only ratings with value higher than three were considered, as lowrated items should not influence further recommendations.
Dataset  #Users  #Items  #Reviews  #Categories  #Actions 

Amazon  
Goodreads 
Sampling. For our experiments, we sampled seed users who had between and actions, from both Amazon and Goodreads datasets. The filters served to prune out underactive and power users (potentially bots). Activity graphs were constructed for the sampled users by taking their fourhop neighborhood from the sampled data (Table 1). Four is a reasonably small radius to keep the items relevant and personalized to the seed users. On average, this resulted in having about items and items for each user in their HIN, for Amazon and Goodreads, respectively.
The graphs were augmented with weighted edges for node similarity. For Amazon, we added reviewreview edges where weights were computed using the cosine similarity of the review embeddings, generated with Google’s Universal Sentence Encoder (Cer et al., 2018), with a cutoff threshold to retain only confident pairs. This resulted in reviewreview edges. For Goodreads, we added three types of similarity edges: categorycategory, bookbook and reviewreview, with the same similarity measure ( categorycategory, bookbook, and reviewreview edges). Corresponding thresholds were and . We crawled category descriptions from the Goodreads’ website and used book descriptions and review texts from the raw data. Table 1 gives some statistics about the sampled datasets.
Initialization. The replacement item for is always chosen from the original top recommendations generated by the system; we systematically investigate the effect of on the size of explanations in our experiments (with a default ). Prince does not need to be restricted to an explicitly specified candidate set, and can actually operate over the full space of items . In practice, however, replacement items need to be guided by some measure of relevance to the user, or itemitem similarity, so as not to produce degenerate or trivial explanations if is replaced by some arbitrary item from a pool of thousands.
We use the standard teleportation probability (Brin and Page, 1998). The parameter is set to . To compute PPR scores, we used the reverse local push method (Zhang et al., 2016) with for Amazon and for Goodreads. With these settings, Prince and the baselines were executed on all userspecific HINs to compute an alternative recommendation (i.e., replacement item) * and a counterfactual explanation set *.
Baselines. Since Prince is an optimal algorithm with correctness guarantees, it always finds minimal sets of actions that replace (if they exist). We wanted to investigate, to what extent other, more heuristic, methods approximate the same effects. To this end, we compared Prince against two natural baselines:

Highest Contributions (): This is analogous to counterfactual evidence in featurebased classifiers for structured data (Chen et al., 2017; Moeyersoms et al., 2016). It defines the contribution score of a user action to the recommendation score as (Eq. 4), and iteratively deletes edges with highest contributions until the highestranked changes to a different item.

Shortest Paths (): computes the shortest path from to and deletes the first edge on this path. This step is repeated on the modified graph, until the topranked changes to a different item.
Evaluation Metric. The metric for assessing the quality of an explanation is its size, that is, the number of actions in for Prince, and the number of edges deleted in and .
5.2. Results and Insights
Amazon  Goodreads  

k  Prince  HC  SP  Prince  HC  SP 
*  *  
*  *  
*  
*  
* 
Parameter  Amazon  Goodreads  

Precomp  Dynamic  Precomp  Dynamic  
Method  Explanation for “Baby stroller” with category “Baby” [Amazon] 

Prince  Action 1: You rated highly “Badger Basket Storage Cubby” with category “Baby” 
Replacement Item: “Google Chromecast HDMI Streaming Media Player” with categories “Home Entertainment”  
Action 1: You rated highly “Men’s hair paste” with category “Beauty”  
Action 2: You reviewed “Men’s hair paste” with category “Beauty” with text “Good product. Great price.”  
Action 3: You rated highly “Badger Basket Storage Cubby” with category “Baby”  
Action 4: You rated highly “Straw bottle” with category “Baby”  
Action 5: You rated highly “3 Sprouts Storage Caddy” with category “Baby”  
Replacement Item: “Bathtub Waste And Overflow Plate” with categories “Home Improvement”  
Action 1: You rated highly “Men’s hair paste” with category “Beauty”  
Action 2: You rated highly “Badger Basket Storage Cubby” with category “Baby”  
Action 3: You rated highly “Straw bottle” with category “Baby”  
Action 4: You rated highly “3 Sprouts Storage Caddy” with category “Baby”  
Replacement Item: “Google Chromecast HDMI Streaming Media Player” with categories “Home Entertainment”  
Method  Explanation for “The Multiversity” with categories “Comics, Historicalfiction, Biography, Mystery” [Goodreads] 
Prince  Action 1: You rated highly “Blackest Night” with categories “Comics, Fantasy, Mystery, Thriller” 
Action 2: You rated highly “Green Lantern” with categories “Comics, Fantasy, Children”  
Replacement item: “True Patriot: Heroes of the Great White North” with categories “Comics, Fiction”  
Action 1: You follow User ID  
Action 2: You rated highly “Blackest Night” with categories “Comics, Fantasy, Mystery, Thriller”  
Action 3: You rated highly “Green Lantern” with categories “Comics, Fantasy, Children”  
Replacement item: “The Lovecraft Anthology: Volume 2” with categories “Comics, Crime, Fiction”  
Action 1: You follow User ID  
Action 2: You rated highly “Fahrenheit 451” with categories “Fantasy, Youngadult, Fiction”  
Action 3: You rated highly “Darkly Dreaming Dexter (Dexter, #1)” with categories “Mystery, Crime, Fantasy”  
And 6 more actions  
Replacement item: “The Lovecraft Anthology: Volume 2” with categories “Comics, Crime, Fiction” 
We present our main results in Table 2 and discuss insights below. These comparisons were performed for different values of the parameter . Wherever applicable, statistical significance was tested under the tailed paired test at . Anecdotal examples of explanations by Prince and the baselines are given in Table 4. In the Amazon example, we observe that our method produces a topically coherent explanation, with both the recommendation and the explanation items in the same category. The and methods give larger explanations, but with poorer quality, as the first action in both methods seems unrelated to the recommendation. In the Goodreads example, both and yield the same replacement item, which is different from that of Prince.
Approximating Prince is difficult. Explanations generated by Prince are more concise and hence more usercomprehensible than those by the baselines. This advantage is quite pronounced; for example, in Amazon, all the baselines yield at least one more action in the explanation set on average. Note that this translates into unnecessary effort for users who want to act upon the explanations.
Explanations shrink with increasing k. The size of explanations shrinks as the top candidate set for choosing the replacement item is expanded. For example, the explanation size for Prince on Amazon drops from at to at . This is due to the fact that with a growing candidate set, it becomes easier to find an item that can outrank .
Prince is efficient. To generate a counterfactual explanation, Prince only relies on the scores in the graph configuration (where all the outgoing edges of are deleted). Precomputing (for all ), Prince could find the explanation for each pair in about millisecond on average (for ). Table 3 shows runtimes of Prince for different parameters. As we can see, the runtime grows linearly with in both datasets. This is justified by Line 1 in Algorithm 1. Computing onthefly slows down the algorithm. The second and the fourth columns in Table 3 present the runtimes of Prince when the scores are computed using the reverse push algorithm for dynamic graphs (Zhang et al., 2016). Increasing makes the computation slower (experimented at ). All experiments were performed on an Intel Xeon server with cores @ GHz CPU and GB main memory.
6. User Study
Qualitative survey on usefulness. To evaluate the usefulness of counterfactual (actionoriented) explanations, we conducted a survey with Amazon Mechanical Turk (AMT) Master workers (www.mturk.com/help#what_are_masters). In this survey, we showed workers three recommendation items (“Series Camelot”, “Pregnancy guide book”, “Nike backpack”) and two different explanations for each. One explanation was limited to only the user’s own actions (actionoriented), and the other was a path connecting the user to the item (connectionoriented).
We asked the workers three questions: (i) Which method do you find more useful?, where chose the actionoriented method; (ii) How do you feel about being exposed through explanations to others?, where expressed a privacy concern either through complete disapproval or through a demand for anonymization; (iii) Personally, which type of explanation matters to you more: “Actionoriented” or “connectionoriented”?, where of the workers chose the actionoriented explanations. We described actionoriented explanations as those allowing users to control their recommendation, while connectionoriented ones reveal connections between the user and item via other users and items.
Quantitative measurement of usefulness. In a separate study (conducted only on Amazon data for resource constraints), we compared Prince to a pathbased explanation (Yang et al., 2018) (later referred to as CredPaths). We used the credibility measure from (Yang et al., 2018), scoring paths in descending order of the product of their edge weights. We computed the best path for all useritem pairs (Sec. 5.1). This resulted in paths of a maximum length of three edges (four nodes including user and ). For a fair comparison in terms of cognitive load, we eliminated all data points where Prince computed larger counterfactual sets. This resulted in about useritem pairs, from where we sampled exactly . As explanations generated by Prince and CredPaths have a different format of presentation (a list of actions vs. a path), we evaluated each method separately to avoid presentation bias. For the sake of readability, we broke the paths into edges and showed each edge on a new line. Having three AMT Masters for each task, we collected annotations for Prince and the same number for CredPaths.
A typical data point looks like a row in Table 6, that shows representative examples (Goodreads shown only for completeness). We divided the samples into ten HITs (Human Intelligence Tasks, a unit of job on AMT) with data points in each HIT. For each data point, we showed a recommendation item and its explanation, and asked users about the usefulness of the explanation on a scale of (“Not useful at all”, “Partially useful”, and “Completely useful”). For this, workers had to imagine that they were a user of an ecommerce platform who received the recommendations as result of doing some actions on the platform. Only AMT Master workers were allowed to provide assessments.
To detect spammers, we planted one honeypot in each of the HITs, that was a completely impertinent explanation. Subsequently, all annotations of detected spammers (workers who rated such irrelevant explanations as “completely useful”) were removed ( 25% of all annotations).
Method  Mean  Std. Dev.  #Samples 

Prince  *  
CredPaths (Yang et al., 2018)  
Prince (Size=)  
Prince (Size=)  *  
Prince (Size=)  * 
Method  Explanation for “Baby stroller” with category “Baby” [Amazon] 

Prince  Action 1: You rated highly “Badger Basket Storage Cubby” with category “Baby” 
Replacement Item: ”Google Chromecast HDMI Streaming Media Player” with category “Home Entertainment”  
CredPaths  You rated highly “Men’s hair paste” with category “Beauty” 
that was rated by “Some user”  
who also rated highly “Baby stroller” with category “Baby”  
Method  Explanation for “The Multiversity” with categories “Comics, Historicalfiction, Biography, Mystery” [Goodreads] 
Prince  Action 1: You rated highly “Blackest Night” with categories “Comics, Fantasy, Mystery, Thriller” 
Action 2: You rated highly “Green Lantern” with categories “Comics, Fantasy, Children”  
Replacement Item: “True Patriot: Heroes of the Great White North” with categories “Comics, Fiction, Crime, Fiction”  
CredPaths  You follow “Some user” 
who has rated highly “The Multiversity” with categories “Comics, Historicalfiction, Biography, Mystery” 
Based on multiple actions explained simply and clearly. [Prince ] 

The recommendation is for a home plumbing item, but the action rated a glue. [Prince ] 
The explanation is complete as it goes into full details of how to use the product, which is in alignment of my review and useful to me. [CredPaths] 
It’s weird to be given recommendations based on other people. [CredPaths] 
Table 5 shows the results of our user study. It gives average scores and standard deviations, and it indicates statistical significance of pairwise comparisons with an asterisk. Prince clearly obtains higher usefulness ratings from the AMT judges, on average. Krippendorff’s alpha (Krippendorff, 2018) for Prince and CredPaths were found to be and respectively, showing moderate to fair interannotator agreement. The superiority of Prince also holds for slices of samples where Prince generated explanations of size and . We also asked Turkers to provide succinct justifications for their scores on each data point. Table 7 shows some typical comments, where methods for generating explanations are in brackets.
7. Related Work
Foundational work on explainability for collaborativefilteringbased recommenders was done by Herlocker et al. (Herlocker et al., 2000). Over time, generating explanations (like (Yang et al., 2018)) has become tightly coupled with building systems that are geared for producing more transparent recommendations (like (Balog et al., 2019)). For broad surveys, see (Zhang and Chen, 2018; Tintarev and Masthoff, 2007). With methods using matrix or tensor factorization (Zhang et al., 2014; Chen et al., 2016; Wang et al., 2018b), the goal has been to make latent factors more tangible. Recently, interpretable neural models have become popular, especially for text (Seo et al., 2017; Chen et al., 2018a, b) and images (Chen et al., 2019), where the attention mechanism over words, reviews, items, or zones in images has been vital for interpretability. Efforts have also been made on generating readable explanations using models like LSTMs (Li et al., 2017) or GANs (Lu et al., 2018).
Representing users, items, categories and reviews as a knowledge graph or a heterogeneous information network (HIN) has become popular, where explanations take the form of paths between the user and an item. This paradigm comprises a variety of mechanisms: learning path embeddings (Ai et al., 2018; Wang et al., 2018b), propagating user preferences (Wang et al., 2018a), learning and reasoning with explainable rules (Ma et al., 2019; Xian et al., 2019), and ranking useritem connections (Yang et al., 2018; Ghazimatin et al., 2019). In this work, we choose the recent approach in (Yang et al., 2018) as a representative for the family of pathbased recommenders to compare Prince with. Finally, posthoc or modelagnostic rationalizations for blackbox models have attracted interest. Approaches include association rule mining (Peake and Wang, 2018), supervised ranking of useritem relationships (Ghazimatin et al., 2019), and reinforcement learning (Wang et al., 2018c).
Random walks over HIN’s have been pursued by a suite of works, including (Desrosiers and Karypis, 2011; Cooper et al., 2014; Christoffel et al., 2015; Eksombatchai et al., 2018; Jiang et al., 2018). In a nutshell, the Personalized PageRank (PPR) of an item node in the HIN is used as a ranking criterion for recommendations. (Nikolakopoulos and Karypis, 2019) introduced the RecWalk method, proposing a random walk with a nearly uncoupled Markov chain. Our work uses this framework. As far as we know, we are the first to study the problem of computing minimum subsets of edge removals (user actions) to change the topranked node in a counterfactual setup. Prior research on dynamic graphs, such as (Csáji et al., 2014; Kang et al., 2018), has addressed related issues, but not this very problem. A separate line of research focuses on the efficient computation of PPR. Approximate algorithms include power iteration (Page et al., 1999), local push (Andersen et al., 2006, 2007; Zhang et al., 2016) and Monte Carlo methods (Avrachenkov et al., 2007; Bahmani et al., 2010).
8. Conclusions and Future Work
This work explored a new paradigm of actionbased explanations in graph recommenders, with the goal of identifying minimum sets of user actions with the counterfactual property that their absence would change the topranked recommendation to a different item. In contrast to prior works on (largely pathbased) recommender explanations, this approach offers two advantages: (i) explanations are concise, scrutable, and actionable, as they are minimal sets derived using a counterfactual setup over a user’s own purchases, ratings and reviews; and (ii) explanations do not expose any information about other users, thus avoiding privacy breaches by design.
The proposed Prince method implements these principles using random walks for Personalized PageRank scores as a recommender model. We presented an efficient computation and correctness proof for computing counterfactual explanations, despite the potentially exponential search space of useraction subsets. Extensive experiments on large reallife data from Amazon and Goodreads showed that simpler heuristics fail to find the best explanations, whereas Prince can guarantee optimality. Studies with AMT Masters showed the superiority of Prince over baselines in terms of explanation usefulness.
Acknowledgements
This work was partly supported by the ERC Synergy Grant 610150 (imPACT) and the DFG Collaborative Research Center 1223. We would like to thank Simon Razniewski from the MPI for Informatics for his insightful comments on the manuscript.
Footnotes
 copyright: acmcopyright
 journalyear: 2020
 copyright: acmcopyright
 conference: The Thirteenth ACM International Conference on Web Search and Data Mining; February 3–7, 2020; Houston, TX, USA
 booktitle: The Thirteenth ACM International Conference on Web Search and Data Mining (WSDM ’20), February 3–7, 2020, Houston, TX, USA
 price: 15.00
 ccs: Information systems Recommender systems
References
 Learning heterogeneous knowledge base embeddings for explainable recommendation. Algorithms 11 (9). Cited by: §1, §7.
 Local computation of PageRank contributions. In WAW, Cited by: §1, §3, §7.
 Local graph partitioning using Pagerank vectors. In FOCS, Cited by: §4, §7.
 Monte carlo methods in pagerank computation: when one iteration is sufficient. SIAM Journal on Numerical Analysis 45 (2). Cited by: §7.
 Fast incremental and personalized PageRank. In VLDB, Cited by: §7.
 Transparent, scrutable and explainable user models for personalized recommendation. In SIGIR, Cited by: §1, §7.
 The anatomy of a largescale hypertextual web search engine. Computer networks and ISDN systems 30 (17). Cited by: §5.1.
 Universal sentence encoder for English. In EMNLP, Cited by: §5.1.
 Neural attentional rating regression with reviewlevel explanations. In WWW, Cited by: §7.
 Enhancing transparency and control when drawing datadriven inferences about individuals. Big data 5 (3). Cited by: item (i).
 Personalized Fashion Recommendation with Visual Explanations based on Multimodal Attention Network: Towards Visually Explainable Recommendation. In SIGIR, Cited by: §7.
 Learning to rank features for recommendation over multiple categories. In SIGIR, Cited by: §7.
 Sequential recommendation with user memory networks. In WSDM, Cited by: §7.
 Blockbusters and Wallflowers: Accurate, Diverse, and Scalable Recommendations with Random Walks. In RecSys, Cited by: §7.
 Random walks in recommender systems: Exact computation and simulations. In WWW, Cited by: §7.
 PageRank optimization by edge selection. Discrete Applied Mathematics 169. Cited by: §7.
 A comprehensive survey of neighborhoodbased recommendation methods. In Recommender Systems Handbook, Cited by: §7.
 Pixie: A system for recommending 3+ billion items to 200+ million users in realtime. In WWW, Cited by: §7.
 FAIRY: A Framework for Understanding Relationships between Users’ Actions and their Social Feeds. In WSDM, Cited by: §1, §1, §7.
 Topicsensitive Pagerank: A contextsensitive ranking algorithm for Web search. TKDE 15 (4). Cited by: §2.
 Explaining collaborative filtering recommendations. In CSCW, Cited by: §7.
 TrustWalker: A random walk model for combining trustbased and itembased recommendation. In KDD, Cited by: §1.
 Scaling personalized Web search. In WWW, Cited by: §3.
 Recommendation in heterogeneous information networks based on generalized random walk model and bayesian personalized ranking. In WSDM, Cited by: §7.
 AURORA: Auditing PageRank on large graphs. In Big Data, Cited by: §7.
 Matrix factorization techniques for recommender systems. Computer (8). Cited by: §1.
 Personalized explanations for hybrid recommender systems. In IUI, Cited by: §1.
 Content analysis: an introduction to its methodology. Sage. Cited by: §6.
 Let Me Explain: Impact of Personal and Impersonal Explanations on Trust in Recommender Systems. In CHI, Cited by: §1.
 Neural rating regression with abstractive tips generation for recommendation. In SIGIR, Cited by: §7.
 Why I like it: Multitask learning for recommendation and explanation. In RecSys, Cited by: §7.
 Jointly learning explainable rules for recommendation with knowledge graph. In WWW, Cited by: §7.
 Personalized social recommendations: Accurate or private?. In VLDB, Cited by: §1.
 Explaining datadriven document classifications. MIS Quarterly 38 (1). Cited by: §1.
 IJCAI 2019 workshop on explainable ai (xai). Cited by: §1.
 Explaining classification models built on highdimensional sparse data. arXiv preprint arXiv:1607.06280. Cited by: item (i).
 Recwalk: Nearly uncoupled random walks for topn recommendation. In WSDM, Cited by: §1, §2, §2, §3, §3, §7.
 The PageRank citation ranking: Bringing order to the Web. Technical report Stanford InfoLab. Cited by: §7.
 Explanation mining: Post hoc interpretability of latent factor models for recommendation systems. In KDD, Cited by: §1, §1, §7.
 Why should I trust you?: Explaining the predictions of any classifier. In KDD, Cited by: §1.
 KDD Workshop on Explainable AI for Fairness, Accountability, and Transparency. Cited by: §1.
 Interpretable convolutional neural networks with dual local and global attention for review rating prediction. In RecSys, Cited by: §7.
 A survey of heterogeneous information network analysis. TKDE 29 (1). Cited by: §1, §2.
 Semantic path based personalized recommendation on weighted heterogeneous information networks. In CIKM, Cited by: §1.
 A survey of explanations in recommender systems. In Workshop on Ambient Intelligence, Media and Sensing, Cited by: §7.
 Item recommendation on monotonic behavior chains. In RecSys, Cited by: item (ii).
 Ripplenet: Propagating user preferences on the knowledge graph for recommender systems. In CIKM, Cited by: §1, §7.
 Explainable recommendation via multitask learning in opinionated text data. In SIGIR, Cited by: §1, §7, §7.
 Explainable reasoning over knowledge graphs for recommendation. In AAAI, Cited by: §1.
 A reinforcement learning framework for explainable recommendation. In ICDM, Cited by: §1, §1, §7.
 Reinforcement knowledge graph reasoning for explainable recommendation. In SIGIR, Cited by: §1, §7.
 Towards interpretation of recommender systems with sorted explanation paths. In ICDM, Cited by: §1, §1, Table 5, Table 6, §6, §7, §7.
 Personalized entity recommendation: A heterogeneous information network approach. In WSDM, Cited by: §1.
 Recommendation in heterogeneous information networks with implicit user feedback. In RecSys, Cited by: §1.
 SHNE: Representation Learning for SemanticAssociated Heterogeneous Networks. In WSDM, Cited by: §1.
 Approximate personalized pagerank on dynamic graphs. In KDD, Cited by: §1, §5.1, §5.2, Table 3, §7.
 CVPR19 Workshop on Explainable AI. Cited by: §1.
 Explainable recommendation: A survey and new perspectives. arXiv preprint arXiv:1804.11192. Cited by: §1, §7.
 Explicit factor models for explainable recommendation based on phraselevel sentiment analysis. In SIGIR, Cited by: §1, §7.
 EARS 2019: The 2nd International Workshop on ExplainAble Recommendation and Search. In SIGIR, Cited by: §1.