A Textbased Deep Reinforcement Learning Framework for Interactive Recommendation
Abstract
Due to its nature of learning from dynamic interactions and planning for longrun performance, reinforcement learning (RL) recently has received much attention in interactive recommender systems (IRSs). IRSs usually face the large discrete action space problem, which makes most of the existing RLbased recommendation methods inefficient. Moreover, data sparsity is another challenging problem that most IRSs are confronted with. While the textual information like reviews and descriptions is less sensitive to sparsity, existing RLbased recommendation methods either neglect or are not suitable for incorporating textual information. To address these two problems, in this paper, we propose TDDPGRec, a Textbased Deep Deterministic Policy Gradient framework for interactive recommendation. Specifically, we leverage textual information to map items and users into a feature space, which greatly alleviates the sparsity problem. Moreover, we design an effective method to construct an action candidate set. By the policy vector dynamically learned from TDDPGRec that expresses the user’s preference, we can select actions from the candidate set effectively. Through extensive experiments on three public datasets, we demonstrate that TDDPGRec achieves stateoftheart performance over several baselines in a timeefficient manner.
1 Introduction
In the era of information explosion, recommender systems play a critical role in alleviating the information overload problem. Recently, interactive recommender system (IRS) [31], which continuously recommends items to individual users and receives their feedbacks to refine its recommendation policy, has received much attention and plays an important role in personalized services, such as Tik Tok, Pandora, and YouTube.
In the past few years, there have been some attempts to address the interactive recommendation problem by modeling the recommendation process as a multiarmed bandit (MAB) problem [16, 31, 24], but these methods are not designed for longterm planning explicitly, which makes their performance unsatisfactory [4]. It is well recognized that reinforcement learning (RL) performs excellently in finding policies on interactive longrunning tasks, such as playing computer games [19] and solving simulated physics problems [17]. Therefore, it is natural to introduce RL to model the interactive recommendation process. In fact, recently there have been some works on applying RL to address the interactive recommendation problem [32, 28, 29, 10]. However, most of the existing RLbased methods, including all Deep Qlearning Network (DQN) based methods [32, 29, 5] and most Deep Deterministic Policy Gradient (DDPG) based methods [10, 28], suffer from the problem of making a decision in linear time complexity with respect to the size of the action space, i.e., the number of available items, which makes them inefficient (or unscalable) when the action space size is large.
To improve efficiency, based on DDPG, DulacArnold et al. [9] proposed to first learn an action representation (vector) in a continuous hidden space, and then find the valid item by using a nearest neighbor search method. However, such a method ignores the importance of each dimension in the action vector. Moreover, it still needs to find the nearestneighbors from the whole action space, which is timeconsuming. Recently, Chen et al. [4] proposed a treestructured policy gradient recommendation (TPGR) framework, within which a balanced hierarchical clustering tree is built over the items. Then, picking an item is formulated as seeking a path from the root to a certain leaf in the tree, which dramatically reduces the time complexity. But this method introduces the burden of building a clustering tree, especially when new items appear frequently, the tree needs to be reconstructed and this may cost a lot.
On the other hand, most exiting RLbased recommendation methods use the past interaction data, such as ratings, purchase logs, or viewing history, to model user preferences and item features [9, 30, 4]. A major limitation of such kind of methods is that they may suffer serious performance degradation when facing the data sparsity problem, which is very common in realworld recommendation systems. As well known, textual information like reviews by users and item descriptions provided by suppliers contains more knowledge than interaction data, and is less sensitive to data sparsity. Nowadays, textual information has been readily available in many ecommerce and review websites, such as Amazon and Yelp. Thanks to the invention of word embedding, applying textual information for recommendation is possible, and there have been some successful attempts in conventional recommender systems [33, 3, 7]. But for IRS, existing RLbased methods either neglect to leverage textual information, or are not suitable for incorporating textual information due to their unique structures for processing rating sequence.
In this paper, we propose a Textbased Deep Deterministic Policy Gradient framework for IRSs (TDDPGRec). Specifically, we utilize textual information and pretrained word vectors [20] to embed items and users into a continuous feature space, which, to a great extent, alleviates the data sparsity problem. Then we classify users into several clusters by the Kmeans algorithm [1]. Next, based on the thought of collaborative filtering, we construct an action candidate set, which consists of positive, negative and ordinary items that are selected based on the user’s historical logs and classification results. Afterwards, we use a policy vector, which is dynamically learned from the actor part of TDDPGRec, to express the user’s preference in the feature space. Finally, we use the policy vector to select items from the candidate set to form the action for recommendation.
Figure 1 gives an example for helping understand the policy vector. Suppose a user selects a movie according to the preference that can be represented as explicit policies such as Prefer Detective Comics, Insensitive to genres and Like Superman. By our method, a policy vector in the feature space, e.g., , can be learned, where the value of each dimension represents how much emphasis this user gives on the dimension in the feature space. By conducting a dot product between the policy vector and the item vectors, we finally can choose the movie Superman Returns with the highest score of for recommendation (assume Top recommendation here).
Moreover, since it is too expensive to train and test our model in an online manner, we build an environment simulator to mimic online environments with principles derived from realworld data. Through extensive experiments on several realworld datasets with different settings, we demonstrate that TDDPGRec achieves high efficiency and remarkable performance improvement over several stateoftheart baselines, especially for largescale highsparsity datasets. To sum up, our main contributions of this work are as follows:

By utilizing textual information and pretrained word vectors, we embed items and users into a continuous feature space to reduce the negative influence of rating sparsity.

We express the preferences of users by implicit policy vectors and propose a method based on DDPG to learn the policy vectors dynamically. Moreover, based on the thought of collaborative filtering, we classify users into several clusters and build the candidate set. The policy vector, combining with the candidate set, is used to select items that form an action, which reduces the scale of action space effectively.

Extensive experiments are conducted on three benchmark datasets and the results verify the high efficiency and superior performance of TDDPGRec over stateoftheart methods.
The remainder of this paper is organized as follows: Section 2 discusses related work; Section 3 formally defines the research problem and details the proposed TDDPGRec model, as well as the corresponding learning algorithm; Section 4 presents and analyzes the experimental results; Finally, Section 5 concludes the paper with some remarks.
2 Related Work
2.1 RLbased Recommendation Methods
RLbased recommendation methods usually formulate the recommendation procedure as a Markov Decision Process (MDP). They explicitly model the dynamic user’s status and plan for longrun performance [22, 32, 10, 28, 29, 5, 27]. As mentioned earlier, most existing RLbased methods suffer from the largescale discrete action space problem.
To address the largescale discrete action space problem in IRS, there are some good attempts. DulacArnold et al. [9] proposed to leverage prior information about the actions to embed them in a continuous space to generate a protoaction, and then find a set of discrete actions closest to the protoaction as the candidate in logarithmic time via a nearest neighbor search. This method ignores the negative influences of the dimensions that users do not care about, which makes it fail to find proper actions sometimes. Moreover, the nearestneighbor search needs to be conducted on the whole action space, which still surfers a high runtime overhead. Zhao et al. [30] used the actorpart of the ActorCritic network to gain weight vectors, each of which can pick up a maximumscore item from the remaining items. But the relationship of these vectors is blurry, which causes the order of the items cannot be explained. Based on DPG, Chen et al. [4] proposed a treestructured policy gradient recommendation (TPGR) framework. In TPGR, a balanced hierarchical clustering tree is built over all the items. Then, making a decision can be formulated as seeking a path from the root to a certain leaf in the clustering tree, which reduces the time complexity significantly. But this method can only support Top recommendation. Moreover, when new items appear frequently, the clustering tree needs to be reconstructed, which incurs extra cost.
2.2 Textual Information for Recommendation
Most of the recommendation models (including RLbased ones) that merely exploit interaction matrix usually face the data sparsity problem. A large amount of knowledge in the textual information can potentially alleviate the data sparsity problem [33]. The development of deep learning in natural language processing makes it possible for using textual information that human beings can understand to enhance the recommendation performance [33, 3, 7, 6]. Reviews and descriptions are the most important textual information in recommender systems. The reviews, which contain users’ attitudes, and the descriptions, which contain items’ advantages, along with the ratings, can show the preferences of users. There are works that use sentiment analysis [3], convolutional neural networks [33, 8] and pretrained word vectors on large corpora [7], to get vectors from textual information. These vectors are then incorporated into the proposed model to improve recommendation performance.
IRS also suffers from the rating sparsity problem, but so far, we are not aware of any recommendation method for IRS that utilizes textual information. Most existing RLbased methods for IRS either neglect to incorporate with textual information, or have difficulty in utilizing textual information, since they use rating sequence, which has timerelated structures, as the input of their model [32, 4]. Note that in the domain of conversational recommender system (CRS), Basile et al. [2] proposed a framework that combines deep learning and reinforcement learning and uses textbased features to provide relevant recommendations and produce meaningful dialogues. But different from CRS, in our RLbased method for IRS, the textual information is utilized to learn the implicit longterm preferences, not the proactive immediate needs of users.
3 Proposed Method
3.1 Problem Formulation
We consider an interactive recommendation system with users and items , and use to denote the rating matrix, where is the rating of user on item . This kind of interactive Top recommendation process can be modeled as a special Markov Decision Process (MDP), where the key components are defined as follows.

State. Use to denote the state space. A state is defined as the possible interaction between a user and the recommender system, which can be represented by item vectors.

Action. Use to denote the action space. An action contains ordered items, each of which is represented by a vector, for recommendation.

Reward function. After receiving an action at state , our environment simulator returns a reward , which reflects the user’s feedback to the recommended items. We use to denote the reward function.

Transition. In our model, since the state is a set of item vectors, once the action is determined and the user’s feedback is given, the state transition is also determined.
Consider an agent that interacts with the environment in discrete timesteps. At each timestep , the agent can receive a state by observing the current environment, then it takes an action and gets a reward . An agent’s behavior is defined by a policy , which maps states to a probability distribution over the action, i.e., . Based on the above notations, we can define the instantiated MDP for our recommendation problem, , where is the maximal decision step, and is the discount factor. Our objective in this work is to learn a policy that maximizes the expected discounted cumulative reward.
3.2 Framework Overview
Figure 2 gives an overview of our framework, which contains two major steps: data preparation and training. In data preparation, we first embed items to get item vectors by leveraging textual information. Based on the derived item vectors and users’ historical logs, we can embed users into the same feature space. Next, we classify the users into several clusters by Kmeans. In the training phase, we train a unique model for each cluster, with the objective of implementing a more personalized recommendation. Take cluster for an example, we randomly select a user from it. Based on the historical log of and the user classification results, we sample positive, negative and ordinary items for to construct a candidate set, which later will be used in the reinforcement model for action selection. Our reinforcement model is based on DDPG, which interacts with the simulator that is based on historical logs to learn the inner relationship among all possible states and actions. The training phase will stop when the model loss reaches stable.
3.3 Embedding with Textual Information
Textual information like descriptions and reviews is important for decision making, we build vectors based on them. Item vectors are calculated by the word vectors from GloVe.6B
(1) 
where and are the vectors of the words from descriptions and reviews, respectively, and and denote the corresponding numbers of them. The word vectors with similar semantics have closer Euclidean distance than the word vectors with large semantics differences [20], which ensures that items with similar reviews and descriptions are closer to each other.
Given a user and one of its historical logs, if the corresponding rating is greater than a given bound (e.g., in a rating system with the highest rating ), then the log is regarded as positive; Otherwise, it is negative. We use and to denote the set of items that are in ’s positive and negative historical logs, respectively. After obtaining all the item vectors, we can calculate user vector by normalizing the summation of the items’ vectors that appear in , i.e.,
(2) 
where denotes the number of items in . In this way, we embed users and items in a same feature space.
3.4 Construction of the Candidate Set
In Top recommendation, the state is defined as a set of items. So there are a total of (note here is a permutation) actions that can be chosen as an action. With the increase of the number of items (), the scale of the action space will increase rapidly. Based on the assumption that the preferences can be obtained by a set of items that users like and dislike, we pick up the positive and negative items to build a candidate set . Additionally, to maintain generalization, we add some ordinary items in the candidate set.
For user , we sample positive items from , negative items from , and ordinary items by random. Since users usually skip the items that they do not like, the negative items in are rare [18]. Based on the idea of collaborative filtering, i.e., the more differences between two users, the more possible that one user’s likes are the other’s dislikes, we classify users into several clusters by Kmeans [1] to supplement negative items. Specifically, we denote the set of items that appear in the positive historical logs of users in cluster as (user belongs to cluster ), and use to denote the cluster that has the farthest distance from the current cluster . If the negative items in are not enough, the rest negative items will be selected from . In this way, we can reduce the scale of the action space from to , where is the number of items in the candidate set .
Algorithm 1 shows the detail of the construction for the candidate set, in which the positive items account for no more than percent (line ), and the negative and ordinary items each share of the remaining part of (line ). In the training phase, since constructing a candidate set only contains some simple operations, such as randomly select and merge, and the candidate set size of our model is always fixed, it is not difficult to see that it has constant time complexity.
3.5 Architecture of TDDPGRec
The goal of a typical reinforcement learning model is to learn a policy that can maximize the discounted future reward, i.e., the Qvalue, which is usually estimated by the stateaction value function . Combined with deep neural networks, there are many algorithms that try to approximate . Among them, DDPG, a modelfree, offpolicy actorcritic algorithm, which combines the advantages of DQN [19] and DPG [21], can concurrently learn policy and in highdimensional, continuous action spaces by using neural network function approximation [17]. We use DDPG in our model, and Figure 3 shows the architecture of it.
In each timestep , the actor network takes a state as input. By a multiplelayer perceptron (MLP) network, we can learn a continuous vector, which we term as the policy vector, denoted by . The critic network takes state and policy vector as input. By an MLP, it can learn the current Qvalue to evaluate . As mentioned in Figure 1, represents a user’s preferences in the feature vector space, it is a continuous weight vector that can measure the importance of each dimension. Combining with the candidate set , we can get items with the highest score, each of which is denoted by and,
(3) 
Moreover, to cover the action space to a large extent, the candidate set is randomly generated at each time step.
Note that the actions in IRSs are discrete. In our method, when embedding the items, we have mapped the discrete actions into a continuous feature space, where each item is represented by a feature vector. Then, by conducting the dot product between and the item vectors in , we can select the actions from a discrete space. In this way, our method can overcome the gap between discrete actions in IRSs and continuous actions in DDPG.
3.6 Environment Simulator
The same as several previous work [25, 4], we build an environment simulator to mimic online environments. It receives the present state and action , then returns reward and the next state . In our model, the reward function guides the model to capture users’ preferences and evaluate the rank quality of the recommended items. For user in timestep , we give a reward on gained by among the candidate set . The reward is determined by two values, and , specifically,
(4) 
where is the number of items in , is the ranking weight of the items in , and is the reward of item for user . Inspired by DCG [11, 25], the ranking weight is calculated by,
(5) 
To give proper rewards for different types of items, is designed as follows,
(6) 
Recall here is the rating of user on item , and is the rating bound to determine whether the corresponding log is regarded as positive or negative. By this formula, positive items in will get positive feedback, and negative items will get negative feedback. Moreover, the supplemented negative items will get half of the minimum negative feedback, i.e., , while the other items will get a feedback of .
As shown in Figure 3, our method of generating is in a slidingwindow manner. Specifically, among ordered items in , we keep the order and select the items that are not in as . Then we put at the head of , and select the top items as .
3.7 Learning TDDPGRec
The training phase (as shown in Algorithm 2) learns model parameters , , , and through maximizing the cumulated discounted rewards of all the decisions. Based on the assumption that similar users have similar preferences, our method classifies users and trains models for each cluster. At the beginning of the training phase, we randomly initialize the network parameters and the replay buffer . In order for action exploration, we initialize a random process , adding some uncertainty when generating p. The critic network focuses on minimizing the gap between the current Qvalue and the expected Qvalue , which is achieved by minimizing the following loss,
(7) 
where can be expressed in a recursive manner by using the Bellman equation,
(8) 
The objective of the actor network is to optimize the policy vector p, through maximizing the Qvalue. The actor network is trained by the sampled policy gradient:
(9) 
Note that in our implementation, we set the minimum and maximum training step thresholds based on the size of buffer . When the number of steps is greater than the minimum threshold and the loss remains stable, or the number of steps is greater than the maximum threshold, the training phase will stop.
4 Experiments and Results
In this section, we conduct experiments to demonstrate the effectiveness of the proposed TDDPGRec model
versus several stateoftheart models. We first introduce the experimental setup, and then present and discuss the experimental results from the perspective of both recommendation performance and time efficiency. Finally, we conduct the hypeparameter sensitivity analysis in the last part of this section. We have implemented our models based on Tensorflow, which can be accessed in GitHub
4.1 Experimental Settings
Datasets
Jure Leskovec et al. [15] collected and categorized a variety of Amazon products and built several datasets
DataSet  #Users  #Items  #Ratings of Pos & Neg  Sparsity  Size of Textual Info 

Music  5,541  3,568  58,905 5,801  0.9967  68,096 KB 
Beauty  22,363  12,101  176,520 21,982  0.9993  88,986 KB 
Clothing  39,387  23,033  252,022 26,655  0.9997  84,168 KB 
Method  Music  Beauty  Clothing  

HR@10  F1@10  nDCG@10  HR@20  F1@20  nDCG@20  HR@10  F1@10  nDCG@10  HR@20  F1@20  nDCG@20  HR@10  F1@10  nDCG@10  HR@20  F1@20  nDCG@20  
ItemPop  0.2447  0.0454  0.1101  0.4889  0.0525  0.1716  0.2551  0.0482  0.1134  0.5278  0.0543  0.1817  0.2265  0.0413  0.1033  0.4964  0.0482  0.1706 
LinerUCB  0.3318  0.0621  0.1569  0.5885  0.0626  0.2210  0.2734  0.0502  0.1249  0.5273  0.0529  0.1885  0.2393  0.0437  0.1041  0.5044  0.0488  0.1704 
DMF  0.3201  0.0631  0.1462  0.5747  0.0638  0.2095  0.3219  0.0614  0.1447  0.5911  0.0613  0.2122  0.2500  0.0458  0.1130  0.5041  0.0489  0.1756 
ANR  0.4980  0.1128  0.2756  0.7097  0.1084  0.3252  0.4550  0.0990  0.2252  0.6993  0.1006  0.2850  0.3421  0.0663  0.1622  0.6008  0.0659  0.2264 
Caser  0.8097  0.1676  0.5351  0.9090  0.1048  0.5542  0.6125  0.1218  0.3939  0.7817  0.0826  0.4344  0.5060  0.0934  0.2900  0.7196  0.0702  0.3427 
SASRec  0.8897  0.1910  0.6212  0.9635  0.1151  0.6325  0.6823  0.1386  0.4569  0.8330  0.0907  0.4942  0.5817  0.1084  0.3525  0.7655  0.0758  0.3968 
DDPGNN (=0.1M)  0.3274  0.0648  0.1527  0.5838  0.0647  0.2171  0.2585  0.0489  0.1170  0.5142  0.0529  0.1809  0.2541  0.0467  0.1131  0.5043  0.0490  0.1757 
DDPGNN (=M)  0.3436  0.0692  0.1617  0.6001  0.0676  0.2258  0.2734  0.0522  0.1274  0.5197  0.0539  0.1889  0.2768  0.0510  0.1242  0.5293  0.0517  0.1874 
TDQNRec  0.8150  0.1712  0.5048  0.8977  0.1053  0.5251  0.5466  0.1053  0.3033  0.7662  0.0796  0.3573  0.3739  0.0685  0.1873  0.6252  0.0607  0.2501 
MDDPGRec  0.8293  0.1803  0.5293  0.9074  0.1087  0.5477  0.5985  0.1202  0.3391  0.7740  0.0825  0.3830  0.3094  0.0567  0.1487  0.5595  0.0542  0.2111 
TDDPGRec  0.9164  0.2032  0.6630  0.9426  0.1142  0.6687  0.7648  0.1517  0.4942  0.8952  0.0948  0.5261  0.6237  0.1203  0.3553  0.8210  0.0818  0.4055 
Baseline methods
We compare TDDPGRec with methods, where ItemPop is a conventional recommendation method, LinearUCB is a MABbased method, DMF is an MFbased method with neural network, ANR is a neural recommendation method that leverages textual information, Caser and SASRec are timerelated deep learning based methods, DDPGKNN, TPGR, TDQNRec and MDDPGRec are all RLbased methods.

ItemPop recommends the most popular items (i.e., the item with the highest average rating) from currently available items to the user at each timestep. This method is nonpersonalized and is often used as a benchmark for recommendation tasks.

LinearUCB [16] is a contextualbandit recommendation approach that adopts a linear model to estimate the upper confidence bound for each arm.

DMF [26] is a stateoftheart matrix factorization model using deep neural networks. Specifically, it utilizes two distinct MLPs to map the users and items into a common lowdimensional space with nonlinear projections.

ANR [7] uses an attention mechanism to focus on the relevant parts of reviews and estimates aspectlevel user and item importance in a joint manner.

Caser [23] embeds a sequence of recent items into an image and learn sequential patterns as local features of the image by using convolutional filters.

SASRec [12] is a selfattention based sequential model for next item recommendation. It models the entire user sequence and adaptively considers consumed items for prediction.

DDPGNN [9] addresses the large discrete action space problem by combining DDPG with an approximate NN method.

TPGR [4] builds a balanced hierarchical clustering tree and formulates picking an item as seeking a path from the root to a certain leaf of the tree.

TDQNRec is a method that replaces DDPG in TDDPGRec with DQN, while retains other components the same as that in TDDPGRec.

MDDPGRec is a method that uses the same framework as TDDPGRec, but with vectors being derived by matrix factorization [13], rather than leveraging textual information.
Note that for DDPGNN, larger (i.e., the number of nearest neighbors) will result in better performance but poor efficiency. For a fair comparison, we consider setting as and (recall that is the number of items) respectively.
Evaluation Metrics and Methodology
The methods that achieve their goals by Top recommendation take evaluation on the indexes like Hit Ratio (HR) [26], Precision [32, 23], Recall [23], F1 [4] and normalized Discounted Cumulative Gain (nDCG) [25, 30, 32, 12]. To cover as many aspects of Top recommendation as possible, we chose HR@, F1@, and nDCG@ as the evaluation metrics.
The test data was constructed in data preparation, and all the evaluated methods were tested by using this data. We now describe the test method in detail: For each user, we first classify user’s history logs into positive and negative ones, and sort the items in positive history logs by timestamp. Then, we choose the last of the ordered items in the positive logs as positive items. Finally, the negative items are randomly selected from the cluster that is farthest from the one that the current user belongs to. Based on such a strategy, the recommendation methods (except TPGR, which only recommends one item in each episode) can generate a ranked Top list to evaluate the metrics mentioned above.
4.2 Results and Analysis
Table 2 shows the summarized results of our experiments on the three datasets in terms of six metrics including HR@, F1@, nDCG@, HR@, F1@ and nDCG@. Note that since TPGR is not suitable for Top recommendation, we did not include it as a competitor when evaluating the recommendation performance. From the results, we have the following key observations:

The proposed model TDDPGRec achieves the best (or the secondbest, but with a small gap to the best one) performance and obtains remarkable improvements over the stateoftheart methods. Moreover, the performance improvement increases along with the increase of data scale and data sparsity, where the datasets are arranged in increasing order of scale and sparsity. This justifies the effectiveness of TDDPGRec that leverages textual information in RLbased recommendation, especially for largescale and highsparsity datasets.

The structure of ANR is similar to that of DMF, while the structure of TDDPGRec is the same as that of MDDPGRec. Textbased methods ANR and TDDPGRec consistently outperform their counterparts, DMF and MDDPGRec, which only use interaction information for embedding. This demonstrates the importance of utilizing textual information to alleviate the negative effects of data sparsity for better performance.
Moreover, based on the results in Table 2, we conduct several statistical significance tests [14], with significance level . For all metrics, the value of SASRec and TDDPGRec is , the value of TDQN and TDDPGRec is , and the value of MDDPGRec and TDDPGRec is . The results indicate that there are significant differences between the evaluated couple methods.
Method  Per training step (ms)  Per decision (ms) 

DDPGNN (0.1M)  5.26  7.86 
DDPGNN (M)  5.51  56.42 
TPGR  4.98  5.06 
SASRec  205.12  14.32 
TDQNRec  1.86  1.08 
TDDPGRec  3.13  0.90 
4.3 Time Comparison
In this section, we compare the efficiency of RLbased models on Beauty dataset from two aspects, the consumed time of training (updating the model) and decision making (selecting action), where the time spend is measured in millisecond. Note that since SASRec provides competitive results, we also include SASRec as a competitor. To make a fair comparison, both and are set to , and the experiments are conducted on the same machine with 6core CPU (i76850k, 3.6GHz) and 64GB RAM.
As shown in Table 3, DDPGNN runs much slower than other models, because it has high time complexity. TDQNRec consumes the least time on training, due to its simple structure. But as shown in Table 2, it has the worst recommendation performance among all the RLbased methods. TPGR reduces the decisionmaking time significantly by constructing a clustering tree, but as mentioned before, it only supports Top recommendation. Compared to other methods, by using policy vector and action candidate, our model TDDPGRec achieves significant improvement in terms of both recommendation performance and efficiency.
4.4 Parameter Sensitivity
We select several important parameters to analyze their effects on the performance of TDDPGRec in terms of HR@ and nDCG@. Note that we have conducted such experiments on all the datasets, and the results show that our approach exhibits similar performance trends on all the evaluated datasets. For simplicity, we only present the results in Beauty dataset. When testing one parameter, we keep the others fixed. The default settings are , , , , and .
The Dimension of Feature Space () The number of feature dimension reflects the richness of the information. As shown in Figure 6 (a), with the increase of , as expected, TDDPGRec also performs better.
The Number of Clusters () As shown in Figure 6 (b), the increase of improves the performance. This is mainly because the more clusters, the larger the difference between the current cluster and the one farthest from it, which leads to more highquality negative items, and eventually results in better performance.
The Size of Candidate () Figure 9 (a) shows that the performance decreases with the increase of . This is mainly because the items in are much less than the items in . In other words, the increase of will cause imbalance sampling, which in turn leads to worse performance.
The Ratio of Positive Items () As shown in Figure 9 (b), with the increase of , the performance first grows and then remains stable. This is because increasing will introduce more positive items to perceive the user’s interests better. But since (see Algorithm 1), when is big enough, its growth may no longer affect .
The Size of State () Figure 12 (a) shows that the performance stays smoothly with the increase of , which means the size of state impacts TDDPGRec little.
The Size of Action () Figure 12 (b) shows that with the increase of , the performance also increases. This is due to that the larger is, the more frequent the user state changes, which makes the positive items have more opportunities to be selected.
5 Conclusion
In this paper, we propose TDDPGRec, a Textbased Deep Deterministic Policy Gradient framework for Top interactive recommendation. By leveraging textual information and pretrained word vectors, we embed items and users into a same feature space, which greatly alleviates the data sparsity problem. Moreover, based on the thought of collaborative filtering, we classify users into several clusters and construct an action candidate set. Combining with the policy vector dynamically learned from DDPG that expresses the user’s preferences in the feature space, we select items from the candidate set to form action for recommendation, which greatly improves the efficiency of decision making. Experimental results over a carefully designed simulator on three public datasets demonstrate that compared with stateoftheart methods, TDDPGRec can achieve remarkable performance improvement in a timeefficient manner.
For future work, we would like to see whether utilizing other techniques, such as the attention mechanism, can achieve better recommendation accuracy. Moreover, we intend to study if it is possible to incorporate our proposed model with transfer learning.
We would like to thank the referees for their valuable comments, which helped improve this paper considerably. The work was partially supported by the National Natural Science Foundation of China under Grant No. 61672252, and the Fundamental Research Funds for the Central Universities under Grant No. 2019kfyXKJC021.
Footnotes
 http://nlp.stanford.edu/data/glove.6B.zip
 https://www.ranks.nl/stopwords
 https://github.com/SunwardTree/TDDPGRec
 http://snap.stanford.edu/data/amazon/productGraph/categoryFiles
References
 Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan, ‘Automatic subspace clustering of high dimensional data for data mining applications’, in Proceedings ACM SIGMOD International Conference on Management of Data (SIGMOD), eds., Laura M. Haas and Ashutosh Tiwary, pp. 94–105, (1998).
 Pierpaolo Basile, Claudio Greco, Alessandro Suglia, and Giovanni Semeraro, ‘Deep learning and hierarchical reinforcement learning for modeling a conversational recommender system’, Intelligenza Artificiale, 12(2), 125–141, (2018).
 Konstantin Bauman, Bing Liu, and Alexander Tuzhilin, ‘Aspect based recommendations: Recommending items with the most valuable aspects based on user reviews’, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), Halifax, NS, Canada, August 13  17, pp. 717–725. ACM, (2017).
 Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu, ‘Largescale interactive recommendation with treestructured policy gradient’, in Proceedings of the ThirtyThird AAAI Conference on Artificial Intelligence, Honolulu, Hawaii, USA, January 27  February 1, pp. 3312–3320. AAAI Press, (2019).
 Xinshi Chen, Shuang Li, Hui Li, Shaohua Jiang, Yuan Qi, and Le Song, ‘Generative adversarial user model for reinforcement learning based recommendation system’, in Proceedings of the 36th International Conference on Machine Learning (ICML), 915 June, Long Beach, California, USA, eds., Kamalika Chaudhuri and Ruslan Salakhutdinov, volume 97, pp. 1052–1061. PMLR, (2019).
 Germán Cheuque, José Guzmán, and Denis Parra, ‘Recommender systems for online video game platforms: the case of STEAM’, in Proceedings of International Conference on World Wide Web (WWW), San Francisco, CA, USA, May 1317, eds., Sihem AmerYahia, Mohammad Mahdian, Ashish Goel, GeertJan Houben, Kristina Lerman, Julian J. McAuley, Ricardo BaezaYates, and Leila Zia, pp. 763–771. ACM, (2019).
 Jin Yao Chin, Kaiqi Zhao, Shafiq R. Joty, and Gao Cong, ‘ANR: aspectbased neural recommender’, in Proceedings of the 27th ACM International Conference on Information and Knowledge Management (CIKM), Torino, Italy, October 2226, eds., Alfredo Cuzzocrea, James Allan, Norman W. Paton, Divesh Srivastava, Rakesh Agrawal, Andrei Z. Broder, Mohammed J. Zaki, K. Selçuk Candan, Alexandros Labrinidis, Assaf Schuster, and Haixun Wang, pp. 147–156. ACM, (2018).
 Dong Deng, Liping Jing, Jian Yu, Shaolong Sun, and Haofei Zhou, ‘Neural gaussian mixture model for reviewbased rating prediction’, in Proceedings of the 12th ACM Conference on Recommender Systems (RecSys), Vancouver, BC, Canada, October 27, eds., Sole Pera, Michael D. Ekstrand, Xavier Amatriain, and John O’Donovan, pp. 113–121. ACM, (2018).
 Gabriel DulacArnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin, ‘Deep reinforcement learning in large discrete action spaces’, arXiv preprint arXiv:1512.07679, (2015).
 Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu, ‘Reinforcement learning to rank in ecommerce search engine: Formalization, analysis, and application’, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (SIGKDD), pp. 368–377, (2018).
 Kalervo Järvelin and Jaana Kekäläinen, ‘Cumulated gainbased evaluation of ir techniques’, ACM Transactions on Information Systems, 20(4), 422–446, (October 2002).
 WangCheng Kang and Julian J. McAuley, ‘Selfattentive sequential recommendation’, in Proceedings of IEEE International Conference on Data Mining (ICDM), Singapore, November 1720, pp. 197–206. IEEE Computer Society, (2018).
 Yehuda Koren, Robert Bell, and Chris Volinsky, ‘Matrix factorization techniques for recommender systems’, Computer, (8), 30–37, (2009).
 Elena Kulinskaya, Stephan Morgenthaler, and Robert G. Staudte, Significance Testing: An Overview, 1318–1321, Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
 Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014.
 Lihong Li, Wei Chu, John Langford, and Robert E Schapire, ‘A contextualbandit approach to personalized news article recommendation’, in Proceedings of International Conference on World Wide Web (WWW), pp. 661–670, (2010).
 Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra, ‘Continuous control with deep reinforcement learning’, in Proceedings of the 4th International Conference on Learning Representations (ICLR Poster), (2016).
 Benjamin M. Marlin and Richard S. Zemel, ‘Collaborative prediction and ranking with nonrandom missing data’, in Proceedings of the 3th ACM Conference on Recommender Systems (RecSys), pp. 5–12, (2009).
 Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, et al., ‘Humanlevel control through deep reinforcement learning’, Nature, 518(7540), 529, (2015).
 Jeffrey Pennington, Richard Socher, and Christopher D. Manning, ‘Glove: Global vectors for word representation’, in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), October 2529, Doha, Qatar, pp. 1532–1543. ACL, (2014).
 David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller, ‘Deterministic policy gradient algorithms’, in Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 387–395, (2014).
 Haihui Tan, Ziyu Lu, and Wenjie Li, ‘Neural network based reinforcement learning for realtime pushing on text stream’, in Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 913–916, (2017).
 Jiaxi Tang and Ke Wang, ‘Personalized topn sequential recommendation via convolutional sequence embedding’, in Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), Marina Del Rey, CA, USA, February 59, eds., Yi Chang, Chengxiang Zhai, Yan Liu, and Yoelle Maarek, pp. 565–573. ACM, (2018).
 Huazheng Wang, Qingyun Wu, and Hongning Wang, ‘Factorization bandits for interactive recommendation’, in Proceedings of the ThirtyFirst AAAI Conference on Artificial Intelligence, February 49, San Francisco, California, USA, pp. 2695–2702. AAAI Press, (2017).
 Zeng Wei, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng, ‘Reinforcement learning to rank with markov decision process’, in Proceedings of the 40th International ACM Conference on Research and Development in Information Retrieval (SIGIR), pp. 945–948, (2017).
 HongJian Xue, Xinyu Dai, Jianbing Zhang, Shujian Huang, and Jiajun Chen, ‘Deep matrix factorization models for recommender systems’, in Proceedings of International Joint Conference on Artificial Intelligence (IJCAI), Melbourne, Australia, August 1925, pp. 3203–3209, (2017).
 Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin, ‘Deep reinforcement learning for search, recommendation, and online advertising: A survey’, SIGWEB Newsl., (Spring), 4:1–4:15, (July 2019).
 Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang, ‘Deep reinforcement learning for pagewise recommendations’, in Proceedings of the 12th ACM Conference on Recommender Systems (RecSys), pp. 95–103, (2018).
 Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Long Xia, Jiliang Tang, and Dawei Yin, ‘Recommendations with negative feedback via pairwise deep reinforcement learning’, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data mining (SIGKDD), pp. 1040–1048, (2018).
 Xiangyu Zhao, Liang Zhang, Zhuoye Ding, Dawei Yin, Yihong Zhao, and Jiliang Tang, ‘Deep reinforcement learning for listwise recommendations’, arXiv preprint arXiv:1801.00209, (2018).
 Xiaoxue Zhao, Weinan Zhang, and Jun Wang, ‘Interactive collaborative filtering’, in Proceedings of 22nd ACM International Conference on Information and Knowledge Management (CIKM), San Francisco, CA, USA, October 27  November 1, 2013, pp. 1411–1420. ACM, (2013).
 Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li, ‘Drn: A deep reinforcement learning framework for news recommendation’, in Proceedings of International Conference on World Wide Web (WWW), pp. 167–176, (2018).
 Lei Zheng, Vahid Noroozi, and Philip S. Yu, ‘Joint deep modeling of users and items using reviews for recommendation’, in Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM), Cambridge, United Kingdom, February 610, pp. 425–434. ACM, (2017).