A Scalable Algorithm for PrivacyPreserving Itembased TopN Recommendation
Abstract
Recommender systems have become an indispensable component in online services during recent years. Effective recommendation is essential for improving the services of various online business applications. However, serious privacy concerns have been raised on recommender systems requiring the collection of users’ private information for recommendation. At the same time, the success of ecommerce has generated massive amounts of information, making scalability a key challenge in the design of recommender systems. As such, it is desirable for recommender systems to protect users’ privacy while achieving highquality recommendations with lowcomplexity computations.
This paper proposes a scalable privacypreserving itembased topN recommendation solution, which can achieve highquality recommendations with reduced computation complexity while ensuring that users’ private information is protected. Furthermore, the computation complexity of the proposed method increases slowly as the number of users increases, thus providing high scalability for privacypreserving recommender systems. More specifically, the proposed approach consists of two key components: (1) MinHashbased similarity estimation and (2) clientside privacypreserving prediction generation. Our theoretical and experimental analysis using realworld data demonstrates the efficiency and effectiveness of the proposed approach.
Index terms— itembased, topN recommendation, privacy, scalability
I Introduction
With the explosive growth of information on the Internet, recommender systems have become indispensable in many online services [1]. As one of the most important techniques in recommender systems, collaborative filtering (CF) [2, 3] has been broadly adopted, such as in Amazon [4, 5], MovieLens [6], YouTube [7], and so on. Such broad applications of CF raise serious concerns about the leakage of users’ privacy.
The privacy concern originates from the basic idea behind CF techniques. For instance, itembased topN recommendation, as one of the widely used CF techniques, assumes that a user may be interested in items that are similar to the items that he/she liked before. More specifically, the problem of topN recommendation aims to provide an ordered list of items to a user. Itembased recommendation works by first collecting users’ “ratings” of items. Then, items are profiled by analyzing users’ ratings for them. At last, a user is recommended with items that have a high similarity with his/her historically highlyrated items. Through the recommendation, users can find items of their interest, such as movies, music, or other things. However, to profile items and perform recommendations for users, sensitive information, such as user demographics and user ratings, are collected by recommender systems, giving rise to serious privacy concerns of individual users [4, 1, 8, 9, 10]. In addition, profiling and recommending items based on all users’ information lead to scalability issues because of the explosive growth of online information.
Recent research works have aimed to tackle the privacypreserving issues for individual users based on CF [11]. Scalability is the primary limiting factor to existing cryptographybased methods [4, 6, 12, 13]. These methods adopt cryptography, such as homomorphic encryptions, to hide true user ratings during computation. These solutions require computationintensive encryption and decryption operations. Hence they do not scale well given the large number of users and items [6, 12, 13]. Compared with cryptographybased solutions, random perturbation based methods do not conduct intensive computations but protect users’ privacy by perturbing users’ ratings, such as adding random noise, before sending the users’ information to the server [14, 15, 16]. As such, random perturbation based methods are efficient and easy to implement. However, these methods trade accuracy for privacy, yet the protection of user privacy is not guaranteed. As shown in [17], the server can partially recover valid user data from perturbed data using learning techniques.
An ideal privacypreserving recommender system should guarantee user privacy protection without compromising recommendation accuracy or efficiency. However, existing privacypreserving CF methods trade either efficiency (cryptographybased methods) or accuracy (random perturbation based methods) for privacy. Therefore, the focus of this work is to address the above challenges and develop a scalable solution capable of preserving privacy while achieving highquality recommendations with lowcomplexity computations.
Our work is motivated by one key observation – the overloading information generated on the Internet has a degree of redundancy for recommendations, as some users have a similar preference. As such, it is possible to develop a scalable approach by collecting some anonymous users’ information while achieving highquality recommendations. Therefore, this work proposes an approach for itembased topN recommendations, which can cope with a large number of user and item scenarios while achieving efficient privacypreserving recommendation. Different from the methods above, the proposed approach neither performs additional calculation nor changes the original users’ information to preserve users’ privacy. Specifically, our proposed approach works as follows: (1) to protect individual users’ privacy, we first use anonymous random walks to collect users’ information, thus eliminating the correspondence between individual users and their information; (2) item similarities are estimated online using the MinHashbased method through the received information and the recommendations are generated locally; and (3) by restricting the number of random walks, the computation cost is reduced significantly, especially for scenarios with a large number of users and/or items.
The contributions of this work are summarized as follows:

This work proposes a scalable algorithm for privacypreserving itembased topN recommendations, which can achieve highquality recommendations with lowcomplexity computations. Compared with existing methods, the computation time of the proposed method increases slowly as the number of users increases, thus providing high scalability for privacypreserving recommender systems.

The evaluation results in three realworld data sets demonstrate that the proposed method can be more efficient than nonprivacypreserving itembased topN recommendation methods. Specifically, when the accuracy loss is less 1.0%, the corresponding computation time is reduced by 20.99%, 57.35%, and 62.81% on the Last.fm data set, on the Jester data set, and on the MovieLens 20M data set, respectively.
The rest of this paper is organized as follows. Section II surveys the related works. Section III describes the problem formulation. Section IV presents the proposed scalable algorithm for privacypreserving itembased topN recommendation. Section V discusses experimental results on the realworld data sets. Finally, Section VI concludes this work.
Ii Related Work
Existing privacypreserving recommender systems can be classified into two main categories: cryptography based methods [4, 6, 12, 13] and random perturbation based methods [14, 15, 16].
Iia Cryptography Based Methods
Cryptography based methods adopt cryptography, such as homomorphic encryptions, to hide true user ratings during computation.
Kikuchi et al. proposed using homomorphic encryption to calculate user similarities, item recommendation scores and decrypting the scores by a set of trusted authors [18]. Since their method has all users’ original data involved in the calculation, users can obtain recommendations without privacy violation. A similar method has also been introduced in [19], in which Zhan et al. constructed a more efficient privacypreserving collaborative recommender system based on the scalar product protocol by comparing with major cryptology approaches. Canny et al. proposed an algorithm whereby a community of users can compute a public “aggregate" of their data that does not expose individual users’ data [6]. They used homomorphic encryption to allow sums of encrypted vectors to be computed and decrypted without exposing individual users’ data. These homomorphic encryption based approaches are computation inefficient because all computations are performed on encrypted data. Additionally, cryptography based methods suffer from the scalability issue since encryption and decryption operations are computation intensive and do not scale well given a large number of users and items.
IiB Random Perturbation Based Methods
Random perturbation based methods perturb users’ ratings to prevent the server from obtaining true users’ ratings [14, 15, 16].
Polat et al. proposed a privacypreserving CF method to perturb individual user’s original data by adding a random number, while accurately estimating the aggregation data from a large number of users [14]. Casino et al. proposed a anonymous approach to protect the user’s privacy, in which they clustered similar users and profiled the clusters. Then the profile in the same cluster had the capability of representing the users in the cluster, achieving perturbing individual users’ data and preserving users’ privacy. Storing users’ profiles in a distributed manner and achieving recommendations is another option for perturbation based privacypreserving recommender systems. Shokri et al. proposed a distributed mechanism to increase privacypreserving while achieving accurate recommendations [20]. In their methods, users first store their profiles on their own sides (called offline profiles). Then, the offline profiles are partly merged with the profiles of similar users. After that, the offline profiles are uploaded to a central server periodically for participating in recommendations. A different approach is presented in [21] that relies on a twofold mechanism implemented on a decentralized userbased CF. In the work, users’ exact profiles are prevented from exchanging on their own sides while constructing interestbased topology, through which obfuscated profiles are still extracted, ensuring the quality of recommendations. Random perturbation based methods trade accuracy for privacy, yet the protection of user privacy is not guaranteed. As shown in [17], the server can partially recover true user data from perturbed data using learning techniques.
Iii Problem Formulation
Iiia Privacy Issue in Itembased TopN Recommendation
The main purpose of recommender systems is to estimate user ratings of items that have not been seen by the user [1]. Generally, recommender systems estimate users’ ratings for a specific item based on users’ previous ratings on other items. Thus, recommender systems need to collect users’ item ratings before estimating users’ ratings of unrated items. This is a violation of user privacy, since most users do not wish to expose their item ratings [22]. Sensitive user information (e.g., user preferences) may be collected, analyzed, and even sold when a company declares bankruptcy [6]. Ideally, users of recommender systems should be able to obtain highquality recommendations efficiently, without exposing their item ratings to the recommendation server or any other third parties. Generally, two key steps that are required in itembased topN recommendation [23] are described in the following two subsections.
IiiB Item Similarity Computation
A variety of methods can be adopted to calculate the similarities between items, such as cosine similarity [24, 25], correlationbased similarity [24] and Jaccard similarity [26], etc. For binary ratings (i.e., 0 means dislike and 1 means like), Jaccard similarity is the most widely adopted similarity measure. Given two items and , their Jaccard similarity is defined as follows [26]:
(1) 
where () is the set of users who like item (). Computing the Jaccard similarity is timeconsuming, especially when the number of items is large. Thus, cryptography based privacypreserving recommendation is not practical for large datasets.
IiiC Predicting Item Ratings and TopN Recommendation
After obtaining item similarities, predictions for unrated items are computed by taking a weighted average of a target user’s past item ratings. Given a target user and an unrated item , the predicted rating of on can be computed as follows [23]:
(2) 
where is the set of items rated by and is ’s rating on item .
However, the rating is 1 if the user like the th item and 0 otherwise for binary ratings. The predicted rating calculated by Equation 2 is always a constant for the th item. Therefore, for binary ratings, the predicted rating of on can be rewritten as Equation 3 [26]:
(3) 
After computing for all unrated items of , the similarity ranking of the unrated items of user can be obtained, and the items with the highest predicted ratings will be recommended to .
Iv Scalable PrivacyPreserving Itembased TopN Recommendation
This section first gives the overview of the proposed scalable privacypreserving itembased TopN recommendation approach, then details the process of the approach. After that, the theoretical analysis of the efficiency and effectiveness of the proposed approach is presented.
Iva Solution Overview
In this work, our goal is to protect user privacy and reduce computation complexity with minimum loss in recommendation accuracy. We achieve this by proposing the following techniques: 1) a MinHashbased privacypreserving similarity estimation method, which can estimate the Jaccard similarity of items with high efficiency and protect user privacy during the computation process; and 2) a clientside privacypreserving prediction generation method, which can predict item ratings for users with privacy protection. Unlike existing works, which trade either efficiency (e.g., cryptographybased methods) or accuracy (e.g., perturbationbased methods) for privacy, our solution can guarantee privacy protection while supporting flexible balancing between accuracy and efficiency. The flowchart of the proposed approach is shown as Figure 1.
IvB MinHashbased PrivacyPreserving Jaccard Similarity Estimation
MinHash is an efficient method for estimating Jaccard similarity between two sets [27]. Let denote independent random perturbations on elements (hash functions), for any two sets and , they have the following property [27]:
(4) 
In Equation 4, the number of hash functions — is the key factor for determining estimation accuracy and efficiency, i.e., high values indicate higher estimation accuracy but lower computation efficiency. Theoretically, if is large enough, then the estimation can be as accurate as computation using Equation 1. Tradeoffs between accuracy and efficiency of the method are further discussed in Section IVD2.
Based on MinHash, Jaccard similarities among item pairs can be efficiently estimated, but user privacy should be strictly protected during this process. To this end, a privacypreserving MinHash protocol is proposed, in which users choose to add their information in anonymous random walks. Thus, the server and any other user cannot know which piece of data are from a target user and whether a target user has added his/her data, so that user privacy can be protected. The detailed procedure for estimating Jaccard similarity based on privacypreserving MinHash method is presented in Algorithm 1.
IvC ClientSide PrivacyPreserving Prediction Generation
Based on Algorithm 1, the recommender system can obtain the Jaccard similarities among all item pairs. Then, the server can send the item similarities to all users. After obtaining item similarities, each client can compute its own prediction scores based on Equation 3 and recommend items with high values to its user. Since the computation of this step is fully accomplished on the client side, user privacy can be strictly protected.
IvD Discussion
IvD1 Complexity Analysis
The complexity of Jaccard similarity computation is , where is the number of items and is the number of users. But based on Algorithm 1, the computation complexity is reduced to , where is the number of hash functions. This is because the server can go through the hash results, and find out all the cases that . Then, the server can calculate the probability of for all item pairs. For prediction generation, the server side computation is zero. And the computation complexity for each user is , where is the number of items rated by the user.
IvD2 Accuracy of Similarity Estimation
The similarity estimation accuracy of Algorithm 1 is determined by , the number of hash functions. Here, we analyze the relationship between estimation accuracy and theoretically; detailed statistical results are presented in the evaluation section. We first introduce the Chernoff Bound before analyzing the accuracy of the method.
Theorem (Chernoff Bound [28])
Given a set of r independent identically distributed (iid) random variables , satisfying that and . Let . Then for any , .
Theorem
Given a set of hash functions and items , , let and be the estimated and true Jaccard similarity between and , then for any and , .
Proof
For any , we have if . Let . Then, is a set of independent identically distributed random variables and . Let , then (based on Equation 4). Then, according to Chernoff Bound, let and , we have , i.e., .
IvD3 Privacy Analysis
In the proposed method, only item similarity computation, which are performed among users, may reveal user privacy. Here, we prove that Algorithm 1 can strictly protect user privacy under the semihonest model [29], in which users follow the computation protocols honestly except that they can infer information based on intermediate data. The “privacy” definition is adopted from Goldreich [29], which states that a computation protocol is privacypreserving if the view of each party during the execution of the protocol can be simulated by a polynomialtime algorithm knowing only the input and the output of the party.
Theorem
Algorithm 1 is privacypreserving for users under the semihonest model.
Proof
The simulator for Algorithm 1 can be constructed as follows:

Stage 1: In this stage, the server randomly chooses to start a random walk or receives an empty . If decides not to add its data, the output of is an empty which can be easily simulated. Otherwise, the simulator for can simulate with , and then send to another user . From the view of , ( is the number of users), i.e., can be from any user in . Thus, cannot learn any information from , no matter added his data or not. This indicates that the output of is indistinguishable from what the next user views in real random walk.

Stage 2: In this stage, a user receives a nonempty . The simulator for can simulate his output with . No matter to whom chooses to send (the server or another user), does not contain any private information of and the output of is also indistinguishable from what the next party views in real random walk.
The above simulator is linear in the size of , i.e., a polynomialtime simulator is successfully constructed for users. Thus, Algorithm 1 is privacypreserving for users.
V Experiments
This section evaluates the efficiency and effectiveness of the proposed approach on three realworld data sets. The first study evaluates how the recommendation accuracy of the proposed method is affected by the key factor in the proposed algorithm. Then, the absolute error of similarity estimation is further analyzed to help understand the proposed method.
Va Experimental Setup
The evaluation data sets are collected from three realworld data sets that have been widely used for evaluating recommendation algorithms. Table I shows the three data sets in detail.
Data Set  # of Users  # of Items  Density (%) 
Last.fm  17,976  8,007  0.65 
Jester  24,983  100  41.97 
MovieLens 20M  7,120  131,262  0.11 
It is necessary to mention that this study changes the ratings from a real number to a binary number in the Jester data set and the MovieLens 20M data set. For instance, the ratings are from 10 to 10 in Jester data set and from 0 to 5 in the MovieLens 20M data set. We change the ratings to 1 if a user has rated an item, and to 0 otherwise.
For each data set, we split it into train and test sets randomly by setting the ratio between the train set and the test set as 4:1. The results are presented by averaging the results of ten different random traintest splits. Since the privacy property of the proposed method has been proved theoretically, the evaluation focuses on comparing the efficiency and the accuracy of the proposed method (PPIBTN) and an itembased topN recommendation algorithm [23, 30] (IBTN).
VB Evaluation Metrics
This study adopts precision metrics to evaluate the accuracy of the proposed approach, which is defined as follows:
(5) 
where is the set of items that a user rated and is the set of items that are recommended.
Also, Equation 6 defines the absolute error of similarity estimation:
(6) 
where is the similarity between item and calculated by the proposed Algorithm 1.
VC Recommendation Efficiency Comparison
Figure (a)a, Figure (a)a, and Figure (a)a show the recommendation efficiency of the proposed PPIBTN method and the IBTN method on Last.fm, Jester, and MovieLens 20M, respectively, where is the number of users and is the number of hash functions. We can see that for all values, the computation time of the PPIBTN method is less than that of the IBTN method. For instance, compared with the IBTN method, when , the proposed PPIBTN method can reduce the computation time by approximately 20.99%, 49.50%, and 47.74% on Last.fm, Jester, and MovieLens 20M, respectively. And when , the computation time is reduced by approximately 8.85%, 6.25%, and 12.32% on Last.fm, Jester, and MovieLens 20M, respectively. In the proposed PPIBTN method, the hash procedure can greatly reduce the density of the data set, thus the similarity computation step is much more efficient than that of the IBTN method.
VD Recommendation Accuracy Comparison
Figure (b)b, Figure (b)b, and Figure (b)b show the recommendation precision loss of the proposed PPIBTN method relative to the IBTN method on Last.fm, Jester, and MovieLen 20M, respectively. More specifically, when , precision loss is approximately 2.60%, 0.13%, and 1.07% on Last.fm, Jester, and MovieLens 20M, respectively. Recommendation precision loss decreases to 0 when increases to in the three data sets. Especially, the recommendation precision losses are less than 1% when is larger than , which indicates that the proposed method can achieve decent accuracy.
VE Accuracy Analysis of Similarity Estimation
Figure (c)c, Figure (c)c, and Figure (c)c show the probability that the absolute error of similarity estimation is less than on Last.fm, Jester, and MovieLens 20M, respectively. We can see from the results that experimental probabilities are higher than theoretical bounds for values of 0.03, 0.04, and 0.05, and the probabilities are closer to 1 when increases. This confirms that the similarity estimation accuracy is bounded, as in Theorem IVD2. Meanwhile, the results also indicate that recommender systems can choose different values to balance between similarity estimation accuracy and efficiency.
Vi Conclusion
Recommender systems have played an essential role in ecommerce in recent years. However, existing solutions for recommendation have limited capabilities when it comes to protecting user privacy while still achieving high scalability. In this work, we have proposed a scalable algorithm for privacypreserving, itembased topN recommendations. The proposed algorithm can guarantee the protection of user privacy while significantly enhancing recommendation efficiency with decent recommendation quality. Comprehensive theoretical and experimental analysis demonstrates the efficiency and effectiveness of the proposed approach.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61233016, and the National Science Foundation (NSF) of United States under grant No. 1334351 and 1442971.
References
 [1] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: a survey of the stateoftheart and possible extensions,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, June 2005.
 [2] Y. He, C. Wang, and C. Jiang, “Modeling data correlations in recommendation,” IEEE Access, vol. 5, pp. 11 030–11 042, 2017.
 [3] D. Li, Q. Lv, X. Xie, L. Shang, H. Xia, T. Lu, and N. Gu, “Interestbased realtime content recommendation in online social communities,” KnowledgeBased Systems, vol. 28, pp. 1–12, 2012.
 [4] D. Li et al., “An algorithm for efficient privacypreserving itembased collaborative filtering,” Future Generation Computer Systems, vol. 55, pp. 311 – 320, 2016.
 [5] G. Linden, B. Smith, and J. York, “Amazon. com recommendations: Itemtoitem collaborative filtering,” IEEE Internet computing, vol. 7, no. 1, pp. 76–80, 2003.
 [6] J. Canny, “Collaborative filtering with privacy,” in Proceedings of IEEE Symposium on Security and Privacy. IEEE, 2002, pp. 45–57.
 [7] J. Davidson et al., “The youtube video recommendation system,” in Proceedings of the fourth ACM conference on Recommender systems. ACM, 2010, pp. 293–296.
 [8] D. Li et al., “Efficient privacypreserving content recommendation for online social communities,” Neurocomputing, vol. 219, pp. 440 – 454, 2017.
 [9] D. Li, Q. Lv, L. Shang, and N. Gu, “Yana: an efficient privacypreserving recommender system for online social communities,” in Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 2011, pp. 2269–2272.
 [10] D. Li, Q. Lv, H. Xia, L. Shang, T. Lu, and N. Gu, “Pistis: a privacypreserving content recommender system for online social communities,” in Web Intelligence and Intelligent Agent Technology (WIIAT), 2011 IEEE/WIC/ACM International Conference on, vol. 1. IEEE, 2011, pp. 79–86.
 [11] A. Ozturk and H. Polat, “From existing trends to future trends in privacypreserving collaborative filtering,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 5, no. 6, pp. 276–291, 2015.
 [12] E. Aïmeur et al., “Alambic: a privacypreserving recommender system for electronic commerce,” International Journal of Information Security, vol. 7, no. 5, pp. 307–334, 2008.
 [13] H. Kikuchi, H. Kizawa, and M. Tada, “Privacypreserving collaborative filtering schemes,” in International Conference on Availability, Reliability and Security. IEEE, 2009, pp. 911–916.
 [14] H. Polat and W. Du, “Privacypreserving collaborative filtering using randomized perturbation techniques,” in The Third IEEE International Conference on Data Mining. IEEE, 2003, pp. 625–628.
 [15] S. Zhang, J. Ford, and F. Makedon, “A privacypreserving collaborative filtering scheme with twoway communication,” in Proceedings of the 7th ACM conference on Electronic commerce. ACM, 2006, pp. 316–323.
 [16] F. McSherry and I. Mironov, “Differentially private recommender systems: building privacy into the net,” in Proceedings of the 15th ACM international conference on Knowledge discovery and data mining. ACM, 2009, pp. 627–636.
 [17] Z. Huang, W. Du, and B. Chen, “Deriving private information from randomized data,” in Proceedings of the 2005 ACM international conference on Management of data. ACM, 2005, pp. 37–48.
 [18] H. Kikuchi, H. Kizawa, and M. Tada, “Privacypreserving collaborative filtering schemes,” in International Conference on Availability, Reliability and Security. IEEE, 2009, pp. 911–916.
 [19] J. Zhan et al., “Privacypreserving collaborative recommender systems,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 40, no. 4, pp. 472–476, 2010.
 [20] R. Shokri et al., “Preserving privacy in collaborative filtering through distributed aggregation of offline profiles,” in Proceedings of the third ACM conference on Recommender systems. ACM, 2009, pp. 157–164.
 [21] A. Boutet et al., “Privacypreserving distributed collaborative filtering,” Computing, vol. 98, no. 8, pp. 827–846, 2016.
 [22] S. Berkovsky et al., “Examining users’ attitude towards privacy preserving collaborative filtering,” in Data Mining for User Modeling Online Proceedings of Workshop held at the, 2007, p. 28.
 [23] M. Deshpande and G. Karypis, “Itembased topn recommendation algorithms,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 143–177, 2004.
 [24] B. Sarwar et al., “Itembased collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on World Wide Web. ACM, 2001, pp. 285–295.
 [25] Z. Tan and L. He, “An efficient similarity measure for userbased collaborative filtering recommender systems inspired by the physical resonance principle,” IEEE Access, 2017.
 [26] A. S. Das et al., “Google news personalization: scalable online collaborative filtering,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 271–280.
 [27] A. Z. Broder et al., “Minwise independent permutations,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998, pp. 327–336.
 [28] H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Annals of Mathematical Statistics, pp. 493–507, 1952.
 [29] O. Goldreich, “Secure multiparty computation,” Manuscript. Preliminary version, pp. 86–97, 1998.
 [30] D. Li et al., “Itembased topn recommendation resilient to aggregated information revelation,” KnowledgeBased Systems, vol. 67, pp. 290–304, 2014.