A Scalable Algorithm for Privacy-Preserving Item-based Top-N Recommendation
Recommender systems have become an indispensable component in online services during recent years. Effective recommendation is essential for improving the services of various online business applications. However, serious privacy concerns have been raised on recommender systems requiring the collection of users’ private information for recommendation. At the same time, the success of e-commerce has generated massive amounts of information, making scalability a key challenge in the design of recommender systems. As such, it is desirable for recommender systems to protect users’ privacy while achieving high-quality recommendations with low-complexity computations.
This paper proposes a scalable privacy-preserving item-based top-N recommendation solution, which can achieve high-quality recommendations with reduced computation complexity while ensuring that users’ private information is protected. Furthermore, the computation complexity of the proposed method increases slowly as the number of users increases, thus providing high scalability for privacy-preserving recommender systems. More specifically, the proposed approach consists of two key components: (1) MinHash-based similarity estimation and (2) client-side privacy-preserving prediction generation. Our theoretical and experimental analysis using real-world data demonstrates the efficiency and effectiveness of the proposed approach.
Index terms— item-based, top-N recommendation, privacy, scalability
With the explosive growth of information on the Internet, recommender systems have become indispensable in many online services . As one of the most important techniques in recommender systems, collaborative filtering (CF) [2, 3] has been broadly adopted, such as in Amazon [4, 5], MovieLens , YouTube , and so on. Such broad applications of CF raise serious concerns about the leakage of users’ privacy.
The privacy concern originates from the basic idea behind CF techniques. For instance, item-based top-N recommendation, as one of the widely used CF techniques, assumes that a user may be interested in items that are similar to the items that he/she liked before. More specifically, the problem of top-N recommendation aims to provide an ordered list of items to a user. Item-based recommendation works by first collecting users’ “ratings” of items. Then, items are profiled by analyzing users’ ratings for them. At last, a user is recommended with items that have a high similarity with his/her historically highly-rated items. Through the recommendation, users can find items of their interest, such as movies, music, or other things. However, to profile items and perform recommendations for users, sensitive information, such as user demographics and user ratings, are collected by recommender systems, giving rise to serious privacy concerns of individual users [4, 1, 8, 9, 10]. In addition, profiling and recommending items based on all users’ information lead to scalability issues because of the explosive growth of online information.
Recent research works have aimed to tackle the privacy-preserving issues for individual users based on CF . Scalability is the primary limiting factor to existing cryptography-based methods [4, 6, 12, 13]. These methods adopt cryptography, such as homomorphic encryptions, to hide true user ratings during computation. These solutions require computation-intensive encryption and decryption operations. Hence they do not scale well given the large number of users and items [6, 12, 13]. Compared with cryptography-based solutions, random perturbation based methods do not conduct intensive computations but protect users’ privacy by perturbing users’ ratings, such as adding random noise, before sending the users’ information to the server [14, 15, 16]. As such, random perturbation based methods are efficient and easy to implement. However, these methods trade accuracy for privacy, yet the protection of user privacy is not guaranteed. As shown in , the server can partially recover valid user data from perturbed data using learning techniques.
An ideal privacy-preserving recommender system should guarantee user privacy protection without compromising recommendation accuracy or efficiency. However, existing privacy-preserving CF methods trade either efficiency (cryptography-based methods) or accuracy (random perturbation based methods) for privacy. Therefore, the focus of this work is to address the above challenges and develop a scalable solution capable of preserving privacy while achieving high-quality recommendations with low-complexity computations.
Our work is motivated by one key observation – the overloading information generated on the Internet has a degree of redundancy for recommendations, as some users have a similar preference. As such, it is possible to develop a scalable approach by collecting some anonymous users’ information while achieving high-quality recommendations. Therefore, this work proposes an approach for item-based top-N recommendations, which can cope with a large number of user and item scenarios while achieving efficient privacy-preserving recommendation. Different from the methods above, the proposed approach neither performs additional calculation nor changes the original users’ information to preserve users’ privacy. Specifically, our proposed approach works as follows: (1) to protect individual users’ privacy, we first use anonymous random walks to collect users’ information, thus eliminating the correspondence between individual users and their information; (2) item similarities are estimated online using the MinHash-based method through the received information and the recommendations are generated locally; and (3) by restricting the number of random walks, the computation cost is reduced significantly, especially for scenarios with a large number of users and/or items.
The contributions of this work are summarized as follows:
This work proposes a scalable algorithm for privacy-preserving item-based top-N recommendations, which can achieve high-quality recommendations with low-complexity computations. Compared with existing methods, the computation time of the proposed method increases slowly as the number of users increases, thus providing high scalability for privacy-preserving recommender systems.
The evaluation results in three real-world data sets demonstrate that the proposed method can be more efficient than non-privacy-preserving item-based top-N recommendation methods. Specifically, when the accuracy loss is less 1.0%, the corresponding computation time is reduced by 20.99%, 57.35%, and 62.81% on the Last.fm data set, on the Jester data set, and on the MovieLens 20M data set, respectively.
The rest of this paper is organized as follows. Section II surveys the related works. Section III describes the problem formulation. Section IV presents the proposed scalable algorithm for privacy-preserving item-based top-N recommendation. Section V discusses experimental results on the real-world data sets. Finally, Section VI concludes this work.
Ii Related Work
Ii-a Cryptography Based Methods
Cryptography based methods adopt cryptography, such as homomorphic encryptions, to hide true user ratings during computation.
Kikuchi et al. proposed using homomorphic encryption to calculate user similarities, item recommendation scores and decrypting the scores by a set of trusted authors . Since their method has all users’ original data involved in the calculation, users can obtain recommendations without privacy violation. A similar method has also been introduced in , in which Zhan et al. constructed a more efficient privacy-preserving collaborative recommender system based on the scalar product protocol by comparing with major cryptology approaches. Canny et al. proposed an algorithm whereby a community of users can compute a public “aggregate" of their data that does not expose individual users’ data . They used homomorphic encryption to allow sums of encrypted vectors to be computed and decrypted without exposing individual users’ data. These homomorphic encryption based approaches are computation inefficient because all computations are performed on encrypted data. Additionally, cryptography based methods suffer from the scalability issue since encryption and decryption operations are computation intensive and do not scale well given a large number of users and items.
Ii-B Random Perturbation Based Methods
Polat et al. proposed a privacy-preserving CF method to perturb individual user’s original data by adding a random number, while accurately estimating the aggregation data from a large number of users . Casino et al. proposed a -anonymous approach to protect the user’s privacy, in which they clustered similar users and profiled the clusters. Then the profile in the same cluster had the capability of representing the users in the cluster, achieving perturbing individual users’ data and preserving users’ privacy. Storing users’ profiles in a distributed manner and achieving recommendations is another option for perturbation based privacy-preserving recommender systems. Shokri et al. proposed a distributed mechanism to increase privacy-preserving while achieving accurate recommendations . In their methods, users first store their profiles on their own sides (called offline profiles). Then, the offline profiles are partly merged with the profiles of similar users. After that, the offline profiles are uploaded to a central server periodically for participating in recommendations. A different approach is presented in  that relies on a two-fold mechanism implemented on a decentralized user-based CF. In the work, users’ exact profiles are prevented from exchanging on their own sides while constructing interest-based topology, through which obfuscated profiles are still extracted, ensuring the quality of recommendations. Random perturbation based methods trade accuracy for privacy, yet the protection of user privacy is not guaranteed. As shown in , the server can partially recover true user data from perturbed data using learning techniques.
Iii Problem Formulation
Iii-a Privacy Issue in Item-based Top-N Recommendation
The main purpose of recommender systems is to estimate user ratings of items that have not been seen by the user . Generally, recommender systems estimate users’ ratings for a specific item based on users’ previous ratings on other items. Thus, recommender systems need to collect users’ item ratings before estimating users’ ratings of unrated items. This is a violation of user privacy, since most users do not wish to expose their item ratings . Sensitive user information (e.g., user preferences) may be collected, analyzed, and even sold when a company declares bankruptcy . Ideally, users of recommender systems should be able to obtain high-quality recommendations efficiently, without exposing their item ratings to the recommendation server or any other third parties. Generally, two key steps that are required in item-based top-N recommendation  are described in the following two subsections.
Iii-B Item Similarity Computation
A variety of methods can be adopted to calculate the similarities between items, such as cosine similarity [24, 25], correlation-based similarity  and Jaccard similarity , etc. For binary ratings (i.e., 0 means dislike and 1 means like), Jaccard similarity is the most widely adopted similarity measure. Given two items and , their Jaccard similarity is defined as follows :
where () is the set of users who like item (). Computing the Jaccard similarity is time-consuming, especially when the number of items is large. Thus, cryptography based privacy-preserving recommendation is not practical for large datasets.
Iii-C Predicting Item Ratings and Top-N Recommendation
After obtaining item similarities, predictions for unrated items are computed by taking a weighted average of a target user’s past item ratings. Given a target user and an unrated item , the predicted rating of on can be computed as follows :
where is the set of items rated by and is ’s rating on item .
However, the rating is 1 if the user like the th item and 0 otherwise for binary ratings. The predicted rating calculated by Equation 2 is always a constant for the th item. Therefore, for binary ratings, the predicted rating of on can be rewritten as Equation 3 :
After computing for all unrated items of , the similarity ranking of the unrated items of user can be obtained, and the items with the highest predicted ratings will be recommended to .
Iv Scalable Privacy-Preserving Item-based Top-N Recommendation
This section first gives the overview of the proposed scalable privacy-preserving item-based Top-N recommendation approach, then details the process of the approach. After that, the theoretical analysis of the efficiency and effectiveness of the proposed approach is presented.
Iv-a Solution Overview
In this work, our goal is to protect user privacy and reduce computation complexity with minimum loss in recommendation accuracy. We achieve this by proposing the following techniques: 1) a MinHash-based privacy-preserving similarity estimation method, which can estimate the Jaccard similarity of items with high efficiency and protect user privacy during the computation process; and 2) a client-side privacy-preserving prediction generation method, which can predict item ratings for users with privacy protection. Unlike existing works, which trade either efficiency (e.g., cryptography-based methods) or accuracy (e.g., perturbation-based methods) for privacy, our solution can guarantee privacy protection while supporting flexible balancing between accuracy and efficiency. The flowchart of the proposed approach is shown as Figure 1.
Iv-B MinHash-based Privacy-Preserving Jaccard Similarity Estimation
MinHash is an efficient method for estimating Jaccard similarity between two sets . Let denote independent random perturbations on elements (hash functions), for any two sets and , they have the following property :
In Equation 4, the number of hash functions — is the key factor for determining estimation accuracy and efficiency, i.e., high values indicate higher estimation accuracy but lower computation efficiency. Theoretically, if is large enough, then the estimation can be as accurate as computation using Equation 1. Tradeoffs between accuracy and efficiency of the method are further discussed in Section IV-D2.
Based on MinHash, Jaccard similarities among item pairs can be efficiently estimated, but user privacy should be strictly protected during this process. To this end, a privacy-preserving MinHash protocol is proposed, in which users choose to add their information in anonymous random walks. Thus, the server and any other user cannot know which piece of data are from a target user and whether a target user has added his/her data, so that user privacy can be protected. The detailed procedure for estimating Jaccard similarity based on privacy-preserving MinHash method is presented in Algorithm 1.
Iv-C Client-Side Privacy-Preserving Prediction Generation
Based on Algorithm 1, the recommender system can obtain the Jaccard similarities among all item pairs. Then, the server can send the item similarities to all users. After obtaining item similarities, each client can compute its own prediction scores based on Equation 3 and recommend items with high values to its user. Since the computation of this step is fully accomplished on the client side, user privacy can be strictly protected.
Iv-D1 Complexity Analysis
The complexity of Jaccard similarity computation is , where is the number of items and is the number of users. But based on Algorithm 1, the computation complexity is reduced to , where is the number of hash functions. This is because the server can go through the hash results, and find out all the cases that . Then, the server can calculate the probability of for all item pairs. For prediction generation, the server side computation is zero. And the computation complexity for each user is , where is the number of items rated by the user.
Iv-D2 Accuracy of Similarity Estimation
The similarity estimation accuracy of Algorithm 1 is determined by , the number of hash functions. Here, we analyze the relationship between estimation accuracy and theoretically; detailed statistical results are presented in the evaluation section. We first introduce the Chernoff Bound before analyzing the accuracy of the method.
Theorem (Chernoff Bound )
Given a set of r independent identically distributed (iid) random variables , satisfying that and . Let . Then for any , .
Given a set of hash functions and items , , let and be the estimated and true Jaccard similarity between and , then for any and , .
For any , we have if . Let . Then, is a set of independent identically distributed random variables and . Let , then (based on Equation 4). Then, according to Chernoff Bound, let and , we have , i.e., .
Iv-D3 Privacy Analysis
In the proposed method, only item similarity computation, which are performed among users, may reveal user privacy. Here, we prove that Algorithm 1 can strictly protect user privacy under the semi-honest model , in which users follow the computation protocols honestly except that they can infer information based on intermediate data. The “privacy” definition is adopted from Goldreich , which states that a computation protocol is privacy-preserving if the view of each party during the execution of the protocol can be simulated by a polynomial-time algorithm knowing only the input and the output of the party.
Algorithm 1 is privacy-preserving for users under the semi-honest model.
The simulator for Algorithm 1 can be constructed as follows:
Stage 1: In this stage, the server randomly chooses to start a random walk or receives an empty . If decides not to add its data, the output of is an empty which can be easily simulated. Otherwise, the simulator for can simulate with , and then send to another user . From the view of , ( is the number of users), i.e., can be from any user in . Thus, cannot learn any information from , no matter added his data or not. This indicates that the output of is indistinguishable from what the next user views in real random walk.
Stage 2: In this stage, a user receives a non-empty . The simulator for can simulate his output with . No matter to whom chooses to send (the server or another user), does not contain any private information of and the output of is also indistinguishable from what the next party views in real random walk.
The above simulator is linear in the size of , i.e., a polynomial-time simulator is successfully constructed for users. Thus, Algorithm 1 is privacy-preserving for users.
This section evaluates the efficiency and effectiveness of the proposed approach on three real-world data sets. The first study evaluates how the recommendation accuracy of the proposed method is affected by the key factor in the proposed algorithm. Then, the absolute error of similarity estimation is further analyzed to help understand the proposed method.
V-a Experimental Setup
The evaluation data sets are collected from three real-world data sets that have been widely used for evaluating recommendation algorithms. Table I shows the three data sets in detail.
|Data Set||# of Users||# of Items||Density (%)|
It is necessary to mention that this study changes the ratings from a real number to a binary number in the Jester data set and the MovieLens 20M data set. For instance, the ratings are from -10 to 10 in Jester data set and from 0 to 5 in the MovieLens 20M data set. We change the ratings to 1 if a user has rated an item, and to 0 otherwise.
For each data set, we split it into train and test sets randomly by setting the ratio between the train set and the test set as 4:1. The results are presented by averaging the results of ten different random train-test splits. Since the privacy property of the proposed method has been proved theoretically, the evaluation focuses on comparing the efficiency and the accuracy of the proposed method (PP-IBTN) and an item-based top-N recommendation algorithm [23, 30] (IBTN).
V-B Evaluation Metrics
This study adopts precision metrics to evaluate the accuracy of the proposed approach, which is defined as follows:
where is the set of items that a user rated and is the set of items that are recommended.
Also, Equation 6 defines the absolute error of similarity estimation:
where is the similarity between item and calculated by the proposed Algorithm 1.
V-C Recommendation Efficiency Comparison
Figure (a)a, Figure (a)a, and Figure (a)a show the recommendation efficiency of the proposed PP-IBTN method and the IBTN method on Last.fm, Jester, and MovieLens 20M, respectively, where is the number of users and is the number of hash functions. We can see that for all values, the computation time of the PP-IBTN method is less than that of the IBTN method. For instance, compared with the IBTN method, when , the proposed PP-IBTN method can reduce the computation time by approximately 20.99%, 49.50%, and 47.74% on Last.fm, Jester, and MovieLens 20M, respectively. And when , the computation time is reduced by approximately 8.85%, 6.25%, and 12.32% on Last.fm, Jester, and MovieLens 20M, respectively. In the proposed PP-IBTN method, the hash procedure can greatly reduce the density of the data set, thus the similarity computation step is much more efficient than that of the IBTN method.
V-D Recommendation Accuracy Comparison
Figure (b)b, Figure (b)b, and Figure (b)b show the recommendation precision loss of the proposed PP-IBTN method relative to the IBTN method on Last.fm, Jester, and MovieLen 20M, respectively. More specifically, when , precision loss is approximately 2.60%, 0.13%, and 1.07% on Last.fm, Jester, and MovieLens 20M, respectively. Recommendation precision loss decreases to 0 when increases to in the three data sets. Especially, the recommendation precision losses are less than 1% when is larger than , which indicates that the proposed method can achieve decent accuracy.
V-E Accuracy Analysis of Similarity Estimation
Figure (c)c, Figure (c)c, and Figure (c)c show the probability that the absolute error of similarity estimation is less than on Last.fm, Jester, and MovieLens 20M, respectively. We can see from the results that experimental probabilities are higher than theoretical bounds for values of 0.03, 0.04, and 0.05, and the probabilities are closer to 1 when increases. This confirms that the similarity estimation accuracy is bounded, as in Theorem IV-D2. Meanwhile, the results also indicate that recommender systems can choose different values to balance between similarity estimation accuracy and efficiency.
Recommender systems have played an essential role in e-commerce in recent years. However, existing solutions for recommendation have limited capabilities when it comes to protecting user privacy while still achieving high scalability. In this work, we have proposed a scalable algorithm for privacy-preserving, item-based top-N recommendations. The proposed algorithm can guarantee the protection of user privacy while significantly enhancing recommendation efficiency with decent recommendation quality. Comprehensive theoretical and experimental analysis demonstrates the efficiency and effectiveness of the proposed approach.
This work was supported in part by the National Natural Science Foundation of China under Grant No. 61233016, and the National Science Foundation (NSF) of United States under grant No. 1334351 and 1442971.
-  G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 6, pp. 734–749, June 2005.
-  Y. He, C. Wang, and C. Jiang, “Modeling data correlations in recommendation,” IEEE Access, vol. 5, pp. 11 030–11 042, 2017.
-  D. Li, Q. Lv, X. Xie, L. Shang, H. Xia, T. Lu, and N. Gu, “Interest-based real-time content recommendation in online social communities,” Knowledge-Based Systems, vol. 28, pp. 1–12, 2012.
-  D. Li et al., “An algorithm for efficient privacy-preserving item-based collaborative filtering,” Future Generation Computer Systems, vol. 55, pp. 311 – 320, 2016.
-  G. Linden, B. Smith, and J. York, “Amazon. com recommendations: Item-to-item collaborative filtering,” IEEE Internet computing, vol. 7, no. 1, pp. 76–80, 2003.
-  J. Canny, “Collaborative filtering with privacy,” in Proceedings of IEEE Symposium on Security and Privacy. IEEE, 2002, pp. 45–57.
-  J. Davidson et al., “The youtube video recommendation system,” in Proceedings of the fourth ACM conference on Recommender systems. ACM, 2010, pp. 293–296.
-  D. Li et al., “Efficient privacy-preserving content recommendation for online social communities,” Neurocomputing, vol. 219, pp. 440 – 454, 2017.
-  D. Li, Q. Lv, L. Shang, and N. Gu, “Yana: an efficient privacy-preserving recommender system for online social communities,” in Proceedings of the 20th ACM international conference on Information and knowledge management. ACM, 2011, pp. 2269–2272.
-  D. Li, Q. Lv, H. Xia, L. Shang, T. Lu, and N. Gu, “Pistis: a privacy-preserving content recommender system for online social communities,” in Web Intelligence and Intelligent Agent Technology (WI-IAT), 2011 IEEE/WIC/ACM International Conference on, vol. 1. IEEE, 2011, pp. 79–86.
-  A. Ozturk and H. Polat, “From existing trends to future trends in privacy-preserving collaborative filtering,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 5, no. 6, pp. 276–291, 2015.
-  E. Aïmeur et al., “Alambic: a privacy-preserving recommender system for electronic commerce,” International Journal of Information Security, vol. 7, no. 5, pp. 307–334, 2008.
-  H. Kikuchi, H. Kizawa, and M. Tada, “Privacy-preserving collaborative filtering schemes,” in International Conference on Availability, Reliability and Security. IEEE, 2009, pp. 911–916.
-  H. Polat and W. Du, “Privacy-preserving collaborative filtering using randomized perturbation techniques,” in The Third IEEE International Conference on Data Mining. IEEE, 2003, pp. 625–628.
-  S. Zhang, J. Ford, and F. Makedon, “A privacy-preserving collaborative filtering scheme with two-way communication,” in Proceedings of the 7th ACM conference on Electronic commerce. ACM, 2006, pp. 316–323.
-  F. McSherry and I. Mironov, “Differentially private recommender systems: building privacy into the net,” in Proceedings of the 15th ACM international conference on Knowledge discovery and data mining. ACM, 2009, pp. 627–636.
-  Z. Huang, W. Du, and B. Chen, “Deriving private information from randomized data,” in Proceedings of the 2005 ACM international conference on Management of data. ACM, 2005, pp. 37–48.
-  H. Kikuchi, H. Kizawa, and M. Tada, “Privacy-preserving collaborative filtering schemes,” in International Conference on Availability, Reliability and Security. IEEE, 2009, pp. 911–916.
-  J. Zhan et al., “Privacy-preserving collaborative recommender systems,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 40, no. 4, pp. 472–476, 2010.
-  R. Shokri et al., “Preserving privacy in collaborative filtering through distributed aggregation of offline profiles,” in Proceedings of the third ACM conference on Recommender systems. ACM, 2009, pp. 157–164.
-  A. Boutet et al., “Privacy-preserving distributed collaborative filtering,” Computing, vol. 98, no. 8, pp. 827–846, 2016.
-  S. Berkovsky et al., “Examining users’ attitude towards privacy preserving collaborative filtering,” in Data Mining for User Modeling On-line Proceedings of Workshop held at the, 2007, p. 28.
-  M. Deshpande and G. Karypis, “Item-based top-n recommendation algorithms,” ACM Transactions on Information Systems, vol. 22, no. 1, pp. 143–177, 2004.
-  B. Sarwar et al., “Item-based collaborative filtering recommendation algorithms,” in Proceedings of the 10th international conference on World Wide Web. ACM, 2001, pp. 285–295.
-  Z. Tan and L. He, “An efficient similarity measure for user-based collaborative filtering recommender systems inspired by the physical resonance principle,” IEEE Access, 2017.
-  A. S. Das et al., “Google news personalization: scalable online collaborative filtering,” in Proceedings of the 16th international conference on World Wide Web. ACM, 2007, pp. 271–280.
-  A. Z. Broder et al., “Min-wise independent permutations,” in Proceedings of the thirtieth annual ACM symposium on Theory of computing. ACM, 1998, pp. 327–336.
-  H. Chernoff, “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Annals of Mathematical Statistics, pp. 493–507, 1952.
-  O. Goldreich, “Secure multi-party computation,” Manuscript. Preliminary version, pp. 86–97, 1998.
-  D. Li et al., “Item-based top-n recommendation resilient to aggregated information revelation,” Knowledge-Based Systems, vol. 67, pp. 290–304, 2014.