Privacy-preserving collaborative filtering on the cloud – practical implementation experiences Anirban Basu

Jaideep Vaidya

Graduate School of Engineering, Tokai University 2-3-23 Takanawa, Minato-ku, Tokyo 108-8619, Japan [email protected]

MSIS Department, Rutgers, The State University of New Jersey 1, Washington Park, Newark, New Jersey 07102-1897, USA [email protected]

Hiroaki Kikuchi

Theo Dimitrakos

Security Futures Practice, Research & Technology, BT Department of Frontier Media Science, Adastral Park, Martlesham Heath, IP5 3RE, UK School of Interdisciplinary Mathematical Sciences, Meiji University [email protected] 4-21-1 Nakano, Nakano-ku, Tokyo 164-8525, Japan [email protected]

Abstract— Recommender systems typically use collaborative filtering to make sense of huge and growing volumes of data. An emerging trend in industry has been to use public clouds to deal with the computing and storage requirements of such systems. This, however, comes at a price – data privacy. Simply ensuring communication privacy does not protect against insider threats or even attacks agagainst the cloud infrastructure itself. To deal with this, several privacy-preserving collaborative filtering algorithms have been developed in prior research. However, these have only been theoretically analyzed for the most part. In this paper, we analyze an existing privacy preserving collaborative filtering algorithm from an engineering perspective, and discuss our practical experiences with implementing and deploying privacypreserving collaborative filtering on real world Software-as-aService enabling Platform-as-a-Service clouds.

I. I NTRODUCTION Recommender systems are immensely useful in many electronic commerce applications, and are used to target personalized items of interest to users. Typically, recommender systems either use content based filtering or collaborative filtering approaches. While content based filtering is useful given a sufficiently diverse behavioral history, collaborative filtering is used more often when there is a large base of users to provide accurate recommendations even in the case of limited personal history. Given this, many large enterprises / services such as Netflix, iTunes, and Amazon have turned to using collaborative filtering. With ever increasing data volumes, an emerging trend is to outsource either the computation or the data or both to public cloud providers1 . In doing so, the privacy of users’ data is put at risk. In most real world examples, the privacy guarantees are offline legal agreements between the cloud providers and the recommendation providers, which do not protect the users’ data from insider threats or unforeseen attacks on the cloud infrastructure. In an attempt to ensure privacy from a computational perspective, a number of privacy 1 See: http://techblog.netflix.com/2013/01/ janitor-monkey-keeping-cloud-tidy-and.html

preserving collaborative filtering (PPCF) schemes have been proposed. However, a majority of those schemes have been analysed from either theoretical perspectives or within limited experimental setups. In this paper we analyse our earlier work [1], [2], a PPCF scheme, from an engineering perspective. We have significantly improved on the implementation to process PPCF queries in parallel, and performed a comprehensive evaluation of this scheme from a cloud engineering perspective. Our work does not assume trusting third party providers. The real world implementation shows the feasibility of what could become a recommendation-as-a-service atop a cloud platform. We discuss our practical experiences from the implementation, deployment and test (with two public datasets) of our PPCF proposal on two real world Software-as-a-Service (SaaS) enabling Platform-as-a-Service (PaaS) clouds – the Google App Engine for Java and the Amazon Elastic Beanstalk. Thus, this paper provides an applications and experiences report for a real-life deployment of PPCF on SaaS enabling PaaS clouds. The rest of the paper is organised as follows. In § II, we briefly describe the PPCF scheme that we have implemented, followed by an overview of the system architecture from a cloud engineering perspective in § III. A performance evaluation with two public datasets is presented in § IV, which is followed by discussions in § V. We present a brief overview of related work in § VI, and finally conclude the paper in § VII. II. P RELIMINARIES : CF AND PPCF A. Collaborative filtering with Slope One The Slope One predictors due to Lemire and McLachlan [3] are item-based collaborative filtering schemes that predict the rating of an item for a user from a pair-wise deviations of item ratings. Slope One CF can be evaluated in two stages: the pre-computation and the prediction of ratings. In the pre-computation stage, the average deviations of

ratings from item a to item b is given as: P P (ri,a − ri,b ) ∆a,b i δi,a,b = = i . δa,b = φa,b φa,b φa,b

(Item) pair-wise plaintext deviations of ratings are submitted. Alice

(II.1)

where φa,b is the number of the users who have rated both items while δi,a,b = ri,a − ri,b is the deviation of the rating of item a from that of item b both given by user i. In the prediction stage, the rating for user u and item x using the weighted Slope One is given as: P a|a6=x (δx,a + ru,a )φx,a P ru,x = a|a6=x φx,a P a|a6=x (∆x,a + ru,a φx,a ) P = . (II.2) a|a6=x φx,a B. Homomorphic encryption

(II.3)

and, the encryption of the product of one plaintext messages m1 and a plaintext integer multiplicand π is the modular exponentiation of the ciphertext of m1 with π as the exponent: E(m1 · π) = E(m1 )π .

Item 1

Item 2

...

Item n-1

Item n

1

5

...

-

4

3

4



2

-

Bob

SaaS PPCF application

PaaS cloud

Carol

Encrypted (with Carol's public key) query vector. Item 1

...

Item k

...

Item n

5

-

?

...

2

Carol

Carol decrypts the numerator and divides to obtain the prediction value of 4.05.

Item k

Encrypted numerator (deviations) Plaintext denominator (cardinalities)

405 100

Fig. II.1: User-interaction diagram of our scheme.

The Paillier public-key cryptosystem [4] exhibits additively homomorphic properties. Denoting encryption and decryption functions as E() and D() respectively, the encryption of the sum of two plaintext messages m1 and m2 is the modular product of their individual ciphertexts: E(m1 + m2 ) = E(m1 ) · E(m2 )

Identity anonymiser (e.g., NAT)

(II.4)

These two homomorphic properties can be applied on the Slope One CF predictors for encrypted prediction. C. Privacy preserving collaborative filtering In the pre-computation (i.e., submission of ratings) stage, we can anonymise the submitted ratings (or deviations) to the cloud as shown in figure II.1. In the diagram, Alice and Bob, over time, submit the ratings (or deviations) of items in pairwise fashion, which are accumulated by the cloud application. Ratings (or deviations) submitted in this way are used to pre-compute the deviation matrix, ∆, and the cardinality matrix, φ. Note that while an identity anonymiser is shown in the diagram, alternatively, ratings (or deviations) can also be submitted with additive random noise. We have proposed a scheme using random noise and tested that mechanism on the cloud too but it is beyond the scope of this paper. The use of random noise impacts accuracy while the use of anonymiser could affect privacy. Neither of these have significant effect on performance except cloud computing costs. Therefore, we limit our discussion to the performance of encrypted prediction only for the rest of the paper. In the prediction phase, the user (Carol) queries the cloud with an encrypted (with her public key) and complete rating vector. The cloud uses the public key to encrypt the necessary elements from the deviation matrix and to apply homomorphic multiplication according to the prediction equation defined in equation II.5, where ∆x,a is the deviation of ratings between

item x and item a; φx,a is their relative cardinality and E(ru,a ) is an encrypted rating on item a sent by user u, although the identity of the user is irrelevant in this process. The final decryption is performed at the user’s end with the user’s private key, thereby eliminating the need of any trusted third party or threshold decryption. Q D( a|a6=x (E(∆x,a )(E(ru,a )φx,a ))) P ru,x = . (II.5) a|a6=x φx,a This equation is optimised by reducing the number of encryptions as follows: P Q D(E( a|a6=x ∆x,a ) a|a6=x (E(ru,a )φx,a )) P ru,x = . (II.6) a|a6=x φx,a III. S YSTEM OVERVIEW AND CLOUD ARCHITECTURE Figure III.1 presents an overview of the system architecture.The portion within the blue dotted lines, i.e., within the Platform-as-a-Service, is a high-level view of the components on the cloud side, especially from the perspective of the user and that of an application developer. The users could either do connect to the service to submit new pairwise ratings or deviations; or to query over the encrypted domain the recommendation for a particular item given their encrypted rating vectors. The PPCF application has been deployed on both the Google App Engine (GAE/J) and the Amazon Elastic Beanstalk (EBS). In both infrastructures, the entry-point to the PPCF application begins with a user request sent over HTTP or HTTPS. Each user request is intercepted by the load balancer, which channels it to an application instance that responds to the request. On the GAE/J, an application instance is a web application deployment on a Java servlet environment, all of which run inside a Java Virtual Machine. On the EBS, an application instance constitutes a web application deployment within the Tomcat application server, inside a Java Virtual Machine on a guest operating system running on a Xen

TABLE IV.1: Jester and MovieLens 100K datasets. User

User

User

User

Query load balancer PPCF cloud app instance

Cloud datastore

PaaS cloud

PPCF cloud app instance

Cloud datastore

PPCF cloud app instance

Cloud datastore

PPCF cloud app instance

Cloud datastore

Fig. III.1: System architecture of privacy preserving collaborative filtering on a PaaS cloud.

hypervisor kernel. The specifics of the OS-level virtualisation are not transparent in the GAE/J. In terms of isolation, the EBS is stronger than the GAE/J because every application runs within its own isolated virtual machine providing operating system level isolation. The GAE/J provides isolation at the Java Virtual Machine level. Thus, in terms of resources, EBS is not affected by other applications running on the same EC2 instance because what runs on the EC2 instance (beyond basic OS and essential services) can be configured by the user. On the GAE/J, however, applications share the common OS environment with other applications leaving the user with no control over how a particular application may get (or be deprived of) a resource (e.g., processor time). In the GAE/J, the backend datastore is a massively distributed Google BigTable implementation. From a programming perspective, one can view it as a massive hash table with near constant-time lookup (barring the network delays). The GAE/J datastore has very limited language expressibility in queries. It only supports Google Query Language (GQL), which implements a small subset of SQL. Lookups on the GAE/J are made more efficient with indexes but those indexes amount to storage costs. The EBS provides a choice to the user to choose the type of data storage that can be used. We have used the Amazon Relational Database Service (RDS), where a traditional (MySQL, in our case) RDBMS is utilised. Thus, the datastore is not as scalable as the GAE/J’s by design. However, the redundancy can be added by introducing more than one database server EC2 instances, which comes at a price. The lookup performance of the RDS is good except when subject to excess load. The performance also degrades if the database has huge quantities of data, which may need to be read from the disk instead of being served from memory. IV. E VALUATION We first discuss the datasets used and then the results obtained through the evaluation.

Users Items Range Ratings Data pointsa Density a “Data

Jester 73,421 100 [−10.00 10.00] 4,100,000 4,950 100%

MovieLens 100K 943 1,682 {1, 2, 3, 4, 5} 100,000 983,206 69.5%

points” and “Density” refer to Slope One data points and their density.

A. Dataset used For our experiments, we have used two public datasets: Jester and MovieLens 100K. Jester is a dataset of ratings on jokes. It contains 4.1 million ratings from 73,421 users on 100 jokes. The rating range is continuous over [−10.00 +10.00]. The value 99 corresponds to “not rated”. MovieLens 100K is a sparse dataset of ratings on movies. It contains 100,000 ratings from 943 users on 1682 movies. The rating range is discrete integral {1, 2, 3, 4, 5} with 0 signifying “not rated”. A comparison showing the descriptive statistics and other properties of the two datasets is given in table IV.1. a) Combined SlopeOne matrix storage: In a practical situation, the SlopeOne matrix is expected to be large, sufficiently larger than the 4,950 points based on the Jester dataset. Keeping this in mind, we run a number of our experiments with SlopeOne matrix from both the Jester and the MovieLens datasets persisted in the data storage in order to check the impact on the performance. B. Results We have carried out a practical deployment evaluation by varying the query load, the encryption key sizes and the configuration settings on the cloud platforms. For the sake of brevity, we include only the results from queries using 1024-bit Paillier encryption (though other key sizes were also tested and gave similar results) on both datasets and both cloud providers with varying lengths of the query vector as well as the total number of concurrent queries. In § V, we show performance estimations with 2048-bit encryption. The objective of the experiments was to see how encrypted prediction performed under varying loads. The encrypted query vector can be processed in parallel by multiple threads (i.e., operating on each item in the query vector in equation II.6). On top of that, each cloud application instance could receive concurrent queries at the same time. Therefore, the results presented have four performance categories: 1) P – parallel processing of the query vector, 2) C, P – parallel processing under concurrent load, 3) S – single-threaded processing of the query vector, and 4) C, S – single-threaded processing under concurrent load. In all the figures in this section, the ordinate depicts the time in milliseconds taken, on an average, for each individual prediction. The GAE/J front-end instance class used was F4 (2400MHz, 512MB RAM), with maximum idle instances set

to automatic and minimum pending latency set to 10ms. The EBS was set to start one minimum t1.micro instance, and 8 maximum. The load balancer trigger was set to increment by one instance when the virtual machine CPU usage breached 70%, and decrement by one when the CPU usage fell below 40% over a measurement duration and breach duration of one minute (fastest) each. We have used one Amazon RDBMS standard instance only for our experiments.

query vector in parallel. However, the time spent in thread scheduling often offset the benefit of parallel processing the vector for small vector sizes. This is particularly true of the concurrent connection and parallel processing combination. Despite these general trends, there are also some observations, where anomalies can be attributed to resource usage and allocation in the cloud infrastructure which are non-transparent and beyond the scope of our application. On the PaaS level, it is not possible to entirely isolate our experiments from the effects of other applications and/or virtual machines running on the same cloud infrastructure. V. L ESSONS LEARNED

Fig. IV.1: Encrypted prediction performance with 2048-bits Paillier cryptosystem. 1) Encrypted prediction performance: In figure IV.1, the time required to perform a PPCF computation is plotted against the size of the encrypted query vector. It is evident that with a 2048-bit Paillier cryptosystem, the prediction performance of the GAE/J is worse than Amazon EBS. Both exhibit steady increase in time with the increase in size of the query vector. The situation is, however, different when both cloud platforms are subject to varying degrees of load, as shown below. 2) Comparison with concurrent load: Figure IV.2 shows the comparative performances of the GAE/J and Amazon EBS with varying lengths of the query vector for the Jester dataset. The lumped bar graphs represent the performance of a particular platform given a certain number of concurrent requests. For example, ‘A (100)’ represents the performances of Amazon EBS with a load of 100 concurrent queries. Figure IV.3 illustrates the comparative performances of the GAE/J and Amazon EBS with varying lengths of the query vector for the MovieLens 100K dataset. The EBS sometimes exhibited failure with concurrent and parallel handling of requests using the MovieLens 100K dataset, because the concurrent load was too high for the EBS to cope with. This can be avoided by pre-allocating more resources with a significant increase in cost. The EBS load-balancer starts one or more virtual machine instances depending on the CPU load. If the load is large and bursty, this allocation mechanism is sometimes too slow. On the contrary, GAE/J starts up Java virtual machines with incoming load with a quick (10ms in our case) response time, which is a lot faster. The application spawned multiple threads (up to a maximum of 10) per request in order to process the

Given the data from various experiments with two datasets, we draw some conclusions about prediction performance in table V.1. Google App Engine, although offers a generally slower performance, is the platform of choice for the future of such a PPCF application. In table IV.1, note that the SlopeOne deviation matrix density for the Jester dataset is 100%, which means deviations for all possible item pairs exist. This implies that each item in the query vector will contribute to the time required for computing prediction (e.g., datastore/cache lookup, encryption). In contrast, this is not the case for the MovieLens dataset. Therefore, prediction timing results with Jester are better performance indicators than those with the MovieLens 100K. A. The impact of scale 1) The size of users and items in the dataset: SlopeOne CF algorithm makes the number of users in the dataset irrelevant. The stored Slope One matrix is an item-item deviationcardinality matrix. Therefore, for all practical purposes, the number of users that can be used in our prototype is unlimited and the prediction performance is not affected by the number of users. However, the more the number of users, the more the inputs to the data points in the SlopeOne matrix. For every pair of items the deviation-cardinality tuple stored in the cloud needs to be updated every time there is a new update (from a user). This affects the bandwidth and has a data store write computation cost. It may require index write operations depending on the configuration of the datastore. With m users and n items, it would amount to 16 n(n−1) 2 bytes if each user was to rate each item. Similarly, it will be 8 n(n−1) bytes if each user was sending only deviations. For 2 all users, the maximum number of bytes necessary to send the entire dataset to the cloud would be either 16m n(n−1) using 2 ratings or 8m n(n−1) using deviations. If the data was pre2 processed and sent in bulk then it would only require 8 n(n−1) 2 bytes, thus making it free from the number of users. Thus the bandwidth required for adding the data as ratings or deviations is in order of O(mn2 ), while for pre-processed bulk upload, it is in order of O(n2 ). The primary key of the Slope One tuple is a Java String, typically represented as a 256-bit (32 bytes) value in the datastore. For each of n items, we will have a 32-bytes

(a) Vector size: 10

(b) Vector size: 20

(c) Vector size: 35

(d) Vector size: 70

Fig. IV.2: Comparison of performances between the GAE/J and the EBS with the Jester dataset.

long primary key (can be less!), a 8-bytes double storing the deviation and a 8-bytes long integer storing the cardinalities; adding up to a total of 48-bytes per entry excluding indexes. Thus, the total byte value of the stored data can be a maximum of 48 n(n−1) = 24n(n − 1), i.e., the storage cost is in O(n2 ). 2 Theoretical estimations about the maximum storage bandwidth and storage space (ignoring overheads and varying index sizes except the primary key) are shown in table V.2. In the table, we have also shown the actual values in bytes for a hypothetical dataset of 10,000 users and 10,000 items with each user rating each item. The bandwidth required for transferring the data to the datastore is in several terabytes, unless using a pre-processed bulk upload. However, such a terabyte-scale transfer happens over a long period of time (typically, several months) while 10,000 users go about rating 10,000 items. 2) The query vector: Depending on the backend datastore, the prediction performance is only affected by the number of items in the prediction query vector. This depends on the backend datastore implementation. If it is similar to the one on Google App Engine (i.e., fully distributed Google BigTable)

then the average lookup performance of such a data store is efficient and independent of the number of stored data points (ignoring network latencies and similar external factors). If a traditional single-server RDBMS is used (as in Amazon Elastic Beanstalk), the larger the stored dataset, the worse the performance unless the data storage is distributed. Assuming that the stored data does not affect the performance (as is the case with the public datasets – Jester and MovieLens 100K – due to their relatively small sizes), figure IV.1 shows the performances of encrypted predictions over the 10-, 20-, 35- and 70-item query vectors on the Jester dataset with a 2048-bit Paillier cryptosystem. Note that the query vector has been processed using a single-thread and only one request was sent to the server at a time. When using parallel processing on the query vector, the linear estimation does not hold because the prediction timing becomes dependent on the speed of thread execution, which in turn depends on a number of factors related to resource availability on the cloud server instances. Similarly, with concurrent connections, performance depends on the speed of application instance allocation and on individual server loads.

(a) Vector size: 5

(b) Vector size: 10

(c) Vector size: 20

(d) Vector size: 50

(e) Vector size: 100

Fig. IV.3: Comparison of performances between the GAE/J and the EBS with the MovieLens 100K dataset.

Using linear estimation, we can determine (given in table V.3) the approximate maximum number of items in the query vector that can be processed within a deadline of 30 seconds. Traditionally, GAE/J had a maximum execution deadline of 30s for any user request, which has been increased. However, given a HTTP request, it is unlikely that a user will

wait beyond few seconds for a turn-around. Thus, a 30-seconds deadline can be treated as an absolute maximum. Some cloud providers, such as the GAE/J, limit the size of the data that can be sent on POST, while some web servers can apply similar limit too. For a query vector of n items, 4096n bits for encrypted rating is required when using a

TABLE V.1: Conclusions about prediction performances. GAE/Ja Very fast Steady Good Limited High Low, can be capped

Response to short bursty load Response to continuing steady load Effect of parallel query vector processing Configurability Ease of deployment Running cost a Google

EBSb Slow Steady Not necessarily good High Moderately difficult High

App Engine. Elastic Beanstalk.

b Amazon

TABLE V.2: Estimation of absolute theoretical bandwidth and storage cost for our PPCF application for m users and n items. SlopeOne tuple storage Pairwise rating addition Pairwise deviation addition Bulk pre-processed upload

Size in bytesa 24n(n − 1) 8mn(n − 1) 4mn(n − 1) 8n(n − 1)

Example m = 10000, n = 10000 2.235GB 7.2752TB 3.6376TB 762.86MB

b

a This excludes additional variable index costs beyond the primary key. It also excludes additional overheads of data transfer that are not directly related to the number of users and items b In this example, we assume that we are using fully dense dataset where all m users have rated all the n items.

TABLE V.3: Estimation of absolute theoretical maximum number of items in the query vector for a 30 seconds prediction turn-around deadline. Encrypted query size Encrypted POST approx. size using numeric IDsb Encrypted POST approx. size using string IDs

GAE/J 1376 items 698KB 731KB

EBS 3274 items 1.624MB 1.698MB

Theoretical sizesa – 520n bytes 544n bytes

a Estimations are shown for Paillier 2048-bit encryption and n items in the query vector. This ignores the other POST overheads that are not directly related to the query vector. b The size estimate shown here excludes some additional overhead of JSON packaging and other parameters unrelated to the main PPCF query.

2048-bit encryption. Added to that is the 64n bits for the numeric item IDs. This results to 520n bytes for encrypted query. If using string IDs limited to 256 bits or 32 bytes, the total encrypted query size approximates to 544n bytes (i.e., 4096n + 256n bits). The byte sizes corresponding to the theoretical maximum value of n is also illustrated in table V.3. Google App Engine POST limitation is at 32MB and Apache Tomcat on Amazon Elastic Beanstalk limits it at 2MB (although this can be changed by varying the maxPostSize parameter). Nonetheless, the total POST size limits are still higher than the absolute theoretical maximum values of the query vector item size shown in our estimations. This shows that privacy-preserving collaborative filtering is indeed feasible, when carefully implemented and deployed on the cloud. VI. R ELATED WORK There has been significant work on privacy-preserving collaborative filtering (PPCF). Canny [5] uses a partial Singular Value Decomposition (SVD) model and homomorphic en-

cryption to devise a multi-party PPCF scheme. Canny [6] also proposes a new solution based on a probabilistic factor analysis model that can handle missing data, and provides privacy through a peer-to-peer protocol. Polat and Du [7]–[10] present several algorithms based on randomised perturbation and randomised response techniques to provide recommendations over distributed data. Berkovsky et al. [11] present a decentralised distributed storage of user data combined with data modification techniques to mitigate privacy issues. Tada et al. presented a privacy-preserving item-based collaborative filtering scheme in [12]. However, none of these techniques have been built for the cloud, nor are they efficient enough for large-scale general purpose use. Various research efforts have also been put in towards evaluating the costs and benefits of cloud infrastructures, e.g., Khajeh-Hosseini et al. [13] present a case study of migration of data from in-house to the Amazon IaaS infrastructure for the oil and gas industry. Hong et al. [14] presented a case-study of e-commerce functions as services over a cloud infrastructure. The work on CloudFTP [15] discusses the benefits and

challenges of porting a traditional application (i.e., FTP) to a cloud infrastructure. An article from Suciu et al. [16] presents a case study, based on the Slap OS, on open-source cloud computing platforms. Shaikh and Haider, in [17], provide an overview of security threats in cloud computing; while Lu et al. [18] utilise trusted computing to counter threats in a malicious cloud environment and validate their proposal with three forms of outsourced collaborative computing scenarios: k-nearest neighbour, k-means and support vector machines. Tan et al. [19] proposes a modification to a locking mechanism in public clouds to allow users to verify correctness as part of strong consistency requirements associated with migrating data onto public clouds. However, none of these deal with privacy on the cloud. In [20], we have proposed a privacy preserving (using encryption) CF scheme based on the well known weighted Slope One predictor [3]. Our prior scheme is applicable to pure horizontal and pure vertical dataset partitions. In [1], [2] we have proposed a PPCF scheme based on homomorphic encryption and performed a preliminary evaluation on real cloud platforms. Having significantly improved on the implementation to process PPCF queries in parallel, and having performed a comprehensive evaluation of this scheme from a cloud engineering perspective, in this paper, we are providing an applications and experiences report for a reallife deployment of PPCF on SaaS enabling PaaS clouds. VII. C ONCLUSIONS Privacy-preserving collaborative filtering has gained importance over the last few years. Its need is felt when the users’ data and PPCF computation are outsourced to the cloud. While many theoretical schemes have been proposed, the actual deployment on a cloud platform has not been studied. In this applications and experiences report, we have presented our evaluation experiences of deploying an existing PPCF scheme on two well-known real world public SaaS enabling PaaS cloud architectures. This highlights the practical considerations of deploying similar solutions within the cloud. ACKNOWLEDGMENTS The work of Anirban Basu and Hiroaki Kikuchi while at Tokai University has been supported by the Japanese Ministry of Internal Affairs and Communications funded project “Research and Development of Security Technologies for Cloud Computing”. Jaideep Vaidya’s work is supported in part by the US National Science Foundation under Grant No. CNS0746943. The work of Theo Dimitrakos is related to research in British Telecommunications under the IST Framework Programme 7 integrated project OPTIMIS that is partly funded by the EU under contract number 257115. R EFERENCES [1] A. Basu, J. Vaidya, H. Kikuchi, T. Dimitrakos, and S. K. Nair, “Privacy preserving collaborative filtering for SaaS enabling PaaS clouds,” Journal of Cloud Computing: Advances, Systems and Applications, vol. 1(8), 2012.

[2] A. Basu, J. Vaidya, H. Kikuchi, and T. Dimitrakos, “Privacy-preserving collaborative filtering for the cloud,” in Proceedings of the 3rd IEEE International Conference on Cloud Computing Technology and Science (Cloudcom), Athens, Greece, 2011. [3] D. Lemire and A. Maclachlan, “Slope one predictors for online ratingbased collaborative filtering,” in Proceedings of SIAM Data Mining (SDM’05), 2005. [4] P. Paillier, “Public-key cryptosystems based on composite degree residuosity classes,” in Advances in CryptologyEUROCRYPT99, vol. 1592. Springer, 1999, pp. 223–238. [5] J. Canny, “Collaborative filtering with privacy,” Proceedings 2002 IEEE Symposium on Security and Privacy, pp. 45–57, 2002. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=1004361 [6] ——, “Collaborative filtering with privacy via factor analysis,” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, ser. SIGIR ’02. New York, NY, USA: ACM, 2002, pp. 238–245. [Online]. Available: http://doi.acm.org/10.1145/564376.564419 [7] H. Polat and W. Du, “Privacy-preserving collaborative filtering using randomized perturbation techniques,” in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on. IEEE, 2003, pp. 625– 628. [8] ——, “Privacy-preserving collaborative filtering on vertically partitioned data,” Knowledge Discovery in Databases: PKDD 2005, pp. 651–658, 2005. [9] ——, “SVD-based collaborative filtering with privacy,” Proceedings of the 2005 ACM symposium on Applied computing - SAC ’05, p. 791, 2005. [Online]. Available: http://portal.acm.org/citation.cfm?doid= 1066677.1066860 [10] ——, “Achieving private recommendations using randomized response techniques,” Advances in Knowledge Discovery and Data Mining, pp. 637–646, 2006. [11] S. Berkovsky, Y. Eytani, T. Kuflik, and F. Ricci, “Enhancing privacy and preserving accuracy of a distributed collaborative filtering,” in Proceedings of the 2007 ACM conference on Recommender systems. ACM, 2007, pp. 9–16. [12] M. Tada, H. Kikuchi, and S. Puntheeranurak, “Privacy-preserving collaborative filtering protocol based on similarity between items,” in 2010 24th IEEE International Conference on Advanced Information Networking and Applications. IEEE, 2010, pp. 573–578. [13] A. Khajeh-Hosseini, D. Greenwood, and I. Sommerville, “Cloud migration: A case study of migrating an enterprise it system to iaas,” in Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on. IEEE, 2010, pp. 450–457. [14] H. Cai, K. Zhang, M. Wang, J. Li, L. Sun, and X. Mao, “Customer centric cloud service model and a case study on commerce as a service,” in Cloud Computing, 2009. CLOUD’09. IEEE International Conference on. IEEE, 2009, pp. 57–64. [15] L. Zhou, “Cloudftp: A case study of migrating traditional applications to the cloud,” in Intelligent System Design and Engineering Applications (ISDEA), 2013 Third International Conference on. IEEE, 2013, pp. 436–440. [16] G. Suciu, E. G. Ularu, and R. Craciunescu, “Public versus private cloud adoption – a case study based on open source cloud platforms,” in Telecommunications Forum (TELFOR), 2012 20th. IEEE, 2012, pp. 494–497. [17] F. B. Shaikh and S. Haider, “Security threats in cloud computing,” in Internet Technology and Secured Transactions (ICITST), 2011 International Conference for. IEEE, 2011, pp. 214–219. [18] Q. Lu, Y. Xiong, X. Gong, and W. Huang, “Secure collaborative outsourced data mining with multi-owner in cloud computing,” in Trust, Security and Privacy in Computing and Communications (TrustCom), 2012 IEEE 11th International Conference on. IEEE, 2012, pp. 100– 108. [19] C. C. Tan, Q. Liu, and J. Wu, “Secure locking for untrusted clouds,” in Cloud Computing (CLOUD), 2011 IEEE International Conference on. IEEE, 2011, pp. 131–138. [20] A. Basu, H. Kikuchi, and J. Vaidya, “Privacy-preserving weighted Slope One predictor for Item-based Collaborative Filtering,” in Proceedings of the international workshop on Trust and Privacy in Distributed Information Processing (workshop at the IFIPTM 2011), Copenhagen, Denmark, 2011.

Privacy-preserving collaborative filtering on the cloud ...

which implements a small subset of SQL. ... used the Amazon Relational Database Service (RDS), where a ... The performance also degrades if the database.

690KB Sizes 0 Downloads 251 Views

Recommend Documents

Privacy-preserving collaborative filtering for the cloud
Your private rating data may not be safe on the cloud because of insider and outsider threats. Anirban Basu, et al. Cloud based privacy preserving CF. 4/22 ...

Practical privacy preserving collaborative filtering on the Google App ...
Google App Engineにおけるプライバシー保護協調フィルタリング ... 方式を Platform-as-a-Service (PaaS) cloud によって実現されている Software-as-a-Service (SaaS).

Practical privacy preserving collaborative filtering on ...
A recommendation example: Amazon's “people who buy x also buy y”. Recommendation .... Amazon Web Services Elastic Beanstalk (AWS EBS)2. PaaS cloud.

Collaborative Filtering Personalized Skylines..pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Collaborative ...

Combinational Collaborative Filtering for ... - Research at Google
Aug 27, 2008 - Before modeling CCF, we first model community-user co- occurrences (C-U) ...... [1] Alexa internet. http://www.alexa.com/. [2] D. M. Blei and M. I. ...

Content-Boosted Collaborative Filtering
Most recommender systems use Collaborative Filtering or ... file. Because of these reasons, CF systems have been used ..... -means clustering algorithm.

Collaborative Filtering with Personalized Skylines
A second alternative incorporates some content-based (resp. CF) characteristics into a CF (resp. content-based) system. Regarding concrete systems, Grundy proposes stereo- types as a mechanism for modeling similarity in book rec- ommendations [36]. T

Transfer learning in heterogeneous collaborative filtering domains
E-mail addresses: [email protected] (W. Pan), [email protected] (Q. Yang). ...... [16] Michael Collins, S. Dasgupta, Robert E. Schapire, A generalization of ... [30] Daniel D. Lee, H. Sebastian Seung, Algorithms for non-negative matrix ...

Securing Collaborative Filtering Against Malicious ...
the IEEE Joint Conference on E-Commerce Technol- ogy and Enterprise Computing, E-Commerce and E-. Services (CEC/EEE 2006). Burke, R.; Mobasher, B.; and Bhaumik, R. 2005. Lim- ited knowledge shilling attacks in collaborative filter- ing systems. In Pr

Collaborative Filtering via Learning Pairwise ... - Semantic Scholar
assumption can give us more accurate pairwise preference ... or transferring knowledge from auxiliary data [10, 15]. However, in real ..... the most popular three items (or trustees in the social network) in the recommended list [18], in order to.

Attack Resistant Collaborative Filtering - Research at Google
topic in Computer Science with several successful algorithms and improvements over past years. While early algorithms exploited similarity in small groups ...

Google Message Filtering - Devoteam G Cloud
enables the service to harness the latest and most accurate threat ... are always protected from the latest threats. ... Google Apps is a suite of applications.

Google Message Filtering - Devoteam G Cloud
Software-as-a-Service (SaaS) model, saving money and IT resources ... in real time – and apply it to every message flowing through the service network.

Using Mixture Models for Collaborative Filtering - Cornell Computer ...
Using Mixture Models for Collaborative Filtering. Jon Kleinberg. ∗. Department of Computer Science. Cornell University, Ithaca, NY, 14853 [email protected].

CoFiSet: Collaborative Filtering via Learning Pairwise ...
from an auxiliary data domain to a target data domain. This is a directed knowledge transfer approach similar to traditional domain adaptation methods. Adaptive ...

An Incremental Approach for Collaborative Filtering in ...
Department of Computer Science and Engineering, National Institute of Technology. Rourkela, Rourkela, Odisha ... real-world datasets show that the proposed approach outperforms the state-of-the-art techniques in ... 1 Introduction. Collaborative filt

Google Message Filtering - Devoteam G Cloud
Software-as-a-Service (SaaS) model, saving money and IT resources ... Google is a trademark of Google Inc. All other company and product names may be ...

Feasibility of a privacy preserving collaborative filtering ... - Anirban Basu
cloud for running web applications developed in Python,. 3Report available at .... Extensions in the GAE/J, the open-source University of. Texas (Dallas) Paillier ...

Collaborative Filtering Supporting Web Site Navigation
rithms that try to cluster users with respect some (given ... tion Systems, with the emerging filtering technologies .... file, the user is interested in the document.

Collaborative Filtering as a Multi-Armed Bandit - HAL-Inria
Jan 14, 2016 - values in R, and (ii) a strategy to choose the item to recommend given ..... [25] X. Luo, Y. Xia, and Q. Zhu, “Incremental Collaborative Filtering ...

Personalized Click Model through Collaborative Filtering - Botao Hu
lected, they offer a promising approach to optimize search engine performance with low cost. However, a well-known challenge of using such data is the position ...

Transfer Learning in Collaborative Filtering for Sparsity Reduction
ematically, we call such data sparse, where the useful in- ... Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10) ... way. We observe that these two challenges are related to each other, and are similar to the ...