Privacy Preserving Stream AnalyticsThe Marriage of Randomized Response and Approximate Computing

Privacy Preserving Stream Analytics
The Marriage of Randomized Response and Approximate Computing

Do Le Quoc, Martin Beck, Pramod Bhatotia
Ruichuan Chen, Christof Fetzer, and Thorsten Strufe
TU Dresden  The University of Edinburgh  Nokia Bell Labs
Technical Report, Jan 2017
Abstract

How to preserve users’ privacy while supporting high-utility analytics for low-latency stream processing?

To answer this question: we describe the design, implementation and evaluation of PrivApprox, a data analytics system for privacy-preserving stream processing. PrivApprox provides three properties: (i) Privacy: zero-knowledge privacy guarantee for users, a privacy bound tighter than the state-of-the-art differential privacy; (ii) Utility: an interface for data analysts to systematically explore the trade-offs between the output accuracy (with error-estimation) and the query execution budget; (iii) Latency: near real-time stream processing based on a scalable “synchronization-free” distributed architecture.

The key idea behind our approach is to marry two techniques together, namely, sampling (used in the context of approximate computing) and randomized response (used in the context of privacy-preserving analytics). The resulting marriage is complementary — it achieves stronger privacy guarantees and also improves the performance for low-latency stream analytics.

plus 0.2ex

Privacy Preserving Stream Analytics

The Marriage of Randomized Response and Approximate Computing

Do Le Quoc, Martin Beck, Pramod Bhatotia
Ruichuan Chen, Christof Fetzer, and Thorsten Strufe
TU Dresden  The University of Edinburgh  Nokia Bell Labs
Technical Report, Jan 2017

1 Introduction

Many online services continuously collect users’ private data for real-time analytics. Much of this data arrives as a data stream and in huge volumes, requiring real-time stream processing based on distributed systems [1, 2, 4, 3].

In the current ecosystem of data analytics, the analysts usually have direct access to the users’ private data, and must be trusted not to abuse it. However, this trust has been violated in the past [28, 79, 57, 87].

A pragmatic eco-system has two desirable, but contradictory design requirements: (i) stronger privacy guarantees for the users; and (ii) high-utility stream analytics in real-time. Users seek stronger privacy, while analysts strive for high-utility analytics in real time.

To meet these two design requirements, there is a surge of novel computing paradigms that address these concerns, albeit separately. Two such paradigms are privacy-preserving analytics to protect user privacy and approximate computation for real-time analytics.

Privacy-preserving analytics. Recent privacy-preserving analytics systems favor a distributed architecture to avoid central trust (see 7 for details), where users’ private data is stored locally on their respective client devices. Data analysts use a publish-subscribe mechanism to run aggregate queries over the distributed private dataset of a large number of clients. Thereafter, such systems add noise to the aggregate output to provide useful privacy guarantees, such as differential privacy [33]. Unfortunately, these state-of-the-art systems normally deal with single-shot batch queries, and therefore, these systems cannot be used for real-time stream analytics.

Approximate computation. Approximate computation is based on the observation that many data analytics jobs are amenable to an approximate, rather than the exact output (see 7 for details). For such an approximate workflow, it is possible to trade accuracy by computing over a partial subset (usually selected via a sampling mechanism) instead of the entire input dataset. Thereby, data analytics systems based on approximate computation can achieve low latency and efficient utilization of resources. However, the existing systems for approximate computation assume a centralized dataset, where the desired sampling mechanism can be employed. Thus, existing systems are not compatible with the distributed privacy-preserving analytics systems.

The marriage. In this paper, we make the observation that the two computing paradigms, privacy-preserving analytics and approximate computation, are complementary. Both paradigms strive for an approximate instead of the exact output, but they differ in their means and goals for approximation. Privacy-preserving analytics adds explicit noise to the aggregate query result to protect users’ privacy. Whereas, approximate computation relies on a representative sampling of the entire dataset to compute over only a subset of data items to enable low-latency/efficient analytics. Therefore, we marry these two existing paradigms together in order to leverage the benefits of both. The high-level idea is to achieve privacy (via approximation) by directly computing over a subset of sampled data items (instead of computing over the entire dataset) and then adding an explicit noise for privacy-preservation.

To realize this marriage, we designed an approximation mechanism that also achieves privacy-preserving goals for stream analytics. Our design (see Figure 1) targets a distributed setting, similar as aforementioned, where users’ private data is stored locally on their respective personal devices, and an analyst issues a streaming query for analytics over the distributed private dataset of users. The analyst’s streaming query is executed on the users’ data periodically (a configurable epoch) and the query results are transmitted to a centralized aggregator via a set of proxies. The analyst interfaces with the aggregator to get the aggregate query output periodically.

We employ two core techniques to achieve our goal. Firstly, we employ sampling [74] directly at user’s site for approximate computation, where each user randomly decides whether to participate in answering the query in the current epoch. Since we employ sampling at the data source, instead of sampling at a centralized infrastructure, we are able to squeeze out the desired data size (by controlling the sampling parameter) from the very “first stage” in the analytics pipeline, which is essential in low-latency environments.

Secondly, if the user participates in the query answering process, we employ a randomized response [40] mechanism to add noise to the query output at user’s site, again locally at the source of the data in a decentralized fashion. In particular, each user locally randomizes its truthful answer to the query to achieve local differential privacy guarantees (3.2.2). Since we employ noise addition at the source of data, instead of adding the explicit noise to the aggregate output at a trusted aggregator or proxies, we enable a truly “synchronization-free” distributed architecture, which requires no coordination among proxies and the aggregator for the mandated noise addition.

The last, but not the least, silver bullet of our design: it turns out that the combination of the two aforementioned techniques (i.e., sampling and randomized response) led us to achieve zero-knowledge privacy [45], a privacy bound tighter than the state-of-the-art differential privacy [33]. (We prove our claim in Appendix C.)

To summarize, we present the design and implementation of a practical system for privacy-preserving stream analytics in real time. In particular, our system is a novel combination of the sampling and randomized response techniques, as well as a scalable “synchronization-free” routing scheme employing a light-weight XOR encryption scheme [26]. The resulting system ensures zero-knowledge privacy, anonymization, and unlinkability for users (2.2.4). Altogether, we make the following contributions:

  • We present a marriage of sampling and randomized response to achieve improved performance and stronger privacy guarantees.

  • We present an adaptive query execution interface for analysts to systematically make a trade-off between the output accuracy, and the query execution budget.

  • We present a confidence metric on the output accuracy using a confidence interval to interpret the approximation due to sampling and randomization.

To empirically evaluate our approach, we implemented our design as a fully-functional prototype in a system called PrivApprox 111 The source code of PrivApprox along with the experimental evaluation setup is publicly available :https://PrivApprox.github.io. based on Apache Flink [1] and Apache Kafka [8]. In addition to stream analytics, we further extended our system to support privacy-preserving “historical” batch analytics over users’ private datasets. The evaluation based on micro-benchmarks and real-world case-studies shows that this marriage is, in fact, made in heaven!

2 Overview

Figure 1: System overview

In this section, we present an overview of our system called PrivApprox.

2.1 System Architecture

PrivApprox is designed for privacy-preserving stream analytics on distributed users’ private dataset. Figure 1 depicts the high-level architecture of PrivApprox. Our system consists of four main components: clients, proxies, aggregator, and analysts.

Clients locally store users’ private data on their respective personal devices, and subscribe to queries from the system. Analysts publish streaming queries to the system, and also specify a query execution budget. The query execution budget can either be in the form of latency guarantees/SLAs, output quality/accuracy, or the available computing resources for query processing. Our system ensures that the computation remains within the specified budget.

At a high-level, the system works as follows: a query published by an analyst is distributed to clients via the aggregator and proxies. Clients answer the analyst’s query locally over the users’ private data using a privacy-preserving mechanism. Client answers are transmitted to the aggregator via anonymizing proxies. The aggregator aggregates received answers from the clients to provide privacy-preserving stream analytics to the analyst.

2.2 System Model

Before we explain the design of PrivApprox, we present the system model assumed in this work.

2.2.1 Query Model

PrivApprox supports the SQL query language for analysts to formulate streaming queries. While queries can be complex, the results of a query are expressed as counts within histogram buckets, i.e., each bucket represents a small range of query’s answer values. Specifically, each query answer is represented in the form of binary buckets, where each bucket stores a possible answer value ‘1’ or ‘0’ depending on whether or not the answer falls into the value range represented by that bucket. For example, an analyst can learn the driving speed distribution across all vehicles in San Francisco by formulating an SQL query “SELECT speed FROM vehicle WHERE location=‘San Francisco’”. The analyst can then define 12 answer buckets on speed: ‘0’, ‘110’, ‘1120’, , ‘8190’, ‘91100’, and ‘’. If a vehicle is moving at 15 mph in San Francisco, it answers ‘1’ for the third bucket and ‘0’ for all others.

Our query model supports not only numeric queries as described above, but also non-numeric queries. For non-numeric queries, each bucket is specified by a matching rule or a regular expression. Note that, at first glance, our query model may appear simple, it however supports a range of queries such as histogram queries and frequency queries. In addition, it has been shown to be effective for a wide-range of analytics algorithms [18, 19].

2.2.2 Computation Model

PrivApprox adopts a batched stream programming model [1, 3] in which the online data stream is split into small batches; and each small batch is processed by launching a distributed data-parallel job. The batched streaming model is adopted widely compared to trigger-based systems [2, 4] for the following advantages: exact-once semantics, efficient fault-tolerance, and a common data-parallel programming model for both stream and batch analytics.

In particular, PrivApprox employs sliding window computations over batched stream processing [16, 17]. For sliding windows, the computation window slides over the input data stream, where the new incoming data items are added, and the old data items are dropped from the window as they become less relevant. Note that these systems [1, 3] expose a time-based window length, and based on the arrival rate, the number of data items within a window may vary accordingly.

2.2.3 Threat Model

Analysts are potentially malicious. They may try to violate the PrivApprox’s privacy model, i.e., de-anonymize clients, build profiles through the linkage of requests and answers, or de-randomize (remove added noise from) the answers.

Clients are potentially malicious. They could generate false or invalid responses to distort the query result for the analyst. However, we do not defend against the Sybil attack [32], which is beyond the scope of this work [93].

Proxies are also potentially malicious. They may transmit messages between clients and the aggregator in contravention of the system protocols. PrivApprox includes at least two proxies, and there are at least two proxies which do not collude with each other.

The aggregator is assumed to be Honest-but-Curious (HbC): the aggregator faithfully conforms to the system protocol, but may try to exploit the information about clients. The aggregator does not collude with any proxy, nor the analyst.

Finally, we assume that all end-to-end communications use authenticated and confidential connections (are protected by long-lived TLS connections), and no system component could monitor all network traffic.

2.2.4 Privacy Properties

Our privacy properties include: (i) zero-knowledge privacy, (ii) anonymity, and (iii) unlinkability.

All aggregate query results in the system are independently produced under zero-knowledge privacy guarantees. The chosen privacy metric zero-knowledge privacy [45] builds upon differential privacy [33] and provides a tighter bound on privacy guarantees compared to differential privacy. Informally, zero-knowledge privacy states that essentially everything that an adversary can learn from the output of an zero-knowledge private mechanism could also be learned using aggregate information. Anonymity means that no system components can associate query answers or query requests with a specific client. Finally, unlinkability means that no system component can join any pair of query requests or answers to the same client, even to the same anonymous client.

For the formal definitions, analysis, and proofs—refer Appendix C.

2.2.5 Assumptions

We make the following assumptions.

  • We assume that the input stream is stratified based on the source of event, i.e., the data items within each stratum follow the same distribution, and are mutually independent. Here a stratum refers to one sub-stream. If multiple sub-streams have the same distribution, they are combined to form a stratum.

  • We assume the existence of a virtual function that takes the query budget as the input and outputs the sample size for each window based on the budget.

  • We assume that the aggregator faithfully follows the system protocol. We could use trusted computing such as remote attestation [86] based on Trusted Platform Modules (TPMs) to relax the HbC assumption.

We discuss different possible means to meet the first two assumptions in Appendix B.

3 Design

PrivApprox consists of two main phases (see Figure 1): submitting queries and answering queries. In the first phase, an analyst submits a query (along with the execution budget) to clients via the aggregator and proxies. In the second phase, the query is answered by the clients in the reverse direction.

3.1 Submitting Queries

To perform statistical analysis over users’ private data streams, an analyst creates a query using the query model described in 2.2.1. In particular, each query consists of the following fields, and is signed by the analyst for non-repudiation:

(1)
  • denotes a unique identifier of the query. This can be generated by concatenating the identifier of the analyst with a serial number unique to the analyst.

  • denotes the actual SQL query, which is passed on to clients and executed on their respective personal data.

  • denotes the format of a client’s answer to the query. The answer is an -bit vector where each bit associates with a possible answer value in the form of a “0” or “1” per index (or answer value range).

  • denotes the answer frequency, i.e., how often the query needs to be executed at clients.

  • denotes the window length for sliding window computations [16]. For example, an analyst may only want to aggregate query results for the last ten minutes, which means the window length is ten minutes.

  • denotes the sliding interval for sliding window computations. For example, an analyst may want to update the query results every one minute, and so the sliding interval is set to one minute.

After forming the query, the analyst sends the query, along with the query execution budget, to the aggregator. Once receiving the pair of the query and query budget from the analyst, the aggregator first converts the query budget into system parameters for sampling () and randomization (). We explain these system parameters in the next section 3.2. Hereafter, the aggregator forwards the query and the converted system parameters to clients via proxies.

3.2 Answering Queries

After receiving the query and system parameters, we next explain how the query is answered by clients and processed by the system to produce the result for the analyst. The query answering process involves several steps including (i) sampling at clients for low-latency approximation; (ii) randomizing answers for privacy preservation; (iii) transmitting answers for anonymization and unlinkability; and finally, (iv) aggregating answers with error estimation to give a confidence level on the approximate output. We next explain the entire workflow using these four steps. (The algorithms are detailed in Appendix A.)

3.2.1 Step I: Sampling at Clients

We make use of approximate computation to achieve low-latency execution by computing over a subset of data items instead of the entire input dataset. Specifically, our work builds on sampling-based techniques [12, 10, 46, 63] in the context of “Big Data” analytics. Since we aim to keep the private data stored at individual clients, PrivApprox applies an input data sampling mechanism locally at the clients. In particular, we use Simple Random Sampling (SRS) [74].

Simple Random Sampling (SRS). SRS is considered as a fair way of selecting a sample from a given population since each individual in the population has the same chance of being included in the sample. We make use of SRS at the clients to select clients that will participate in the query answering process. In particular, the aggregator passes the sampling parameter () on to clients as the probability of participating in the query answering process. Thereafter, each client flips a coin with the probability based on the sampling parameter (), and decides whether to participate in answering a query. Suppose that we have a population of clients, and each client has an answer . We want to calculate the sum of these answers across the population, i.e., . To compute an approximate sum, we apply the SRS at clients to get a sample of clients. The estimated sum is then calculated as follows:

(2)

Where the error bound is defined as:

(3)

Here, is a value of the -distribution with degrees of freedom at the level of significance, and the estimated variance of the sum is:

(4)

Where is the sample variance of sum.

Note that we currently assume that all clients produce the input stream with data items following the same distribution, i.e., all clients’ data streams belong to the same stratum. We further extend it for stratified sampling in 3.3.

3.2.2 Step II: Answering Queries at Clients

Clients that participate in the query answering process make use of the randomized response technique [40] to preserve answer privacy, with no synchronization among clients.

Randomized response. Randomized response protects user’s privacy by allowing individuals to answer sensitive queries without providing truthful answers all the time, yet it allows analysts to collect statistical results. Randomized response works as follows: suppose an analyst sends a query to individuals to obtain the statistical result about a sensitive property. To answer the query, a client locally randomizes its answer to the query [40]. Specifically, the client flips a coin, if it comes up heads, then the client responds its truthful answer; otherwise, the client flips a second coin and responds “Yes” if it comes up heads or “No” if it comes up tails. The privacy is preserved via the ability to refuse responding truthful answers.

Suppose that the probabilities of the first coin and the second coin coming up heads are and , respectively. The analyst receives randomized answers from individuals, among which answers are “Yes”. Then, the number of original truthful “Yes” answers before the randomization process can be estimated as:

(5)

Suppose and are the actual and the estimated numbers of the original truthful “Yes” answers, respectively. The accuracy loss is then defined as:

(6)

It has been proven in [37] that, the randomized response mechanism achieves -differential privacy [33], where:

(7)

More specifically, the randomized response mechanism achieves -differential privacy, where:

(8)

The reason is: if a truthful answer is “Yes”, then with the probability of ‘’, the randomized answer will still remain “Yes”. Otherwise, if a truthful answer is “No”, then with the probability of ‘’, the randomized answer will become “Yes”.

It is worth mentioning that, combining randomized response with the sampling technique used in Step I, we achieve not only differential privacy but also zero-knowledge privacy [45] which is a privacy bound tighter than differential privacy. We prove our claim in Appendix C.

Figure 2: XOR-based encryption with two proxies.

3.2.3 Step III: Transmitting Answers via Proxies

After producing randomized responses, clients transmit them to the aggregator via the proxies. To achieve anonymity and unlinkability of the clients against the aggregator and analysts, we utilize the XOR-based encryption together with source rewriting, which has been used for anonymous communications [27, 26, 84, 31]. Under the assumptions that:

  • at least two proxies are not colluding

  • the proxies don’t collude with the aggregator, nor the analyst

  • the aggregator and analyst have only a local view of the network

neither the aggregator, nor the analyst will learn any (pseudo-)identifier to deanonymize or link different answers to the same client. This property is achieved by source rewriting, which is a typical building block for anonymization schemes [84, 31]. At the same time the content of the answers is hidden from the proxies using the XOR-based encryption.

XOR-based encryption. At a high-level, the XOR-based encryption employs extremely efficient bit-wise XOR operations as its cryptographic primitive compared to expensive public-key cryptography. This allows us to support resource-constrained clients, e.g., smartphones and sensors. The underlying idea of this encryption is simple: if Alice wants to send a message of length to Bob, then Alice and Bob share a secret (in the form of a random bit-string of length ). To transmit the message privately, Alice sends an encrypted message ‘’ to Bob, where ‘’ denotes the bit-wise XOR operation. To decrypt the message, Bob again uses the bit-wise XOR operation: .

Specifically, we apply the XOR-based encryption to transmit clients’ randomized answers as follows. At first, each randomized answer is concatenated with its associated query identifier to build a message :

(9)

Thereafter, the client generates random -bit key strings with using a cryptographic pseudo-random number generator (PRNG) seeded with a cryptographically strong random number. The XOR of all key strings together forms the secret .

(10)

Next, the client performs an XOR operation with and to produce an encrypted message .

(11)

As a result, the message is split into messages . Afterwards, a unique message identifier is generated, and sent along with the split messages to the proxies via anonymous channels enabled by source rewriting [84, 31].

(12)

Upon receiving the messages (either or ) from clients, the proxies transmit these messages to the aggregator.

The message identifier ensures that and all associated will be joined later to decrypt the original message at the aggregator. Note that, and all are computationally indistinguishable, which hides from the proxies if the received data contains the encrypted answer or is just a pseudo-random bit string.

3.2.4 Step IV: Generating Result at the Aggregator

At the aggregator, all data streams ( and ) are received, and can be joined together to obtain a unified data stream. Specifically, the associated and are paired by using the message identifier . To decrypt the original randomized message from the client, the XOR operation is performed over and : with being the XOR of all : . As the aggregator cannot identify which of the received messages is , it just XORs all the received messages to decrypt .

The joined answer stream is processed to produce the query results as a sliding window. For each window, the aggregator first adapts the computation window to the current start time by removing all old data items, with , from the window. Next, the aggregator adds the newly incoming data items into the window. Then, the answers in the window are decoded and aggregated to produce the query results for the analyst. Each query result is an estimated result which is bound to a range of error due to the approximation. The aggregator estimates this error bound using equation 3 and produces a confidence interval for the result as: . The entire process is repeated for every window.

Note that an adversarial client might answer a query many times in an attempt to distort the query result. However, we can handle this problem, for example, by applying the triple splitting technique [26].

Error bound estimation. We provide an error bound estimation for the aggregate query results. The accuracy loss in PrivApprox is caused by two processes: (i) sampling and (ii) randomized response. Since the accuracy loss of these two processes is statistically independent (see 5), we estimate the accuracy loss of each process separately. Furthermore, Equation 2 indicates that the error induced by sampling can be described as an additive component of the estimated sum. The error induced by randomized response is contained in the values in Equation 2. Therefore, independent of the error induced by randomized response, the error coming from sampling is simply being added upon. Following this, we sum up both independently estimated errors to provide the total error bound of the query results.

To estimate the accuracy loss of the randomized response process, we make use of an experimental method. We run several micro-benchmarks at the beginning of the query answering process without performing the sampling process, to estimate the accuracy loss caused by randomized response. We measure the accuracy loss using Equation 6.

On the other hand, to estimate the accuracy loss of the sampling process, we apply the statistical theory of the sampling techniques. In particular, we first identify a desired confidence level, e.g., %. Then, we compute the margin of error using Equation 3. Note that, to use this equation the sampling distribution must be nearly normal. According to the Central Limit Theorem (CLT), when the sample size is large enough (e.g., ), the sampling distribution of a statistic becomes close to the normal distribution, regardless of the underlying distribution of values in the dataset [90].

3.3 Practical Considerations

Next, we present three design enhancements to improve the practicality of PrivApprox.

3.3.1 Stratified Sampling

As described in 3.2.1, we employ Simple Random Sampling (SRS) at clients for approximate computation. The assumption behind using SRS is that all clients produce data streams following the same distribution, i.e., all clients’ data streams belong to the same stratum. However, in a distributed environment, it may happen that different clients produce data streams with disparate distributions.

Accommodating such cases requires that all strata are considered fairly to have a representative sample from each stratum. To achieve this we use the stratified sampling technique [63, 12]. Stratified sampling ensures that data from every stratum is proportionally selected (based on the arrival rate) and none of the minorities are excluded.

To perform stratified sampling, instead of just one sampling parameter , we use a set of sampling parameters where ( is the number of disparate distribution sub-streams in the input stream). All clients within a given stratum flip a sampling coin with the probability to decide on the participation in the answering process. The value is determined based on the proportional arrival rate of the sub-stream (or stratum). The rest of the answering process remains unchanged (as in 3.2.2).

Accordingly, we adapt the error estimation for stratified sampling to provide a confidence interval for the query result. Suppose the clients come from sources (disjoint strata) , , i.e., , and the stratum has clients and each such client has an associated answer in binary format.

To compute an approximate sum of the “Yes” answers, we first select a sample from all clients based on the stratified sampling, i.e., we sample items from each stratum . Then we estimate the sum from this sample as: where the error bound is defined as:

(13)

Here, is the value of the -distribution (i.e., t-score) with degrees of freedom and confidence level. The degree of freedom is calculated as:

(14)

The estimated variance for the sum, , can be expressed as:

(15)

Here, is the population variance in the stratum. Similar to the SRS described in 3.2.1, we use the statistical theories [90] for stratified sampling to calculate the error bound.

3.3.2 Historical Analytics

In addition to providing real-time data analytics, we further extended PrivApprox to support historical analytics. The historical analytics workflow is essential for the data warehousing setting, where analysts wish to analyze user behaviors over a longer time period. To facilitate historical analytics, we support the “batch analytics” over the users’ data at the aggregator. The analyst can analyze users’ responses stored in a fault-tolerance distributed storage (HDFS) at the aggregator to get the aggregate query result over the desired time period.

We further extend the adaptive execution interface for historical analytics, where the analyst can specify query execution budget, for example, to suit dynamic pricing in spot markets in the cloud deployment. Based on the query budget, we perform an additional round of sampling at the aggregator to ensure that batch analytics computation remains within the query budget. We omit the sampling details at the aggregator due to space constraints.

3.3.3 Query Inversion

In the current setting, some queries may result in very few “Yes” truthful answers in users’ responses. For such cases, PrivApprox can only achieve lower utility as the fraction of truthful “Yes” answers gets far from the second randomization parameter (see experimental results in 5). For instance, if is set to a high value (e.g., ), having few “Yes” answers in the user responses will affect the overall utility of the query result.

To address this issue, we propose a query inversion mechanism. If the fraction of truthful “Yes” answers is too small or too large compared to the value, then the analysts can invert the query to calculate the truthful “No” answers instead of the truthful “Yes” answers. In this way, the fraction of truthful “No” answers gets closer to , resulting in a higher utility of the query result.

4 Implementation

We implemented PrivApprox as an end-to-end stream analytics system. Figure 3 presents the architecture of our prototype. Our system implementation consists of three main components: (i) clients, (ii) proxies, and (iii) the aggregator.

First, the query and the execution budget specified by the analyst are processed by the initializer module to decide on the sampling parameter () and the randomization parameters ( and ). These parameters along with the query are then sent to the clients.

Clients. We implemented Java-based clients for mobile devices as well as for personal computers. A client makes use of the sampling parameter (based on the sampling module) to decide whether to participate in the query answering process (3.2.1). If the client decides to participate then the query answer module is used to execute the input query on the local user’s private data stored in SQLite [6]. The client makes use of the randomized response to execute the query (3.2.2). Finally, the randomized answer is encrypted using the XOR-based encryption module; thereafter, the encrypted message and the key messages are sent to the aggregator via proxies (3.2.3).

Proxies. We implemented proxies based on Apache Kafka (which internally uses Apache Zookeeper [5] for fault tolerance). In Kafka, a topic is used to define a stream of data items. A stream producer can publish data items to a topic, and these data items are stored in Kafka servers called brokers. Thereafter, a consumer can subscribe to the topic and consume the data items by pulling them from the brokers. In particular, we make use of Kafka APIs to create two main topics: key and answer for transmitting the key message stream and the encrypted answer stream in the XOR-based encryption protocol, respectively (3.2.3).

Figure 3: Architecture of PrivApprox prototype
Figure 4: (a) Accuracy loss with varying sampling and randomization parameters. (b) Error estimation during the randomized response process and sampling process, combined and individually. (c) Accuracy loss with varying # of clients.

Aggregator. We implemented the aggregator using Apache Flink for real-time stream analytics and also for historical batch analytics. At the aggregator, we first make use of the join method (using the aggregation module) to combine the two data streams: (i) encrypted answer stream and (ii) key stream. Thereafter, the combined message stream is decoded (using the XOR-based decryption module) to reproduce the randomized query answers. These answers are then forwarded to the analytics module. The analytics module processes the answers to provide the query result to the analyst. Moreover, the error estimation module is used to estimate the error (3.2.4), which we implemented using the Apache Common Math library. If the error exceeds the error bound target, a feedback mechanism is activated to re-tune the sampling and randomization parameters to provide higher utility in the subsequent epochs.

For the historical analytics, we asynchronously store the (randomized responses) data in HDFS [20] at the aggregator (as a separate pipeline, which is not shown in Figure 3 for simplicity). To support historical analytics on the stored data at the aggregator, we also implemented a sampling method sample() in Flink to support our sampling mechanism (3.3.2).

5 Evaluation: Microbenchmarks

In this section, we evaluate PrivApprox using a series of microbenchmarks. For all microbenchmark measurements, we report the average over runs.

#I: Effect of sampling and randomization parameters.

We first measure the effect of randomization parameters on the utility and the privacy guarantee of the query results. In particular, the utility is measured by the query results’ accuracy loss (Equation 6), and privacy is measured by the level of achieved zero-knowledge privacy (Equation 19). For the experiment, we generated original answers randomly, 60% of which are “Yes” answers. The sampling parameter is set to .

Table 1 shows that different settings of the two randomization parameters, and , do affect the utility and the privacy guarantee of the query results. The higher means the higher probability that a client responds with its truthful answer. As expected, this leads to higher utility (i.e., smaller accuracy loss ) but weaker privacy guarantee (i.e., higher privacy level ). In addition, Table 1 also shows that the closer we set the probability to the fraction of truthful “Yes” answers (i.e., in this microbenchmark), the higher utility the query result provides. Nevertheless, to meet the utility and privacy requirements in various scenarios, we should carefully choose the appropriate and . In practice, the selection of the value depends on real-world applications [64].

We also measured the effect of sampling parameter on the accuracy loss. Figure 4 (a) shows that the accuracy loss decreases with the increase of sampling fraction, regardless of the settings of randomization parameters and . The benefits reach diminishing returns after the sampling fraction of 80%. The system operator can set the sampling fraction using resource prediction model [98, 97, 99] for any given SLA.

#II: Error estimation. To analyze the accuracy loss, we first measured the accuracy loss caused by sampling and randomized response separately. For comparison, we also computed the total accuracy loss after running the two processes in succession as in PrivApprox. In this experiment, we set the number of original answers to with of which being “Yes” answers. We measure the accuracy loss of the randomized response process by setting the sampling parameter to () and the randomization parameters and to and , respectively. Meanwhile, we measure the accuracy loss of the sampling process without the randomized response process by setting to .

Accuracy loss () Privacy Level ()
0.3 0.3 0.0278 1.7047
0.6 0.0262 1.3862
0.9 0.0268 1.2527
0.6 0.3 0.0141 2.5649
0.6 0.0128 2.0476
0.9 0.0136 1.7917
0.9 0.3 0.0098 4.1820
0.6 0.0079 3.5263
0.9 0.0102 3.1570
Table 1: Utility and privacy of query results with different randomization parameters and .

Figure 4 (b) represents that the accuracy loss during the two experiments is statistically independent to each other. In addition, the accuracy loss of the two processes can effectively be added together to calculate the total accuracy loss.

Figure 5: (a) Accuracy loss for the native and inverse query results with different fractions of truthful “Yes” answers. (b) Throughput of proxies with different bit-vector sizes for the query answer. (c) Average number of sampled data items after stratified sampling with different sampling fractions.
Encryption Decryption
Phone Laptop Server Phone Laptop Server
RSA [13] 937 2,770 4,909 126 698 859
Goldwasser [27] 2,106 17,064 22,902 127 6,329 7,068
Paillier [83] 116 489 579 72 250 309
PrivApprox 15,026 943,902 1,351,937 3,262,186 16,519,076 22,678,285
Table 2: Comparison of crypto overheads (# operations/sec). The public-key crypto schemes use a -bit key.

#III: Effect of the number of clients. We next analyzed how the number of participating clients affects the utility of the results. In this experiment, we fix the sampling and randomization parameters , and to , and , respectively, and set the fraction of truthful “Yes” answers to .

Figure 4 (c) shows that the utility of query results improves with the increase of the number of participating clients, and few clients (e.g., ) may lead to low-utility query results.

Note that increasing the number of participating clients leads to higher network overheads. However, we can tune the number of clients using the sampling parameter and thus decrease the network overhead (see 6.2.2).

#IV: Effect of the fraction of truthful answers. We also measured the utility of both the native and the inverse query results with different fractions of truthful “Yes” answers. For the experiment, we still keep the sampling and randomization parameters , and to , and , respectively, and set the total number of answers to .

Figure 5 (a) shows that PrivApprox achieves higher utility as the fraction of truthful “Yes” answers gets closer to % (i.e., the value). In addition, when the fraction of truthful “Yes” answers is too small compared to the value (e.g., ), the accuracy loss is quite high at %. However, by using the query inversion mechanism (3.3.3), we can significantly reduce the accuracy loss to %.

#V: Effect of answer’s bit-vector sizes. We measured the throughput at proxies with various bit-vector sizes of client answers (i.e., in 3.1). We conducted this experiment with a -node cluster (see 6.1 for the experimental setup). Figure 5 (b) shows that the throughput, as expected, is inversely proportional to the answer’s bit-vector sizes.

#VI: Effect of stratified sampling. To illustrate the use of stratified sampling, we generated a synthetic data stream with three different stream sources , , . Each stream source is created with an independent Poisson distribution. In addition, the three stream sources have an arrival rate of data items per time unit, respectively. The computation window size is fixed to data items.

Figure 5 (c) shows the average number of selected items of each stream source with varying sample fractions using the stratified sampling mechanism. As expected, the average number of sampled data items from each stream source is proportional to its arrival rate and the sample fractions.

#VII: Computational overhead of crypto operations.

We compared the computational overhead of crypto operations used in PrivApprox and prior systems. In particular, these crypto operations are XOR in PrivApprox, RSA in [13], Goldwasser-Micali in [27], and Paillier in [83]. For the experiment, we measured the number of crypto operations that can be executed on: (i) Android Galaxy mini III smartphone running Android 4.1.2 with a 1.5 GHz CPU; (ii) MacBook Air laptop with a 2.2 GHz Intel Core i7 CPU running OS X Yosemite 10.10.2; and (iii) Linux server running Linux 3.15.0 equipped with a 2.2 GHz CPU with 32 cores.

Table 2 shows that the XOR operation is extremely efficient compared with the other crypto mechanisms. This highlights the importance of XOR encryption in our design.

No. of operations/sec Phone Laptop Server
SQLite read 1,162 19,646 23,418
Randomized response 168,938 418,668 1,809,662
XOR encryption 15,026 943,902 1,351,937
Total 1,116 17,236 22,026
Table 3: Throughput (# operations/sec) at clients

#VIII: Throughput at clients. We measured the throughput at clients. In particular, we measured the number of operations per second that can be executed at clients for the query answering process. In this experiment, we used the same set of devices as in the previous experiment.

Table 3 presents the throughput at clients. To further investigate the overheads, we measured the individual throughput of three sub-processes in the query answering process: (i) database read, (ii) randomized response, and (iii) XOR encryption. The result indicates that the performance bottleneck in the answering process is actually the database read operation.

#IX: Comparison with related work. First, we compared PrivApprox with SplitX [26], a high-performance privacy-preserving analytics system. We compare the latency incurred at proxies in both PrivApprox and SplitX. SplitX is geared towards batch analytics, but can be adapted to enable privacy-preserving data analytics over data streams. Since PrivApprox and SplitX share the same architecture, we compare the latency incurred at proxies in both systems.

Figure 6 shows that, with different numbers of clients, the latency incurred at proxies in PrivApprox is always nearly one order of magnitude lower than that in SplitX. The reason is simple: unlike PrivApprox, SplitX requires synchronization among its proxies to process query answers in a privacy-preserving fashion. This synchronization creates a significant delay in processing query answers, making SplitX unsuitable for dealing with large-scale stream analytics. More specifically, in SplitX, the processing at proxies consists of a few sub-processes including adding noise to answers, answer transmission, answer intersection, and answer shuffling; whereas, in PrivApprox, the processing at proxies contains only the answer transmission. Figure 6 also shows that with clients, the latency at SplitX is sec, whereas PrivApprox achieves a latency of just sec, resulting in a speedup compared with SplitX.

Figure 6: Latency comparison b/w SplitX and PrivApprox.

Next, we compared PrivApprox with a recent privacy-preserving analytics system called RAPPOR [91]. Similar to PrivApprox, RAPPOR applies a randomized response mechanism to achieve differential privacy. However, RAPPOR is not designed for stream analytics, and therefore, we compared PrivApprox with RAPPOR for privacy only. To make an “apple-to-apple” comparison between PrivApprox and RAPPOR in terms of privacy, we make a mapping between the system parameters of the two systems. We set the sampling parameter , and the randomized parameters , in PrivApprox, where is the parameter used in the randomized response process of RAPPOR [91]. In addition, we set the number of hash functions used in RAPPOR to () for a fair comparison. In doing so, the two systems have the same randomized response process. However, since PrivApprox makes use of the sampling mechanism before performing the randomized response process, PrivApprox achieves stronger privacy. Figure 7 shows the differential privacy level of RAPPOR and PrivApprox with different sampling fractions .

It is worth mentioning that, by applying the sampling mechanism, PrivApprox achieves stronger privacy (i.e., zero-knowledge privacy) for clients. The comparison between differential privacy and zero-knowledge privacy is presented in the Appendix C.

Figure 7: Differential privacy level comparison b/w RAPPOR and PrivApprox.

Recently, several privacy-preserving stream analytics systems have been proposed [94, 51, 96]. These systems make use of the Laplace mechanism [35, 33] to achieve differential privacy. In particular, they add Laplace noise to the truthful answers at the aggregator to protect the users’ privacy. However, their approach relies on strong trust assumptions of the aggregator as well as the connection between clients and the aggregator. On the contrary, PrivApprox applies randomized response mechanism to process users’ private data locally at clients under the control of users. Combined with the sampling mechanism, PrivApprox achieves stronger privacy guarantees (with a tighter bound for -differential privacy and -zero-knowledge privacy).

6 Evaluation: Case-studies

We next present our experience using PrivApprox in the following two case studies: (i) New York City (NYC) taxi ride, and (ii) household electricity consumption.

6.1 Experimental Setup

Cluster setup. We used a cluster of nodes connected via a Gigabit Ethernet. Each node contains 2 Intel Xeon quad-core CPUs and 8 GB of RAM running Debian 5.0 with Linux kernel 2.6.26. We deployed two proxies with Apache Kafka, each of which consists of Kafka broker nodes and Zookeeper nodes. We used nodes to deploy Apache Flink as the aggregator. In addition, we employed the remaining nodes to replay the datasets to generate data streams for evaluating our PrivApprox system.

Datasets. For the first case study, we used the NYC Taxi Ride dataset from the DEBS 2015 Grand Challenge [60]. The dataset consists of the itinerary information of all rides across taxies in New York City in 2013. For the second case study, we used the Household Electricity Consumption dataset [7]. This dataset contains electricity usage (kWh) measured every 30 minutes for one year by smart meters.

Queries. For the NYC taxi ride case-study, we created a query: “What is the distance distribution of taxi trips in New York?”. We defined the query answer with buckets as follows: [0, 1) mile, [1, 2) miles, [2, 3) miles, [3, 4) miles, [4, 5) miles, [5, 6) miles, [6, 7) miles, [7, 8) miles, [8, 9) miles, [9, 10) miles, and [10, ) miles.

For the second case-study, we defined a query to analyze the electricity usage distribution of households over the past 30 minutes. The query answer format is as follows: [0, 0.5] kWh, (0.5, 1] kWh, (1, 1.5] kWh, (1.5, 2] kWh, (2, 2.5] kWh, and (2.5, 3] kWh.

Evaluation metrics. We evaluated our system using four key metrics: throughput, latency, utility, and privacy level. Throughput is defined as the number of data items processed per second, and latency is defined as the total amount of time required to process a certain dataset. Utility is the accuracy loss defined as , where and are the query results produced by applying PrivApprox and the native computation, respectively. Finally, privacy level () is calculated using equation 19. For all measurements, we report the average over runs.

Figure 8: Throughput at proxies and the aggregator with different numbers of CPU cores and nodes.

6.2 Results from Case-studies

Figure 9: Results from the NYC taxi case-study with varying sampling and randomization parameters: (a) Utility, (b) Privacy level, (c) Comparison between utility and privacy.

6.2.1 Scalability

We measured the scalability of the two main system components: proxies and the aggregator. We first measured the throughput of proxies with various numbers of CPU cores (scale-up) and different numbers of nodes (scale-out). This experiment was conducted on a cluster of nodes. Figure 8 (a) shows that, as expected, the throughput at proxies scales quite well with the number of CPU cores and nodes. In the NYC Taxi case-study, with cores, the throughput of each proxy is answers/sec, and with cores (1 node) the throughput is answers/sec; whereas, with a cluster of nodes each with cores, the throughput of each proxy reaches answers/sec. In the household electricity case-study, the proxies achieve relatively higher throughput because the message size is smaller than in the NYC Taxi case-study.

We next measured the throughput at the aggregator. Figure 8 (b) depicts that the aggregator also scales quite well when the number of nodes for aggregator increases. The throughput of the aggregator, however, is much lower than the throughput of proxies due to the relatively expensive join operation and the analytical computation at the aggregator. We notice that the throughput of the aggregator in the household electricity case study does not significantly improve in comparison to the first case study. This is because the difference in the size of messages between the two case studies does not affect much the performance of the join operation and the analytical computation.

6.2.2 Network Bandwidth and Latency

Next, we conducted the experiment to measure the network bandwidth usage. By leveraging the sampling mechanism at clients, our system reduces network traffic significantly. Figure 10 (a) shows the total network traffic transferred from clients to proxies with different sampling fractions. In the first case study, with the sampling fraction of %, PrivApprox can reduce the network traffic by ; whereas in the second case study, the reduction is .

Besides the benefit of saving network bandwidth, PrivApprox achieves also lower latency in processing query answers by leveraging approximate computation. To evaluate this advantage, we measured the effect of sampling fractions on the latency of processing query answers. Figure 10 (b) depicts the latency with different sampling fractions at clients. For the first case-study, with the sampling fraction of %, the latency is lower than the execution without sampling; whereas, in the second case-study this value is lower than the execution without sampling.

Figure 10: Total network traffic and latency at proxies with different sampling fractions at clients.

6.2.3 Utility and Privacy

Figure 9 (a)(b)(c) show the utility, the privacy level, and the trade-off between them, respectively, with different sampling and randomization parameters. The randomization parameters and vary in the range of (0, 1), and the sampling parameter is calculated using Equation 19. Here, we show results only for NYC Taxi dataset. As the sampling parameter and the first randomization parameter increase, the utility of query results improves (i.e., accuracy loss gets smaller) whereas the privacy guarantee gets weaker (i.e., privacy level gets higher). Since the New York taxi dataset is diverse, the accuracy loss and the privacy level change in a non-linear fashion with different sampling fractions and randomization parameters. Interestingly, the accuracy loss does not always decrease as the second randomization parameter increases. The accuracy loss gets smaller when . This is due to the fact that the fraction of truthful “Yes” answers in the dataset is % (close to ).

6.2.4 Historical Analytics

To analyze the performance of PrivApprox for historical analytics, we executed the queries on the datasets stored at the aggregator. Figure 11 (a) (b) present the latency and throughput, respectively, of processing historical datasets with different sampling fractions. We can achieve a speedup of over native execution in historical analytics by setting the sampling fraction to %.

We also measured the accuracy loss when the approximate computation was applied (for the NYC Taxi case-study). Figure 11 (c) shows the accuracy loss in processing historical data with different sampling fractions. With the sampling fraction of %, the accuracy loss is only less than %.

Figure 11: Historical analytics results with varying sampling fractions: (a) Latency, (b) Throughput, and (c) Utility.

7 Related Work

Privacy-preserving analytics. Since the notion of differential privacy [33, 35], a plethora of systems have been proposed to provide differential privacy with centralized trusted databases supporting linear queries [66], graph queries [61], histogram queries [56], Airavat-MapReduce [85], SQL-type PINQ queries [71, 72, 80] and even general programs, such as GUPT [73] and Fuzz [54]. In practice, however, such central trust can be abused, leaked, or subpoenaed [28, 79, 57, 87].

To overcome the limitations of the centralized database schemes, recently a flurry of systems have been proposed with a focus on achieving users’ privacy (mostly, differential privacy) in a distributed setting where the private data is kept locally. Examples include Privad [49], PDDP [27], DJoin [75], SplitX [26], Box [65], KISS [92], Koi [50], xBook [89], Popcorn [52], and many other systems [34, 55, 13]. However, these systems are designed to deal with the “one-shot” batch queries only, whereby the data is assumed to be static during the query execution.

To overcome the limitations of the aforementioned systems, several differentially private stream analytics systems have been proposed recently [36, 22, 21, 88, 83, 41, 53]. Unfortunately, these systems still contain several technical shortcomings that limit their practicality. One of the first systems [36] updates the query result only if the user’s private data changes significantly, and does not support stream analytics over an unlimited time period. Subsequent systems [22, 53] remove the limit on the time period, but introduce extra system overheads. Some systems [88, 83] leverage expensive secret sharing cryptographic operations to produce noisy aggregate query results. These protocols, however, cannot work at large scale under churn; moreover, in these systems, even a single malicious user can substantially distort the aggregate results without detection. Recently, some other privacy-preserving distributed stream monitoring systems have been proposed [41, 21]. However, they all require some form of synchronization, and are tailored for heavy-hitter monitoring only. Streaming data publishing systems like [94] use a stream-privacy metric at the cost of relying on a trusted party to add noise. In contrast, PrivApprox does not require a trusted proxy or aggregator to add noise. Furthermore, PrivApprox provides stronger privacy properties (zero-knowledge privacy).

Sampling and randomized response. Sampling and randomized response, also known as input perturbation techniques, are being studied in the context of privacy-preserving analytics, albeit they are explored separately. For instance, the relationship between sampling and privacy is being investigated to provide k-anonymity [24], differential privacy [73], and crowd-blending privacy [44]. In contrast, we show that sampling combined with randomized response achieves the zero-knowledge privacy, a privacy bound strictly stronger that the state-of-the-art differential privacy. Furthermore, PrivApprox achieves these guarantees for stream processing with a distributed private dataset.

Randomized response [40, 95] is a surveying technique in statistics, since 1960s, for collecting sensitive information via input perturbation. Recently, Google, in a system called RAPPOR [91], made use of randomized response for privacy-preserving analytics for the Chrome browser. RAPPOR provides differential privacy () for clients while enabling analysts to collect various types of statistics. Like RAPPOR, PrivApprox utilizes randomized response. However, RAPPOR is designed for heavy-hitter collection, and does not deal with the situation where clients’ answers to the same query are changing over time. Therefore, RAPPOR does not fit well with the stream analytics. Furthermore, since we combine randomized response with sampling, PrivApprox () provides a privacy bound tighter than RAPPOR ().

Secure multi-party computation. In theory, secure multi-party computation (SMC) [47, 101] could be used for privacy-preserving analytics. It is, however, expensive for real-world deployment, especially for stream analytics, even though there have been several proposals reducing SMC’s computational overhead [48, 59, 67, 68, 77, 100]. Furthermore, SMC guarantees input-privacy during computation, but is orthogonal to output-privacy as provided by differential privacy.

Approximate computing. Approximation techniques such as sampling [14, 43, 25], sketches [30], and online aggregation [58] have been well-studied over the decades in the databases community. Recently, sampling-based systems (such as ApproxHadoop [46], BlinkDB [12, 11], IncApprox [63], Quickr [10], StreamApprox [82]) and online aggregation-based systems (such as MapReduce Online [29, 76], G-OLA [102]) have also been shown effective for “Big Data” analytics.

We build on the advancements of sampling-based techniques. However, we differ in two crucial aspects. First, we perform sampling in a distributed way as opposed to sampling in a centralized dataset. Second, we extend sampling with randomized response for privacy-preserving analytics.

8 Conclusion

In this technical report, we presented PrivApprox, a privacy-preserving stream analytics system. Our approach builds on the observation that both computing paradigms — privacy-preserving data analytics and approximate computation — strive for approximation, and can be combined together to leverage the benefits of both. Our evaluation shows that PrivApprox not only improves the performance to support real-time stream analytics, but also achieves provably stronger privacy guarantees than the state-of-the-art differential privacy. This technical report is the complete version of our conference publication [81]. PrivApprox’s source code is publicly available: https://PrivApprox.github.io.

Appendices

Appendix A Algorithms

Input: Query and query budget
costFunction(budget); // is the sampling parameter
// and are the randomizing parameters
; // Answer bit-vector
execute—At—Client() //Execute the method every seconds
begin
  coinFlip();// Flip the sampling coin if Heads then
  True\algocf@endline
localDataProcess()\algocf@endline
coinFlip(); // First randomizing coin
if Heads then
  ; // Process the local data
end
else
  coinFlip(); // Second coin
if Heads then
  // for all “Yes” in the bit-vector
: if ;
end
else
  // for all “No” in the bit-vector
: if ;
end
end
sendAnswer(); // Send the answer to the aggregator
end
end
Algorithm 1 Answering a query at clients

In this section, we describe the algorithmic details of PrivApprox’s system protocol. We present two algorithms: (i) the workflow at a client carrying out sampling and randomization; and (ii) the workflow at the aggregator.

#I: Workflow at a client. Algorithm 1 summarizes how a client processes a query. Each client maintains its personal data in a local database. Upon receiving a query, the client first flips a sampling coin to decide whether to answer the query or not. If the coin comes up heads, then the client executes the query on its local database to create a truthful answer to the query. The truthful answer is in the form of bit buckets with a “1” or “0” per bucket, depending on whether or not the “Yes” answer falls within that bucket. The answer may have more than one bucket containing a “1” depending on the query. Next, the client randomizes the answer using the randomized response mechanism. In particular, the client flips the first randomization coin, if it comes up heads, the client responds its truthful answer. If it comes up tails, then the client flips the second randomization coin and reports the result of this coin flipping. The randomized answer is still in the binary string format after the randomization process.

#II: Workflow at the aggregator. The aggregator receives clients’ data streams from the proxies, and joins them to obtain a combined data stream. Thereafter, the aggregator processes the joined stream to produce the output for the analyst. Algorithm 2 describes the overall process at the aggregator. The algorithm computes the query results as a sliding window computation over the incoming answer stream. For each window, the aggregator first adapts the computation window to the current start time by removing all old data items, i.e., with , from the computation window. Next, the aggregator adds the new incoming data items in the window and decrypts the answers in the data stream. Thereafter, the input data items for a window are aggregated to produce the query output for the analyst. We also estimate the error in the output due to approximation and randomization. The aggregator estimates this error bound and defines a confidence interval for the result as: . The entire process is repeated for the next window, with the updated windowing parameters and query budget (for the adaptive execution).

Input: ; ;
start time of window;
execute—At—Aggregator() //Execute the method every seconds
begin
  ; // List of items in the window
foreach (window in the incoming stream ) do
  forall in do
  if .timestamp then
  .remove(); // Remove all old items
end
end
.insert(new items); // Add new items
; // query result
forall in the do
  decryptAnswer()\algocf@endline
// Get query results associate with analyst IDs
aggregateAnswer()\algocf@endline
end
estimateError()\algocf@endline
+ ; // Update the start time for the next window
end
end
Algorithm 2 Generating query result at the aggregator

Appendix B Discussion

In this section, we discuss some approaches that could be used to meet our assumptions listed in 2.2.5.

Stratified sampling. In our design in 3, we currently assume that the input stream is already stratified based on the source of event, i.e., the data items within each stratum follow the same distribution. However, it may not be the case. We next discuss two proposals for the stratification of evolving data streams, namely bootstrap [38, 39, 78] and semi-supervised learning [70].

Bootstrap [38, 39, 78] is a well-studied non-parametric sampling technique in statistics for the estimation of distribution for a given population. In particular, the bootstrap method randomly selects “bootstrap samples” with replacement to estimate the unknown parameters of a population; for instance, by averaging the bootstrap samples. We can employ a bootstrap-based estimator for the stratification of incoming sub-streams. Alternatively, we could also make use of a semi-supervised algorithm [70] to stratify a data stream. The advantage of the algorithm is that it can work with both the labeled and unlabeled data stream to train a classification model.

Virtual cost function. Currently, in our implementation described in 4, for a given user-specified query budget about privacy , the sampling and randomizing parameters can be computed using the reversed function of equation 19. However, for the query budget involving available computing resources or latency requirements (SLAs)—we currently assume that there exists a virtual function that determines the sampling parameter based on the query budget. We recommend two existing approaches—Pulsar [15] and resource prediction model [42, 69]—to design and implement such a virtual function for the given computing resources and latency requirements, respectively.

Pulsar [15] is a “virtual datacenter (VDC)” system that allows users to allocate resources based on tenants’ demand. The system proposes a multi-resource token bucket algorithm that uses a pre-advertised cost model for supporting workload independent guarantees. We could apply a similar cost model based on Pulsar as follows: A data item to be processed could be considered as a request, and “amount of resources” needed to process these items could be the cost in tokens. Since the resource budget gives total resources (here tokens) to be used, we could find the number of clients, i.e., the sampling fraction at clients, that can be processed using these resources.

To find the sampling parameter for a given latency budget, we could use a resource prediction model [99, 97, 98]. The resource prediction model could build by analyzing the diurnal patterns of resource usage [23], and predicts the resource requirement to meet SLAs leveraging statistical machine learning [42, 69]. Once we have the resource requirement in place to meet a given SLA—we can find the appropriate sampling parameter by using the above suggested method similar to Pulsar.

Appendix C Privacy Analysis and Proofs

PrivApprox achieves three privacy properties (i) zero-knowledge privacy, (ii) anonymity, and (iii) unlinkability as introduced in 2.2.4.

Property # I: Zero-knowledge privacy. We show that the system designed in Section 3 achieves -zero-knowledge privacy and prove a tighter bound for -differential privacy, than what generally follows from zero-knowledge privacy [45]. The basic idea is that all data from the clients is already differentially private due to the use of randomized response. Furthermore, the combination with pre-sampling at the clients makes it zero-knowledge private as well. Following the privacy definitions, any computation upon the results of differentially, as well as, zero-knowledge private algorithms is guaranteed to be private again.

In the following paragraphs we show that:

  • Independent and identically distributed (IID) sampling decomposes easily and is self-commutative. See Lemma C.1.

  • Sampling and randomized response mechanisms commute. See Lemma C.2.

  • Pre-sampling and post-sampling can be traded arbitrarily around a randomized response mechanism. See Corollary C.3.

  • A -zero-knowledge privacy bound for our system. See Theorem C.4

  • A -differential privacy bound for our system. See Theorem C.5

  • Our differential privacy bound is tighter than the general differential privacy bound derived from a zero-knowledge private algorithm. See Proposition C.6.

Intuitively, differential privacy limits the information that can be learned about any individual by the difference occurring from either including ’s sensitive data in a differentially private computation or not. Zero-knowledge privacy on the other hand also gives the adversary access to aggregate information about the remaining individuals denoted as . Essentially everything that can be learned about individual can also be learned by having access to some aggregate information upon .

Let be a sanitizing algorithm, which takes a database of sensitive attributes of individuals from a population as input and outputs a differentially private or zero-knowledge private result . For brevity, we write for the output of the adversary with arbitrary external input and access to . Similarly, we omit the explicit usage of the external information as input to the simulator , as well as the total size of the database. See [44] Definition 1 and 2 for the extended notation. Let be any set of possible outputs. -differential privacy can be defined as

(16)

while -zero-knowledge privacy is defined as

(17)

Before proving the desired properties, we need to introduce some notation. Let be a database of sensitive attributes of individuals . For ease of presentation and without loss of generality we restrict the individual’s sensitive attribute to a boolean value and for all . Furthermore, let be the super-set of all possible databases and be a randomized algorithm that i.i.d. samples rows or individuals with their sensitive attributes from database with probability without replacement. Let be a two-coin randomized response algorithm that decides for any individual in database with probability if it should be part of the output. If it is not included in the output, the result of tossing a biased coin (coming up heads with probability ) is added to the output.

Lemma C.1.

(Decompose and commute sampling) Let with being sampling probabilities for a sampling function . It follows that can be composed and decomposed easily and is self-commutative.

Proof.

Let be sampling algorithms that sample rows i.i.d. from a database with probability and respectively. By applying , any row in has probability of being sampled. The probability for any row in being sampled by is equivalently . Using function composition the probability for any row in being sampled by is

(18)

From multiplication being commutative () follows that and commute, that is . This is true for deterministic functions and can easily be extended to randomized functions described as random variables, as random variables are commutative under addition and multiplication. For ease of presentation and without loss of generality we keep the notion of functions instead of random variables. Let be a sampling function that samples rows i.i.d. from a given database with probability . Decomposing sampling function with probability into two functions with probabilities and follow from Equation (18). It also follows that two sampling functions with probabilities can be composed into a single sampling function with sampling probability . ∎

Lemma C.2.

(Commutativity of sampling and randomized response) Given a sampling algorithm and a randomized response algorithm , the result of the pre-sampling algorithm is statistically indistinguishable from the result of the post-sampling algorithm . It follows that sampling and randomized response commute under function composition: .

Proof.

For any individual having we have to consider eight different possible cases. In case the sampling algorithm decides to not sample , it obviously doesn’t matter if it gets removed before the randomized response algorithm is run of afterwards. We thus condition on to include in the output.

  1. Let us first consider the case that outputs the real value for individual . As is fixed to output independent of its value, there is no difference between and .

  2. In case outputs a randomized answer again is not influenced by the outcome of any of the coin tosses and passes along to the output. This is of course also independent of the actual randomized result.

This concludes the proof that sampling and randomized response are independent regarding their order of execution and thus commute. ∎

Corollary C.3.

(Arbitrary sampling around randomized response) Let for be sampling probabilities for a sampling function and be a two-coin randomized response mechanism with probabilities . Sampling can be arbitrarily traded between pre-sampling and post-sampling around the randomized response mechanism .

Proof.

This follows directly from applying Lemma C.1 and Lemma C.2. ∎

We will now give a bound on for the privacy of our system under the zero-knowledge privacy setting, as well as derive a tighter bound for ()-differential privacy, than the bound that generally follows from zero-knowledge privacy.

Theorem C.4.

(-zero-knowledge privacy) Let be an algorithm that applies sampling with probability , together with a two-coin randomized response algorithm using probabilities . is -zero-knowledge private with

(19)

The system design is described in Section 3.

Proof.

From [44], Theorem 1 follows that a -crowd-blending private mechanism combined with a pre-sampling using probability achieves -zero-knowledge privacy with

We omit the description for the additive error , which can be derived equivalently from [44] Theorem 1. Following Proposition 1 from [44] every -differentially private mechanism is also -crowd-blending private, thus randomized response being an -differentially private mechanism, also satisfies -crowd-blending privacy with . Combining both results with Equation (8) gives an

zero-knowledge private mechanism for randomized response combined with pre-sampling. Using Corollary C.3 we can replace pre-sampling with a combination of pre- and post-sampling (with probabilities respectively and ) while keeping fixed. We thus have

If we do not aim at achieving zero-knowledge privacy, we can fall back to differential privacy using the result from [45], Proposition 3, which states that any -zero-knowledge private algorithm is also -differentially private. Using the results from sampling secrecy [62], which achieve a privacy boost by applying pre-sampling before using a differentially private algorithm, we derive a tighter bound for differential privacy, than what follows generally from zero-knowledge privacy.

Theorem C.5.

(-differential privacy) Let be an algorithm that applies sampling with probability , followed by a two-coin randomized response algorithm using probabilities . is -differentially private with

(20)
Proof.

We use the result from [9], Proof of Lemma 3, which bounds an -differential private algorithm combined with pre-sampling using probability by . Let be the bound derived for randomized response, we get

Applying Corollary C.3 we derive an bound for the combination of pre-sampling, randomized response and post-sampling of:

Proposition C.6.

(Tighter -differential privacy bound) The bound for differential privacy of a sampled randomized response system derived in Theorem C.5 is tighter than -differential privacy, which is again tighter than the general -differential privacy bound that follows from -zero-knowledge privacy [45].

We directly proof Proposition C.6 by comparing from Theorem C.5 with from Theorem C.4. As we want to prove a bound that is tighter than , we drop the factor of . This is possible because a -differentially private algorithm is also -differentially private. If we succeed in proving a bound tighter than , then -differential privacy is trivially fulfilled.

Proof.

Proposition 3 from [45] states that every -zero-knowledge private algorithm is also -differentially private. Using Theorem C.4 we get a -differentially private system with . Theorem C.5 proves a bound of . Let . Putting together Theorem C.5, Theorem C.4, Proposition C.6 and Proposition 3 [45] we have:

As is the sampling parameter with a minimal right side for the above inequality becomes , which holds and concludes the proof. ∎

Figure 12: Ratio of depending on the sampling parameter for different values and .

Relation of differential privacy and zero-knowledge privacy. Zero-knowledge privacy and differential privacy describe the advantage of an adversary in learning information about individual by using an output from an algorithm running over database containing sensitive information of individual compared to using a result of a second — possibly different — algorithm running over . Zero-knowledge privacy is a strictly stronger privacy metric through the additional access to aggregate information of the remaining database compared to differential privacy [45]. By intuition, as differential privacy is a special case of zero-knowledge privacy and the adversary aims at maximizing its advantage, the advantage of an adversary in the zero-knowledge model is at least as high and possibly higher than the advantage of an adversary in the differential privacy model: . Figure 12 draws the ratio between the zero-knowledge privacy level and the differential privacy level given identical parameters and . Put differently, as the adversary is allowed to do more in the zero-knowledge model, the privacy level is lower, which is reflected by a higher value compared to the differential privacy level — given identical system parameters.

Property # II: Anonymity. We make the following assumptions to achieve the remaining two privacy properties:

  • At least two out of the proxies are not colluding.

  • The aggregator does not collude with any of the proxies.

  • The aggregator and analysts cannot — at the same time — observe the communication around the proxies.

  • The adversary, seen as an algorithm, lies within the polynomial time complexity class.

To provide anonymity, we require that no system component (proxy, aggregator, analyst) can relate a query request or answer to any of the clients. To show the fulfillment of that requirement we take the view of all three parties.

a) A proxy can of course link the received data stream to a client, as it is directly connected. However, as the data stream is encrypted, it would need to have the plaintext query request or response for the received data stream. To get the plaintext the proxy would either need to break symmetric cryptography, which breaks assumption (A4), collude with all other proxies for decryption, which breaks assumption (A1) or collude with the aggregator to learn the plaintext, which breaks assumption (A2).

b) Anonymity against the aggregator is achieved by source-rewriting, which is a standard anonymization technique typically used by proxies and also builds the basis for anonymization schemes [84, 31]. To break anonymity the aggregator must be a global, passive attacker, which means that he is able to simultaneously listen to incoming and outgoing traffic of any proxy. This would violate assumption (A3). The other possibility to bridge the proxies is by colluding with any of them — breaking assumption (A2).

c) The analyst knows the query request, but doesn’t get to learn the single query answers. He needs to collude with the aggregator, to see single responses. Thus the problem reduces to breaking anonymity from the view of the aggregator. Collusion with the aggregator and any proxy would break assumption (A2). Collusion with up to proxies reduces to breaking anonymity from the proxy view.

Property # III: Unlinkability. Unlinkability is provided by the source-rewriting scheme as in anonymity. Breaking unlinkability on any proxy is similar to breaking anonymity, as the proxy would need to get the plaintext query. The aggregator only gets query results, but no source information, as this is hidden by the anonymization scheme. The query results sent by the clients also do not contain linkable information, just identically structured answers without quasi-identifiers. The view of the analyst doesn’t receive responses, so it must collude with either a proxy or the aggregator, effectively reducing to the same problem as described above.

Acknowledgements. We would like to thank Amazon for providing us an Amazon Web Services (AWS) Education Grant.

References

  • [1] Apache Flink. https://flink.apache.org/. Accessed: Jan, 2017.
  • [2] Apache S4. http://incubator.apache.org/s4. Accessed: Jan, 2017.
  • [3] Apache Spark Streaming. http://spark.apache.org/streaming. Accessed: Jan, 2017.
  • [4] Apache Storm. http://storm-project.net/. Accessed: Jan, 2017.
  • [5] Apache Zookeeper. https://zookeeper.apache.org/. Accessed: Jan, 2017.
  • [6] Kafka - A high-throughput distributed messaging system. http://kafka.apache.org. Accessed: Jan, 2017.
  • [7] Sample household electricity time of use data. https://goo.gl/0p2QGB. Accessed: Jan, 2017.
  • [8] SQLite. https://www.sqlite.org/. Accessed: Jan, 2017.
  • [9] Differential privacy and the secrecy of the sample, Sept. 2009.
  • [10] Quickr: Lazily Approximating Complex Ad-Hoc Queries in Big Data Clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2016.
  • [11] S. Agarwal, H. Milner, A. Kleiner, A. Talwalkar, M. Jordan, S. Madden, B. Mozafari, and I. Stoica. Knowing when You’Re Wrong: Building Fast and Reliable Approximate Query Processing Systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 2014.
  • [12] S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data. In Proceedings of the ACM European Conference on Computer Systems (EuroSys), 2013.
  • [13] I. E. Akkus, R. Chen, M. Hardt, P. Francis, and J. Gehrke. Non-tracking web analytics. In Proceedings of the ACM Conference on Computer and Communications Security (CCS), 2012.
  • [14] M. Al-Kateb and B. S. Lee. Stratified Reservoir Sampling over Heterogeneous Data Streams. In Proceedings of the 22nd International Conference on Scientific and Statistical Database Management (SSDBM), 2010.
  • [15] S. Angel, H. Ballani, T. Karagiannis, G. O’Shea, and E. Thereska. End-to-end performance isolation through virtual datacenters. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI), 2014.
  • [16] P. Bhatotia, U. A. Acar, F. P. Junqueira, and R. Rodrigues. Slider: Incremental Sliding Window Analytics. In Proceedings of the 15th International Middleware Conference (Middleware), 2014.
  • [17] P. Bhatotia, M. Dischinger, R. Rodrigues, and U. A. Acar. Slider: Incremental Sliding-Window Computations for Large-Scale Data Analysis. In Technical Report: MPI-SWS-2012-004, 2012.
  • [18] A. Blum, C. Dwork, F. McSherry, and K. Nissim. Practical privacy: the SuLQ framework. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2005.
  • [19] A. Blum, K. Ligett, and A. Roth. A Learning Theory Approach to Non-interactive Database Privacy. In Proceedings of the ACM Symposium on Theory of Computing (STOC), 2008.
  • [20] D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.
  • [21] T.-H. H. Chan, M. Li, E. Shi, and W. Xu. Differentially Private Continual Monitoring of Heavy Hitters from Distributed Streams. In Proceedings of the 12th International Conference on Privacy Enhancing Technologies (PETS), 2012.
  • [22] T.-H. H. Chan, E. Shi, and D. Song. Private and Continual Release of Statistics. ACM Trans. Inf. Syst. Secur., 2011.
  • [23] R. Charles, T. Alexey, G. Gregory, H. K. Randy, and K. Michael. Towards understanding heterogeneous clouds at scale: Google trace analysis. Technical report, 2012.
  • [24] K. Chaudhuri and N. Mishra. When Random Sampling Preserves Privacy. In Proceedings of the 26th Annual International Conference on Advances in Cryptology (CRYPTO), 2006.
  • [25] S. Chaudhuri, G. Das, and V. Narasayya. Optimized Stratified Sampling for Approximate Query Processing. Proceedings of ACM Transaction of Database Systems (TODS), 2007.
  • [26] R. Chen, I. E. Akkus, and P. Francis. SplitX: High-performance Private Analytics. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), 2013.
  • [27] R. Chen, A. Reznichenko, P. Francis, and J. Gehrke. Towards Statistical Queries over Distributed Private User Data. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.
  • [28] ComScore Reaches $14 Million Settlement in Electronic Privacy Class Action. http://www.alstonprivacy.com/comscore-reaches-14-million-settlement-in-electronic-privacy-class-action. Accessed: Jan, 2017.
  • [29] T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce Online. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2010.
  • [30] G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine. Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Found. Trends databases, 2012.
  • [31] R. Dingledine, N. Mathewson, and P. Syverson. Tor: The second-generation onion router. Technical report, DTIC Document, 2004.
  • [32] J. R. Douceur. The Sybil Attack. In Proceedings of 1st International Workshop on Peer-to-Peer Systems (IPTPS), 2002.
  • [33] C. Dwork. Differential privacy. In Proceedings of the 33rd International Colloquium on Automata, Languages and Programming, part II (ICALP), 2006.
  • [34] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor. Our Data, Ourselves: Privacy Via Distributed Noise Generation. In Proceedings of the 24th Annual International Conference on The Theory and Applications of Cryptographic Techniques (EUROCRYPT), 2006.
  • [35] C. Dwork, F. McSherry, K. Nissim, and A. Smith. Calibrating Noise to Sensitivity in Private Data Analysis. In Proceedings of the Third conference on Theory of Cryptography (TCC), 2006.
  • [36] C. Dwork, M. Naor, T. Pitassi, and G. N. Rothblum. Differential privacy under continual observation. In Proceedings of the ACM Symposium on Theory of Computing (STOC), 2010.
  • [37] C. Dwork and A. Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • [38] D. M. Dziuda. Data mining for genomics and proteomics: analysis of gene and protein expression data. John Wiley & Sons, 2010.
  • [39] B. Efron and R. Tibshirani. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy. Statistical Science, 1986.
  • [40] J. A. Fox and P. E. Tracy. Randomized response: a method for sensitive surveys. Beverly Hills California Sage Publications, 1986.
  • [41] A. Friedman, I. Sharfman, D. Keren, and A. Schuster. Privacy-Preserving Distributed Stream Monitoring. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), 2014.
  • [42] A. S. Ganapathi. Predicting and optimizing system utilization and performance via statistical machine learning. In Technical Report No. UCB/EECS-2009-181, 2009.
  • [43] M. N. Garofalakis and P. B. Gibbon. Approximate Query Processing: Taming the TeraBytes. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 2001.
  • [44] J. Gehrke, M. Hay, E. Lui, and R. Pass. Crowd-blending privacy. In Proceedings of the 32nd Annual International Conference on Advances in Cryptology (CRYPTO), 2012.
  • [45] J. Gehrke, E. Lui, and R. Pass. Towards Privacy for Social Networks: A Zero-Knowledge Based Definition of Privacy. In Theory of Cryptography, 2011.
  • [46] I. Goiri, R. Bianchini, S. Nagarakatte, and T. D. Nguyen. ApproxHadoop: Bringing Approximations to MapReduce Frameworks. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2015.
  • [47] O. Goldreich, S. Micali, and A. Wigderson. How to Play any Mental Game or A Completeness Theorem for Protocols with Honest Majority. In STOC, 1987.
  • [48] S. D. Gordon, T. Malkin, M. Rosulek, and H. Wee. Multi-party Computation of Polynomials and Branching Programs without Simultaneous Interaction. In Proceedings of the Annual International Conference on Advances in Cryptology (EUROCRYPT), 2013.
  • [49] S. Guha, B. Cheng, and P. Francis. Privad: Practical Privacy in Online Advertising. In Proceedings of the 8th Symposium on Networked Systems Design and Implementation (NSDI), 2011.
  • [50] S. Guha, M. Jain, and V. N. Padmanabhan. Koi: A location-privacy platform for smartphone apps. In Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012.
  • [51] V. Gulisano, V. Tudor, M. Almgren, and M. Papatriantafilou. BES: Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures. In Proceedings of the 2nd ACM International Workshop on Cyber-Physical System Security (CPSS), 2016.
  • [52] T. Gupta, N. Crooks, W. Mulhern, S. Setty, L. Alvisi, and M. Walfish. Scalable and Private Media Consumption with Popcorn. In Proceedings of 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2016.
  • [53] T. h. Hubert Chan, E. Shi, and D. Song. Privacy-preserving stream aggregation with fault tolerance. In Proceedings of 16th International Conference on Financial Cryptography and Data Security (FC), 2012.
  • [54] A. Haeberlen, B. C. Pierce, and A. Narayan. Differential Privacy Under Fire. In Proceedings of the 20th USENIX Security Symposium (USENIX Security), 2011.
  • [55] M. Hardt and S. Nath. Privacy-aware personalization for mobile advertising. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS), 2012.
  • [56] M. Hay, V. Rastogi, G. Miklau, and D. Suciu. Boosting the Accuracy of Differentially Private Histograms Through Consistency. Proceedings of the International Conference on Very Large Data Bases (VLDB), 2010.
  • [57] HealthCare.gov Sends Personal Data to Dozens of Tracking Websites. https://www.eff.org/deeplinks/2015/01/healthcare.gov-sends-personal-data. Accessed: Jan, 2017.
  • [58] J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online Aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD), 1997.
  • [59] S. Jarecki and V. Shmatikov. Efficient Two-Party Secure Computation on Committed Inputs. In Proceedings of the Annual International Conference on Advances in Cryptology (EUROCRYPT), 2007.
  • [60] Z. Jerzak and H. Ziekow. The debs 2015 grand challenge. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems (DEBS), 2015.
  • [61] V. Karwa, S. Raskhodnikova, A. Smith, and G. Yaroslavtsev. Private analysis of graph structure. In Proceedings of the International Conference on Very Large Data Bases (VLDB), 2011.
  • [62] S. P. Kasiviswanathan, H. K. Lee, K. Nissim, S. Raskhodnikova, and A. Smith. What Can We Learn Privately? SIAM J. Comput.
  • [63] D. R. Krishnan, D. L. Quoc, P. Bhatotia, C. Fetzer, and R. Rodrigues. IncApprox: A Data Analytics System for Incremental Approximate Computing. In Proceedings of International Conference on World Wide Web (WWW), 2016.
  • [64] J. Lee and C. Clifton. How Much is Enough? Choosing for Differential Privacy. In Proceedings of the 14th International Conference on Information Security (ISC), 2011.
  • [65] S. Lee, E. L. Wong, D. Goel, M. Dahlin, and V. Shmatikov. Box: A Platform for Privacy-Preserving Apps. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2013.
  • [66] C. Li, M. Hay, V. Rastogi, G. Miklau, and A. McGregor. Optimizing Linear Counting Queries Under Differential Privacy. In Proceedings of the ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), 2010.
  • [67] Y. Lindell and B. Pinkas. An Efficient Protocol for Secure Two-Party Computation in the Presence of Malicious Adversaries. In Proceedings of the Annual International Conference on Advances in Cryptology (EUROCRYPT), 2007.
  • [68] Y. Lindell and B. Pinkas. An Efficient Protocol for Secure Two-Party Computation in the Presence of Malicious Adversaries. J. Cryptology, 2015.
  • [69] S. Mallick, G. Hains, and C. S. Deme. A resource prediction model for virtualization servers. In Proceedings of International Conference on High Performance Computing and Simulation (HPCS), 2012.
  • [70] M. M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K. W. Hamlen, and N. C. Oza. Facing the reality of data stream classification: coping with scarcity of labeled data. Knowledge and information systems, 2012.
  • [71] F. McSherry. Privacy Integrated Queries. In Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD), 2009.
  • [72] F. McSherry and R. Mahajan. Differentially-private Network Trace Analysis. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications (SIGCOMM), 2010.
  • [73] P. Mohan, A. Thakurta, E. Shi, D. Song, and D. Culler. GUPT: Privacy Preserving Data Analysis Made Easy. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2012.
  • [74] D. S. Moore. The Basic Practice of Statistics. W. H. Freeman & Co., 2nd edition, 1999.
  • [75] A. Narayan and A. Haeberlen. DJoin: Differentially Private Join Queries over Distributed Databases. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation (OSDI), 2012.
  • [76] N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online Aggregation for Large MapReduce Jobs. Proceedings of the International Conference on Very Large Data Bases (VLDB), 2011.
  • [77] B. Pinkas, T. Schneider, N. P. Smart, and S. C. Williams. Secure Two-Party Computation Is Practical. In Proceedings of the 15th International Conference on the Theory and Application of Cryptology and Information Security: Advances in Cryptology (ASIACRYPT), 2009.
  • [78] O. Pons. Bootstrap of means under stratified sampling. Electronic Journal of Statistics, 2007.
  • [79] Privacy Lawsuit Targets Net Giants Over ‘Zombie’ Cookies. http://www.wired.com/2010/07/zombie-cookies-lawsuit. Accessed: Jan, 2017.
  • [80] D. Proserpio, S. Goldberg, and F. McSherry. Calibrating Data to Sensitivity in Private Data Analysis: A Platform for Differentially-private Analysis of Weighted Datasets. Proceedings of the International Conference on Very Large Data Bases (VLDB), 2014.
  • [81] D. L. Quoc, M. Beck, P. Bhatotia, R. Chen, C. Fetzer, and T. Strufe. PrivApprox: Privacy-Preserving Stream Analytics. In Proceedings of the 2017 USENIX Conference on USENIX Annual Technical Conference (USENIX ATC), 2017.
  • [82] D. L. Quoc, R. Chen, P. Bhatotia, C. Fetzer, V. Hilt, and T. Strufe. StreamApprox: Approximate Computing for Stream Analytics. 2017.
  • [83] V. Rastogi and S. Nath. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of the International Conference on Management of Data (SIGMOD), 2010.
  • [84] M. G. Reed, P. F. Syverson, and D. M. Goldschlag. Anonymous connections and onion routing. IEEE Journal on Selected Areas in Communications, 1998.
  • [85] I. Roy, S. T. V. Setty, A. Kilzer, V. Shmatikov, and E. Witchel. Airavat: Security and Privacy for MapReduce. In Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2010.
  • [86] N. Santos, R. Rodrigues, K. P. Gummadi, and S. Saroiu. Policy-Sealed Data: A New Abstraction for Building Trusted Cloud Services . In Proceedings of the USENIX Security Symposium (USENIX Security), 2012.
  • [87] SEC Charges Two Employees of a Credit Card Company with Insider Trading. http://www.sec.gov/litigation/litreleases/2015/lr23179.htm. Accessed: Jan, 2017.
  • [88] E. Shi, T. H. Chan, E. G. Rieffel, R. Chow, and D. Song. Privacy-Preserving Aggregation of Time-Series Data. In Proceedings of the Symposium on Network and Distributed System Security (NDSS), 2011.
  • [89] K. Singh, S. Bhola, and W. Lee. xbook: Redesigning privacy control in social networking platforms. In Proceedings of the 18th Conference on USENIX Security Symposium (USENIX Security), 2009.
  • [90] S. K. Thompson. Sampling. Wiley Series in Probability and Statistics, 2012.
  • [91] E. Úlfar, P. Vasyl, and K. Aleksandra. RAPPOR: Randomized Aggregatable Privacy-Preserving Ordinal Response. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security (CCS), 2014.
  • [92] B. Viswanath, E. Kiciman, and S. Saroiu. Keeping Information Safe from Social Networking Apps. In Proceedings of the ACM SIGCOMM Workshop on Social Networks (WOSN’12), 2012.
  • [93] G. Wang, B. Wang, T. Wang, A. Nika, H. Zheng, and B. Y. Zhao. Defending against sybil devices in crowdsourced mapping services. In Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services (MobiSys), 2016.
  • [94] Q. Wang, Y. Zhang, X. Lu, Z. Wang, Z. Qin, and K. Ren. RescueDP: Real-time Spatio-temporal Crowd-sourced Data Publishing with Differential Privacy. In Proceedings of the 35th Annual IEEE International Conference on Computer Communications (INFOCOM), 2016.
  • [95] S. L. Warner. Randomized response: A survey technique for eliminating evasive answer bias. In Journal of the American Statistical Association, 1965.
  • [96] L. Waye. Privacy integrated data stream queries. In Proceedings of the 2014 International Workshop on Privacy & Security in Programming (PSP), 2014.
  • [97] A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Brief Announcement: Modelling MapReduce for Optimal Execution in the Cloud. In Proceedings of the 29th ACM SIGACT-SIGOPS symposium on Principles of Distributed Computing (PODC), 2010.
  • [98] A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Conductor: Orchestrating the Clouds. In Proceedings of the 4th international workshop on Large Scale Distributed Systems and Middleware (LADIS), 2010.
  • [99] A. Wieder, P. Bhatotia, A. Post, and R. Rodrigues. Orchestrating the Deployment of Computations in the Cloud with Conductor. In Proceedings of the 9th USENIX symposium on Networked Systems Design and Implementation (NSDI), 2012.
  • [100] D. P. Woodruff. Revisiting the Efficiency of Malicious Two-Party Computation. In Proceedings of the 26th Annual International Conference on Advances in Cryptology (EUROCRYPT), 2007.
  • [101] A. C. Yao. Protocols for Secure Computations. In Proceedings of the 23rd Annual Symposium on Foundations of Computer Science (SFCS), 1982.
  • [102] K. Zeng, S. Agarwal, A. Dave, M. Armbrust, and I. Stoica. G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD), 2015.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
23356
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description