TimeCrypt: A Scalable Private Time Series Data Store\newlist
We present TimeCrypt, an efficient and scalable system that augments time series data stores with encrypted data processing capabilities and features a new encryption-based access control scheme that is additively homomorphic. TimeCrypt is tailored for time series workloads, supports fast statistical queries over large volumes of encrypted time series data, and enables users to cryptographically restrict the scope of queries and data access based on pre-defined access policies. Our evaluation of TimeCrypt shows that its memory overhead and performance are competitive and close to operating on data in the clear (insecure). E.g., with data ingest at 2.4 million data points per second, we experience a slowdown of around 1.8% for both encrypted data ingest and statistical queries.
Recent years have seen an explosive growth in devices and services that automatically collect, aggregate, and analyze time series data, and this trend is only expected to accelerate with the proliferation of low-cost sensors and the adoption of IoT. However, with this growth has come mounting concern over protecting this data and the privacy of users.
A key challenge in ensuring data privacy is that the need for privacy co-exists with the desire to extract value from the data and extracting value often implies granting users and third-party services access to the data. For example, many wearable devices now collect detailed personal health indicators which might be advantageously shared with doctors, insurance companies, and family.
The most common approach deployed today has data owners storing their data at trusted third-party providers who can observe the data in the clear [fitbit-privacy, strava]. However, the recurrence of incidents [techcrunch, healthapp, uber, bbc, NYT, fb-scandal] of unauthorized sharing/selling of user data and the privacy issues that arise from storing sensitive data remotely has shown the need for solutions that conceal data from cloud/storage providers while still benefitting from their services. To address this need, many research efforts are exploring an alternative approach in which data is always stored encrypted and query processing is done directly over encrypted data [cryptdb, monomi, pilatus, seabed, blindseer]. These efforts, however, focus primarily on transactional [cryptdb] or batch analytic workloads [seabed] rather than the time series data we consider here. As a result, they do not provide the necessary abstractions for effective stream analytics.
Additionally, supporting query processing over encrypted data is not in itself sufficient to protect data privacy. In many situations, it is unnecessary and/or undesirable to give even authorized third-parties the ability to query the data in an unfettered manner. Instead, users may want to (i) share only aggregate statistics about the data (e.g., avg/min/max vs. all values), (ii) limit the resolution at which such statistics are reported (e.g., hourly vs. per-minute), (iii) limit the time interval over which queries are issued (e.g., only June 2018), (v) or a combination of the above. Moreover, the desired granularity and scope of sharing can vary greatly across users and applications. For example, a user might choose to simultaneously share hourly averages of their measured heart rate with their doctor and per-minute averages with their trainer but only for the duration of their workout session. Similarly, a datacenter operator might share resource utilization levels with a tenant but only for the duration of her job. Hence support for encrypted query processing must go hand-in-hand with access control that limits the scope of data that users might query. Such access control must be secure, flexible, and fine-grained: i.e., providing crypto-enforced control over not just who can access a data stream, but also at what resolution, and over what time interval. Equally important, the overhead of the crypto primitives underlying access control and encrypted data processing needs to be negligible to meet the scaling, latency, and performance requirements associated with time series data stores (§2).
There are two aspects to ensuring data confidentiality and privacy while data is stored remotely: the first oversees the hosting providers (e.g., the cloud), which store and process the data but cannot access data in the clear. The second deals with authorized services (i.e., data consumers), which are allowed to access the data but only within a defined scope and granularity (in time and resolution), to retain control over what can be inferred from the data. Ensuring privacy, in this case, is a matter of supporting secure access control. These two aspects have traditionally been addressed independently: the former through work on encrypted data processing [cryptdb, monomi, talos, seabed, agrawal2003information, perrig_enc_search, hacigumucs2002executing] and the latter through crypto-enforced access control schemes [abe5, abe2, goyal2006attribute, abe4, Sieve]. The challenge with time series data is that we must simultaneously support both: processing over encrypted data streams and powerful access control, so that authorized parties can issue queries on encrypted data but can only decrypt results within the data scope they are authorized to access. While one could achieve this by simply combining solutions for each issue [blindseer, pilatus, enki], we show that current approaches scale poorly for time series data (§6.2). Hence, we propose an alternative approach based on a new unified encryption scheme that simultaneously supports encrypted query processing and crypto-based access control.
In this paper, we present TimeCrypt, the first system that augments time series data stores with efficient and privacy-preserving processing of time series data, while providing cryptographic means to restrict the query scope based on data owners policies. Particularly, we introduce a new encryption-based access control scheme that is additively homomorphic. With our encryption, data owners can cryptographically restrict user A to query encrypted data at a defined temporal range and granularity, while simultaneously allowing user B to execute queries at a different granularity of the same data without (i) introducing ciphertext expansion, (ii) introducing any noticeable delays, or (iii) requiring a trusted entity to facilitate this. Simultaneously achieving all three has never been realized before.
The crux of our Homomorphic Encryption-based Access Control (HEAC) is time-encoded keys that are derived from a novel construction (based on a key-derivation tree and dual key-regression) which allows selective sharing of keys and expressing powerful access policies (i.e., temporal range and granularity). With HEAC the server can efficiently process aggregation-based statistical queries over encrypted data and access control is cryptographically enforced at the client-side; a client can only decrypt query results over data they are authorized to access, i.e., true end-to-end encryption. TimeCrypt also supports queries that span multiple streams, e.g., from multiple users. Similarly, a principal can only decrypt the query result if she is granted access to all streams involved in the inter-streams processing.
Contributions. In summary, our contributions are:
Design and development of TimeCrypt, the first scalable private time series data store. TimeCrypt supports a rich set of analytics of relevance to time series data: statistical analysis (e.g., sum, mean, variance, min/max, frequency count), data compaction (e.g., archiving at lower resolutions), and range queries over large volumes of encrypted time series data.
A novel encryption-based access control scheme for stream data that is additively homomorphic. It is the first homomorphic scheme that cryptographically enforces access to desired data granularities.
An open-source prototype
1and evaluation of TimeCrypt showing its feasibility and competitive performance. We prototype TimeCrypt on top of Cassandra as the storage engine. We evaluate TimeCrypt with two real-world applications, i.e., health and data center monitoring. Considering a workload with 2.4 million data point ingest per second on a single machine, TimeCrypt’s throughput is only reduced by 1.8% for both data ingest and statistical queries.
2 Time Series Data
Massive collection of time series data is increasingly prevalent in a wide variety of scenarios [gorilla, farmbeats-MSR, car-data]. To cope with the challenges that arise when processing large volumes of time series data, numerous efforts have focused on optimizing and designing dedicated databases which offer significantly higher throughput and scalability for their target workloads [gorilla, BTrDB, influxdata, netflix, kairosdb, timescale, chronix]. Such optimizations include employing simplified data models [gorilla] and building specialized indices [BTrDB, Faloutsos] or precomputed aggregates [BTrDB, summarystore, influxdata] to accelerate queries over high volumes of data. Time series databases have been the fastest growing category of databases in the last two years [db-rank, TSList].
Time series data is a sequence of time-ordered data points. Each point is associated with a value and a unique timestamp capturing when the record was reported. Time series data generally consists of consecutive measurements reported by the same source forming a stream . A stream also includes metadata such as data metric describing the data, e.g., heart rate, and data source describing the origin of data, e.g., a unique device. Data points are typically generated at an extremely high rate - high velocity, and are initially stored at a high resolution - large volume [gorilla, BTrDB]. It is not unusual for applications in this space to report hundreds of millions of data points per day [BTrDB, summarystore].
Primitives distinct to time series analytics are: (i) access pattern: analytics is primarily carried over ranges of streams, where data is accessed by specifying an arbitrary time interval, i.e., a set of temporally co-located data points rather than target point queries. (ii) aggregation and indexing: keeping computation tractable over massive amounts of data necessitates fast stream processing routines which are leveraged to identify fractions of the data (less than %6 [bailis2017macrobase]) that are of interest for more complex analytics. Hence, specialized indices to accelerate statistical queries are essential in time series databases [BTrDB, influxdata, prometheus]. (iii) compression and data decay: time series data is often machine generated, continuous, and massive. Hence, compression [gorilla] and summarization techniques [summarystore] are crucial for these systems.
Emerging time series databases primarily focus on scaling performance and do not adequately address data privacy. With TimeCrypt, we aim to fill this gap without impairing their performance.
We discuss the challenges our system must address, give an overview of TimeCrypt, and describe our security model.
To realize an encrypted time series data store that is efficient, scalable, and practical, we have to overcome important challenges with regards to the overhead of the underlying cryptographic operations. This is of particular significance in our context due to the high velocity and volume of time series data. The crypto primitives employed by existing private databases [cryptdb, talos, monomi, blindseer] fall short in meeting the scalability and latency requirements for time series data workloads, or lack in the cryptographic means to restrict user’s access and queries [seabed, cryptdb, talos, monomi]. Most of these systems employ cryptographic primitives that incur considerable overhead with regards to both computation (expensive operations) and/or memory (ciphertext expansion). In the context of time series databases, the former impacts the system’s responsiveness negatively (i.e., making it hard to meet low-latency requirements) and the latter highly inflates the memory consumption of an encrypted index which negates the performance gains of in-memory processing as less index data can fit into memory (§6.1).
Most private databases rely on a trusted third party who manages and enforces access control. While simple, it implies that the trusted third party can access sensitive data. Alternatively, a user can encrypt the target data for each principal. This approach is not scalable when considering fine-grained polices and leads to memory redundancies, e.g., (re)encrypting the same data repeatedly for different users. System’s that do employ crypto-enforced access control [blindseer, enki, pilatus], however, similarly induce high memory and computation overheads, which renders them impractical for our context (§6.2). In this work we focus on addressing these key challenges.
TimeCrypt builds on a conventional times series database architecture [BTrDB, chronix, influxdata], where a standard distributed key-value store is extended with additional logic for time series workloads. TimeCrypt includes a client-side engine to realize the end-to-end encryption module paired with access control. In TimeCrypt, the server can only see encrypted data, yet TimeCrypt incorporates a fast server-side query engine for aggregation-based statistical queries and raw data access similar to other plaintext time series databases [BTrDB].
TimeCrypt in a nutshell. TimeCrypt consists of two components (i.e., the client and server engines) and involves four parties (i.e., data owner, data producer, data consumer, and the database server), as illustrated in Fig. 1. Data producers are devices that generate and upload time series data, such as wearables and IoT devices. A data producer runs TimeCrypt’s client engine which handles stream preprocessing (§4.1) and securely storing (§4.2.2) the data on untrusted servers.Data consumers are entities (e.g., services) that are authorized to access user’s data to provide added value, such as visualizations, monitoring, and predictions. Data consumers (i.e., principals) leverage TimeCrypt’s client engine to send queries (§4.5). Data owners are entities who own the data generated by data producing devices and can express flexible access permissions to their data (§4.2.3). TimeCrypt executes statistical queries directly on compressed and encrypted data. The server engine builds and maintains private aggregate indices over encrypted data streams for fast processing of statistical queries (§4.5). Our encryption scheme enables fast data processing and flexible and powerful access control, while maintaining low query response times (§4.2.2). TimeCrypt instances are stateless and therefore horizontally scalable.
TimeCrypt crypto primitives. To enable encrypted data processing that natively supports access control, TimeCrypt introduces a stream-cipher-based encryption scheme with additive homomorphic properties. An additive homomorphic encryption scheme supports additions on ciphertexts, such that . With this, we can support statistical queries that are inherently aggregation-based (e.g., sum, mean, frequency count) or can be transformed to be aggregation-based (e.g., min/max) [prio]. Since the encryption scheme is based on a stream cipher, it relies on a keystream. A key aspect of our scheme is tied to the observation that time series data streams are continuous. Hence, we introduce a time-encoded keystream that maps keys to temporal segments of the data stream, such that a user can restrict access to the data stream by only sharing the corresponding range in the keystream with the principal (§4.3). Based on the access policy, a user is acquainted with the necessary decryption keys by means of access tokens. Access tokens are encrypted with the principal’s public key (hybrid encryption) and stored at the server’s key-store. To enable sharing without enumerating all the keys, our encryption scheme integrates a tree-based key derivation function (§4.2.3). To grant a user access to only a defined temporal resolution, e.g., aggregated values at the full hour, we introduce an encoding scheme that allows for efficient decryption of aggregated values (§4.2.2). In §4.4, we discuss how we restrict access to a particular resolution level and introduce an efficient key distribution mechanism based on dual key regression.
3.3 Security Model
Threat model. We assume the cloud storage to behonest-but-curious, such that it follows the protocol correctly, though trying to learn as much as possible from the underlying data. This is a common trust model [seabed, cryptdb, monomi], as the storage provider would face economic repercussions in case of protocol violations. We assume the existence of an identity provider, e.g., Keybase [keybase], for the public-key to identity mappings. We assume data producers are honest and report correct data and all secret keys are stored securely.
Security properties. TimeCrypt provides provable data confidentiality and prevents unauthorized data access. TimeCrypt embodies a crypto-enforced access control that cryptographically restricts the temporal resolution and/or scope at which a principal can access the data within one or multiple streams (i.e., issue statistical queries). We formalize and prove the security guarantees of TimeCrypt in §A.1. TimeCrypt supports access revocation with forward secrecy, such that the revoked user can no longer access new data. The revoked user can, however, still access old data. Revoking access to old data is, in general, a hard problem, as the revoked user might have already cached the data locally. TimeCrypt does not guarantee freshness, completeness, nor correctness of the retrieved results. TimeCrypt can be extended with frameworks, such as Verena [Verena], that provide these guarantees. Finally, TimeCrypt does not protect against access pattern-based inferences. TimeCrypt can be complemented with Oblivious RAM approaches [Elaine:RAM] to hide the access patterns.
4 TimeCrypt Design
In the following, we discuss TimeCrypt’s system components and how they interplay to realize TimeCrypt.
4.1 Writing Stream Data
We now explain the data serialization process.
Chunking. We serialize and store time series data in time-ordered blocks, referred to as chunks, which contain batches of consecutive data points. Batching/chunking is a common technique in time series databases [chronix, bolt, minicrypt]. It provides positive performance gains as time series analytics predominantly involves processing of sequences of in time co-located data points, rather than individual points [bolt].
Each data chunk includes an encrypted digest that contains aggregate statistics over its corresponding chunk data (e.g., sum, count, min, max, frequency count, etc.).
The content of a digest is pre-configured based on the statistical queries to be supported per stream.
TimeCrypt supports common statistical queries by default.
The digest is encrypted with our encryption scheme HEAC that is additively homomorphic.
The data points per chunk are compressed
4.2 TimeCrypt’s Encryption
We start this section by giving an intuition about the underlying symmetric-key homomorphic encryption we employ in TimeCrypt.
Symmetric-key Homomorphic Encryption
Our scheme builds on Castelluccia’s encryption [castelluccia2005]. Castelluccia’s encryption in its essence is similar to a stream cipher, where one-time keys are combined with the plaintext data block for encryption. As in traditional stream ciphers, the scheme makes the assumption that the client is in possession of a pseudorandom keystream for encryption and decryption. To encrypt an integer we select a random key and compute the ciphertext: . For decryption, access to the random key is required: .
This scheme is information-theoretically secure, as long as the keys are pseudorandom and no key is reused [castelluccia2009]. Since we use modular addition as the encryption function, the scheme is also additively homomorphic and requires access to the aggregated secret keys to decrypt an aggregated ciphertext:
After decrypting the aggregated ciphertext, we gain access to the aggregated plaintext value. Note that similar to operating on plaintext, there will be an overflow (in this case modulo ), if the aggregated values grow larger than . Hence, the ciphertext size in the scheme is limited by and the homomorphic additions are extremely efficient, as they correspond to modular additions. In TimeCrypt, we set to , such that we can support all integer sizes, without leaking any information about their original size.
The drawback of this approach is that the client needs to aggregate all the keys to decrypt each aggregated ciphertext. If the server aggregates one million different ciphertexts, the client has to add one million keys for decryption, which is highly inefficient and not scalable.
In TimeCrypt, we utilize the characteristics of time series data to overcome this problem. Our Homomorphic Encryption-based Access Control scheme (HEAC) employs a key canceling technique for reducing the number of key additions during decryption (§4.2.2) and comprises an efficient key generation construction that embodies the access control capabilities (§4.2.3).
To reduce the number of key aggregations for decryption of aggregated ciphertexts, we leverage the fact that time series data is aggregated in-range (i.e., over a contiguous range in time). The idea is that instead of just adding a key to message , we construct the key as . In other words, we select the individual keys, such that during aggregation the inner keys cancel each other out, and only the outer keys at the beginning and end are needed for decryption. We employ the following encoding scheme for encryption:
For decryption, access to and is required:
The advantage of this key encoding becomes relevant while decrypting an in-range aggregated ciphertext. The resulting aggregated key consists of only two keys as the other keys are canceling out:
The encoding renders the decryption in TimeCrypt independent of the number of in-range aggregated ciphertexts. An encoding that cancels out keys during aggregation was introduced by Castelluccia et al. [castelluccia2011CancelOut] and was adopted in other systems [seabed, bonawitz2017practical]. However, Castelluccia’s encryption does not lend itself to access control as we discuss next.
Key Generation with Access Control
Prior work. Castelluccia et al. [castelluccia2009] propose leveraging a pseudorandom function (PRF) (§A.1.1) with an initially exchanged secret key, to generate the keystream . Other systems [castelluccia2011CancelOut, seabed, bonawitz2017practical] adopt this approach and use a PRF (e.g., a hash function ora block cipher) to generate the -th encryption key based on a secret key . While this addresses the challenge of handling a large number of keys, it exhibits the all-or-nothing sharing principle, as with access to the secret key one can compute all keys. Hence, it lacks support for proper access control needed in the context of our target applications. Ours. To enable a fine-grained access control that allows data owners to cryptographically enforce the scope of access to their data, we design a novel key construction scheme based on key derivation trees. A key derivation tree is a balanced binary tree where each node contains a unique pseudorandom string. The key derivation tree is built top-down from a secret random seed as the root. The child nodes are generated with a pseudorandom generator (PRG) - formally defined in §A.1.1 - that takes the parent as the input. Our PRG consists of for the left child and for the right child, where is the parent node. This procedure is applied recursively until the desired depth in the tree is reached. The leaf nodes represent the keystream as depicted in Fig. 2. We select a large such that the keystream is virtually infinite.
In TimeCrypt, the pseudorandom generator can be realized with hash functions or block ciphers with as the key. In §6.2, we discuss the trade-offs of different PRGs and why AES-NI is the best candidate in terms of performance. For a formal treatment and security proofs of our key derivation tree, we refer to §A.1.3.
Sharing. The key derivation tree allows us to share segments of the keystream efficiently. Instead of sharing the segment key-by-key, the client shares a few inner nodes, which we refer to as access tokens. A principal, who is in possession of these tokens, can derive the keys in the segment by computing the subtrees of each token, as illustrated in Fig. 2. Therefore, for sharing a segment in the keystream the client has to send at most access tokens instead of the the individual keys, e.g., in Fig. 2’s toy example, we share eight keys with a single access token. In practice a single token could be used to share thousands of keys. Note that given a token it is computationally not feasible (i.e., due to one-way property of PRGs) to compute the parent, sibling, or any of the ancestor nodes. Hence, a principal cannot compute any keys outside the segment they are granted access to.
4.3 Time-Encoded Keystream
The tree key derivation allows us to efficiently restrict access to segments of keys. However, in TimeCrypt, we need to restrict access based on temporal ranges, e.g., Fri-Sep-14-14:00 till Mo-17-06:00 2018. To enable this in HEAC, we associate each key to a fixed time window. By mapping keys to temporal ranges, a time range implicitly determines the position of the used key in the keystream. Hence, we do not need to store identifiers of the keys along with the ciphertexts and can avoid ciphertext expansion which other systems suffer from [seabed].
TimeCrypt’s data chunking is carried out at fixed time intervals of size per stream (e.g., 10 s intervals) and each chunk is encrypted with a fresh key from the keystream. Keep in mind that each chunk contains in addition to raw data points a digest that is relevant for the statistical processing (§4.1). Assuming the data stream starts at timestamp , the chunk digest for the interval () is encrypted with HEAC as , while the corresponding chunk is simply encrypted with AES_GCM using the key computed as . For example, when the data owner grants access to the stream from to (i.e., ) the data owner shares the access token from the key derivation tree that corresponds to the key segment , as illustrated in Fig. 2. With these keys the principal can decrypt all queries associated with data within the time range .
The fixed time interval size is an application parameter in TimeCrypt that can vary between streams and defines the smallest interval for server-side data processing. Due to the additively homomorphic property of HEAC, TimeCrypt computes data aggregates that are a multitude of . TimeCrypt additionally supports inter-streams queries, regardless of the individual stream chunk sizes. Only the principal who is granted access to the corresponding streams can decrypt the result of inter-streams queries.
4.4 Resolution-based Access Restriction
We now discuss how TimeCrypt provides crypto-enforced access control over the resolution at which data can be queried; i.e., where the data owner not only restricts access to a time range per principal but also defines the temporal granularity (e.g., per minute, hour, etc.) at which the principals can retrieve/query data. Multi-resolution data support is crucial for approximate queries, visualization, and time series analytics that operate on higher representation of the data to keep the analysis computationally tractable [summarystore]. Aside from being important for time series analytics, serving data at different granularities is of ample relevance for privacy. Restricting access to only the necessary data resolutions provides a higher degree of privacy protection, preventing inference of sensitive information from high resolution data [prio].
Outer Key Sharing
To enable crypto-enforced resolution access in TimeCrypt, we leverage the fact that inner keys cancel out during in-range aggregations, as described in §4.2.2. In general, an in-range aggregated ciphertext over the time period has the form where the inner keys are canceled out. Hence, with only access to the keys and , one can only decrypt this aggregation , i.e., restricted access. As an example, if the owner wants to restrict access to 6-fold aggregations () the owner only shares the outer keys with the principal. The principal can decrypt the aggregated ciphertexts only at the 6-fold granularity (e.g., 6x ) or lower resolutions, but not higher resolutions since the inner keys are missing. Note that the chunk interval defines the highest resolution, and lower resolutions are a multitude of . For instance, we set to 10 s in the health application and hence can define access resolutions as a multitude of 10 s (e.g., 1-min aggregates in the example above). Resolutions are aligned at timestamps and cannot be shifted arbitrarily. For instance, per-minute resolution means only aggregated data at the full minute. Otherwise one could compute the difference of two aggregates shifted by and disclose data at the chunk resolution.
Efficient Key Distribution
The Outer Key Sharing technique (§4.4.1) requires sharing keys from the keystream that are not contiguous (e.g., only every 6th key instead of every key in the time segment) which cannot be realized efficiently via our tree structure. To overcome the hurdle of key distribution in the case of resolution-based access, TimeCrypt generates a new keystream per access resolution, for efficient segment-wise access to the outer keys (i.e., resolution keys). For example, if a client wants to support a per-minute resolution for 10 s data chunks, the client encrypts the outer keys with the resolution keystream and stores the resulting key envelopes on the server. A principal with access to the resolution keystream can download the envelopes and gain access to the respective outer keys.
To enable efficient and flexible sharing of the resolution keystream, we employ dual key regression, as introduced in our prior work [droplet]. Dual key regression is a special construction based on hash chains, it supports efficient enumeration of keys within a bounded interval, such that given two tokens and with , one can only compute the keys with . Hash chains have the property that they can be computed efficiently in one direction, but it is computationally infeasible to compute the reverse, due to the preimage resistance property of cryptographic hash functions. Dual key regression employs two hash chains (primary and secondary) where the key derivation function takes two tokens; one token from each chain. The secondary chain is consumed in the reverse order of the primary chain allowing to set start and end boundaries on the shared interval, as depicted in Fig. 3. We refer to §A.2 for a formal description of dual key regression.
|(1) CreateStream(uuid, [config])||Create a new stream, config defines parameters, e.g., compression, chunk interval, operators.|
|(2) DeleteStream(uuid)||Delete specified stream with all associated data.|
|(3) RollupStream(uuid, res, [, ])||Rollup an existing stream or a segment of it to the specified temporal resolution.|
|(4) InsertRecord(uuid, [t, val])||Serialize data points in a chunk and append to the end of the stream.|
|(5) GetRange(uuid, , )||Retrieve all data records within the specified time interval.|
|(6) GetStatRange([uuid], , , [operators])||Retrieve statistics for the given time interval, default [sum, count, mean, var, freq, min/max].|
|(7) DeleteRange(uuid, start, end)||Delete specified segment of the stream, while maintaining per-chunk digest.|
|(8) GrantAccess(uuid, principal-id, start, end, res)||Grant access to a principal at the specified resolution for the specified time interval.|
|(9) GrantOpenAccess(uuid, principal-id, start, res)||Grant open-ended subscription, i.e., access granted until revoked.|
|(10) RevokeAccess(uuid, principal-id, end)||Revoke access of a principal starting from the specified end time.|
In TimeCrypt, a user does not need to decide a-priori on a fixed resolution for data consumers and can dynamically at any point in time define a new resolution, i.e., creating a new dual key regression, as illustrated in Fig. 3. E.g., Alice can share her health data with a physician at minute-level (high-resolution) during physiotherapy from Jan-to-Feb, and from March reduce the resolution to hourly (low-resolution). The physician only sees high-resolution data for Jan-Feb and only hourly-data from March onwards. In §6.2, we discuss the performance of resolution-based access control.
4.5 Encrypted Index
We now discuss the indexing structure we employ to process statistical queries over billions of encrypted data records and describe the statistical queries we support.
Time series indexing. TimeCrypt builds a time-partitioned aggregation tree [BTrDB] over encrypted data at the server side. This forms the base for an index structure that makes it possible to execute fast queries over a large sequence of encrypted data and enables fast and efficient data retention. The index structure is a -ary tree, where each node contains statistical summaries referred to as digest. Each digest represents a statistical summary over a time interval in the time series. The server builds the tree bottom-up, where the leaf nodes contain the digest of each chunk. A digest in the parent node is computed as the aggregation of all digests of the child nodes. Therefore, the parent node represents the statistical summary spanning over the whole time interval of its subtree, as depicted in Fig. 4. The encrypted index enables the server to significantly decrease the computation time of statistical range queries, as we avoid expensive serial scans and minimize on-the-fly computations. The server builds the index from the digests that are encrypted with HEAC. Since time series workloads are in-order append-only, updating the tree is straightforward for the server. Note that only the relevant segments of the tree are loaded into memory, which is important in case the tree grows too large.
Data decay. As time series data ages, it is often aggregated into lower resolutions for long-term retention of historical data. Typical strategies for collapsing multiple data points for roll-up to a lower precision are based on aggregation-based summarization. Depending on the application, high-resolution data can be aged-out after a defined retention period. TimeCrypt natively supports downsampling and archiving at lower resolutions, e.g., strategies for pruning of the index at a fixed resolution.
Statistical queries. TimeCrypt supports a set of common statistical queries in time series databases which are essential for identifying and searching for data segments that are relevant for further complex analytics. The per-chunk digest holds a vector of values that are encrypted with HEAC. TimeCrypt computes an aggregate function to serve the statistical queries.
To evaluate linear computations such as SUM, COUNT, and MEAN the vector in the digest contains encrypted sum and count for data points in the chunk, and MEAN is computed based on SUM and COUNT. For quadratic functions, e.g., VAR and STDEV, the vector stores squares of sums per chunk. For computing HISTOGRAM the bin boundaries per chunk simply track the encrypted count per bin and the final result delivers the aggregate counts for each count per bin.
We compute MIN/MAX values via the HISTOGRAM function. Hence, in addition to providing the MIN/MAX values, we also gain information about their frequency count. Note that since the MIN/MAX function does not rely on order-revealing encryption it does not suffer from leakage [Naveed]. The set of supported statistical queries can be extended with further aggregation-based functions, e.g., aggregation-based encodings that allow private training of linear machine learning models [prio].
|System||Micro||Index - Size||Average Ingest Time||Average Query Time (worst-case)|
|Paillier||2.1s||780MB (96x)||37ms (6167x)||42ms (3500x)||N/A||22ms (1692x)||37ms (1028x)||N/A|
|EC-ElGamal||0.7ms||168MB (21x)||27ms (4500x)||43ms (3583x)||N/A||66ms (5077x)||185ms (5139x)||N/A|
|TimeCrypt||1ns||8.1MB (1x)||16s (1.8x)||18s (1.3x)||22s (1.3x)||21s (1.6x)||46s (1.3x)||50s (1.1x)|
|Plaintext||1ns||8.1MB (1x)||6s (1x)||12s (1x)||17s (1x)||13s (1x)||36s (1x)||45s (1x)|
4.6 TimeCrypt Integration
In this section, we briefly describe TimeCrypt’s API, and storage and integration aspects of TimeCrypt. API. TimeCrypt is realized as a service which exposes an interface similar to conventional time series stores; applications can insert encrypted data, retrieve encrypted data by specifying an arbitrary time range, and process statistical queries over arbitrary ranges of encrypted data, as summarized in Table 1. In TimeCrypt, each stream is identified by a unique UUID and associated stream metadata, e.g., hostname, data type, sensor ID, location. Each stream has one writer (i.e., data producer) and one or multiple readers (i.e., principals). A data owner can grant and revoke read access to a principal at the desired temporal resolution.
Storage model. TimeCrypt can be plugged-in with any scalable key-value store for persisting data chunks and statistical indices. For each incoming chunk, the server engine loads the relevant index nodes and updates the index accordingly. The index nodes and chunks are stored under a unique identifier, which consists of the stream identifier and the encoding of the time interval that they represent. We compute the identifier of a node/chunk on-the-fly without storing any references, based on the temporal range boundaries.
Client-side batching. To meet the performance and end-to-end encryption requirements, the client takes an active role in TimeCrypt. The client feeds encrypted data chunks and digests to the server. Chunking at the client-side induces a latency that is bound by the chunk interval. This latency can be eradicated without breaking the encryption, by instantly uploading encrypted data records in real-time to the datastore and dropping the encrypted records once the corresponding chunk is stored. The required efforts to integrate TimeCrypt in existing databases varies largely on the internals of the target database. However, an integration is straight-forward for databases that leverage key-value storage. Our prototype is interfaced with Cassandra – a distributed key-value storage. This resembles a common design of current time series databases [gorilla, influxdata, netflix] and allows for modular development and independent performance quantification.
We built a prototype of TimeCrypt in Java. The server library consists of a network interface implemented with Netty [netty], which provides TimeCrypt’s API. The server and client communicate over Google’s protobufs [protobuf] protocol. The server engine includes the implementation of the -ary tree logic and an interface to the underlying distributed storage, i.e., Cassandra [cassandra]. We employ an LRU cache for the index nodes using caffeine [coffein]. The client engine contains the data serialization module, which compresses (i.e., default zlib) and encrypts chunks with AES-GCM. For the crypto operations, we use the Java security provider and a native C implementation of AES-NI. The query module implements the statistical query interface and a standard data retrieval interface. A user can query and access data only if she is granted the corresponding access permissions. The client and server consist of 850 and 5600 sloc, respectively.
To compare TimeCrypt to the strawman, our implementation integrates the crypto primitives for index computation based on Paillier [javalier] (based on Java BigIntegers) and EC-ElGamal (based on OpenSSL [openssl]).
We now present the evaluation results of TimeCrypt. The evaluation is designed to answer two questions; (i) can TimeCrypt meet the performance requirements of time series workloads - by showing to which extent we narrow the performance gap of TimeCrypt compared to operating on plaintext data, and (ii) what are the performance gains of TimeCrypt relative to a strawman construction (representing encrypted databases [cryptdb, monomi, talos]). After elaborating on our evaluation setup, we discuss the results of the microbenchmark of the system components (§6.1 and §6.2). Then we present the results of the end-to-end (E2E) system evaluation considering workloads of a health monitoring (mhealth) and data center monitoring (DevOps) applications (§6.3).
Setup and metrics. Our experiments are conducted with Amazon Web Service (AWS) M5 instances, each equipped with a 2.5 GHz Intel Xeon processor, running Ubuntu (16.04 LTS). We run TimeCrypt’s server nodes on m5.2xlarge instances with 8 virtual processor cores and 32 GB RAM. The clients are simulated on several m5.xlarge instances each with 4 virtual processor cores and 16 GB RAM. The client and server are connected to the same datacenter network, with up to 10 Gbps bandwidth. For the microbenchmark, we include IoT OpenMotes which are equipped with 32-bit ARM M3 SoC 32 MHz and a 250 MHz crypto accelerator (Fitbit trackers utilize a similar class of microcontrollers) and a MacBook Pro 2.8 GHz Intel Core i7, with 16 GB RAM.
For all our experiments, we use the same system architecture. We quantify the overhead of TimeCrypt, compare it to the plaintext setting operating over data in the clear, i.e., 64-bit unencrypted values, and strawman where we consider two alternative encryption schemes for the digest, i.e., Paillier and EC-ElGamal. For fairness, in all settings, the system operates with compressed data chunks. For TimeCrypt and strawman, we use 128-bit security [nistrecommendation]. This corresponds to 3072-bit keys for Paillier and a 256-bit elliptic curve for EC-ElGamal (i.e., prime256v1 in OpenSSL).
In the microbenchmarks, we consider the latency of index updates and query processing without accounting for network delay. In all experiments, we instantiate -ary index trees and a keystream with one billion keys via the key derivation tree.
For the E2E system benchmark, we use workloads of an mhealth and a DevOps application. For the mhealth, we consider a health monitoring wearable [medicalhealthtracker], which collects 12 different metrics at 50 Hz, where we set the chunk length to 10 s holding up to 500 data points. In the DevOps scenario, we utilize a synthesized CPU monitoring workload generated by the time series benchmark suite [timescalebenchmarkframework] with 10 metrics, 100 hosts, 10 s data rate and a one-min chunk size . For both, we generate the load on a machine with 100 threads, where each thread constantly performs four statistical queries after each chunk ingest for the experiment duration of half an hour excluding warm-up and cool-down phases. On the server side, we run a Cassandra and a TimeCrypt instance on the same machine for the mhealth app and separate them in the DevOps scenario. The network latency between the client and server is about 0.6 ms.
6.1 Index Performance
We discuss the evaluation results of different aspects of the index, as summarized in Table 2. In the micro-benchmark, the index supports one statistical operation (i.e., sum) for isolated overhead quantification, whereas in the E2E benchmark the index supports more queries.
Size. The in-memory index is essential for fast statistical queries. As system memory is limited and generally smaller than available data, it is crucial to the system performance to keep the index-size small. The encryption schemes in the strawman exhibit large ciphertext expansion, for instance for one million chunks we experience 96x index size expansion with Paillier. TimeCrypt has no ciphertext expansion for 64-bit values.
Ingest time. On each ingest, the index nodes are updated by computing aggregates of the child nodes. The costs for updates are relatively high for both EC-Elgamal and Paillier, i.e., more than 3500x slower than plaintext. With Paillier this is due to the high encryption cost, while in EC-ElGamal the cost of elliptic curve additions dominates the average ingest time. In TimeCrypt, additions are as efficient as in plaintext. Hence, the average ingest time increases slightly due to the encryption cost; 1.3x for a large index, outperforming the strawman constructions by three orders of magnitudes.
Query performance. Fig. 5 shows the performance of the index for statistical range queries of different lengths, i.e., with . As the length of queries increases fewer tree levels are traversed, which results in fewer cache fetches and lower computation time, e.g., the index depth of five is observable in Fig. 5. For plaintext and TimeCrypt the resulting pattern is similar due to the low cost of additions, while for TimeCrypt the decryption overhead is visible. The strawman encryptions have higher addition costs, which results in the distinct sawtooth pattern due to on-the-fly aggregations within index nodes.
Queries with non-power-of-k ranges require an index drill down on either end of the range. This increases the computation time logarithmic, - for a worst-case alignment, and not linear to the stored chunks. Similar to ingest, TimeCrypt performs statistical queries with a latency close to the plaintext (i.e., only 1.1x) and outperforms the strawman significantly.
6.2 Client Performance
Crypto primitives. Table 3 summarizes the encryption and decryption costs for TimeCrypt. TimeCrypt’s encryption scheme requires two key derivations. By leveraging hardware AES instructions (AES-NI), the derivation cost is reduced to 2.5 s for a hash tree with keys, as shown in Fig. 6. Hence, encryption and decryption in TimeCrypt amount to 5.08 s, which outperforms the strawman by several orders of magnitude.
Access control. To compare TimeCrypt’s crypto-based data access to related approaches, such as Attribute-based Encryption (ABE), we consider it in isolation, as ABE-based systems, e.g., Sieve [Sieve], do not support encrypted data processing. To enable granular access at the level of chunks, the ABE scheme takes the chunk counter as an attribute. To grant access to a given range, the attributes of the principal’s key are set to this range. To realize resolution access, the client or a trusted proxy can download and compute the aggregates, which are protected with the corresponding attribute. This results in an overhead of 53 ms per chunk (80-bit security), considering only one attribute. The overhead is expected to increase linearly with more attributes. TimeCrypt computes a keystream via the tree-based key derivation, and a resolution keystream via dual-key-regression. Considering zero caching for the worst-case computation time, the former has an upper bound of hash computations with keys, i.e., 2.5 s for a tree with keys. The later amounts to an iteration with the dual key regression which has an upper bound of for nodes in the key regression. This amounts to 2.7 ms for a key regression with the highest resolution matching the keys. To decrypt, ABE requires 13 ms per chunk whereas TimeCrypt only requires one addition and one subtraction (i.e., 2 ns). Consequently, TimeCrypt outperforms ABE significantly. Note that for fairness, we assume both TimeCrypt and the ABE-based system to have the same key distribution mechanisms in place.
6.3 End-to-End System Performance
In the following, we quantify the E2E overhead for the health and DevOps applications.
mHealth Performance. The plaintext setting reaches a throughput of 2.47M records/s for ingest and 19.4k ops/s for statistical queries, as shown in Fig. 7a-b. TimeCrypt demonstrates an outstanding throughput for both ingest and statistical queries with only 1.8% slowdown compared to plaintext. TimeCrypt is by 20x and 52x faster than EC-ElGamal and Paillier in the strawman.
With regards to latency, TimeCrypt outperforms the strawman by two orders of magnitude and approaches the latency of plaintext (Fig. 7c-d). The impact of a small index cache (1 MB) is distinct, but similar for both plaintext and TimeCrypt, due to higher cache misses.
mHealth Views. Our mhealth app shows different aggregation plots of last month’s data (121 M data points). We also consider the extreme case of plotting one-month data at minute granularity (403 MB plot), which induces an overhead of 1.51x in latency compared to plaintext, as shown in Fig. 8. This is due to the high number of decryptions of the individually retrieved aggregates (i.e., 40320). With higher granularities, the overhead sharply decreases and reaches 1.01x for one month.
DevOps Performance. In the DevOps app, we consider a data center CPU monitoring workload, where the clients query the sever for average CPU utilization and percentage of machines with higher than 50% utilization, within up to 16 h intervals. The load is similar to mhealth but with smaller chunks, i.e., 6 records per chunk. With plaintext, we observe an ingest throughput of 60.6k records/s and a query throughput of 40.4k ops/s. TimeCrypt matches the plaintext performance, with only 0.75% slowdown.
7 Discussion and Limitations
We highlight some research questions that remain open.
Richer statistics. TimeCrypt is optimized for statistical range queries on encrypted time series data, the most common queries over such data (§2). Analyzing raw streams in their entirety is generally impractical if not infeasible [bailis2017macrobase]. TimeCrypt provides powerful tools to search and retrieve relevant segments of the stream that can then be subject to advanced client-side processing after decryption. Data mining and advanced queries, e.g., series similarity, distance measures, predictions, and pattern detection are currently not supported on encrypted data. These are naturally more complex to process, and require specialized data structures and algorithms to be performed efficiently over big data [Faloutsos].
Performance. TimeCrypt’s encryption scheme consists of the key management that delivers one-time keys. Our encryption is optimized for continuous aggregation segments and suffers from alternative patterns, such as aggregating every second data chunk. Here the decryption overhead grows linearly with the number of aggregations and is not bound by a constant factor.
8 Related Work
TimeCrypt’s encryption scheme HEAC builds on
Castelluccia’s encryption [castelluccia2009, castelluccia2005], which has been adopted in several works [seabed, bonawitz2017practical, privacy-preserving-aggregation].
Seabed [seabed] is a secure analytics system that similarly builds on Castelluccia for secure aggregation.
However, Seabed is designed for Spark-like batch processing workloads without the tight latency requirements of time series data
Encrypted databases are designed mainly for transactional [cryptdb, pilatus, monomi, enki] and analytics workloads [monomi, seabed]. Hence, they neither provide the necessary primitives for stream analytics nor support crypto-enforced fine-grained access. E.g., CryptDB [cryptdb] has high computation load and large memory expansion. ENKI [enki] and Pilatus [pilatus] support sharing and encrypted computations, but lack fine-grained access control, and suffer similarly from high overheads. Bolt [bolt] is an encrypted data storage system for time series data that supports retrieval of encrypted chunks but has no server-side query support. BlindSeer [blindseer] enables private boolean search queries over an encrypted database by building an index with Yao’s garbled circuits, but does not support statistical queries. It integrates access control for the search queries, but requires two non-colluding parties.
Crypto-based access is explored by crypto-systems [abe5] such as identity-based encryption, ABE, predicate encryption, and functional encryption. ABE [abe1, abe2, abe3, abe4, goyal2006attribute] is the most expressive among them, though it comes with limitations with respect to revocation, fine-grained access, and dynamic updates [abe5]. Sieve [Sieve] combines ABE with key-homomorphic encryption, to enable revocation on a non-colluding cloud. ABE-based systems do not support private aggregation and lack scalability for time series data workloads.
Trusted Execution Environments (TEE) provide an isolated execution environment without exposing the data to the server. TEE-based systems have been introduced for analytics and database workloads [enclavedb, oblix, vc3, Kossmann, M2R, Haven]. However, they require dedicated trusted hardware which may be vulnerable to attacks [sgxcacheattack, sgxpectre].
TimeCrypt is a new system that augments time series data stores with support for encrypted data processing. TimeCrypt provides a new set of primitives, notably a new encryption scheme tailored for stream data that enables fast statistical range queries over large volumes of encrypted data and empowers data owners to cryptographically restrict the scope of stream queries based on their privacy preferences and access policies. Our evaluation on various large-scale workloads shows that the overhead of TimeCrypt is close to that of operating on plaintext data, demonstrating the feasibility of providing high performance and strong confidentiality guarantees when operating on large-scale sensitive time series data.
Appendix A Appendix
a.1 TimeCrypt Encryption
In this section, we analyze and proof the security of TimeCrypt’s cryptographic construction for encrypting the chunk digest. We first outline the basic cryptographic building blocks of our construction and give a detailed definition of our encryption scheme. Then, after analyzing the tree construction of TimeCrypt, we provide a proof of security of the proposed scheme.
Used Cryptographic Building Blocks
In our construction, we make use of the following cryptographic primitives.
Pseudorandom Function (PRF). A function is a PRF, if there is no probabilistic polynomial-time (PTT) distinguisher, which can distinguish from a random function drawn from with non-negligible probability in where is drawn uniformly at random from [goldreichconstuction].
Pseudorandom Generator (PRG). is a pseudorandom generator, if and no probabilistic polynomial-time (PTT) distinguisher can distinguish the output from a uniform choice with non-negligible probability [goldreichconstuction].
Goldreich-Goldwasser-Micali Construction. The Goldreich-Goldwasser-Micali construction shows how to construct a PRF from pseudorandom generators [goldreichconstuction]. Given a PRG and both of length , a PRF can be constructed as follows.
Given is a pseudorandom generator then the above construction is a pseudorandom function.
TimeCrypt introduces a new variant of the Castelluccia encryption scheme [castelluccia2005, castelluccia2009], which in addition to additive homomorphic computations also allows for access control. The basic idea of the Castelluccia encryption scheme is to replace the exclusive-OR operation in a standard stream cipher with modular addition. We leverage the same principle, but extend it with a key derivation function based on a binary tree (i.e., for access control) and an encoding for reducing the number of keys required for decryption on in-range aggregated ciphertexts. Let be our key derivation function with master secret computing the -th key, we define our symmetric private-key encryption scheme as:
Gen: on input , randomly pick for the key derivation function and set the plaintext space to where .
Enc: on input , samples uniformly at random and encrypts message as .
Dec: on input , decrypts ciphertext as .
We observe that the above scheme is additively homomorphic if we expand the added ciphertext with the used parameters during encryption. Note that for the analysis, we select the value uniformly random and attach it to the ciphertext. However, in our practical system, we do not attach each parameter to the ciphertext and can select based on the time-counter without compromising the security. As long as the selected values for are unique (i.e., do not repeat) the security guarantees remain intact. Furthermore, we can reduce the number of key derivations for decryption on in-range aggregated ciphertexts by canceling out the keys in between.
To prove the CPA-security of our scheme, we first analyze our key derivation function based on a tree data structure and show that the function is similar to a pseudorandom function. In a second step, we show that our encoding in the encryption step, which requires two evaluations of a pseudorandom function, can be reduced to a single pseudorandom function. Finally, we proof the CPA-security on the simplified scheme, which is similar to the scheme analyzed in [castelluccia2009].
In our system, we use a tree-based key derivation function , which allows for access control. We define the function , which derives keys based on a binary tree with height . Keys are derived from the leaf-nodes of the tree. Each node in the tree has a unique label in and an associated tree-key in . We define the label of each node in the following manner. The root node of the tree has the label , the empty-string, whereas the left and right children of a node with label have the labels and respectively. Hence, the leaf nodes are indexed by their label where . We denote a leaf node with label as .
Each node in the tree with label has a tree-key The root-node of the tree has a randomly chosen tree-key in , which corresponds to the input key of the key-derivation function. To derive the keys for the children of a node, a pseudorandom generator is used. Let and be defined as , where and . The left and right child of a node with tree-key are computed as and . Hence, the tree-key of a leaf-node is constructed as follows.
Note that if a tree-key is revealed, it is easy to compute the tree-keys of its children, but two children tree-keys do not reveal any information about the parent tree-key. This property allows for access control.
To derive the key for input value , the function computes the tree-key of the leaf-node with root-node key and outputs . Hence, the function is defined as
Lemma 1. is a pseudorandom function.
Proof. This directly follows from the definition of the Goldreich-Goldwasser-Micali construction because has an identical definition.
Proof of Security
Lemma 2. Given any sequence of distinct -bit values where , then there are at least functions such that for all , where .
Proof. We use a similar proof construction as presented in [prfaddproof].
Let and be the selected sets and non-empty.
We show how to enumerate all functions .
For all we can set the function result of to any -bit string independently,
whereas the remaining values can be computed as where is the smallest -bit string such that .
Note that the addition and the comparison are all modulo and (i.e., s is the smallest -bit string, which is greater than ).
With this construction of , we can verify that for all since each intermediate result cancels out.
Since , we first have choices for the argument of the function and for each choice possible results, which accumulates to total of possible functions.
Collary 1. Let and be uniform random functions on and define function as . For any oracle algorithm that is bound to queries to the oracle, the distribution of the algorithm for or as the oracle is the same.
Proof. This follows directly from Lemma 1, since the number of queries of is bound to .
Theorem 1. If function is a pseudorandom function and the corresponding distinguisher is bound to queries then the function defined as is a pseudorandom function with a distinguisher bound to queries.
Proof. Let be a PPT adversary, which can distinguish from a random function with non-negligable probability with queries to the oracle. We show with a proof by reduction that given , we can construct a polynomial-time distinguisher emulating , which can distinguish the PRF from a random function with non-negligible probability. has access to an oracle function and emulates to decide if is pseudorandom or random by outputting a bit . We construct as follows.
Whenever performs query with value , queries the oracle for , and responds with .
If outputs the bit , the distinguisher outputs the bit .
If runs in polynomial time also runs in polynomial time. We can observe that given attacker has a non-negligible advantage in distinguishing from random, the distinguisher can distinguish with the same probability, which follows from Lemma 1 and Collary 1. The intuition is that a distinguisher distinguishing from a function has the same distribution as a distinguisher distinguishing from as long as the distinguisher is bound to queries. The distinguisher requires queries to the oracle if is bounded by queries. Hence, if has an advantage , has an advantage in distinguishing from .
With Lemma 1 and Theorem 1, we can simplify the encryption and decryption functions in our scheme to the following.
Enc’: on input , samples uniformly at random and encrypts message as .
Dec’: on input , decrypts ciphertext as .
Theorem 2. If F is a pseudorandom function and then the encryption scheme is CPA-secure.
To proof the security of this construction, we first look at a hypothetical encryption scheme, which replaces the pseudorandom function with a randomly chosen function from the same domain. We argue with a proof by reduction that an attacker has only a negligible higher success probability in breaking the scheme with a pseudorandom function compared to a truly random function . In a final step, we analyze an attacker for the scheme with a completely random function R.
Proof. Given scheme we consider a second scheme , which replaces the pseudorandom function with a random function .
Gen*: on input , chose a uniform random function and set the plaintext space to where .
Enc*: on input , samples uniformly at random and encrypts message as .
Dec*: on input , decrypts ciphertext as .
We first proof that a PPT adversary has only a negligible advantage in breaking scheme compared to . We prove this by reduction assuming there exists a PPT adversary that has a non-negligible advantage in the CPA game with scheme in comparison to . With the attacker , we can construct a distinguisher , which distinguishes a pseudorandom random function from a random function with non-negligible probability. In the reduction, has access to an oracle function , and determines if this function is random or pseudorandom by emulating the attacker A. We construct the distinguisher as follows:
Whenever performs a query to the encryption oracle with message , samples uniformly at random, queries with response and responds to the attacker with .
When gives two messages as an output, the distinguisher choses a bit and samples uniformly at random, queries with response and responds with
When outputs the bit , the distinguisher D outputs 1 if or 0 otherwise.
We can make two observations given the distinguisher . If the oracle function is a pseudorandom function, the distinguisher has the same distribution as the attacker in the CPA experiment with the scheme . Similarly, if the oracle is a random function, the distinguisher has the same distribution as the attacker in the CPA experiment with the scheme . Hence, given is a pseudorandom function, the probability advantage of adversary in succeeding in the CPA game with scheme over the scheme is the same success probability a distinguisher has in distinguishing a pseudorandom function from a random function.
In the second part of the proof, we analyze the scheme assuming a random function . In the CPA game, the attacker can first query the encryption oracle times before the challenge. Assuming the challenge is computed as , where denotes the chosen random string, there are two possible outcomes. In the first case, never occurred as a choice in the querying phase. Since is truly random, has not learned anything in the querying phase. As a result, the probability that outputs during the challenge is because the output is uniformly distributed and independent. The resulting ciphertext is the addition of and modulo . Since and is uniformly distributed, we can directly see that the probability is for distinguishing two encrypted messages.
If the attacker observes in the querying phase, the attacker can determine which message was encrypted. Due to the observation of the ciphertext , the attacker learns that , which may be used to distinguish the encrypted message. The probability that the attacker observes is smaller than , if is uniformly drawn from and the number of queries of the attacker is bounded by the polynomial function .
By combining the results from both cases and assuming , we can bound the success probability of the attacker in the CPA game with scheme by the probability .
Using the findings from the first part of the proof, we can bound the probability of the attacker in the CPA game with scheme by the probability , where is a negligible function. Because is a negligible function and the addition of two negligible functions is again negligible, we complete the proof.
Length-Matching Hash Function. In our proof, we assume that the plaintext space matches the security parameter (i.e., ). However, if we only want to encrypt 64-bit integers with 128-bit security, we would have an overhead in the ciphertext size of 64-bits per ciphertext. To match the output of a PRF to the desired bits of , one could use a length-matching hash function, which is also used and analyzed in the Castelluccia encryption scheme [castelluccia2009]. A length preserving hash function must have the property that if is uniformly distributed over then should be uniformly distributed over . Note that is not a cryptographic hash function (i.e., no collision resistance is required). One possible construction of is splitting the output of the PRF into substrings of the desired range and exclusive-OR them together [castelluccia2009]. Since again outputs uniformly distributed strings with smaller length, the security proof only needs a few modifications if is applied to each output of the PRF.
Selection of the Key Identifier. Our formal construction samples the identifier , which serves as an input for the PRF, uniformly at random from , to prove the CPA-security. In our system, is the identifier for the key being derived and represents a time counter. As long as each identifier is only used once in the encryption process per stream (we keep the state on the client side), the scheme remains secure. Furthermore, the total number of keys can be selected according to the upper bound of chunks to be encrypted for a stream.
a.2 Key Regression
A key regression scheme [fu:keyregression] enables efficient sharing of past keys. If an entity is in possession of the key regression state , the entity can derive all keys with for . However, the entity is not able to infer any information about the keys with . In the following, we describe how to construct a key regression scheme and then define how this can be used to create the dual-key regression scheme. Let denote a function to the most significant bits of and a function to the least significant bits.
Single-Key Regression Construction
Using a pseudorandom generator , a client can construct a key regression scheme as follows. In the first step, the client generates all the possible states in reverse order from an initially randomly chosen seed . The seed is computed as . To derive key from the corresponding state , the client computes (i.e., applies the key derivation function). For sharing the keys up to the -th key, the client shares state with the other entity. With state , the entity can compute all pervious states with by applying the pseudorandom generator function . Because of the one-way property of the client is not able to compute or infer any information about or any with . Since the entity is in possession of states , the entity can derive the keys with the key derivation function.
Dual-Key Regression Construction
The key regression scheme based on a single series of states has the drawback that given the current state an entity can compute all the previous states and keys. Hence, a client is not able to define a lower bound to restrict access on past keys (e.g., , ). To overcome this problem, the idea is to combine two sequences of states to derive the keys, as introduced in our prior work [droplet]. We denote the -th state of the first sequence as and the second sequence as for where is the length of each sequence.
In the bootstrapping phase, the client generates the states as previously from a randomly chosen seed and computes the other states . To create the possibility for a lower restriction level, the second sequence is generated from the opposite direction. The second sequence starts with the random seed and the corresponding next state is computed as . To derive the key where , the states and serve as an input to the key derivation function which is defined as . If an entity is in possession of state and where , the entity can compute the states and with . Since pairs of states are required for deriving the keys, the entity can only compute the keys for which it possesses the corresponding state pairs. Considering the states computed above, the entity knows the state pairs and can compute but no other keys. Therefore, the dual key regression scheme enables access restriction based on ranges of keys by sharing the corresponding state of each state sequence.
Note that it is not possible to share two distinct intervals of keys, since all states can be computed in one direction of each state sequence. To share two distinct intervals of keys new sequences must be generated.
- TimeCrypt code is available at https://timecrypt.io.
- TimeCrypt runs the compression algorithm that yields the best results for the underlying data. For instance, delta encoding might be highly effective for low precision data with many identical values, but less effective for high precision data with large deltas. TimeCrypt supports various lossless compression techniques, with zlib as default.
- E.g., Seabed requires seconds to process an aggregate query over one billion data records with 100 cores [seabed], whereas TimeCrypt can process such a query within few milliseconds on a single machine.