The Design and Implementation of a Rekeying-aware Encrypted Deduplication Storage System
Rekeying refers to an operation of replacing an existing key with a new key for encryption. It renews security protection, so as to protect against key compromise and enable dynamic access control in cryptographic storage. However, it is non-trivial to realize efficient rekeying in encrypted deduplication storage systems, which use deterministic content-derived encryption keys to allow deduplication on ciphertexts. We design and implement REED, a rekeying-aware encrypted deduplication storage system. REED builds on a deterministic version of all-or-nothing transform (AONT), such that it enables secure and lightweight rekeying, while preserving the deduplication capability. We propose two REED encryption schemes that trade between performance and security, and extend REED for dynamic access control. We implement a REED prototype with various performance optimization techniques and demonstrate how we can exploit similarity to mitigate key generation overhead. Our trace-driven testbed evaluation shows that our REED prototype maintains high performance and storage efficiency.
Data explosion has raised a scalability challenge to cloud storage management. For example, Aberdeen Research  reports that the average size of backup data for a medium-size enterprise is 285TB, and meanwhile, faces an annual growth rate of about 24-27%. Deduplication is one plausible solution that makes storage management scalable. Its idea is to eliminate the storage of redundant messages that have identical content, by keeping only one message copy and referring other redundant messages to the copy through small-size pointers. Deduplication is shown to effectively reduce storage space for some workloads, such as backup data . It has also been deployed in today’s commercial cloud storage services (e.g., Dropbox, Google Drive, Bitcasa, Mozy, and Memopal) for saving maintenance costs .
To protect against content leakage of outsourced data, cloud users often want to store encrypted data in the cloud. Traditional symmetric encryption is incompatible with deduplication: it assumes that users encrypt messages with their own distinct keys, and hence identical messages of different users will lead to distinct ciphertexts and prohibit deduplication. Bellare et al.  define a cryptographic primitive called message-locked encryption (MLE), which derives the encryption key from the message itself through a uniform derivation function, so that the same message deterministically returns the same ciphertext through symmetric encryption. One well-known instantiation of MLE is convergent encryption (CE) , which uses the cryptographic hash of the message content as the derivation function. Storage systems that realize CE or MLE have been extensively studied and evaluated in the literature (e.g., [4, 23, 61, 67, 6, 13]). We collectively refer to them as encrypted deduplication storage systems, which encrypt the stored data while preserving the deduplication capability.
However, existing encrypted deduplication storage systems do not address rekeying, an operation that replaces an existing key with a new key so as to renew security protection. Rekeying is critical not only for protecting against key compromise that has been witnessed in real-life accidents [25, 40, 63], but also for enabling dynamic access control to revoke unauthorized users from accessing data in cryptographic storage [39, 30, 10, 51]. However, realizing efficient rekeying in encrypted deduplication storage is challenging. Since the encryption key of each message in MLE is obtained from a deterministic derivation function (e.g., a hash function), if we renew the key by renewing the derivation function, any newly stored message encrypted by the new key can no longer be deduplicated with the existing identical message; if we re-encrypt all existing messages with the new key obtained from the renewed derivation function, there will be tremendous performance overheads for processing large quantities of messages.
This paper presents REED, a rekeying-aware encrypted deduplication storage system that aims for secure and lightweight rekeying, while preserving identical content for deduplication. REED augments MLE with the idea of all-or-nothing transform (AONT) , which transforms a secret into a package, such that the secret cannot be recovered without knowing the entire package. REED constructs a package based on a deterministic variant of AONT  and encrypts a small part of the package with a key that is subject to rekeying, while the remaining large part of the package still preserves identical content for deduplication. We show that this approach enables secure and lightweight rekeying, and simultaneously maintains deduplication effectiveness. The contributions of this paper are summarized as follows.
We propose two encryption schemes for REED, namely basic and enhanced, that trade between performance and security. Both schemes enable lightweight rekeying, while the enhanced scheme is resilient against key leakage through a more expensive encryption than the basic scheme.
We exploit the similarity property that is commonly found in backup workloads  to mitigate the overhead of MLE key generation, while preserving deduplication effectiveness.
We implement a proof-of-concept REED prototype. Our REED prototype leverages various performance optimization techniques to mitigate both computational and I/O overheads.
We conduct extensive trace-driven evaluation on our REED prototype in a LAN testbed. REED shows lightweight rekeying. It only takes 3.4s to re-encrypt an 8GB file with a new key (in active revocation), and maintains high storage saving (e.g., higher than 97%) in real-world datasets. We also demonstrate the effectiveness of exploiting similarity in mitigating key generation overhead.
The source code of our REED prototype is now available for download at the following website: http://ansrlab.cse.cuhk.edu.hk/software/reed.
The remainder of the paper proceeds as follows. Section 2 motivates the need of rekeying for encrypted deduplication storage. Section 3 defines our threat model and security goals. Section 4 presents the design of REED. Section 5 explains how we exploit similarity in REED to mitigate MLE key generation overhead. Section 6 presents the implementation details of REED. Section 7 presents our evaluation results. Section 8 reviews related work, and finally, Section 9 concludes the paper.
2 Background and Motivation
2.1 Encrypted Deduplication Storage
Deduplication exploits content similarity to achieve storage efficiency. Each message is identified by a fingerprint, computed as a cryptographic hash of the content of the message. We assume that two messages are identical (distinct) if their fingerprints are identical (distinct), and that the fingerprint collision of two distinct messages has a negligible probability in practice . Deduplication stores only one copy of identical messages, and refers any other identical message to the copy using a small-size pointer. In this paper, we focus on chunk-level deduplication, which divides file data into fixed-size or variable-size chunks, and removes duplicates at the granularity of chunks. We use the terms “messages” and “chunks” interchangeably to refer to the data units operated by deduplication.
Message-locked encryption (MLE)  is a cryptographic primitive that provides confidentiality guarantees for deduplication storage. It applies symmetric encryption to encrypt a message with a key called the MLE key that is derived from the message itself, so as to produce a deterministic ciphertext. Two identical (distinct) messages will lead to identical (distinct) ciphertexts, so deduplication remains plausible. A special case of MLE is convergent encryption (CE) , which directly uses the message’s fingerprint as the MLE key.
However, MLE (including CE) is inherently vulnerable to brute-force attacks. Suppose that a target message is known to be drawn from a finite space. Then an adversary can sample all messages, derive the MLE key of each message, and compute the corresponding ciphertexts. If one of the computed ciphertexts equals the ciphertext of the target message, then the adversary can deduce the target message. Thus, MLE achieves security only for unpredictable messages , meaning that the number of candidate messages is so large that the adversary cannot feasibly check all messages against the ciphertexts.
To address the unpredictability assumption, DupLESS  implements server-aided MLE. It uses a dedicated key manager to generate an MLE key for a message based on two inputs: the message’s fingerprint and a system-wide secret that is independent of the message content. If the key manager is secure, then the ciphertexts appear to be encrypted with the keys that are derived from a random key space. This provides confidentiality guarantees even for predictable messages. Even if the key manager is compromised, DupLESS still achieves confidentiality for unpredictable messages. To make MLE key generation robust, DupLESS introduces two mechanisms. First, it uses the oblivious pseudo-random function (OPRF)  to “blind” a fingerprint to be processed by the key manager, such that the key manager can return the MLE key without knowing the original fingerprint. Second, the key manager rate-limits the key generation requests to protect against online brute-force attacks.
In this work, we focus on encrypted deduplication storage based on server-aided MLE. Like DupLESS, we deploy a dedicated key manager that is responsible for MLE key generation, so as to be secure against brute-force attacks.
We define rekeying as the generic process of updating an old key to a new key in encrypted storage, such that the old key will be revoked, and all subsequently stored files will be encrypted by the new key. We argue that rekeying is critical for renewing security protection for encrypted deduplication storage in two aspects: key protection and access revocation.
There have been real-life cases that indicate how adversaries make key compromise plausible through various system vulnerabilities, such as design flaws [25, 59, 40] and programming errors . These threats also apply to storage systems, since adversaries can compromise file encryption keys and recover all encrypted files. In addition to key compromise, every cryptographic key in use is associated with a lifetime, and needs to be replaced once the key reaches the end of its lifetime . Rekeying is thus critical for key protection. By immediately updating the compromised or expired keys, we ensure that the stored files remain protected by the new keys.
Since deduplication implies the sharing of data across multiple files and users, rekeying in encrypted deduplication storage is more critical than traditional encrypted storage without deduplication. In particular, the security of a message depends on its MLE key. The leakage of the MLE key may imply the compromise of multiple files that share the corresponding message.
Organizations increasingly outsource large-scale projects to cloud storage providers for efficient management. We consider a special case in genome research. Genome researchers increasingly leverage cloud services for genome data storage due to the huge volume of genome datasets . Some cloud services, such as Google Genomics  and Amazon , have also set up specific platforms for organizing and analyzing genome information. With deduplication, the storage of genome data can be significantly reduced, for example, by 83% in real deployment . However, some genome datasets, such as those produced by disease sequencing projects, are potentially identifiable and must be protected. Thus, dataset owners must properly protect the deduplicated genome data with encryption and multiple dimensions of access control . When a researcher leaves a genome project, it is necessary to revoke the researcher’s access privilege to the genome data.
Rekeying can be used to revoke users’ access rights by re-encrypting ciphertexts (e.g., the genome data in the previous example) with new keys and making old keys inactive. There are two revocation approaches for existing stored data : (i) lazy revocation, in which re-encryption of a stored file is deferred until the next update to the file, and (ii) active revocation, in which the stored files are immediately re-encrypted with the new key for up-to-date protection, at the expense of incurring additional performance overheads.
Enabling rekeying in encrypted deduplication storage is a non-trivial issue. MLE keys are often derived from messages via a global key derivation function, such as a hash function in CE  or a keyed pseudo-random function in DupLESS . A straightforward rekeying approach is to update the key derivation function directly. However, this approach compromises deduplication. Specifically, a new message cannot be deduplicated with the existing identical message, because the messages are now encrypted with different MLE keys that are derived from different derivation functions. If we re-encrypt all existing messages with new MLE keys, the re-encryption overhead will be significant due to the high volume of stored data.
There are other possible rekeying approaches, but we argue that they have limitations. One approach is based on layered encryption [6, 53]. Each deduplicated message is first encrypted with its MLE key, and the MLE key is further encrypted with a master key associated with each user. The security now builds on the master key. Rekeying can simply be done by updating the master key, and re-encrypting the MLE key with the new master key. This approach does not change the MLE key, so any new message can be deduplicated with the existing identical message. Its drawback is that every ciphertext remains encrypted by the same MLE key. If an MLE key is leaked, then the corresponding message can be identified. Another approach is proxy re-encryption , which transforms a ciphertext encrypted with an old key into another ciphertext encrypted with a new key. However, proxy re-encryption is a public-key primitive and is inefficient when encrypting large-size messages.
REED is a rekeying-aware encrypted deduplication storage system designed for a single enterprise or organization in which multiple users want to outsource storage to a remote third-party cloud provider. It deploys a remote server to run deduplication on the storage workloads, and stores the unique data after deduplication in the cloud provider. We target the workloads that have high content similarity, such as backup or genome data (Section 2.2), so that deduplication can effectively remove duplicates and improve storage efficiency.
REED aims to achieve secure and lightweight rekeying, while preserving deduplication capability. In particular, it enables dynamic access control by controlling which group of users can access a file. It supports both lazy and active revocations (Section 2.2); for the latter, the stored files can be re-encrypted with low overhead.
Figure 1 presents an overview of the architecture of REED. REED follows a client-server architecture. It is composed of different entities, as described below.
In each user machine, we deploy a REED client (or client for short) as a software layer that provides a secure interface for a user to access and manage files in remote storage. To perform an upload operation, the client takes a file (e.g., a snapshot of a file system folder) as an input from its co-located user machine. It divides the file data into chunks, encrypts them, and uploads the encrypted chunks to the cloud. We assume that the file has a sufficiently large size (e.g., GB scale), and can be divided into a large number of chunks of small sizes (e.g., KB scale).
We support both fixed-size and variable-size chunking schemes. We implement variable-size chunking using Rabin fingerprinting , which takes the minimum, maximum, and average chunk sizes as inputs. We fix the minimum and maximum chunk sizes at 2KB and 16KB, respectively, and vary the average chunk size in our evaluation. In file downloads, the client reassembles collected chunks into the original file.
As in DupLESS , REED deploys a key manager to provide an interface for a client to access MLE keys for encrypted storage. Each client communicates with the key manager to perform necessary cryptographic operations. We implement server-aided MLE as in DupLESS to protect all chunks, including predictable and unpredictable ones, as well as the OPRF-based MLE key generation protocol (Section 2.1). We elaborate the key generation details in Section 5.1. Other approaches, such as blinded BLS signatures , can be used to implement MLE key generation. This work considers a single key manager, while our design can be generalized to multiple key managers for improved availability .
REED performs server-side deduplication. In the cloud, we deploy a REED server (or server for short) for storage management. The server maintains a fingerprint index that keeps track of all chunks that have been uploaded to the cloud. For a given received chunk, the server checks by fingerprint if the chunk has already been uploaded by the same or a different client. If the chunk is new, it stores the chunk and inserts the chunk fingerprint to the index. We can deploy multiple servers for scalability.
Finally, the server stores the encrypted chunks and metadata in the storage backend of the cloud. For example, if we choose Amazon’s cloud services, we can rent an EC2 virtual machine for a REED server, and use S3 as the storage backend.
3.2 Threat Model
We consider an honest-but-curious adversary that aims to learn the content of the files in outsourced storage. The adversary can take the following actions. First, it can compromise the cloud (including any hosted server and the storage backend) to have full access to all stored chunks and keys. Also, it can collude with a subset of unauthorized or revoked clients, and attempt to learn the files that are beyond the access scope of the colluded clients. Furthermore, it can monitor the activities of the clients, identify the MLE keys returned by the key manager, and attempt to extract the files owned by the monitored clients.
Our threat model makes the following assumptions. We assume the communication between a client and the key manager is encrypted and authenticated (e.g., using SSL/TLS), so as to defend against any eavesdropping activity in the network. Each client and the key manager adopt oblivious key generation , so that the key manager cannot infer the fingerprint information and learn the message content. We also assume that the key manager is deployed in a fully protected zone, and an adversary cannot compromise or gain access to the key manager.
We do not consider the threat in which an adversary launches online brute-force attacks from a compromised client against the key manager, since the key manager can rate-limit the query rate of each client . REED can be deployed in conjunction with remote data checking [8, 38] to efficiently check the integrity of outsourced files against malicious corruptions. REED performs server-side deduplication to protect against the side-channel attacks mentioned in [36, 35].
3.3 Design Goals
Given the threat model, REED focuses on the following design goals.
Confidentiality: REED protects outsourced chunks, such that the chunk contents are kept secret against any honest-but-curious adversary (e.g., any unauthorized user or cloud). In addition, REED prevents revoked users from accessing any new file or update.
Integrity: REED ensures chunk-level integrity of outsourced files. When a client downloads a chunk, it can check if the chunk is intact or corrupted.
Practical rekeying: REED enables rekeying and dynamic access control, such that it can control which group of users can access a file. It supports both lazy and active revocations with low overhead; for the latter, the stored files can be efficiently re-encrypted. REED also allows an unlimited number of rekeying operations.
High storage efficiency: REED achieves storage efficiency by deduplication. In addition, it introduces small storage overhead due to keys or metadata.
High encryption performance: REED introduces limited encryption overhead when compared to the network transmission via the cloud.
4 REED Design
We now present the design details of REED. We propose two encryption schemes that trade between performance and security. We also demonstrate how REED realizes dynamic access control using existing primitives. Finally, we analyze the security of REED. In this section, we focus on the essential features of REED that support secure and lightweight rekeying, without considering workload characteristics; in Section 5, we exploit similarity for performance improvements.
4.1 Main Idea
REED builds security simultaneously on two types of symmetric keys: a file-level secret key per file (or file key for short) and a chunk-level MLE key for each chunk (or MLE key for short). During rekeying, REED only needs to renew the file key, while the MLE keys of all chunks remain unchanged.
REED uses all-or-nothing transform (AONT)  as the underlying cryptographic primitive. AONT is an unkeyed, randomized encryption mode that transforms a message into a ciphertext called the package, which has the property that it is computationally infeasible to be reverted back to the original message without knowing the entire package. The original AONT design prohibits deduplication, since its transformation takes a random key as an input to construct a package. Thus, REED uses convergent AONT (CAONT) , which replaces the random key with a deterministic message-derived key to construct a package. This ensures that identical messages always lead to the same package.
REED augments CAONT to enable rekeying. Our insight is to achieve security by sacrificing a slight degradation of storage efficiency. The idea of REED is based on AONT-based secure deletion , which makes the entire package unrecoverable by securely removing a small part of a package. REED extends the idea to make it applicable for rekeying. Specifically, REED generates a CAONT package with the MLE key as an input, and encrypts a small part of the package, called the stub , with the file key. Thus, the entire package is now protected by both the file key and the MLE key. The stub size is small; for example, our implementation sets it as 64 bytes, equivalent to 0.78% for an 8KB chunk. In addition, we can still apply deduplication to the remaining large part of the package, called the trimmed package, so as to maintain storage efficiency.
In the following, we first design two rekeying-aware encryption schemes on a per-chunk basis (Section 4.2), followed by enabling REED with dynamic access control on a per-file basis.
4.2 Encryption Schemes
We propose the basic and enhanced encryption schemes for REED. The basic scheme is more efficient, but is vulnerable to the leakage of an MLE key. On the other hand, the enhanced scheme protects against the leakage of an MLE key, while introducing an additional encryption step. In the following, we first explain the basics of AONT  and its variant CAONT , followed by how the basic and enhanced encryption schemes build on CAONT.
All-or-nothing transform (AONT):
AONT  works as follows. It transforms a message to a package denoted by , where and are called the head and tail, respectively. Specifically, it first selects a random encryption key and generates a pseudo-random mask , where denotes a symmetric key encryption function (e.g., AES-256) and is a publicly known block with the same size as . It then computes , where ’’ is the XOR operator, and also computes , where is the hash function (e.g., SHA-256). Note that the resulting package has a larger size than the original message by the size of . To recover the original message , suppose that the whole package is known. We first compute , followed by computing .
CAONT  follows the same paradigm of AONT, but replaces the random encryption key by a deterministic cryptographic hash derived from the message . This ensures that packages generated by identical messages remain identical, and hence the packages can still be deduplicated. Another feature of CAONT is that it allows integrity checking without padding. Specifically, after the package is reverted, the integrity can be verified by computing the hash value of and checking if it equals .
The basic encryption scheme leverages CAONT  to generate both the trimmed package and the stub, as shown in Figure 2. In particular, we make two modifications to CAONT. The first modification is to replace the cryptographic hash key in CAONT  by the corresponding MLE key generated by the key manager. The rationale is that we use the MLE key to achieve security even for predictable chunks through server-aided MLE  (Section 2.1). However, we now cannot use the hash key for integrity checking as in CAONT. Thus, the second modification is to append a publicly known, fixed-size canary to  for CAONT, so that the integrity of can be checked. In our implementation, we set the fixed-size canary to be 32 bytes of zeroes.
The basic encryption scheme is detailed as follows. We first concatenate an input chunk with the canary to form , and compute the pseudo-random mask , where is the MLE key obtained from the key manager and is the publicly known block with the same size of . We compute the package head , and the package tail . We generate the stub by trimming the last few bytes (e.g., 64 bytes) from the package , and leave the remaining part as the trimmed package. Finally, we encrypt the stub with the file key. Reconstruction of a message works reversely, and we omit details here.
We briefly comment on the security guarantees of the basic encryption scheme. The security of each chunk builds on both the file key and the MLE key. If both the file key and the MLE key are secure, then given both the trimmed package and the encrypted stub of a chunk, it is computationally infeasible to revert them to the original chunk. In addition, if the file key is renewed, it is also computationally infeasible to restore the stub (which is now protected by the new file key) and hence the original chunk using the old file key.
One limitation of the basic encryption scheme is that it is vulnerable to the compromise of the MLE key. Specifically, an adversary can monitor the MLE keys generated by the key manager at a compromised client (Section 3.3). If an MLE key is revealed, the adversary can recover the pseudo-random mask and XOR the mask with the trimmed package to extract a majority part of the chunk.
We propose the enhanced encryption scheme, which protects against the compromise of its MLE key. Figure 3 shows the workflow of the enhanced encryption, which first applies MLE to form a ciphertext, followed by applying CAONT  to the MLE ciphertext. The rationale is that even if an adversary obtains the MLE key, it still cannot recover original chunk because the MLE ciphertext is now protected by CAONT.
The enhanced encryption scheme is detailed as follows. First, we encrypt an input chunk with the MLE key as in traditional MLE, and obtain the ciphertext . We then transform the concatenation based on the original CAONT . We can now use the hash key , instead of the MLE key used in the basic encryption scheme, to transform the package. This eliminates the security dependence on the MLE key. Formally, we compute the hash key and the pseudo-random mask , where is a publicly known block with the same size as , and computes the package head .
Since the hash key allows integrity checking , we can generate the tail with a self-XOR operation for efficiency , instead of using the cryptographic hash as in the basic encryption scheme (Figure 2). Specifically, we evenly divide into a set of fixed-size pieces, each with the same size as . We then XOR all the pieces as well as to compute the tail . Note that the self-XOR result cannot be predicted without knowing the entire content of . Finally, we obtain the trimmed package and the stub from .
To reconstruct , we first reconstruct from the trimmed package and the stub. We evenly divide into fixed-size pieces, each with the same size as , and compute by XOR-ing the pieces and . We then recover , and check the integrity by comparing and . We finally compute , where is the decryption function.
We now briefly comment on the security guarantees of the enhanced encryption scheme. As in basic encryption, the enhanced encryption scheme ensures that each chunk remains secure if both the file key and the MLE key are secure. If the MLE key is leaked, the adversary can recover the original chunk from the MLE ciphertext (i.e., the input to CAONT), yet the original chunk remains secure if the unpredictability assumption still holds (see Section 2.1). We present a more detailed security analysis in Section 4.5.
4.3 Dynamic Access Control
REED supports dynamic access control by associating each file with a policy, which provides a specification of which users are authorized or revoked to access the file. Our policy-based design builds on two well-known cryptographic primitives: ciphertext policy attribute-based encryption (CP-ABE)  and key regression . REED integrates both primitives to generate the corresponding file key, as shown in Figure 4. Note that our goal here is not to propose new designs for CP-ABE and key regression; instead, we demonstrate how REED can work seamlessly with them to provide advanced security functionalities for rekeying. In the following, we elaborate how REED integrates the two primitives.
REED defines policies based on CP-ABE . In CP-ABE, a message is encrypted based on a specific policy that describes which users can decrypt the message. Each policy is represented in the form of an access tree, in which each non-leaf node represents a Boolean gate (e.g., AND or OR), while each leaf node represents an attribute that defines or classifies some user property (e.g., the department that a user belongs to, the employee rank, the contract duration, etc.). Each user is given a private key that corresponds to a set of attributes. If a user’s attributes satisfy the access tree, his private key can decrypt the ciphertext.
Our current design of REED treats each attribute as a unique identifier for each user. We issue each user with a CP-ABE private key, called the private access key, related to the identifier. We define the policy of each file as an access tree that connects the identifiers of all authorized users with an OR gate. Thus, any authorized user can decrypt the ciphertext, which we use to protect the file key (see the rekeying discussion below). Note that we can define more attributes and a more sophisticated access tree structure for better access control.
REED supports both lazy and active revocations for rekeying. In lazy revocation, REED builds on key regression , which is a serial key derivation scheme for generating different versions of keys. Specifically, key regression introduces a sequence of key states, such that the current key state can derive the previous key states, but it cannot derive any future key state. Thus, an authorized user can access all previous key states, and the corresponding files, by using only the current key state; meanwhile, a user revoked from the current key state cannot access any new file that is protected by a future key state. REED implements lazy revocation using the RSA-based key regression scheme . We assign each user with a unique pair of public-private keys called the derivation keys, such that the private derivation key is used to generate new key states for the files owned by the user, while the public derivation key is used to derive the previous key states. The file key will be obtained by generating a cryptographic hash of the current key state. Each key state refers to a policy, and it will be encrypted by CP-ABE associated with the authorized users. In other words, any authorized user can retrieve the current key state, and hence the file key, with his private access key.
REED implements active revocation following the same paradigm as in lazy revocation, except that the files affected by active revocation are immediately re-encrypted with the new file key.
We now summarize the interactions among a client, a server, the key manager, and the storage backend in REED operations. We focus on three basic operations, including upload, download, and rekeying.
To upload a file , the client first picks a random key state and hashes it into a symmetric file key . It splits into a set of chunks , computes their fingerprints, and runs the OPRF protocol  with the key manager to obtain the MLE keys of these chunks (Section 3.1). For each , it uses to transform a chunk into a trimmed package and a stub, using either the basic or enhanced encryption scheme (Section 4.2). The client writes the stubs of all the chunks of the same file into a separate stub file for storage, and the stub file will be encrypted by the file key . In addition, the client generates a file recipe, which includes the file information such as the file pathname, file size, and the total number of chunks. Furthermore, the client encrypts using CP-ABE based on the policy of the file. Finally, the client uploads the following information to the REED server: (i) the trimmed packages and encrypted stubs for all chunks, (ii) file recipe, and (iii) the encrypted key state and the metadata that includes the policy information. Note that we do not need to upload MLE keys, as they are not used in decryption (Section 4.2). The server performs deduplication on the received trimmed packages. All information will be stored at the storage backend.
To download a file , the client first retrieves the encrypted key state and decrypts it with the private access key. It then hashes to recover the file key . In addition, it downloads all trimmed packages and encrypted stubs from the storage backend, with the help of the REED server and the file recipe. It decrypts the stubs via , and finally reconstructs all chunks for . Note that if the client detects any tampered chunk, the reconstruction operation will abort.
To rekey with new access privileges, the client (on behalf of the owner of ) retrieves and its metadata, and decrypts with the private access key. It then generates a new key state based on key regression (Section 4.3). It encrypts via CP-ABE based on a new policy (e.g., with a new group of users). It finally uploads the encrypted as well as its metadata that describes the new policy information. For active revocation, the client also downloads the stubs of , re-encrypts them with a new file key obtained by hashing , and finally uploads the re-encrypted stubs.
4.5 Security Analysis
We now analyze the security of REED based on our security goals.
We show how REED achieves confidentiality at three levels. First, an adversary can access all trimmed packages, encrypted stubs, and encrypted key states from a compromised server. Since the adversary cannot compromise any private access key and private derivation key, all trimmed packages and encrypted stubs cannot be reverted. Thus, REED achieves the same level of confidentiality like DupLESS  (Section 2.1).
Second, an adversary can collude with revoked or unauthorized clients, through which the adversary can learn a set of private derivation keys and private access keys. Due to the protection of CP-ABE and key regression, these compromised private keys cannot be used to decrypt the file key ciphertexts beyond their access scopes. Without proper file keys, the adversary cannot infer anything about the underlying chunks. One special note is that a client may keep the MLE key (in basic encryption) or the hash key (in enhanced encryption) of a chunk in CAONT (Figures 2 and 3, respectively) to make the chunk accessible even after being revoked. However, if the chunk is updated, the revoked client cannot learn any information from the updated chunk because CAONT will use a new MLE key or hash key to transform the updated chunk, making the old one useless.
Finally, an adversary can monitor a subset of clients and identify the MLE keys requested by them. The enhanced encryption scheme of REED ensures confidentiality for unpredictable chunks, even though the victim clients are authorized to access these chunks. Specifically, the enhanced encryption scheme builds an additional security layer with the file key. As long as the file key is secure, it is computationally infeasible to restore the MLE ciphertext (i.e., the input to CAONT) due to the protection of CAONT. Note that the adversary may restore the MLE ciphertext by launching a brute-force attack to check if the MLE ciphertext is transformed into the trimmed package through CAONT, but it is computationally infeasible if chunks are unpredictable (see Section 4.2). Thus, identifying an MLE key does not help recover the original chunk, and hence the original chunk remains secure.
Both the basic and enhanced encryption schemes of REED ensure chunk-level integrity, such that any modification of the trimmed package or the stub of a chunk can be detected. In the basic encryption scheme, the MLE key can be reverted as (Section 4.2). Since depends on every bit of , the modification of any part of the package will lead to an incorrect . Thus, the client can easily detect the modification by checking the canary padded with the reverted chunk.
Using similar reasonings, the enhanced encryption scheme also ensures the integrity of a chunk, such that a client performs integrity checking by comparing if equals (Section 4.2). One special note regarding the enhanced scheme is that its use of the self-XOR operation may return a correct hash key even if the package is tampered. For example, an intelligent adversary can divide into fixed-size pieces and flip the same bit position for an even number of the pieces. On the other hand, a tampered package will be reverted to a wrong input even with the correct hash key, and its integrity violation can be caught by comparing it with .
We present some open issues of our current REED design.
In this work, we do not explicitly address fault tolerance. To improve fault tolerance of stored data, we can distribute both trimmed packages and stubs across multiple cloud providers via deduplication-aware secret sharing .
We currently focus on the encryption and rekeying for file chunks, while we do not address those for file metadata (e.g., file recipe). We can obfuscate sensitive metadata information, such as the file pathname, by encoding it via a salted hash function.
Group-based file management:
We currently perform rekeying on a per-file basis. We can generalize rekeying for file group with multiple files. This makes file management more flexible. On the other hand, we need to define new metadata to describe the file group information.
5 Exploiting Similarity
REED builds on server-aided MLE key generation , in which the key manager generates an MLE key for each message (or chunk in our case). In this section, we argue that MLE key generation is expensive and significantly degrades the overall performance of REED. In view of this, we propose to exploit the similarity feature that is commonly found in backup workloads, so as to mitigate the performance overhead of REED.
5.1 Overhead of MLE Key Generation
Recall that REED realizes the OPRF protocol to “blind” MLE key generation as in DupLESS . In our design, we configure the key manager with a system-wide public/private key pair, based on 1024-bit RSA in our case. Let and be the public and private keys, respectively, and be the modulus. For each chunk to be uploaded, a client performs MLE key generation in the following steps (note that all arithmetic is performed in modulo ).
Blind: the client selects a random number , raises it to power , and multiplies with the fingerprint. It sends the blinded fingerprint to the key manager.
Sign: the key manager computes an RSA signature by raising the blinded fingerprint to power . It returns the result to the client. Note that the key manager does not know the original fingerprint, which is “blinded” by the random number .
Unblind: the client multiplies the received result with the inverse of . It also hashes the unblinded result to form the MLE key.
OPRF-based MLE key generation is expensive, especially when it operates on small-size chunks. Its overhead comes from two aspects. First, if a client sends individual per-chunk MLE key generation requests to the key manager, there will be substantial transmission overhead. Also, since the OPRF protocol for key generation is based on public key cryptography, there will be substantial computational overhead due to modular exponentiation.
Table 1 provides a performance breakdown (in terms of latency) of MLE key generation for an 8KB chunk. We obtain average results over 10 runs from our experimental testbed (Section 7). If we implement all the steps serially, the total latency for MLE key generation is 1125.3s, or equivalently the throughput is only 6.9MB/s. If we deploy REED in a Gigabit LAN (our experimental testbed), MLE key generation easily becomes a performance bottleneck of REED. In particular, the sign operation occupies 48% of the total latency, and it cannot be trivially parallelized for performance improvement.
|Blind (performed by the client)||46.3|
|Sign (performed by the key manager)||537.2|
|Unblind (performed by the client)||246.9|
5.2 Limitations of Simple Optimizations
To mitigate MLE key generation overhead, our conference paper  uses two optimization approaches: (i) batching per-chunk MLE key generation requests and (ii) caching the most recently generated MLE keys in the client’s local key cache. While both approaches can mitigate key generation overhead based on evaluation results, they still have the following limitations.
Batching per-chunk MLE key generation requests aims to reduce round-trip transmission overhead, but it does not reduce computational overhead (see Table 1). As shown in our conference paper , batching 256 per-chunk key generation requests for 8KB chunks can only achieve a key generation speed of 17.64MB/s, which is still much smaller than the network speed in a Gigabit LAN.
Caching the MLE keys is effective in mitigating key generation overhead, based on the observation that the adjacent uploads of a client often share high content similarity; for example, backup snapshots for a file system are highly similar if there are only small changes to the file system. As shown in our conference paper , we can eliminate most key generation requests for the uploads of subsequent backups after the first one, so the upload speeds for subsequent backups are almost network-bound (around 100MB/s). However, the caching approach has few limitations. First, it is only effective for uploads that are largely duplicated with the previous one (e.g., it is ineffective for the first backup ). Second, its required local cache space is not scalable; for example, it needs 4GB of cache space per 1TB of storage, assuming that we configure 8KB chunks and 256-bit MLE keys. Finally, it is unreliable due to the volatile nature of cache.
5.3 Similarity-based Approach
We propose a similarity-based approach for MLE key generation, such that we can mitigate MLE key generation overhead, while preserving deduplication effectiveness. First, we adopt coarse-grained MLE key generation on a larger data unit called segment, which comprises multiple adjacent chunks and has a size on the order of megabytes (e.g., 1MB by default in our case). To form a segment, we implement the variable-size segmentation scheme in  that operates directly on chunk fingerprints and is configured by the minimum, average, and maximum segment sizes. Specifically, we traverse the stream of chunks, and place a segment boundary after the chunk if the chunk fingerprint modulo a pre-defined divisor is equal to a fixed constant (which we set to -1 as in ). Here, the divisor is configured by the average segment size to specify the expected number of chunks between adjacent segment boundaries. We ensure that the segment size is at least the minimum segment size, and we always place a boundary after the chunk whose inclusion makes the segment size larger than the maximum segment size. In our implementation, we vary the average segment size, and fix the minimum segment size and maximum segment size as half and double of the average segment size, respectively.
Clearly, per-segment MLE key generation incurs much fewer key generation requests than the per-chunk one, thereby significantly mitigating the overall performance overhead. On the other hand, segment-level MLE key generation can introduce different segment-level MLE keys for different segments (and hence ciphertexts), even though the segments share a large portion of identical chunks. This compromises deduplication effectiveness.
Thus, our similarity-based approach aims to maximize deduplication effectiveness by carefully generating segment-level MLE keys. Our insight is to assign “similar” segments with the same MLE key and encrypt every chunk of a segment with the corresponding segment-level MLE key. If two “similar” segments share a large number of identical chunks, the identical chunks are still encrypted with the same key and hence deduplicated.
In this work, we borrow the Extreme Binning approach  to identify similar segments. Specifically, for each segment that contains multiple chunks, a client selects the chunk (called the representative chunk) whose fingerprint value is the minimum. It uses the minimum fingerprint to request the key manager for the segment-level MLE key. It then encrypts each chunk (via either basic or enhanced encryption of REED) with the received MLE key. The rationale is that if two segments share a large number of identical chunks, there is a high probability that both segments share the same representative chunk (due to Border’s Theorem ).
Figure 5 shows an example of our similarity-based approach. Consider three segments and that have four chunks each, and suppose that their representative chunks are , , and , respectively. Since both segments and share the same MLE key, their identical chunks (i.e., , and ) in these similar segments can be deduplicated. Note that the approach cannot achieve exact deduplication; for example, chunk in segments and cannot be deduplicated due to the different segment-level MLE keys. Nevertheless, since similarity is common in backup workloads , we expect that our similarity-based approach achieves high deduplication effectiveness, as also validated in our evaluation (Section 7).
5.4 Security Analysis
We now analyze the security impact of the similarity-based MLE key generation on both the basic and enhanced encryption schemes. Unfortunately, our similarity-based key generation cannot preserve the confidentiality of chunks in the basic scheme. The reason is that it uses a segment-level MLE key (derived from the minimum fingerprint of a segment) as an input to CAONT to transform all chunks in a segment. This creates the same pseudo-random mask for all chunks in the same segment. This allows an adversary to apply an XOR operation to any two of the resulting trimmed packages to remove the mask, and learn a majority part of the XOR result of original chunks.
Nevertheless, we emphasize that the similarity-based MLE key generation does not introduce new security risks in the enhanced scheme. The reason is that the pseudo-random mask is generated from both the MLE key and the MLE ciphertext (i.e., in Figure 3). As a result, different chunks lead to different pseudo-random masks, which are infeasible to be removed without the knowledge of the file key. Although the similarity-based MLE key generation allows an adversary to narrow down the attack space of the online brute-force attack by requesting the MLE keys for potential minimum fingerprints, the key manager can lower the rate limit for key generation requests . Since segment-level key generation has already reduced the number of key generation requests, lowering the rate limit has no impact on normal users. Thus, the enhanced encryption scheme can benefit from similarity-based MLE key generation for performance gains, and achieve similar performance to the basic encryption scheme based on our evaluation (see Section 7.1).
We summarize the benefits of our similarity-based MLE key generation over the simple optimizations in Section 5.2. First, it operates on a per-segment basis, it inherently reduces the number of MLE key generation requests, independent of the amount of duplicates in the workloads. Also, it does not need to locally cache MLE keys, and hence it eliminates the concerns of maintaining a large cache space. Finally, it exploits similarity to remove duplicate chunks to maintain deduplication effectiveness.
We implement a REED prototype in C++ based on our previously built system CDStore . We follow the modular approach as in CDStore to implement REED, and Figure 6 shows how the modules of REED are organized. We mainly extend CDStore to support rekeying, with the addition of a key manager, the basic and enhanced encryption schemes (Section 4.2), dynamic access control (Section 4.3), and the similarity-based key generation approach (Section 5.3). We also use OpenSSL 1.0.2a  and CP-ABE toolkit 0.11  to implement the cryptographic operations in REED. The current REED prototype, including the original CDStore modules, contains around 11,000 LOC.
Like CDStore, a client divides an input file into fixed-size or variable-size (via Rabin fingerprinting ) chunks in the chunk module. It can also reassemble collected chunks into the original file during file download. We currently use SHA-256 to compute chunk fingerprints.
In the key module, the client runs the OPRF protocol with the key manager for generating either chunk-based or segment-level MLE keys (see Section 5.3). In addition, it implements RSA-based key regression  for generating new key states during rekeying, and protects each key state using CP-ABE  (via the CP-ABE toolkit ).
In the encryption module, the client implements both basic and enhanced encryption schemes (and the corresponding decryption schemes). In both encryption schemes, the client transforms a chunk into a trimmed package and a stub through CAONT, in which we implement via AES-256 and the hash function via SHA-256 (see Section 4.2). To resist brute-force attacks on the stub yet preserving storage efficiency, we configure the stub size as 64 bytes for each chunk. To enable integrity checking on reconstructed chunks, we set the fixed-size canary in both schemes to be 32 and zero bytes. The client encrypts each stub file (that consists of stubs of the same file) with a file key hashed from the corresponding key state via SHA-256.
The communication module is similar to that in CDStore. In this module, the client uploads (resp. downloads) all stored data to (resp. from) the server, including the trimmed packages, the encrypted stub file, the file metadata, the encrypted key state, and the public derivation key.
A key manager authenticates clients’ connections via SSL/TLS. It implements the OPRF protocol based on 1024-bit RSA, and computes an RSA signature on each incoming blinded fingerprint.
A server can receive file data from multiple clients via the communication module. It performs deduplication on the trimmed packages via the dedup module, and only stores unique trimmed packages in the storage backend. Since a file may have a large number of trimmed packages, the server packs them in units of containers to make storage and retrieval efficient via the container module. Like CDStore, we cap the container size at 4MB by default.
In the index module, the server keeps track of indexing information, including the fingerprints of all trimmed packages for deduplication, and the references to all trimmed packages and file recipes in the storage backend for file retrieval.
We separate the storage into file data and key information for better management. Specifically, we create two stores at the storage backend: (i) the data store, which stores the file data such as file recipes, trimmed packages, stub files, and all related file metadata, and (ii) the key store, which stores the key information such as encrypted key states. Separating the storage management of key information and file data gives flexibility, for example, by leveraging a more robust platform for encryption key management .
To achieve reasonable performance, REED batches I/O requests, and also parallelizes the encryption (resp. decryption) operations of uploaded (resp. downloaded) chunks via multi-threading. Here, we only configure two threads for encryption/decryption, as our evaluation results indicate that two threads are sufficient for achieving the required performance.
We evaluate REED on a LAN testbed composed of multiple machines, each of which is equipped with a quad-core 3.4GHz Intel Core i5-3570, 7200RPM SATA hard disk, and 8GB RAM, and installed with 64-bit Ubuntu 12.04.2 LTS. All machines are connected via a 1Gb/s switch.
Our default setting of REED is as follows. We run one REED client, one key manager, and five REED servers in different machines. We use multiple REED servers for improved scalability. In particular, four of the five servers manage the data store, and the remaining one server manages the key store. In practice, both the data store and the key store should be deployed in a shared storage backend (e.g., cloud storage); however, to remove the I/O overhead of accessing the shared storage backend in our evaluation, we simply have each server store information in its local hard disk. In addition to the default setting, we describe additional specific settings in each experiment, and also consider the case where multiple clients are involved. We compile our programs with g++ 4.8.1 with the -O3 option. For performance tests, we present the average results over 10 runs. We do not include the variance results in our plots, as they are generally very small in our evaluation. In the following, we use a synthetic dataset and two real-world datasets for our evaluation.
7.1 Synthetic Data
We evaluate different REED operations through synthetic data. In particular, we evaluate how segment-level MLE key generation mitigates overhead (Section 5). Specifically, we generate a 2GB file of synthetic data with globally unique chunks (i.e., the chunks have no duplicate content). Before each experiment, we load the synthetic data into memory to avoid generating any disk I/O overhead.
Experiment A.1 (MLE key generation performance):
We first measure the performance of MLE key generation between a client and the key manager. The client creates chunks of the input 2GB file using variable-size chunking based on Rabin fingerprinting with a specified average chunk size. We also group the chunks into variable-size segments with a specified average segment size (Section 5). The client computes the minimum fingerprint of each segment and requests for segment-level MLE keys from the key manager. We measure the MLE key generation speed, defined as the ratio of the file size (i.e., 2GB) to the total time starting from when the client creates the input file until it obtains all segment-level MLE keys from the key manager.
Figure 7(a) shows the MLE key generation speed versus the average chunk size, in which we fix the average segment size as 1MB. We observe that the speed increases with the average chunk size, mainly because we process fewer chunks to find the minimum fingerprint for each segment. When the average chunk size is at least 8KB, the key generation speed becomes steady at around 168MB/s, since the key manager is now saturated by segment-level key generation requests and the speed is bounded by the computation of the key manager. For comparison, our conference paper  shows a per-chunk key generation speed is below 20MB/s.
Figure 7(b) shows the MLE key generation speed versus the average segment size, in which we fix the average chunk size as 8KB. The speed increases with the average segment size, as a larger segment size implies fewer MLE keys to be generated. When the segment size is at least 512KB, the key generation speed is above 130MB/s, which is higher than the network speed in our LAN testbed (i.e., 1Gb/s).
Experiment A.2 (Encryption performance):
We measure the performance of both basic and enhanced encryption schemes. Suppose that the client has created chunks with variable-size chunking and obtained MLE keys from the key manager. Here, we measure the encryption speed, defined as the ratio of the file size (i.e., 2GB) to the total time of encrypting all chunks into trimmed packages and stubs.
Figure 8 shows the speeds of both basic and enhanced encryption schemes versus the average chunk size (note that the average segment size has no impact on the encryption speed). The throughput of both encryption schemes increases with the average chunk size, mainly because fewer chunks need to be processed. The basic scheme is faster than the enhanced scheme, as the enhanced scheme introduces an additional encryption (see Section 4.2). For example, for the average chunk size 8KB, the basic scheme has 203MB/s, 24% faster than 155MB/s in the enhanced scheme. We observe that the encryption speeds of both schemes are higher than the network speed (i.e., 1Gb/s), and hence the encryption speed is not the performance bottleneck in REED. We further justify this claim in Experiment A.3.
Experiment A.3 (Upload and download performance):
We now measure the upload and download performance of REED. For performance comparisons, we also include the basic encryption scheme, although it is shown to be insecure in similarity-based MLE key generation (see Section 5.4). We first consider the case of a single client. The client first uploads a 2GB file of unique data, followed by downloading the 2GB file. We measure the upload speed as the ratio of the file size to the total time of sending all file data to the servers (including the chunking, key generation, encryption, and data transfer), and the download speed as the ratio of the file size to the total time starting from when the client issues a download request until all original data is recovered.
Figure 9(a) shows the upload speeds under both encryption schemes versus the average chunk size, in which we fix the average segment size as 1MB. We see that the upload speeds increase with the average chunk size, and become close to the effective network speed in our LAN testbed. For example, when the average chunk size is 16KB, the upload speeds for the basic and enhanced schemes are 107.6MB/s and 106.9MB/s, respectively. Both encryption schemes have only minor performance differences.
Figure 9(b) shows the upload speeds under both encryption schemes versus the average segment size, in which we fix the average chunk size as 8KB. Similar to Figure 9(a), the upload speeds grow with the average segment size and are finally bounded by the network speed. For example, when the average segment size is 1MB, the upload speeds for the basic and enhanced schemes are 106.9MB/s and 106.4MB/s respectively.
Figure 9(c) shows the download speeds under both encryption schemes versus the average chunk size. When the average chunk size goes beyond 8KB, the download speeds of both encryption schemes (e.g., 108.0MB/s for basic encryption and 106.6MB/s for enhanced encryption) approximate the effective network speed.
We also consider the case with multiple REED clients. We vary the number of clients from one to eight, and each client runs on a different machine. Here, we focus on the aggregate upload performance under the enhanced encryption scheme, such that each client uploads a 2GB file of unique data simultaneously. We measure the aggregate upload speed, defined as the ratio of the total amount of file data (i.e., 2GB times the number of clients) to the total time when all uploads are finished. Figure 9(d) shows the aggregate upload speed versus the number of clients, in which we fix the average chunk size as 8KB and average segment size as 1MB. We see that the speed increases with the number of clients, and is finally bounded by the network bandwidth. When there are eight clients, the aggregate upload speed reaches 373.3MB/s.
Experiment A.4 (Rekeying performance):
We measure the rekeying performance in both lazy and active revocation schemes. Recall that the rekeying operation of REED requires a CP-ABE decryption with the original policy and another CP-ABE encryption with a new policy. REED treats each policy as an access tree with an OR gate connecting all the authorized user identifiers (see Section 4.3). This implies that the CP-ABE decryption time is constant , while its encryption time grows with the number of authorized users in the new policy. Thus, we focus on evaluating the impact of three parameters in the rekeying operation: (i) total number of users, i.e., the number of authorized users in the original policy; (ii) revocation ratio, the percentage of the number of users to be revoked and removed from the access tree; and (iii) file size, the size of the rekeyed file. We measure the rekeying delay, defined as the total time of performing all rekeying steps including: downloading and decrypting a key state, deriving a new key state, encrypting and uploading the new key state, and re-encrypting the stub file (for active revocation only).
Figure 10(a) shows the rekeying delay versus the total number of users, while we fix the rekeyed file size at 2GB and the revocation ratio at 20%. The rekeying delays of both revocation schemes increase with the total number of users, mainly because the CP-ABE encryption overhead increases with a larger access tree. Nevertheless, the rekeying delays are within three seconds in both revocation schemes. In particular, lazy revocation is faster than active revocation by about 0.6s, as it defers re-encryption process to the next file update.
Figure 10(b) shows the rekeying delay versus the revocation ratio, while we fix the rekeyed file size at 2GB and the total number of users at 500. With a larger revocation ratio, the new policy has fewer authorized users, thereby reducing the revocation time. When the revocation ratio is 50%, the rekeying delays of the lazy and active revocation schemes are 1.44s and 2s, respectively.
Figure 10(c) shows the rekeying delay versus the size of the rekeyed file, while we fix the total number of users at 500 and the revocation ratio at 20%. The rekeyed file size has no impact on lazy revocation, in which the rekeying delay is kept at 2.25s. For active revocation, as the file size increases, it spends more time for transferring and re-encrypting the stub file. Thus, the rekeying delay increases, for example, to 3.4s for an 8GB file. Nevertheless, if we compare the rekeying delay of active revocation with the time of transferring a whole file in the network (e.g., at least 64s in a 1Gb/s network), the rekeying delay is insignificant. Thus, the rekeying operation in REED is lightweight in general.
7.2 Real-world Data
We now consider two real-world datasets to drive our evaluations.
FSL: This dataset is collected by the File systems and Storage Lab (FSL) at Stony Brook University [1, 62]. The original FSL dataset contains daily backups of the home directories of various users in a shared file system. We focus on the Fslhomes dataset in 2013, which comprises 147 daily snapshots from January 22 to June 17, 2013. Each snapshot represents a daily backup, represented by a collection of 48-bit fingerprints of variable-size chunks with an average 8KB chunk size. The dataset we consider accounts for a total of 56.20TB of pre-deduplicated data.
VM: This dataset consists of virtual machine (VM) image snapshots and is collected by ourselves. We have 156 VMs for students enrolling in a university programming course in Spring 2014. We take 26 full image daily snapshots for each VM spanning over three months. Each image snapshot is of size 10GB, and the complete dataset contains 39.61TB of data. Each daily snapshot is represented in SHA-1 fingerprints on 4KB fixed-size chunks. We remove all zero-filled chunks that are known to dominate in VM images , and the size reduces to 18.24TB. A subset of the same dataset is also used in the prior work .
In our evaluation, we construct variable-size segments with the average segment size 1MB by grouping the chunks specified in the datasets (i.e., the variable-size chunks in FSL and the fixed-size chunks in VM), based on variable-size segmentation  described in Section 5.
Experiment B.1 (Storage overhead):
We first measure the storage overhead due to REED. Our goal is to show that REED still maintains storage efficiency via deduplication, even though it can only deduplicate part of a chunk (i.e., trimmed package). We define three types of data: (i) logical data, the original data before any encryption or deduplication; (ii) stub data, the encrypted stub files being stored; (iii) physical data, the trimmed packages being stored after deduplication. We aggregate the data from all users and measure the total size of each data type.
Figure 11(a) shows the cumulative data sizes over the number of days of storing FSL daily backups of all users. Each FSL daily backup contains 290-680GB of logical data for all users, yet the physical and stub data that REED actually stores after deduplication accounts for only 6.56GB per day on average. After 147 days, there is a total of 57,548GB of logical data, and REED generates only 964.4GB of physical and stub data after deduplication. It achieves a total saving of 98.3%. This shows that we still maintain high storage efficiency through deduplication.
Figure 11(b) compares the cumulative sizes of physical and stub data after deduplication. The cumulative size of stub data increases over days. After 147 days, there are 584.3GB of physical data due to the unique trimmed packages. There is also 380.1GB of stub data. Note that the stub data cannot be deduplicated as it is encrypted by a renewable file key. Nevertheless, deduplication effectively reduces the overall storage space according to Figure 11(a).
We now switch to the VM dataset. Figure 11(c) compares the size of logical data with the sizes of physical and stub data. After 26 daily backups, we have a total of 18,681GB of logical data. Deduplication reduces the space to 539.8GB for both physical data and stub data. The storage saving is 97.1%. Figure 11(d) presents a breakdown. We observe that the size of stub data grows linearly with the number of daily backups. The reason is that the stub data size depends on the number of logical chunks, yet each VM daily backup has a similar number of logical chunks (excluding the zero-filled chunks). After 26 days, REED stores 247.9GB of physical data and 291.9GB of stub data. The findings are similar to those for the FSL dataset.
We further compare the storage overhead of our similarity-based approach with that of the original chunk-based approach, which performs deduplication at the granularity of chunks (8KB for FSL and 4KB for VM). Table 2 shows the sizes of the physical and stub data, as well as the storage savings over the original size of logical data. Our similarity-based approach mitigates the MLE key generation overhead of the chunk-based approach, while incurring 35.3% and 45.8% more size of physical data. Note that deduplication does not change the total number of logical chunks, so both approaches have the same size of stub data for each dataset. Nevertheless, the similarity-based approach still achieves almost identical storage savings for both datasets as the chunk-based approach.
REED focuses on maintaining high storage savings for logical data via deduplication, yet we observe that stub data becomes dominant in physical storage as more backups are stored (or more generally, for workloads with high deduplication savings). To mitigate the storage overhead of the stub data, one option is to increase the chunk size; in fact, it has been shown that a larger chunk size may achieve higher effective storage savings by reducing metadata overhead . We pose this issue as future work.
Experiment B.2 (Trace-driven upload and download performance):
We evaluate upload and download speeds of a single REED client using both real-world datasets, as opposed to synthetic dataset in Experiment A.3. Since both FSL and VM datasets only include chunk fingerprints and chunk sizes, we reconstruct a chunk by repeatedly writing its fingerprint to a spare chunk until reaching the specified chunk size; this ensures that the same (resp. distinct) fingerprint returns the same (resp. distinct) chunk. The reconstructed chunk is treated as the output of chunking module of the REED client. Thus, we do not include the chunking time in this experiment.
The client uploads all daily backups (on behalf of all users), followed by downloading them. Due to the large dataset, we only run part of the dataset to reduce the evaluation time. Specifically, for the FSL dataset, we choose seven consecutive daily backups for nine users, totaling 3.64TB of data before deduplication; for the VM dataset, we choose four daily backups for all users, totaling 2.78TB of data before deduplication. We use the same setting as in Experiment A.3, and use the enhanced encryption scheme.
Figure 12 shows the upload and download speeds of REED over days. Both the upload and download speeds of all days are almost network-bound (around 105MB/s for both datasets) due to our segment-level MLE key generation. We highlight that for our original implementation in the conference paper  the upload speed of the first day is as low as 13.1MB/s, since it lacks cached MLE keys and has to request MLE keys for each chunk from the key manager. Our similarity-based MLE key generation does not have this limitation.
8 Related Work
Encrypted deduplication storage:
Section 2 reviews MLE  and DupLESS , which address the theoretical and applied aspects of encrypted deduplication storage, respectively. Bellare et al.  propose a theoretical framework of MLE, and provide formal definitions of privacy and tag consistency. The follow-up studies [12, 2] further examine message correlation and parameter dependency of MLE.
On the applied side, convergent encryption (CE)  has been implemented and experimented in various storage systems (e.g., [6, 67, 61, 4, 23, 57]). DupLESS  implements server-aided MLE. Duan  improves the robustness of key management in DupLESS via threshold signature . Zheng et al.  propose a layer-level strategy specifically for video deduplication. Liu et al.  propose a password-authenticated key exchange protocol for MLE key generation. ClearBox  enables clients to verify the effective storage space that their data occupies after deduplication. SecDep  leverages cross-user file-level deduplication on the client side to mitigate the key generation overhead, but it is susceptible to side channel attacks, in which a malicious user can infer the existence of files through the deduplication pattern [36, 35, 43]. CDStore  realizes CE in existing secret sharing algorithms by replacing the embedded random seed with a message-derived hash to construct shares. REED focuses on the applied aspect, and complements the above designs by enabling rekeying in encrypted deduplication storage.
Abdalla et al.  rigorously analyze key-derivation methods, in which a sequence of subkeys is derived from a shared master key so as to extend the lifetime of the master key for secure communication. Follow-up studies examine key derivation (in either key rotation or key regression) in content distribution networks [30, 10, 39] and cloud storage . A recent work  examines ciphertext re-encryption using an approach similar to REED, in that it performs AONT on files and updates a small piece from the AONT package, yet it does not consider deduplication and has no prototype that demonstrates the applicability. REED differs from the above approaches by addressing the rekeying problem in encrypted deduplication storage. REED also uses the key regression scheme  in key derivation to enable lazy revocation.
REED is related to secure deletion (see detailed surveys [26, 54]), which ensures that securely deleted data is permanently inaccessible by anyone. Secure deletion can be achieved through cryptographic deletion (e.g., [20, 50]), which securely erases keys in order to make encrypted data unrecoverable. REED builds on the AONT-based cryptographic deletion  and preserves deduplication effectiveness. It further allows efficient dynamic access control.
Exploiting workload characteristics:
Some studies address deduplication performance by exploiting workload characteristics, including chunk locality [71, 41], similarity [44, 17, 27, 31] and a combination of both . REED is motivated from a security perspective, and uses the similarity-based approach in Extreme Binning  to mitigate MLE key generation overhead.
We present REED, an encrypted deduplication storage system that aims for secure and lightweight rekeying. The core rekeying design of REED is to renew a key of a deterministic all-or-nothing-transform (AONT) package. We propose two encryption schemes for REED: the basic scheme has higher encryption performance, while the enhanced scheme is resilient against key leakage. We extend REED with dynamic access control by integrating both CP-ABE and key regression primitives. We show the confidentiality and integrity properties of REED under our security definitions. Furthermore, we propose a similarity-based approach to mitigate MLE key generation overhead of REED. We finally implement a REED prototype, and conduct trace-driven evaluation in a LAN testbed to demonstrate its performance and storage efficiency. In future work, we plan to address the open issues of our current REED design (see Section 4.6), investigate how we mitigate the storage overhead of stub data (see Section 7.2), and evaluate how REED performs for other storage workloads.
- FSL traces and snapshots public archive. http://tracer.filesystems.org/, 2014.
- Martín Abadi, Dan Boneh, Ilya Mironov, Ananth Raghunathan, and Gil Segev. Message-locked encryption for lock-dependent messages. In Proc. of CRYPTO, 2013.
- Michel Abdalla and Mihir Bellare. Increasing the lifetime of a key: A comparative analysis of the security of re-keying techniques. In Proc. of ASIACRYPT, 2000.
- Atul Adya, William J. Bolosky, Miguel Castro, Gerald Cermak, Ronnie Chaiken, John R. Douceur, Jon Howell, Jacob R. Lorch, Marvin Theimer, and Roger P. Wattenhofer. Farsite: Federated, available, and reliable storage for an incompletely trusted environment. In Proc. of USENIX OSDI, 2002.
- Amazon. Architecting for genomic data security and compliance in AWS, 2014.
- Paul Anderson and Le Zhang. Fast and secure laptop backups with encrypted de-duplication. In Proc. of USENIX LISA, 2010.
- Frederik Armknecht, Jens-Matthias Bohli, Ghassan O. Karame, and Franck Youssef. Transparent data deduplication in the cloud. In Proc. of ACM CCS, 2015.
- Giuseppe Ateniese, Randal Burns, Reza Curtmola, Joseph Herring, Lea Kissner, Zachary Peterson, and Dawn Song. Provable data possession at untrusted stores. In Proc. of ACM CCS, 2007.
- Giuseppe Ateniese, Kevin Fu, Matthew Green, and Susan Hohenberger. Improved proxy re-encryption schemes with applications to secure distributed storage. ACM Trans. Inf. Syst. Secur., 9(1):1–30, February 2006.
- Michael Backes, Christian Cachin, and Alina Oprea. Secure key-updating for lazy revocation. In Proc. of ESORICS, 2006.
- Elaine Barker, William Barker, William Burr, William Polk, and Miles Smid. NIST Special Publication 800-57 recommendation for key management. Technical report, National Institute of Standards & Technology, July 2012.
- Mihir Bellare and Sriram Keelveedhi. Interactive message-locked encryption and secure deduplication. In Proc. of PKC, 2015.
- Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. DupLESS: Server-aided encryption for deduplicated storage. In Proc. of USENIX Security, 2013.
- Mihir Bellare, Sriram Keelveedhi, and Thomas Ristenpart. Message-locked encryption and secure deduplication. In Proc. of EUROCRYPT, 2013.
- John Bethencourt, Amit Sahai, and Brent Waters. Ciphertext-policy attribute-based encryption. In IEEE S&P, 2007.
- John Bethencourt, Amit Sahai, and Brent Waters. CP-ABE toolkit. http://acsc.cs.utexas.edu/cpabe/, 2011.
- Deepavali Bhagwat, Kave Eshghi, Darrell D.E. Long, and Mark Lillibridge. Extreme binning: Scalable, parallel deduplication for chunk-based file backup. In Proc. of IEEE MASCOTS, 2009.
- John Black. Compare-by-hash: a reasoned analysis. In Proc. of USENIX ATC, 2006.
- Dan Boneh, Craig Gentry, and Brent Waters. Collusion resistant broadcast encryption with short ciphertexts and private keys. In Proc. of CRYPTO, 2005.
- Dan Boneh and Richard Lipton. A revocable backup system. In Proc. of USENIX Security, 1996.
- Dan Boneh, Ben Lynn, and Hovav Shacham. Short signatures from the weil pairing. In Proc. of ASIACRYPT, 2001.
- Andrei Z. Broder. On the resemblance and containment of documents. In Proc. of IEEE Compression and Complexity of Sequences, 1997.
- Landon P. Cox, Christopher D. Murray, and Brian D. Noble. Pastiche: Making backup cheap and easy. In Proc. of USENIX OSDI, 2002.
- Dick Csaplar. Building business resillience through active archiving, 2011.
- Debian Security Advisory. DSA-1571-1 openssl – predictable random number generator. https://www.debian.org/security/2008/dsa-1571, May 2008.
- Sarah M. Diesburg and An-I Andy Wang. A survey of confidential data storage and deletion methods. ACM Comput. Surv., 43(1):2:1–2:37, December 2010.
- Wei Dong, Fred Douglis, Kai Li, and Hugo Patterson. Tradeoffs in scalable data routing for deduplication clusters. In Proc. of USENIX FAST, 2011.
- John R. Douceur, Atul Adya, William J. Bolosky, Dan Simon, and Marvin Theimer. Reclaiming space from duplicate files in a serverless distributed file system. In Proc. of IEEE ICDCS, 2002.
- Yitao Duan. Distributed key generation for encrypted deduplication: Achieving the strongest privacy. In Proc. of ACM CCSW, 2014.
- Kevin Fu, Seny Kamara, and Tadayoshi Kohno. Key regression: Enabling efficient key distribution for secure distributed storage. In Proc. of NDSS, 2006.
- Yinjin Fu, Hong Jiang, and Nong Xiao. A scalable inline cluster deduplication framework for big data protection. In Proc. of Middleware, 2012.
- Shafi Goldwasser and Mihir Bellare. Lecture notes on cryptography. https://cseweb.ucsd.edu/ mihir/papers/gb.html, July 2008.
- Google. Google genomics. https://cloud.google.com/genomics/, 2016.
- Vipul Goyal, Omkant Pandey, Amit Sahai, and Brent Waters. Attribute-based encryption for fine-grained access control of encrypted data. In Proc. of ACM CCS, 2006.
- Shai Halevi, Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. Proofs of ownership in remote storage systems. In Proc. of ACM CCS, 2011.
- Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. Side channels in cloud services: Deduplication in cloud storage. IEEE Security & Privacy, 8(6):40–47, 2010.
- Keren Jin and Ethan L. Miller. The effectiveness of deduplication on virtual machine disk images. In Proc. of ACM SYSTOR, 2009.
- Ari Juels and Burton S. Kaliski, Jr. PORs: Proofs of retrievability for large files. In Proc. of ACM CCS, 2007.
- Mahesh Kallahall, Erik Riedel, Ram Swaminathan, Qian Wang, and Kevin Fu. Plutus: Scalable secure file sharing on untrusted storage. In Proc. of USENIX FAST, 2002.
- Dan Kaminsky. These are not the certs you’re looking for. http://dankaminsky.com/2011/08/31/notnotar/, Aug 2011.
- Erik Kruus, Cristian Ungureanu, and Cezary Dubnicki. Bimodal content defined chunking for backup streams. In Proc. of USENIX FAST, 2010.
- Jingwei Li, Chuan Qin, Patrick P. C. Lee, and Jin Li. Rekeying for encrypted deduplication storage. In IEEE/IFIP DSN, 2016.
- Mingqiang Li, Chuan Qin, and Patrick P. C. Lee. CDStore: Toward reliable, secure, and cost-efficient cloud storage via convergent dispersal. In Proc. of USENIX ATC, 2015.
- Mark Lillibridge, Kave Eshghi, Deepavali Bhagwat, Vinay Deolalikar, Greg Trezise, and Peter Camble. Sparse indexing: Large scale, inline deduplication using sampling and locality. In Proc. of USENIX FAST, 2009.
- Jian Liu, N. Asokan, and Benny Pinkas. Secure deduplication of encrypted data without additional independent servers. Cryptology ePrint Archive: Report 2015/455, August 2015.
- Linda Musthaler. Cloud encryption: Control your own keys in a separate storage vault. http://www.networkworld.com/article/2170564/cloud-computing/cloud-encryption-control-your-own-keys-in-a-separate-storage-vault.html, 2013.
- National Institutes of Health. NIH security best practices for controlled-access data subject to the NIH genomic data sharing policy, 2015.
- NetApp. Netapp deduplication helps duke institute for genome sciences and policy reduce storage requirements for genomic information by 83 percent. http://www.netapp.com/us/company/news/press-releases/news-rel-20081008.aspx, 2008.
- OpenSSL. https://www.openssl.org, 2015.
- Zachary N. J. Peterson, Randal Burns, Joe Herring, Adam Stubblefield, and Aviel D. Rubin. Secure deletion for a versioning file system. In Proc. of USENIX FAST, 2005.
- Krishna PN Puttaswamy, Christopher Kruegel, and Ben Y Zhao. Silverline: toward data confidentiality in storage-intensive cloud applications. In Proc. of ACM SoCC, 2011.
- Michael O. Rabin. Fingerprinting by random polynomials. Center for Research in Computing Technology, Harvard University. Tech. Report TR-CSE-03-01, 1981.
- A. Rahumed, H.C.H. Chen, Yang Tang, P. P. C. Lee, and J.C.S. Lui. A secure cloud backup system with assured deletion and version control. In Proc. of IEEE ICPPW, Sept 2011.
- Joel Reardon, David Basin, and Srdjan Capkun. SoK: Secure data deletion. In Proc. of IEEE S&P, 2013.
- Jason K. Resch and James S. Plank. AONT-RS: Blending security and performance in dispersed storage systems. In Proc. of USENIX FAST, 2011.
- Ronald L. Rivest. All-or-nothing encryption and the package transform. In Proc. of FSE, 1997.
- Peter Shah and Won So. Lamassu: Storage-efficient host-side encryption. In Proc. of USENIX ATC, 2015.
- Victor Shoup. Practical threshold signatures. In Proc. of EUROCRYPT, 2000.
- Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, and Benne de Weger. Md5 considered harmful today. http://www.win.tue.nl/hashclash/rogue-ca/, Dec 2008.
- Lincoln D Stein. The case for cloud computing in genome informatics. Genome Biology, 2010.
- Mark W. Storer, Kevin Greenan, Darrell D.E. Long, and Ethan L. Miller. Secure data deduplication. In Proc. of ACM StorageSS, 2008.
- Zhu Sun, Geoff Kuenning, Sonam Mandal, Philip Shilane, Vasily Tarasov, Nong Xiao, and Erez Zadok. A long-term user-centric analysis of deduplication patterns. In Proc. of IEEE MSST, 2016.
- U.S. Computer Emergency Readiness Team. OpenSSL ‘heartbleed’ vulnerability (CVE-2014-0160). https://www.us-cert.gov/ncas/alerts/TA14-098A, April 2014.
- Grant Wallace, Fred Douglis, Hangwei Qian, Philip Shilane, Stephen Smaldone, Mark Chamness, and Windsor Hsu. Characteristics of backup workloads in production systems. In Proc. of USENIX FAST, 2012.
- Dai Watanabe and Masayuki Yoshino. Key update mechanism for network storage of encrypted data. In Proc. of IEEE CloudCom, 2013.
- A. F. Webster and S. E. Tavares. On the design of S-boxes. In Proc. of CRYPTO, 1985.
- Zooko Wilcox-O’Hearn and Brian Warner. Tahoe: The least-authority filesystem. In Proc. of ACM StorageSS, 2008.
- Wen Xia, Hong Jiang, Dan Feng, and Yu Hua. SiLo: A similarity locality based near exact deduplication scheme with low ram overhead and high throughput. In Proc. of USENIX ATC, 2011.
- Yifeng Zheng, Xingliang Yuan, Xinyu Wang, Jinghua Jiang, Cong Wang, and Xiaolin Gui. Enabling encrypted cloud media center with secure deduplication. In Proc. of ACM ASIACCS, 2015.
- Yukun Zhou, Dan Feng, Wen Xia, Min Fu, Fangting Huang, Yucheng Zhang, and Chunguang Li. SecDep: A user-aware efficient fine-grained secure deduplication scheme with multi-level key management. In Proc. of IEEE MSST, 2015.
- Benjamin Zhu, Kai Li, and R Hugo Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In Proc. of USENIX FAST, 2008.