Bandwidth-efficient Storage Services for Mitigating Side Channel Attack
Data deduplication is able to effectively identify and eliminate redundant data and only maintain a single copy of files and chunks. Hence, it is widely used in cloud storage systems to save storage space and network bandwidth. However, the occurrence of deduplication can be easily identified by monitoring and analyzing network traffic, which leads to the risk of user privacy leakage. The attacker can carry out a very dangerous side channel attack, i.e., learn-the-remaining-information (LRI) attack, to reveal users’ privacy information by exploiting the side channel of network traffic in deduplication. Existing work addresses the LRI attack at the cost of the high bandwidth efficiency of deduplication. In order to address this problem, we propose a simple yet effective scheme, called randomized redundant chunk scheme (RRCS), to significantly mitigate the risk of the LRI attack while maintaining the high bandwidth efficiency of deduplication. The basic idea behind RRCS is to add randomized redundant chunks to mix up the real deduplication states of files used for the LRI attack, which effectively obfuscates the view of the attacker, who attempts to exploit the side channel of network traffic for the LRI attack. Our security analysis shows that RRCS could significantly mitigate the risk of the LRI attack. We implement the RRCS prototype and evaluate it by using three large-scale real-world datasets. Experimental results demonstrate the efficiency and efficacy of RRCS.
According to the International Data Corporation (IDC) report, the amount of worldwide digital data created and replicated reaches 4.4 Zettabytes in 2013, while exceeding 44 Zettabytes in 2020 . IDC analysis also shows that nearly data has a copy , which indicates a large amount of data redundancy existing in our digital world. Moreover, Microsoft Research collects the file data from 857 desktop computers with the size of 162TB, and finds that there exist nearly duplicate data in personal data and nearly duplicate data in the shared data among users . The data redundancy causes large and inefficient consumptions of storage capacity and network bandwidth in distributed file and storage systems.
In order to save network bandwidth and storage space, data deduplication [4, 5, 6] identifies data redundancy and maintains a single copy of files or chunks, which has been widely used in various fields, such as, cloud storage services [7, 8, 9], Redundancy Elimination (RE) in networks [10, 11]. In general, deduplication may occur either at the source (client) or the target (server). In the source-based deduplication, their fingerprints are first uploaded to the server before uploading files (or chunks). If the fingerprints exist in the index of the server, the corresponding files will not be uploaded. On the other hand, in the target-based deduplication, the files are directly uploaded to the server, and then deduplicated. The former can obtain both bandwidth and storage savings, while the latter only saves storage space. Moreover, duplicates can be detected among the files owned by a single user or cross users. Single-user deduplication only identifies redundant data in a single user. Based on the single-user deduplication, further using the cross-user deduplication can identify more redundant data among users, thus obtaining significant space savings , which has been widely used in current cloud storage systems [12, 13].
Although the cross-user source-based deduplication significantly improves storage and bandwidth utilizations, the occurrence of deduplication can be easily identified by monitoring and analyzing network traffic, which leads to the risk of user privacy leakage. The attacker can carry out a much dangerous side channel attack, i.e., learn-the-remaining-information (LRI) attack, to obtain user privacy by exploiting the side channel of network traffic in deduplication, which is detailed in Section II-A. Harnik et al.  perform tests and find that the LRI attack can occur in the popular cloud storage services such as Dropbox  and Mozy . Unfortunately, the LRI attack in deduplication is difficult to be addressed due to the following challenges.
The Limitations Using CE or MLE. To protect data confidentiality in deduplication, convergent encryption (CE) is used to encrypt data . CE proposed by Douceur et al.  uses the hash of files to encrypt the files so that the repeated files always generate identical ciphertexts. Thus deduplication can be done over the encrypted data. Bellare et al.  formalize CE and its variants as a cryptographic primitive, called message-locked encryption (MLE). However, even if data are encrypted by CE/MLE in cryptography deduplication systems, there still exists the risk of the LRI attack. Because the attacker could always carry out the LRI attack based on the side channel of network traffic to perceive whether deduplication occurs without probing the data themselves.
Deduplication Inefficiency. There are two baseline solutions to defend against the LRI attack. The first solution is to use encryption to avoid cross-user deduplication. Before uploading files to the cloud server, a client encrypts the files using the users’ personal keys, and the duplicate files cross users will produce different ciphertexts via encryption with different keys. This solution prevents the cross-user deduplication in the server, but substantially increases bandwidth and storage overheads. The second solution is to perform target-based deduplication. Files are directly uploaded to the server and then deduplicated. This solution has no bandwidth saving and only reduces the storage overhead compared with source-based deduplication. Both the two solutions substantially decrease the deduplication efficiency. Hence, it is nontrivial to defend against the LRI attack while ensuring the deduplication efficiency.
Several schemes have been proposed to defend against the LRI attack. Harnik et al.  propose the randomized threshold solution (RTS). However, RTS causes huge bandwidth overhead due to uploading redundant data, and has the risk of leaking privacy with a certain probability. Heen et al.  propose a gateway-based deduplication model that has to use a gateway (i.e., home router) as the third entity in deduplication systems to improve the resistance to the LRI attack. However, the solution needs an extra gateway provided by the Network Service Provider , which is not always possible in practical settings.
To address the challenges, this paper proposes a bandwidth-efficient scheme, i.e., RRCS, for mitigating the risk of the LRI attack in cloud storage services while maintaining the high bandwidth efficiency of deduplication. By carefully adding randomized chunk-level redundancy for each uploaded file, RRCS can mix up the real deduplication states of files used for the LRI attack, and effectively obfuscate the view of the attacker, who attempts to exploit the side channel of network traffic for the LRI attack. Moreover, a flag-based implementation scheme is introduced to allow the server to quickly identify the redundant chunks added by RRCS at low cost. In summary, the main contributions of this paper include:
Substantially Mitigating the Risk of the LRI Attack. In RRCS, when a client uploads the non-duplicate chunks of a file to the server, a small amount of redundant data chunks are also uploaded, which obfuscate the attacker’s view on the network traffic. The number of the redundant chunks is chosen at random. The randomness of redundant chunks in RRCS mixes up the real deduplication states of files to defend against the LRI attack. Our security analysis demonstrates that RRCS can significantly reduce the risk of the LRI attack.
Ensuring the High Efficiency of Deduplication. RRCS uploads a small number of redundant chunks to defend against the LRI attack, which ensures the high efficiency of deduplication. To further improve the efficiency, we deduplicate the single-user duplicate files (without security risk) in the client and propose a flag-based scheme to help the server quickly identify the redundant chunks added by RRCS at a low cost. Our experimental results based on three large-scale real-world datasets show that RRCS consumes much less bandwidth overheads than the RTS.
Prototype Implementation and Real-world Evaluation. We have really implemented the RRCS prototype in a deduplication system. We examine the real performance of RRCS by using multiple real-world datasets, including Fslhomes , MacOS , and Onefull . Extensive experimental results demonstrate the efficiency of RRCS.
Ii Background and Motivation
Ii-a Learn-the-Remaining-Information Attack in Deduplication
The occurrence of deduplication can be easily identified by monitoring and analyzing network traffic, which leads to the risk of user privacy leakage. The attacker can carry out a very dangerous side channel attack, i.e., learn-the-remaining-information (LRI) attack, to reveal users’ privacy information by exploiting the side channel of network traffic in deduplication.
The LRI Attack: The attacker knows a large part of the target file in the cloud and tries to learn the remaining unknown parts of the target file via uploading all possible versions of the file’s content, i.e, files. As shown in Figure 1, the attacker knows all the contents of the target file except the sensitive information . To learn the sensitive information, the attacker needs to upload files () with all possible values of (), respectively. If a file with the value is deduplicated and other files are not, the attacker knows that the information .
Note that the attacker knows that, for the files, only one file is the same as the file and the remaining files are similar to the file since only a small part of their contents are different from file . The different parts of their contents are the sensitive information, such as the PIN , the password of bank account , and the salary number, which can usually be represented as a small number of bits and easily covered in one-chunk size (about 8kB) in the chunk level.
We use an example to show how the LRI attack is used to obtain the private information of other users in practice . Alice and Bob belong to the same company. Alice knows Bob’s employee number and other information about Bob. The salary of the company is in the range of 5,000 to 15,000, and a multiple of 1,000. If Alice wants to know Bob’s salary, she can backup 11 () versions of the payroll with Bob’s name, Bob’s employee number and the salary ranging from 5,000 to 15,000 to the same server in which Bob has backed up his payroll. Thus Alice can know that Bob has the salary in the payroll version, in which the deduplication occurs.
Ii-B System and Threat Models
We consider a general cloud storage service model that includes two entities, i.e., the user and cloud storage server. In the threat model of the side channel attack, the attack is launched by the users who aim to steal the privacy information of other users [14, 18, 21]. The attacker can act as an user via its own account or use multiple accounts to disguise as multiple users. The cloud storage server communicates with the users through Internet. The connections from the clients to the cloud storage server are encrypted by Secure Socket Layer (SSL)  or Transport Layer Security (TLS) protocol . Hence, the attacker can monitor and measure the amount of network traffic between the client and server but cannot intercept and analyze the contents of the transmitted data. The attacker can then perform the sophisticated traffic analysis with sufficient computing resources. As shown in Figure 2, the user is the victim who has uploaded his/her file with privacy information to the cloud storage server. The user is the attacker who can upload any number of files to the same cloud storage server. During the file uploads, the user monitors the amount of their network traffic to determine the duplication states of files and then infers the privacy information in the file uploaded by the user , as the method described in Section II-A.
In summary, this paper mainly focuses on the side channel of traffic information111Note that if the attacker has the ability to control the SSL encryption or memory sniffing, etc., a new kind of attack can be formed, whereby the attacker could potentially obtain the deduplication state of a file. However, such attack is much harder than the side channel of traffic information, and is beyond the scope of the threat models we consider., like existing work [14, 18, 21] on side channel attacks. Thus the attacker could only infer/probe the privacy by observing the amount of network traffic between the client and server. The variants of the deduplication detection method are discussed in details in Section IV-A.
Ii-C The Related Work Addressing the LRI Attack
The security issues of cross-user deduplication in cloud storage services have been widely studied, including data confidentiality [17, 15, 24], side channel attacks [14, 18], and the proofs of ownership . Convergent encryption  is proposed to ensure the data confidentiality in deduplication systems. However, even with data encryption, deduplication still leaks the sensitive information of users via the LRI attack [14, 18]. Existing work addressing the LRI attack can be divided into two categories.
The first category is based on a special deduplication system model, i.e., gateway-based system model. The model consists of three entities, i.e., the user, the gateway provided by the Network Service Provider, and the storage server. Heen et al.  assume that the gateway is installed in the attacker’s home network, and propose to use the gateway to mix up the traffic of the cloud storage service with that of other services. Shin el al.  assume that the gateway is shared by multiple users, and propose to leverage the gateway to mix up the traffic among the multiple users. These solutions avoid the attacker to learn the occurrence of deduplication by monitoring the network traffic of clients, thus improving the resistance to the LRI attack. However, an extra gateway provided by the Network Service Provider is needed, which is not always possible in practical settings.
The second category addresses the LRI attack in the general deduplication system model including two entities, i.e., the user and the storage server. The general system model is widely used in current cloud storage systems [7, 8, 9]. Harnik et al.  propose the randomized threshold solution (RTS). For each file , the server sets a threshold which is chosen uniformly from the range at random ( might be a public parameter). The server keeps a counter to count the number of previously uploaded copies of file . When a new copy of file is uploaded, RTS checks the counter . If is smaller than , the file is uploaded and deduplicated in the server. Otherwise it is deduplicated in the client. Harnik et al. show that RTS has a risk of privacy leakage with probability . Because is chosen uniformly at random, when , the attacker uploads one copy of file and can learn that deduplication occurs. Moreover, RTS assigns thresholds to all files which consumes high bandwidth overhead in the practical deduplication (detailed in Section V).
From the identification granularity of the duplicate data, the deduplication is divided into two categories, i.e., file-level and chunk-level deduplication. Specifically, file-level deduplication considers the whole file as a unit to eliminate redundant data. Chunk-level deduplication divides the entire file into chunks (fixed-sized  or variable-sized [5, 26]), and then considers the chunk as a unit to eliminate redundant data. Compared with file-level deduplication, the chunk-level deduplication not only identifies the identical files, but also eliminates the identical chunks among the similar files. Consequently, chunk-level deduplication can obtain higher deduplication ratio, and thus has been widely used in backup systems [5, 20, 4] and cloud storage systems [12, 27, 13].
For file-level deduplication, there are two deduplication states for a file in a given storage system, i.e., duplicate and non-duplicate. The client does not upload the duplicate-detected files in the former case. In the latter case, client needs to upload the non-duplicate files. In the LRI attack, for files, only the file with correct sensitive information is the same as the target file and thus not uploaded. Other files with incorrect information are uploaded. If we want to mix up the deduplication states of the file and other files to defend against the LRI attack, we need to upload the whole file regardless of whether deduplication occurs, like RTS , which incurs high bandwidth overhead.
This paper focuses on defending against the LRI attack in chunk-level deduplication. Chunk-level deduplication deals with duplicate files based on their redundant level. Specifically, there are three deduplication states for a file: (1) Full deduplication (). A client uploads a file to the server. If an existing file is completely identical to the file , will be deduplicated without the needs of uploading. (2) Partial deduplication (). A file in the server is similar (partially identical) to file to be uploaded, meaning that they share some duplicate chunks. The client only uploads the non-duplicate chunks. (3) No deduplication (). If no identical/similar files exist in the server, the whole file needs to be uploaded.
As shown in Figure 3, in the LRI attack, for the files, the file with correct sensitive information is completely identical to the target file , i.e, , whose uploading traffic is zero. Other files have duplicate chunks and one non-duplicate chunk with the value (as described in Section II-A), belonging to , whose uploading traffics are equal to one-chunk size. To defend against the LRI attack, we can explore leveraging chunk-level redundancy rather than the whole-file redundancy, to mix up the deduplication states of the file and other files via uploading some redundant chunks in each file. If the number of the redundant chunks is set at random, the attacker using the side channel, i.e., traffic information, would be effectively prevented from accurately distinguishing the file with correct sensitive information from the files used for the LRI attack.
Iii Design and Implementation
In this section, we first demonstrate that using deterministic chunk-level redundancy fails to mitigate the risk of the LRI attack. We then present the Randomized Redundant Chunk Scheme (RRCS) which explores and exploits random chunk-level redundancy to mitigate the risk of the LRI attack.
Iii-a Deterministic Chunk-level Redundancy
As described in Section II-D, for the files used for the LRI attack, the uploading traffic of the file with correct sensitive information is zero and the uploading traffics of the other files are the size of one chunk. To mix up the files in terms of the uploading traffic, a simple solution is to add a fixed number of redundant chunks to ensure that the traffic of each file is always more than one-chunk size. Specifically, for a file with non-duplicate chunks, we upload its non-duplicate chunks. For a file without non-duplicate chunks, i.e., the whole file is duplicate, we randomly choose one chunk of the file to upload. Thus one chunk is uploaded for in the solution. Hence, the files are indistinguishable in terms of uploading traffic, since the traffic of each file is equal to the size of one chunk.
However, in fact, the solution is easily broken. The attacker can append one non-duplicate chunk in each file to break the solution, as shown in Figure 4. The non-duplicate chunk can be randomly generated. Since the average chunk size is about 8 KB, a randomly generated chunk is unlikely to exist in the server since there are possible chunks. By doing so, the traffic of is the size of one chunk and the traffics of other files are the total size of two chunks. Thus with correct sensitive information is easily identified from the files according to the traffic.
To enhance the simple solution, we can add more redundant chunks to ensure that the traffic of each file is always more than the size of chunks (), e.g., . However, the attacker can also append more than non-duplicate chunks in each file. The traffic of is the size of one chunk less than the traffics of other files, which breaks the enhanced solution. We name the method that appends one or multiple non-duplicate chunks in each file to assist the LRI attack as Appending Chunks Attack (ACA). In summary, using deterministic chunk-level redundancy fails to mitigate the risk of the LRI attack.
Iii-B The Randomized Redundant Chunk Scheme
In this subsection, we present the the randomized redundant chunk scheme (RRCS). The idea behind RRCS is to explore and exploit randomized chunk-level redundancy to obfuscate the view of the attacker, who attempts to measure the uploading traffics of files for executing the LRI attack.
In RRCS, the basic idea of adding redundant chunks is to choose the number of the redundant chunks from a range uniformly at random. The redundant chunks can be randomly chosen from all the duplicate chunks of the file. The redundant chunks can also be generated by padding random/null characters 222In current network protocol implementations (e.g., TLS  and SSL ), a sequence of null characters will be encrypted into a sequence of pseudo-random bits. Hence, the chunks padded null characters cannot be distinguished in the ciphertext.. But the size of each redundant chunk should be the chunk size of the file when using fixed-size chunking, and be the average chunk size of the file when using variable-size chunking. No matter how the redundant chunks are generated, they can be easily eliminated by the server using the implementation scheme described in Section III-C.
Iii-B1 The Overview of RRCS
RRCS determines the uploaded chunks based on the real deduplication states of files via mixing the redundant chunks. Figure 5 shows the framework of RRCS. RRCS includes three key function modules, range generation (RG), secure bounds setting (SBS), security-irrelevant redundancy elimination (SRE). When uploading the random-number redundant chunks, RRCS first uses RG to generate a fixed range in which the random number is chosen. However, the fixed range may cause a security issue. SBS is used to deal with the bounds of the fixed range to avoid the security issue. There may exist security-irrelevant redundant chunks in RRCS. SRE reduces the security-irrelevant redundant chunks to improve the deduplication efficiency. The modules are detailed as follows.
Iii-B2 Range Generation
For each file, RRCS first assigns a range , in which the number of redundant chunks is chosen uniformly at random. is the total number of chunks in a file, which the attacker can obtain by chunking the file using the chunking algorithm. is a parameter assigned by the deduplication system, which might be public. How to set the parameter for the system is a tradeoff between the security and bandwidth efficiency, which we will discuss in Section IV and V. If is not an integer, .
Security Analysis for the Range. As described in Section II-A, files () are used for executing the LRI attack, in which the file has the correct sensitive information. We add redundant chunks for file , and is randomly chosen form the range . Thus the number of actually uploaded chunks in is in the range , due to no non-duplicate chunks. The numbers of actually uploaded chunks in other files are in the range , due to one non-duplicate chunk. Hence, the file and other files have different ranges in terms of the uploading traffic, which is not secure enough for the LRI attack. There are two events causing privacy leakage.
If for the file happens to be with probability , the uploading traffic of is zero. Thus the attacker can easily distinguish from the files since the uploading traffic of the other files is always more than zero.
If for all the files with incorrect sensitive information happen to be with probability , the uploading traffics of all the files are equal to the size of chunks. Thus the attacker can determine the remaining one file is since the uploading traffic of is always no more than the size of chunks.
In summary, assigning the same range of the number of the redundant chunks to the files results in the risk of privacy leakage with probability .
Iii-B3 Secure Bounds Setting
|The total number of chunks in the file|
|The number of non-duplicate chunks after deduplication|
||The number of redundant chunks added by RRCS|
||The number of redundant chunks after eliminating the|
|The number of actually uploaded chunks|
||The set which is randomly chosen from|
|in , in|
When happens to the bound of the fixed range , the attacker can identify the file with correct sensitive information, resulting in privacy leakage with a certain probability. In the following, we aim to set the secure bounds to avoid the privacy leakage.
Form the above discussion, we argue that the problem of the bounds can be avoided only when the numbers of actually uploaded chunks in all the files are in the same range. We show how to avoid the problem below. Since the server can clearly know that each uploaded file is completely identical () or partially identical () to the files in the server, different ranges can be set for different cases. For example, For the file belonging to , is randomly chosen from . For the file belonging to , is randomly chosen from . Thus the number of actually uploaded chunks in which belongs to is in the range . The numbers of actually uploaded chunks in other files which belongs to are also in the range .
Overall, we denote that is randomly chosen from the set in the case of , and randomly chosen from the set in the case of . In order to mix up the two deduplication states in the files used for the LRI attack, it is easy to get the equation:
Note that the equation means adding 1 to each element in set to form the set .
Iii-B4 Security-irrelevant Redundancy Elimination
For a file with chunks, due to adding the redundant chunks, the number of uploaded chunks, , is possibly larger than . It is not necessary to upload more than chunks, since the redundant chunks become the security-irrelevant redundant chunks without contributions to the security. We hence upload chunks by reducing the number of redundant chunks, , when is larger than .
Iii-B5 The RRCS Algorithm
We summarize the RRCS algorithm in Algorithm 1. First, the server assigns the range as the set for a file. RRCS algorithm generates set by the Equation 1: . The two sets are used for two real deduplication states of files, i.e., and , respectively. RRCS algorithm then judges which deduplication state the file belongs to by checking the number of its non-duplicate chunks . means the file is completely identical to a file in the server. RRCS algorithm further configures the set = . Moreover, means the file will be partially identical (similar) to files in the server, and we have the set = . Otherwise, means the file has no duplicate chunks in the server, and we have the set . The number of redundant chunks is randomly chosen from the set . If , RRCS algorithm sets . Otherwise, . Finally, RRCS algorithm generates redundant chunks by padding random/null characters or choosing from the duplicate chunks.
From the RRCS algorithm, we can see that the number of the chunks which need to be uploaded meets . For a special case that a file only has one chunk, i.e., , the file is directly uploaded in RRCS algorithm.
In the subsection, we present how to implement RRCS in the chunk-level deduplication system.
As shown in Figure 6, in chunk-level deduplication, the real deduplication states of files include full deduplication (), partial deduplication (), and no deduplication () (described in Section II-D). consists of two cases, i.e., single-user duplicate files and cross-user duplicate files. The single-user duplicate file means that a file uploaded by a user is identical to the file previously uploaded by the user, and thus observing the occurrence of for the single-user duplicate file does not cause privacy leakage, as demonstrated in . Hence, RRCS directly deduplicates the single-user duplicate files in the client to obtain the bandwidth savings. The cross-user duplicate file means that a file uploaded by a user is identical to the file previously uploaded by other users. Observing the occurrence of for the cross-user duplicate file can be used to reveal other users’ privacy. Hence, RRCS mixes up the case with using the RRCS algorithm. We directly upload the files occurring in .
RRCS is implemented in the client. The server receives both the non-duplicate chunks and the redundant chunks uploaded by the client. How to efficiently distinguish the redundant chunks and the non-duplicate chunks is a challenge.
To address the problem, we present a flag-based implementation scheme. We modify the deduplication communication protocol by adding one flag bit in the encrypted data packet. The flag bit of the redundant chunk is different from that of the non-duplicate chunk. As shown in Figure 7, the first part of the data packet in the communication protocol is the fingerprint of the uploaded chunk. The second part is the content of data chunk. We add one flag bit (i.e., the red zone) between fingerprint and data parts. The flag bits of the redundant and non-duplicate chunks are “1” and “0” respectively. Hence, when receiving the data packets, the server can identifies the redundant chunks according to the flag bits.
Iv Security Analysis
In this section, we first discuss all variants of the deduplication detection method and analyze whether the variants are effective in RRCS. We then analyze the security properties of RRCS for the LRI attack.
Iv-a The Variants of the Deduplication Detection Method
In order to comprehensively evaluate the solutions in resisting the LRI attack, we first elaborate below the baseline deduplication detection method from the attacker and its possible variants. As shown in Section II-B, the attacker’s goal is to exploit/identify the occurrence of deduplication to launch the LRI attack. To launch the attack, the attacker can pass the file to the client to upload to the deduplication server. By measuring the uploading traffic, i.e., the side channel, the attacker could attempt to infer/probe the occurrence of deduplication. There are several variants of the above detection method, but we show below that those variants can all be reduced to the above baseline detection method. Thus, later in the next subsection we will only focus on the defense of above baseline case.
The variants include: 1) The attacker might upload the same file multiple times. However, only the first upload could be deemed useful for the attacker. This is because the file will be stored in the server after its first uploaded (regardless of whether there was an old copy of the file or not in the server), and thus all the subsequent upload of the same file will be always identical to the attacker’s own file. Such reasonings could be extended to the case where the attacker might use multiple accounts to disguise as multiple users to upload the same file. 2) The attacker can also try to upload a file to the server and then immediately delete the file. By repeating the operations of uploading and deleting the file, in theory the attacker can perform multiple uploadings. However, this is not feasible in practice. As pointed out by Harnik et al. , many online storage services, such as DropBox, Memopal and Mozy, need to keep copies of the deleted files for a period of at least 30 days, either for the purpose of storage resilience or version recovery. Users hence can restore to past versions. Therefore, the execution of each iteration of the attack has to last at least 30 days. The need of long term execution and the fact that the target file status in cloud could be easily changed during the long period due to normal application requests would render such attack practically useless to the attacker. Again, only the first uploaded file is useful for the attacker in RRCS.
Iv-B Security Strength of RRCS
We analyze the security of RRCS for the LRI attack in the general case. We then analyze the security of RRCS for the LRI attack assisted by the Appending Chunks Attack (presented in Section III-A).
Iv-B1 The LRI Attack in the General Case
For the LRI attack, the attacker knows a big part of the targeted file and tries to determine the remaining unknown parts of file via uploading all possible versions of the file’s content. All possible versions are files in which only one file is the same as file and the remaining files are similar to file since only a small part of their contents are different from file , as the background described in Section II-A. The sizes of different contents are smaller than that of one data chunk. The attacker could observe the client’s upload of similar files () via chunk-level deduplication and measure the uploading traffic.
In general, by observing the results of measuring the uploading traffic, the attacker can find that the uploading traffic of one file is zero, and the uploading traffics of other files are equal to the size of one chunk. The attacker hence confirms the content of the file is the same as the target file .
In order to prove that RRCS can address the LRI attack in the general case, we demonstrate in Theorem 1 that files should be indistinguishable in RRCS.
In the general case, the files used for the LRI attack are indistinguishable from the attacker’s view in RRCS.
Initially, the target file exists in the server. files () are uploaded for the LRI attack, in which file is the same as file . Due to adding randomized redundant chunks in RRCS, the uploading traffic of file is equal to the size of chunks. The uploading traffics of the other files () are equal to the size of () chunks. Since belongs to and the other files belong to , we have that is randomly chosen from the set , and () are randomly chosen from , as shown in Section III-B3. We thus obtain and 333 means adding 1 to each element in set . (). According to Equation 1, we have . Hence, the identical file and other similar files have the same range of uploading traffic, from the attacker’s view. Hence, the attacker cannot distinguish between the identical file and the other files ().
In summary, RRCS defends against the LRI attack by making the files used for executing the LRI attack indistinguishable from the attacker’s view in the general case.
Iv-B2 The LRI Attack Assisted by Appending Chunks Attack (ACA)
To execute the ACA, the attacker can append one or multiple non-duplicate chunks to each file in the files used for the LRI attack. In the following, we analyze the security of RRCS for the LRI attack assisted by the ACA.
Initially, the target file exists in the server. files () are uploaded for the LRI attack, in which each file has chunks and the file is the same as file . By executing the ACA, non-duplicate chunks are appended to each file. We denote the files appended non-duplicate chunks as , which have chunks. Due to being appended by non-duplicate chunks, all the new files belong to , in which are randomly chosen from the range in RRCS. Thus the range of the number of actually uploaded chunks in the file is , and the ranges in the other files are , as shown in Figure 8. The file and other files have different ranges in terms of the uploading traffic.
To analyze security, we demonstrate in Theorem 2 that RRCS leaks no information with high probability for ACA.
For the LRI attack assisted by the Appending Chunks Attack, RRCS leaks no information which prevents the attacker from accurately identifying the file with the correct sensitive information from the files, with the probability of .
We consider all four events in RRCS where the attacker wants to identify with correct sensitive information from the files appended by non-duplicate chunks.
The attacker uploads the files. If observing the uploading traffic of one file is equal to the size of chunks, the attacker can determine that the file is , since only belongs to the range of the number of actually uploaded chunks in .
If the traffics of files are equal to the size of chunks 444If is not an integer, ., the attacker can determine that the remaining one file is , since only belongs to the ranges of the number of actually uploaded chunks in the files with incorrect sensitive information.
If the traffics of all files are between the sizes of and chunks, the attacker fails to determine which file is . This is because the traffics of all files can cover the range of chunks size. The files are indistinguishable from the attacker’s view in RRCS, based on the proof in Section IV-B1.
If the traffics of files are the size of chunks and the traffics of the remaining files are between the sizes of chunks and chunks, the attacker can determine that is not in the files but still cannot identify from the remaining files. Thus the files containing are indistinguishable from the attacker’s view in RRCS, based on the proof in Section IV-B1.
The first event that leaks information, occurs with probability . 555Note that since the average size of personal files is over 600kB in the real-world datasets as shown in Table II and thus the average number of chunks is large enough (), the probability of leaking information is very small. The second event leaking information occurs with probability . Whereas the third and fourth events, which do not leak information, occur with probability .
Remark. How to set for the server is a tradeoff between the security and bandwidth efficiency. The larger is, the higher the probability of leaking no information is. But larger also leads to larger range of , which would naturally result in more potential bandwidth overhead. Nevertheless, even when , RRCS provides the best security guarantee while can also obtain good bandwidth efficiency as demonstrated in SectionV-D.
V Performance Evaluation
V-a Setup and Datasets
To evaluate the performance of RRCS, we implement a prototype of cross-user source-based deduplication system with RRCS. The client is equipped with the Ubuntu 12.04 operating system running on a quad-core Intel Core i5-4460 CPU at 3.20 GHz, with a 16GB RAM and a 2TB hard disk. The server has a 16-core CPU, a 32GB RAM and a 10TB hard disk. The RRCS prototype is written in C language in a Linux environment.
We examine the performance of RRCS using three real-world trace-based datasets, i.e., Fslhomes , MacOS , and Onefull . We explore the characteristics of the datasets in Section V-B and summarize them in Table II.
Fslhomes was collected in the File system and Storage Lab (FSL) at Stony Brook University, which contains the snapshots of students’ home directories from a shared network file system. The files contain virtual machine images, office documents, source code, binaries and other miscellaneous files.
MacOS was collected from a MacOS X Enterprise Server that holds 247 users and provides multiple services: email, webservers, calendar server, mailman for mailing lists, wiki server, mySQL, and a trouble-ticketing server.
Onefull is a subset of the trace reported by Xia et al. , which was collected from the personal computers of 15 graduate students in our research group.
As described in Section III-C, single-user duplicate files do not cause privacy leakage. We eliminate single-user duplicate files in the source (client), which obtains significant bandwidth savings in RRCS and RTS. RRCS and RTS hence exhibit the same bandwidth efficiency, i.e., no bandwidth overheads, in eliminating the single-user redundancy. On the other hand, for cross-user deduplication, RRCS and RTS add different-granularity redundancies (i.e., chunk and file) for defending against the side channel attacks. Therefore, we examine the performance of eliminating the cross-user redundancy in RRCS and RTS. In the performance evaluation, we eliminate single-user duplicate files in the client and evaluate the bandwidth efficiency of cross-user deduplication as shown in Section V-D.
V-B The Characteristics of the Datasets
|Avg. chunk size||8kB||8kB||10kB|
|Avg. file size||1530kB||683kB||622kB|
|Cross-user redundancy ratio|
|The total number of files||3.663M||3.058M||378K|
|The number of unique files||2.238M||1.600M||283K|
|The number of||0.316M||0.281M||7.8K|
|copies unique files||()||()||()|
|The number of||0.068M||0.011M||2.0K|
|copies unique files||()||()||()|
|The number of||0.017M||0.003M||0|
|copies unique files||()||()||(0)|
Before evaluating the performance of RRCS, we explore and analyze the characteristics of cross-user file redundancy in the three real-world datasets owning many users. We count the number of the files that have copies (), while is the number of users sharing the file.
The relationships between the number of files and their copies are shown in Figure 9. The number of files exponentially decreases as a function of the number of file copies. We can observe that most files only have a few copies (i.e., shared by a few users). We summarize the results in Table II (M is , and K is in the Table). For Fslhomes dataset, the number of unique files containing more than 5 copies only accounts for of the total number of the unique files. For MacOS dataset, the number of unique files containing more than 5 copies only accounts for of the total number of unique files. We also investigate the redundancy characteristics in chunk-level which show the similar results.
As a result, most files only have a few copies (or shared by a few users) in the real-world datasets. RTS  performs target-based deduplication when the number of file copies is small than a pre-defined threshold (detailed in Section II-C). However, since the files having a few copies account for a significant proportion as shown in Figure 9, most files are performed target-based deduplication in RTS. Therefore, RTS becomes bandwidth-inefficient in the real-world datasets.
V-C Uploading a Single File Multiple Times
We mainly consider five deduplication schemes, including source-based deduplication, target-based deduplication, file-level RTS, chunk-level RTS, and RRCS. Based on the file-level RTS described in Section II-C, we develop the chunk-level RTS for comparisons, in which a random threshold is set for each chunk. The five deduplication schemes have the same space savings in the storage server, but different bandwidth savings (i.e., the reduced amount of the transmitted data by the above five deduplication schemes).
In order to intuitively compare the characteristic of the five deduplication schemes in bandwidth overheads, we first consider a simple situation that the same file is uploaded multiple times by different users. We use an 800kB-size file, which is divided into 100 chunks with the average chunk size of 8kB. We upload the file times and observe the changes of the total amount of the transmitted data among the above mentioned four schemes. Specifically, file-level (chunk-level) RTS uses the target-based deduplication when the number of the uploaded copies of the file (chunk) is smaller than the threshold that is chosen uniformly from the range . We use the parameter setting in their paper , i.e., . RRCS needs to upload the randomized redundant chunks for defending against the LRI attack.
Figure 10 shows the changes of the total amount of the transmitted data (i.e., the total traffic) with the increase of the file upload number . For the target-based deduplication, the total traffic of uploading file times is equal to times the size of the file. For file-level RTS, the total traffic is equal to times the size of the file when is smaller than the threshold , and the file is deduplicated in the client when is larger than . in the Figure 10, which is selected by the average value in the range . Other cases that the is set to other numbers are easy to understand. For chunk-level RTS, the total traffic increases slower than that of file-level RTS. When the number of file uploads is high (i.e., 17), file-level and chunk-level RTS have the near-same total traffic, since setting a threshold to a file has the same expectation of the total traffic as setting a threshold to each chunk in the file. For RRCS, the total traffic grows slowly due to adding chunk-level redundancy, and the curve shows a fluctuation since the number of redundant chunks is at random. Compared with RTS, when the file uploading times is quite large (more than 42 in Figure 10), the total traffic in RRCS may be more than RTS. However, we argue that the files containing many copies are very few in the real-world datasets as shown in Section V-B. Thus the RRCS could obtain significant improvements in terms of the bandwidth saving, compared with the RTS.
V-D Bandwidth Overhead
We compare these deduplication schemes in terms of bandwidth overheads in cross-user deduplication, using the three real-world datasets mentioned above. Specifically, in file-level (chunk-level) RTS, we also use the range in which the threshold of each file (chunk) is uniformly chosen at random, as the parameter setting in their paper . In RRCS, we respectively set the system parameter and to show how the different impacts the bandwidth efficiency.
Figure 11 shows the normalized bandwidth overheads of five schemes. The bandwidth overhead of target-based deduplication is equal to the total file size. Compared with target-based deduplication, source-based deduplication reduces bandwidth overheads in the three datasets, due to eliminating all redundancy in the client. File-level (chunk-level) RTS reduce () bandwidth overheads, due to only obtaining the bandwidth saving of the files (chunks) that have many copies. In fact, these files (chunks) having many copies are quite few as discussed in Section V-B. RRCS with reduces bandwidth overheads and RRCS with reduces bandwidth overheads. We observe that with the increase of , the bandwidth overhead of RRCS increases, since larger provides better security guarantee while consuming more bandwidth overhead, as discussed in Section IV-B2. Even though in the worst case where in terms of bandwidth overhead, RRCS still consumes much less bandwidth overheads than RTS.
Figure 12 shows the redundancy elimination ratios of the five schemes. Source-based deduplication eliminates data redundancy which however has no security guarantee. File-level (chunk-level) RTS only eliminate () redundancy, due to only eliminating the redundancy of the files (chunks) that have many copies. RRCS with eliminates redundancy and RRCS with eliminates redundancy. Compared with RTS, RRCS can eliminate 2 to 10 times data redundancy.
This paper proposes a simple yet effective scheme called RRCS to address an important security issue which deduplication can be exploited to carry out the LRI attack to steal user privacy in cloud storage services. RRCS mixes up the real deduplication states of files used for the LRI attack by adding the randomized redundant chunks, which prevents the attacker from accurately identifying the file with correct sensitive information and thus significantly mitigates the risk of the LRI attack. RRCS also allows the system to control the tradeoff/balance between the security and bandwidth efficiency by a configurable parameter . A larger results in higher security but lower deduplication efficiency. When , RRCS provides the best security guarantee while also obtains a relatively high redundancy elimination ratio, i.e., about . Based on the real RRCS prototype, experimental results from using three real-world datasets demonstrate that RRCS has much less bandwidth overheads than the RTS.
-  V. Turner, J. F. Gantz, R. David, and M. Stephen, “The digital universe of opportunities: Rich data and the increasing value of the internet of things,” IDC iView: IDC Analyze the Future, 2014.
-  J. Gantz and D. Reinsel, “The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the far east,” IDC iView: IDC Analyze the Future, 2012.
-  D. T. Meyer and W. J. Bolosky, “A study of practical deduplication,” Proc. USENIX FAST, 2011.
-  S. Quinlan and S. Dorward, “Venti: A new approach to archival storage,” Proc. USENIX FAST, 2002.
-  B. Zhu, K. Li, and R. H. Patterson, “Avoiding the disk bottleneck in the data domain deduplication file system.” in Proc. FAST, 2008.
-  M. Fu, D. Feng, Y. Hua, X. He, Z. Chen, W. Xia, Y. Zhang, and Y. Tan, “Design tradeoffs for data deduplication performance in backup workloads,” in Proc. USENIX FAST, 2015.
-  “Dropbox: Dropbox-simplify your life,” https://www.dropbox.com/.
-  “Mozy: Award-winning cloud backup, sync, and mobile access for computers and servers,” http://mozy.com/.
-  “Spideroak: Zero-knowledge data backup, sync, access, storage and share from any device,” https://spideroak.com/.
-  A. Anand, V. Sekar, and A. Akella, “SmartRE: an architecture for coordinated network-wide redundancy elimination,” in SIGCOMM, 2009.
-  S.-H. Shen, A. Gember, A. Anand, and A. Akella, “REfactor-ing content overhearing to improve wireless performance,” in MobiCom, 2011.
-  M. Mulazzani, S. Schrittwieser, M. Leithner, M. Huber, and E. Weippl, “Dark clouds on the horizon: Using cloud storage as attack vector and online slack space.” in USENIX Security Symposium, 2011.
-  P. Puzio, R. Molva, M. Onen, and S. Loureiro, “Cloudedup: secure deduplication with encrypted data for cloud storage,” CloudCom, 2013.
-  D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Side channels in cloud services: Deduplication in cloud storage,” Security & Privacy, IEEE, vol. 8, no. 6, 2010.
-  M. W. Storer, K. Greenan, D. D. Long, and E. L. Miller, “Secure data deduplication,” Proc. ACM StorageSS, 2008.
-  J. R. Douceur, A. Adya, W. J. Bolosky, P. Simon, and M. Theimer, “Reclaiming space from duplicate files in a serverless distributed file system,” Proc. IEEE ICDCS, 2002.
-  M. Bellare, S. Keelveedhi, and T. Ristenpart, “Message-locked encryption and secure deduplication,” Proc. Springer EUROCRYPT, 2013.
-  O. Heen, C. Neumann, L. Montalvo, and S. Defrance, “Improving the resistance to side-channel attacks on cloud storage services,” Proc. IEEE International Conference on New Technologies, Mobility and Security (NTMS), 2012.
-  V. Tarasov, A. Mudrankit, W. Buik, P. Shilane, G. Kuenning, and E. Zadok, “Generating realistic datasets for deduplication analysis,” Proc. USENIX ATC, 2012.
-  W. Xia, H. Jiang, D. Feng, and Y. Hua, “Silo: A similarity-locality based near-exact deduplication scheme with low ram overhead and high throughput,” Proc. USENIX ATC, 2011.
-  Y. Shin and K. Kim, “Differentially private client-side data deduplication protocol for cloud storage services,” Security and Communication Networks, vol. 8, no. 12, pp. 2114–2123, 2015.
-  “The openssl program,” http://www.openssl.org/.
-  T. Dierks, “The transport layer security (tls) protocol version 1.2,” 2008.
-  J. Li, X. Chen, M. Li, J. Li, P. P. Lee, and W. Lou, “Secure deduplication with efficient and reliable convergent key management,” IEEE Transactions on Parallel and Distributed Systems, vol. 25, no. 6, pp. 1615–1625, 2014.
-  S. Halevi, D. Harnik, B. Pinkas, and A. Shulman-Peleg, “Proofs of ownership in remote storage systems,” Proc. ACM CCS, 2011.
-  W. Xia, Y. Zhou, H. Jiang, D. Feng, Y. Hua, Y. Hu, Q. Liu, and Y. Zhang, “Fastcdc: a fast and efficient content-defined chunking approach for data deduplication,” in Proc. USENIX ATC, 2016.
-  Y. Fu, H. Jiang, N. Xiao, L. Tian, and F. Liu, “Aa-dedupe: An application-aware source deduplication approach for cloud backup services in the personal computing environment,” in Proc. CLUSTER, 2011.