Fundamental Limits on Communication for Oblivious Updates in Storage Networks
In distributed storage systems, storage nodes intermittently go offline for numerous reasons. On coming back online, nodes need to update their contents to reflect any modifications to the data in the interim. In this paper, we consider a setting where no information regarding modified data needs to be logged in the system. In such a setting, a ‘stale’ node needs to update its contents by downloading data from already updated nodes, while neither the stale node nor the updated nodes have any knowledge as to which data symbols are modified and what their value is. We investigate the fundamental limits on the amount of communication necessary for such an oblivious update process.
We first present a generic lower bound on the amount of communication that is necessary under any storage code with a linear encoding (while allowing non-linear update protocols). This lower bound is derived under a set of extremely weak conditions, giving all updated nodes access to the entire modified data and the stale node access to the entire stale data as side information. We then present codes and update algorithms that are optimal in that they meet this lower bound. Next, we present a lower bound for an important subclass of codes, that of linear Maximum-Distance-Separable (MDS) codes. We then present an MDS code construction and an associated update algorithm that meets this lower bound. These results thus establish the capacity of oblivious updates in terms of the communication requirements under these settings.
In recent years, there has been a tremendous increase in the amount of digital data stored. This has lead to the popular paradigm of distributed storage wherein the data to be stored is partitioned into fragments and stored across multiple storage nodes connected through a network. This includes peer-to-peer storage systems [1, 2, 3, 4], globally distributed storage systems [5, 6], data-center based storage systems [7, 8], and caching networks . These distributed storage systems store data in a redundant fashion, using either replication or erasure coding, in order to ensure reliability and availability in the face of frequent unavailability events. Under replication, multiple copies of the fragments are stored on different nodes, for example, the Google File System and the Hadoop Distributed File System use 3-replication as the default strategy for introducing redundancy. Under erasure coding, the data fragments are encoded using erasure codes such as Reed-Solomon codes and the encoded fragments are stored on different nodes .
The storage nodes in the system can go offline for certain intervals of time for various reasons. For instance, there is frequent node churn in peer-to-peer networks as nodes join and leave the network at their will; software issues, maintenance shutdowns, reboots, etc. cause nodes to go offline in distributed storage systems [11, 12]; machines are switched off for certain intervals of time for power savings in some data centers .
We consider the setting of mutable data where the data may be modified during its lifetime (as opposed to immutable data which is read-only). When data gets modified, all stored fragments pertaining to this data (either the replicas or the encoded fragments) need to be updated to reflect this modification. When a node comes back online, its contents need to be updated to reflect any modifications to the data that occurred when the node was offline. We term such a node as a stale node.
One approach towards enabling stale nodes to update their contents is to centrally track all modifications to the data. Under such a complete-information approach [14, 15, 16, 17], the data in a stale node is updated via communication with a central node which provides the precise value of the updated fragments to the stale node. However, this approach has the drawback of requiring the system to centrally keep a log of every modification of the data. This paper, on the other hand, considers an entirely distributed approach in which the system does not store any information regarding the modified data. Here, a stale node needs to update its contents by communicating and downloading data from other updated storage nodes present in the system. Neither the stale node nor the updated nodes are aware of what data was modified and what its updated value is. We term such an update process as an oblivious update. In this paper, we seek to establish the fundamental limits on the amount of communication required to perform oblivious updates.
In a distributed system, one could constantly store and maintain, in every storage node, a log of all updates. When required to update a stale node, one could use these logs to identify and transmit the updated data. Oblivious updates, on the other hand, do not necessitate any such additional storage, and also help avoid logistical issues in maintaining any logs. As we will show later in the paper, the amount of communication required to perform an oblivious update is, in fact, not much larger than the amount of communication required for updates in the complete-information setting.
A related line of work is that on maintaining consistency in databases [18, 19] in the presence of modifications to the data. The primary problems here are of ensuring that read requests are served from up-to-date data, and maintaining availability of the data. The problem of set reconciliation  also has similarities with the problem of oblivious updates. The set reconciliation problem involves two entities, each of whom has some set of values, and the goal is to enable these two entities to learn the difference between their sets with the minimum amount of communication.
Following the literature on classical complete-information updates [14, 15, 16, 17], in this paper we study the case when at most a single symbol is modified. Here, a ‘symbol’ refers to the smallest granularity of data that can be modified. The case of a single-symbol update is a stepping stone to the more general case of multiple symbol-updates. Further, motivated by practical considerations, we restrict our attention to linear codes, i.e., where the encoding process for storage is linear. Although the storage codes are linear, the update protocol is allowed to involve non-linear computations as well, thereby leading to more general bounds.
In this paper we investigate the fundamental limits on the amount of data that needs to be communicated to perform oblivious update of a stale node when a single message symbol is modified. We show that under any code that has a linear encoding (over a finite field of size ), including the special case of ‘replication’, a stale node needs to download at least bits when any one of the message symbols is modified (Section III). This lower bound is obtained via a genie-based argument under a set of extremely weak conditions allowing infinite connectivity for the stale node and giving the entire modified data to all the updated nodes and the entire stale data to the stale node as side information. We then present codes and update algorithms that, perhaps surprisingly, meet these lower bounds on communication (Section IV). Here, oblivious updates are preformed by having a stale node download only bits, while the amount of data stored in the node may be arbitrarily large. These codes are also optimal with respect to the ‘storage-bandwidth tradeoff’ for distributed storage . We then investigate the class of codes that are ‘Maximum-Distance-Separable’ (MDS). MDS codes are a popular choice for distributed storage since they provide maximum reliability with minimum storage overheads. When the linear code is restricted to be MDS, we establish a lower bound on the amount of communication required for oblivious update (Section V), and additionally, present an MDS code and an update algorithm that meets this lower bound (Section VI). These results thus establish the capacity of the communication requirements for oblivious updates under linear codes.
The next section formalizes the problem setting and presents an illustrative example.
Ii Problem Description
Ii-a Problem Setting
Consider symbols of data, termed the message, that are to be stored across storage nodes. Each symbol of data is assumed to belong to some finite field of size . Each node has a capacity of storing symbols over . The data is stored across the nodes using a code that is linear over . Now suppose some storage node, say node , was busy or offline for some period of time. In this period, suppose one of the message symbols was updated. The remaining nodes now store (encodings of) the updated data. However, node still contains stale data, and we will call this node as the the stale node. Now, when node comes back up, its contents must be updated to reflect the updated message. To this end, the stale node connects to one or more of the other nodes, and downloads some functions of the data stored in them. The goal is to minimize this amount of download.
In the setup we consider, none of the nodes are required to store any information about the identity or the value of the symbol that was updated. The update of the stale node’s data is thus oblivious of the update in the message. We do assume, however, that the stale node knows that at most a single symbol was updated. We also assume that the code is linear, i.e., each nodes stores linear combinations of the message symbols. Note that we only assume that the underlying encoding of the stored data is linear, and the data passed during an update operation may comprise arbitrary (linear or non-linear) functions of the data stored in the updated nodes.
The second half of the paper considers a very popular subclass of codes known as Maximum-Distance-Separable (MDS) codes. MDS codes satisfy the two following properties: (a) The entire message of symbols can be recovered from the data stored in any of the nodes, for some pre-defined parameter . This ensures that the storage system can tolerate the failure of any arbitrary of the nodes, and furthermore, ensures high availability of the data since it can be recovered from any nodes. (b) The storage requirement at each node is , which is the minimum possible when satisfying the first property. Again, the goal is to minimize the amount of download required to perform an oblivious update.
Notational conventions: For vectors and of equal lengths, will denote the Hamming distance between them. For any positive integer , will denote the set . Vectors will be column vectors by default.
We illustrate the problem setting with an example of a storage code and an update algorithm that are optimal. The code, shown in Fig. 1, operates in the finite field of size . The message comprises symbols , each drawn from . The message is encoded and stored across storage nodes as shown in the figure. One can verify that the entire message is recoverable from any two of the four nodes, thus making the storage system tolerant to the failure of any two of the four nodes.
Now suppose node was unavailable for some period of time, during which message symbol was updated to . The three other nodes store the updated data, and (‘stale’) node must update its own data by downloading data from the three other nodes. The nodes do not keep any record of what is updated and by how much, i.e., do not know that symbol was updated and that its new value is . The update protocol is thus required to be oblivious of the update.
The lower bounds derived subsequently in Section III dictate the necessity of downloading at least bits for the update. The following update protocol meets this lower bound, with the stale node downloading one symbol of each from two arbitrary updated nodes. The stale node contacts any two other nodes, say nodes and , and asks for the inner product of their respective data with . The two nodes return the values of and respectively. Of course, the stale node does not know that this received data is computed with and not . Next, the stale node computes an inner product of its own data with to get the value of , and an inner product of its own data with to get the value of . Subtracting these from the data received from the two other nodes, the stale node obtains the values of . If both these values are zero, then no symbol was updated, and the algorithm terminates. If not, then the algorithm continues in the following manner. Since the identity of the updated symbol is not known, from the perspective of the stale node, these two values could correspond to either or or or or . The stale node now takes the ratio of the two values; this ratio uniquely identifies that symbol was updated. Multiplying the first value by gives the value of the update . This amount is added to the first symbol of the stale node, and the result is stored as the updated first symbol of the stale node.
This storage code and update protocol are generalized to arbitrary system parameters in Section IV-A.
Iii Lower Bounds for Arbitrary Linear Codes
This section derives lower bounds on the amount of download for oblivious update under any arbitrary code with linear encoding. Note that although we consider the encoding to be linear, we allow the update operation to be executed via any arbitrary (linear or non-linear) functions.
Consider a scenario where the stale node is allowed to connect to any arbitrary number of updated nodes. Furthermore, suppose a genie provides all updated nodes with all the updated message symbols and the stale node with all the stale message symbols as side information (the nodes still do not know the identity or value of the symbol that was updated). In order to update the data stored in the stale node, it must download a total of at least bits.
Let denote the vector of the stale symbols, and let denote the vector of the updated symbols, with . Let denote the generator matrix of the stale node, i.e., the stale node stores , and wants . Assume without loss of generality that the rows of are linearly independent. Note that our genie has also provided the entire stale message to the stale node.
Since the genie has provided each updated node with all the updated message symbols , one can assume without loss of generality that the stale node connects to only one updated node. On being contacted by the stale node, the updated node must return some function of the data: let denote this function, i.e., the updated node returns to the stale node. We will now show that the cardinality of the range of cannot be less than , thus necessitating a download of at least bits. We employ a contradiction-based argument, for which we assume that the cardinality of the range of is strictly smaller than .
The linear independence of the rows of implies existence of coordinates of such that for every fixed value of , the map is a bijection. Without loss of generality, let . Consider the set of messages of the form . Since the range of contains strictly fewer than values, there must exist some two distinct messages, say and , in the aforementioned set of messages, for which .
Now we know that , , , and . The last property implies existence of some such that and . Finally, suppose is the stale message. Now, and are two possible candidates for the updated message. The stale node has access to the same data in both cases: as its own stale data, and downloaded from the updated node. This prevents the stale node from distinguishing between and as the updated message. However, the updated data at the stale node must be different (since ), making it necessary to distinguish between the two cases. This causes a contradiction. \qed
Iv Codes Achieving Lower Bounds
The lower bound derived in Theorem 1 is in the presence of a very helpful genie. This section presents codes and update algorithms that meet this bound in the absence of this genie. These upper bounds are obtained by proving the existence of codes meeting these bounds, and towards this, we employ the product-matrix framework of . Interestingly, the proposed codes are also optimal with respect to the storage-bandwidth tradeoff derived in . The update algorithms presented here require the stale node to connect to any two updated nodes.
The code is associated to an additional parameter , and has the property that the entire message can be recovered from any of the nodes. Assume that is divisible by .111If not, then append an appropriate number of zeros to the message. Since the amount of data is typically much larger than and , this operation is relatively inexpensive. Let .
Under the proposed code, each node is required to store symbols over . The value of will be specified later.
Construct symmetric matrices , each of size , in the following manner. In each matrix , set the bottom-right submatrix to zero. Each of these (symmetric) matrices now have free elements remaining. Partition the message symbols into sets of symbols each. For each , populate the remaining free elements of matrix with the message symbols of the set.
Construct vectors , each of length , and scalars that satisfy:
(a) every submatrix of is of full rank
(b) for every such that , and every and such that ,
Each of these requirements is equivalent to showing that a product of polynomials is non-zero. One can see that each of these polynomials individually is a non-zero polynomial. The Schwartz-Zippel lemma ensures that there exist values of and satisfying all the desired conditions when the size of the underlying finite field is large enough. Finally, for every , node stores the data
Condition (a) will help in recovery of the entire message from any of the nodes, and condition (b) will help in performing the oblivious updates.
Iv-B Oblivious Update Algorithm and Performance
In the code constructed in Section IV-A, any stale node can perform an oblivious update by downloading one symbol each from any two updated nodes when at most one symbol has changed.
Let be the matrices comprising the stale message, as constructed in Section IV-A. The construction is such that no two matrices in have any element in common. As a result, the update of a single element causes a change in only one of these matrices. Let be the matrices comprising the updated message. Algorithm 1 updates the data of a stale node by connecting to any two updated nodes and downloading only one symbol from each. Recall that the notation used Step 2 onwards is defined in (2). Step 6 employs condition (b) of the encoding which guarantees . \qed
Stale node contacts any two updated nodes and
Node (, which stores updated data , returns the single symbol
Stale node , which stores stale data , receives the two symbols
It performs the following operations.
From its stale data, compute:
Subtract these from the received symbols to get and If the changed symbol is at location of matrix , and its value has been changed by , then and
If then the stale node already has the updated data; exit
Compute the ratio
Condition (b) ensures that this ratio is different for different , so use the ratio to identify changed location and .
Construct an matrix with value at locations and and zeros elsewhere; in the stale node, update data to as
The code falls under the ‘product-matrix MBR’ framework of [22, Section IV] from which it derives these properties. \qed
V Lower Bounds for Linear MDS Codes
In this section, we consider the class of codes that are ‘Maximum-Distance-Separable (MDS)’ (recall definition from the last paragraph of Section II-A). We provide lower bounds on the amount of download for arbitrary MDS codes with linear encoding. Although we consider the encoding to be linear, we allow the update operation to be executed via any arbitrary (linear or non-linear) functions.
Under any MDS code with linear encoding, a stale node must contact at least updated nodes. Upon contacting nodes, the stale node must download at least bits from each them.
We will first show that an oblivious update cannot be performed by contacting just nodes. The proof is by contradiction for which we will assume existence of some nodes from which some stale node can be updated. Suppose the entire data stored in these nodes is made available to the stale node. Since the code is MDS, there exists exactly one message whose encoding equals the data currently stored in these updated nodes and the stale node. The stale node will thus be unable to distinguish between the two cases: (a) this message as the stale message and no update, and (b) the actual stale and updated messages. The updated data at the stale node must be different in the two cases, thus necessitating it to distinguish the two cases. This yields a contradiction.
Now assume the stale node connects to some nodes. We now show that it must download at least bits from each of these nodes. It suffices to show that the last of these updated nodes must pass bits, since any of these nodes may be defined as the last node. To this end, consider a genie who provides the entire data stored in the first updated nodes to the stale node, and furthermore, provides the entire updated message to the last updated node.
Let be the stale message, and be the modified message (with . Let denote the generator matrix of the stale node, i.e., the stale node stores under message . Assume without loss of generality that the rows of are linearly independent. Upon being contacted by the stale node, the last updated node (to whom the genie has provided all the updated data) must send some function of the data: let denote this function, i.e., the updated node returns to the stale node. We will now show that the range of must contain at least elements, thus necessitating a download of at least bits.
The linear independence of the rows of implies existence of coordinates of such that for every fixed value of , the map is a bijection. Without loss of generality, let .
Let denote the set of all messages of the form . Construct a second set from in the following manner. For each , find the unique vector such that and the encoding of in the first updated nodes is zero. Since the code is MDS, for each , there exists exactly one such . Set as the collection of these vectors .
Partition the set , of size , into sets that map onto identical values in the range of . Since the range of has a cardinality strictly smaller than , at least one of these sets must have a cardinality strictly greater than . Let us call this set .
Now consider the original elements which were transformed into . In this set , of size greater than , there must exist some two messages and which match on the first coordinates. It follows that there exists such that and . Next, let and respectively be the (distinct) constituents of that are derived from and respectively.
Finally, consider the following scenario. Suppose the original message was , and this was updated to . This constitutes the update of at most one symbol since . We claim that this scenario is indistinguishable from the scenario of the original message being and the updated message being . To this end, first observe that the latter situation also constitutes the update of at most one symbol since . Furthermore, since and , it must be that the encoding of at the stale node is identical to the encoding of in the stale node. The data stored in the stale node thus provides no information pertaining to distinguishing these two scenarios. As discussed above, the encoding of and both result in zeros at the first helper nodes. Furthermore, which makes the data downloaded from the last updated node identical in the two cases. An accurate update is thus impossible in this situation, thus proving our claim. \qed
Vi MDS Codes Achieving Lower Bounds
In this section, we present upper bounds on the amount of download required for oblivious updates under MDS codes, that meet the lower bounds established in Theorem 4.
Each node has a storage capacity of symbols. Let be a -length vector consisting of the message symbols. Let be an arbitrary matrix with the property that every submatrix of is of full rank. For instance, one can choose as a Cauchy matrix. Construct matrices , each of size , by partitioning into blocks of rows each. Finally, for every , node stores the data
Vi-B Oblivious Update Algorithm and Performance
In the code constructed in Section VI-A, any stale node can perform an oblivious update by downloading bits each from any updated nodes when at most one symbol has changed.
Let be the stale message and let be the updated message, with . For every , let and be the first and second rows of , respectively. Further, for any and any , let denote the element of .
Stale node contacts any updated nodes .
For , define -length vectors as
Updated node (), which stores the updated data , returns the two symbols:
Stale node , which stores stale data , performs the following operations.
From the set of received symbols, compute and
Given the stale stored data, containing and , take differences to obtain and
If the changed symbol is at location in the message vector, and its value has been changed by , then and
If then the stale node already has the updated data; exit
Compute the ratio
By construction, this ratio is unique for different values of , so use the ratio to identify the location of the change.
Construct a -length vector with value at location and zeros elsewhere; update the stale data by computing
The code constructed in Section VI-A is maximum-distance-separable (MDS).
Each node stores only symbols, and since every submatrix of is of full rank, the entire message is recoverable from any of the nodes.
Vii Summary and Open Problems
This paper considered the problem of oblivious updates wherein the data stored in a storage node needs to be updated by downloading data from already updated nodes in the storage network, but with none of the nodes knowing the identity or the value of the modified data symbols. Oblivious updates allow the system to ensure that all nodes have the updated data (even after being offline/unavailable) without having to keep a log of modifications. We established the fundamental limits on the communication required for performing such oblivious updates, when a single message symbol is modified, by deriving genie-aided lower bounds and designing storage codes and update algorithms meeting these bounds. Our goal for the future is to extend the characterization of the fundamental limits in multiple directions, such as considering oblivious updates for multiple symbol modifications, non-linear codes, and interactive update protocols. In addition, to complement the theoretical standpoint of this paper, we also plan to investigate the questions that arise in practical implementations of oblivious update protocols, such as the design of explicit codes, and quantification of the minimal state that needs to be maintained for realizing the update algorithms.
-  R. Bhagwan et al., “Total recall: System support for automated availability management,” in NSDI, 2004.
-  A. Rowstron and P. Druschel, “Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility,” in ACM SIGOPS, 2001.
-  “Crashplan,” 2014. [Online]. Available: code42.com/crashplan
-  “Space monkey,” 2014. [Online]. Available: spacemonkey.com
-  J. Kubiatowicz et al., “Oceanstore: An architecture for global-scale persistent storage,” ACM Sigplan Notices, 2000.
-  “Cleversafe,” 2014. [Online]. Available: cleversafe.com
-  S. Ghemawat, H. Gobioff, and S. Leung, “The Google file system,” in ACM SOSP, 2003.
-  K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The Hadoop distributed file system,” in Proc. IEEE MSST, 2010.
-  B. Tang, H. Gupta, and S. Das, “Benefit-based data caching in ad hoc networks,” IEEE Trans. Mob. Computing, 2008.
-  D. Borthakur, “HDFS architecture guide,” 2008. [Online]. Available: http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf
-  D. Ford et al., “Availability in globally distributed storage systems,” in USENIX OSDI, Oct. 2010.
-  K. V. Rashmi et al., “A solution to the network challenges of data recovery in erasure-coded distributed storage systems: A study on the Facebook warehouse cluster,” in Proc. USENIX HotStorage, Jun. 2013.
-  M. Lin, A. Wierman, L. Andrew, and E. Thereska, “Dynamic right-sizing for power-proportional data centers,” IEEE/ACM Trans. Nw., 2013.
-  M. Blaum and R. Roth, “On lowest density MDS codes,” IEEE Trans. Inf. Th., 1999.
-  L. Xu, V. Bohossian, J. Bruck, and D. Wagner, “Low-density MDS codes and factors of complete graphs,” IEEE Trans. Inf. Th., 1999.
-  J. S. Plank, “The RAID-6 liber8tion code,” International Journal of High Performance Computing Applications, vol. 23, no. 3, pp. 242–251, 2009.
-  I. Tamo, Z. Wang, and J. Bruck, “Access vs. bandwidth in codes for storage,” in ISIT, Jul. 2012.
-  W. Vogels, “Eventually consistent,” Comm. of the ACM, 2009.
-  A. Demers et al., “The Bayou architecture: Support for data sharing among mobile users,” in Proc. IEEE MCSA Workshop, 1994.
-  Y. Minsky, A. Trachtenberg, and R. Zippel, “Set reconciliation with nearly optimal communication complexity,” IEEE Trans. Inf. Th., 2003.
-  A. Dimakis et al., “Network coding for distributed storage systems,” IEEE Trans. Inf. Th., Sep. 2010.
-  K. V. Rashmi, N. B. Shah, and P. V. Kumar, “Optimal exact-regenerating codes for the MSR and MBR points via a product-matrix construction,” IEEE Trans. Inf. Th., Aug. 2011.