Practical Verified Computation with Streaming Interactive Proofs

Practical Verified Computation with Streaming Interactive Proofs

Graham Cormode AT&T Labs—Research, graham@research.att.com    Michael Mitzenmacher Harvard University, School of Engineering and Applied Sciences, michaelm@eecs.harvard.edu. This work was supported by NSF grants CCF-0915922, CNS-0721491, and IIS-0964473, and in part by grants from Yahoo! Research, Google, and Cisco, Inc.    Justin Thaler Harvard University, School of Engineering and Applied Sciences, jthaler@seas.harvard.edu. Supported by the Department of Defense (DoD) through the National Defense Science & Engineering Graduate Fellowship (NDSEG) Program, and in part by NSF grants CCF-0915922 and CNS-0721491.
Abstract

When delegating computation to a service provider, as in the cloud computing paradigm, we seek some reassurance that the output is correct and complete. Yet recomputing the output as a check is inefficient and expensive, and it may not even be feasible to store all the data locally. We are therefore interested in what can be validated by a streaming (sublinear space) user, who cannot store the full input, or perform the full computation herself. Our aim in this work is to advance a recent line of work on “proof systems” in which the service provider proves the correctness of its output to a user. The goal is to minimize the time and space costs of both parties in generating and checking the proof. Only very recently have there been attempts to implement such proof systems, and thus far these have been quite limited in functionality.

Here, our approach is two-fold. First, we describe a carefully chosen instantiation of one of the most efficient general-purpose constructions for arbitrary computations (streaming or otherwise), due to Goldwasser, Kalai, and Rothblum [19]. This requires several new insights and enhancements to move the methodology from a theoretical result to a practical possibility. Our main contribution is in achieving a prover that runs in time , where is the size of an arithmetic circuit computing the function of interest; this compares favorably to the runtime for the prover promised in [19]. Our experimental results demonstrate that a practical general-purpose protocol for verifiable computation may be significantly closer to reality than previously realized.

Second, we describe a set of techniques that achieve genuine scalability for protocols fine-tuned for specific important problems in streaming and database processing. Focusing in particular on non-interactive protocols for problems ranging from matrix-vector multiplication to bipartite perfect matching, we build on prior work [8, 15] to achieve a prover that runs in nearly linear-time, while obtaining optimal tradeoffs between communication cost and the user’s working memory. Existing techniques required (substantially) superlinear time for the prover. Finally, we develop improved interactive protocols for specific problems based on a linearization technique originally due to Shen [34]. We argue that even if general-purpose methods improve, fine-tuned protocols will remain valuable in real-world settings for key problems, and hence special attention to specific problems is warranted.

1 Introduction

One obvious impediment to larger-scale adoption of cloud computing solutions is the matter of trust. In this paper, we are specifically concerned with trust regarding the integrity of outsourced computation. If we store a large data set with a service provider, and ask them to perform a computation on that data set, how can the provider convince us the computation was performed correctly? Even assuming a non-malicious service provider, errors due to faulty algorithm implementation, disk failures, or memory read errors are not uncommon, especially when operating on massive data.

A natural approach, which has received significant attention particularly within the theory community, is to require the service provider to provide a proof along with the answer to the query. Adopting the terminology of proof systems [2], we treat the user as a verifier , who wants to solve a problem with the help of the service provider who acts as a prover . After returns the answer, the two parties conduct a conversation following an established protocol that satisfies the following property: an honest prover will always convince the verifier to accept its results, while any dishonest or mistaken prover will almost certainly be rejected by the verifier. This model has led to many interesting theoretical techniques in the extensive literature on interactive proofs. However, the bulk of the foundational work in this area assumed that the verifier can afford to spend polynomial time and resources in verifying a prover’s claim to have solved a hard problem (e.g. an NP-complete problem). In our setting, this is too much: rather, the prover should be efficient, ideally with effort close to linear in the input size, and the verifier should be lightweight, with effort that is sublinear in the size of the data.

To this end, we additionally focus on results where the verifier operates in a streaming model, taking a single pass over the input and using a small amount of space. This naturally fits the cloud setting, as the verifier can perform this streaming pass while uploading the data to the cloud. For example, consider a retailer who forwards each transaction incrementally as it occurs. We model the data as too large for the user to even store in memory, hence the need to use the cloud to store the data as it is collected. Later, the user may ask the cloud to perform some computation on the data. The cloud then acts as a prover, sending both an answer and a proof of integrity to the user, keeping in mind the user’s space restrictions.

We believe that such mechanisms are vital to expand the commercial viability of cloud computing services by allowing a trust-but-verify relationship between the user and the service provider. Indeed, even if every computation is not explicitly checked, the mere ability to check the computation could stimulate users to adopt cloud computing solutions. Hence, in this paper, we focus on the issue of the practicality of streaming verification protocols.

There are many relevant costs for such protocols. In the streaming setting, the main concern is the space used by the verifier and the amount of communication between and . Other important costs include the space and time cost to the prover, the runtime of the verifier, and the total number of messages exchanged between the two parties. If any one of these costs is too high, the protocol may not be useful in real-world outsourcing scenarios.

In this work, we take a two-pronged approach. Ideally, we would like to have a general-purpose methodology that allows us to construct an efficient protocol for an arbitrary computation. We therefore examine the costs of one of the most efficient general-purpose protocols known in the literature on interactive proofs, due to Goldwasser, Kalai, and Rothblum [19]. We describe an efficient instantiation of this protocol, in which the prover is significantly faster than in prior work, and present several modifications which we needed to make our implementation scalable. We believe our success in implementing this protocol demonstrates that a fully practical method for reliable delegation of arbitrary computation is much closer to reality than previously realized.

Although encouraging, our general-purpose implementation is not yet practical for everyday use. Hence, our second line of attack is to improve upon the general construction via specialized protocols for a large subset of important problems. Here, we describe two techniques in particular that yield significantly more scalable protocols than previously known. First, we show how to use certain Fast Fourier Transforms to obtain highly scalable non-interactive protocols that are suitable for practice today; these protocols require just one message from to , and no communication in the reverse direction. Second, we describe how to use a ‘linearization’ method applied to polynomials to obtain improved interactive protocols for certain problems. All of our work is backed by empirical evaluation based on our implementations.

Depending on the technique and the problem in question, we see empirical results that vary in speed by five orders of magnitude in terms of the cost to the prover. Hence, we argue that even if general-purpose methods improve, fine-tuned protocols for key problems will remain valuable in real-world settings, especially as these protocols can be used as primitives in more general constructions. Therefore, special attention to specific problems is warranted. The other costs of providing proofs are acceptably low. For many problems our methods require at most a few megabytes of space and communication even when the input consists of terabytes of data, and some use much less; moreover, the time costs of and scale linearly or almost linearly with the size of the input. Most of our protocols require a polylogarithmic number of messages between and , but a few are non-interactive, and send just one message.

To summarize, we view the contributions of this paper as:

• A carefully engineered general-purpose implementation of the circuit checking construction of [19], along with some extensions to this protocol. We believe our results show that a practical delegation protocol for arbitrary computations is significantly closer to reality than previously realized.

• The development of powerful and broadly applicable methods for obtaining practical specialized protocols for large classes of problems. We demonstrate empirically that these techniques easily scale to streams with billions of updates.

1.1 Previous Work

The concept of an interactive proof was introduced in a burst of activity around twenty years ago [3, 20, 25, 33, 34]. This culminated in a celebrated result of Shamir [33], which showed that the set of problems with efficient interactive proofs is exactly the set of problems that can be computed in polynomial space. However, these results were primarily seen as theoretical statements about computational complexity, and did not lead to implementations. More recently, motivated by real-world applications involving the delegation of computation, there has been considerable interest in proving that the cloud is operating correctly. For example, one line of work considers methods for proving that data is being stored without errors by an external source such as the cloud, e.g., [22].

In our setting, we model the verifier as capable of accessing the data only via a single, streaming pass. Under this constraint, there has been work in the database community on ensuring that simple functions based on grouping and counting are performed correctly; see [37] and the references therein. Other similar examples include work on verifying queries on a data stream with sliding windows using Merkle trees [24] and verifying continuous queries over streaming data [29].

Most relevant to us is work which verifies more complex and more general functions of the input. The notion of a streaming verifier, who must read first the input and then the proof under space constraints, was formalized by Chakrabarti et al. [8] and extended by the present authors in [15]. These works allowed the prover to send only a single message to the verifier, with no communication in the reverse direction. With similar motivations, Goldwasser et al. [19] give a powerful protocol that achieves a polynomial time prover and highly-efficient verifier for a large class of problems, although they do not explicitly present their protocols in a streaming setting. Subsequently, it has been noted that the information required by the verifier can be collected with a single initial streaming pass, and so for a large class of uniform computations, the verifier operates with only polylogarithmic space and time. Finally, Cormode et al. [17] introduce the notion of streaming interactive proofs, extending the model of [8] by allowing multiple rounds of interaction between prover and verifier. They present exponentially cheaper protocols than those possible in the single-message model of [8, 15], for a variety of problems of central importance in database and stream processing.

A different line of work has used fully homomorphic encryption to ensure integrity, privacy, and reusability in delegated computation [18, 12, 11]. The work of Chung, Kalai, Liu, and Raz [11] is particularly related, as they focus on delegation of streaming computation. Their results are stronger than ours, in that they achieve reusable general-purpose protocols (even if learns whether accepts or rejects each proof), but their soundness guarantees rely on computational assumptions, and the substantial overhead due to the use of fully homomorphic encryption means that these protocols remain far from practical at this time.

Only very recently have there been sustained efforts to use techniques derived from the complexity and cryptography worlds to actually verify computations. Bhattacharyya implements certain PCP constructions and indicates they may be close to practical [4]. In parallel to this work, Setty et al.  [31, 32] survey results on probabilistically checkable proofs (PCPs), and implement a construction originally due to Ishai et al.  [21]. While their work represents a clear advance in the implementation of PCPs, our approach has several advantages over [31, 32]. For example, our protocols save space and time for the verifier even when outsourcing a single computation, while [31, 32] saves time for the verifier only when batching together several dozen computations at once and amortizing the verifier’s cost over the batch. Moreover, our protocols are unconditionally secure even against computationally unbounded adversaries, while the construction of Ishai et al. relies on cryptographic assumptions to obtain security guarantees. Another practically-motivated approach is due to Canetti et al. [7]. Their implementation delegates the computation to two independent provers, and “plays them off” against each other: if they disagree on the output, the protocol identifies where their executions diverge, and favors the one which follows the program correctly at the next step. This approach requires at least one of the provers to be honest for any security guarantee to hold.

1.2 Preliminaries

Definitions. We first formally define a valid protocol. Here we closely follow previous work, such as [17] and [8].

Definition 1.1

Consider a prover and verifier who both observe a stream and wish to compute a function . After the stream is observed, and exchange a sequence of messages. Denote the output of on input , given prover and ’s random bits , by . can output if is not convinced that ’s claim is valid.

is a valid prover with respect to if for all streams ,

 PrR[out(V,A,R,P)=f(A)]=1.

We call a valid verifier for if there is at least one valid prover with respect to , and for all provers and all streams ,

 PrR[out(V,A,R,P′)∉{f(A),⊥}]≤1/3.

Essentially, this definition states that a prover who follows the protocol correctly will always convince , while if makes any mistakes or false claims, then this will be detected with at least constant probability. In fact, for our protocols, this ‘false positive’ probability can easily be made arbitrarily small.

As our first concern in a streaming setting is the space requirements of the verifier as well as the communication cost for the protocol, we make the following definition.

Definition 1.2

We say possesses an -message protocol, if there exists a valid verifier for such that:

2. There is a valid prover for such that and exchange at most messages in total, and the sum of the lengths of all messages is words.

We refer to one-message protocols as non-interactive. We say an -message protocol has rounds.

A key step in many proof systems is the evaluation of the low-degree extension of some data at multiple points. That is, the data is interpreted as implicitly defining a polynomial function which agrees with the data over the range , and which can also be evaluated at points outside this range as a check. The existence of streaming verifiers relies on the fact that such low-degree extensions can be evaluated at any given location incrementally as the data is presented [17].

Input Representation. All protocols presented in this paper can handle inputs specified in a very general data stream form. Each element of the stream is a tuple , where and is an integer (which may be negative, thereby modeling deletions). The data stream implicitly defines a frequency vector , where is the sum of all values associated with in the stream, and the goal is to compute a function of . Notice the function of to be computed may interpret as an object other than a vector, such as a matrix or a string. For example, in the MVMult problem described below, defines a matrix and a vector to be multiplied, and in some of the graph problems considered as extensions in Section 2, defines the adjacency matrix of a graph.

In Sections 2 and 3, the manner in which we describe protocols may appear to assume that the data stream has been pre-aggregated into the frequency vector (for example, in Section 3, we apply the protocol of Goldwasser et al. [19] to arithmetic circuits whose ’th input wire has value ). It is therefore important to emphasize that in fact all of the protocols in this paper can be executed in the input model of the previous paragraph, where only sees the raw (unaggregated) stream and not the aggregated frequency vector , and there is no explicit conversion between the raw stream and the aggregated vector . This follows from observations in [8, 17], which we describe here for completeness.

The critical observation is that in all of our protocols, the only information must extract from the data stream is the evaluation of a low-degree extension of at a random point , which we denote by , and this value can be computed incrementally by using words of memory as the raw stream is presented to . Crucially this is possible because, for fixed , the function is linear, and thus it is straightforward for to compute the contribution of each update to .

More precisely, we can write , where is a (Lagrange) polynomial that depends only on . Thus, can compute incrementally from the raw stream by initializing , and processing each update via:

 LDEa(r)←LDEa(r)+δχi(r).

only needs to store and , which requires words of memory. Moreover, for any , can be computed in field operations, and thus can compute with one pass over the raw stream, using words of space and field operations per update.

Problems. To focus our discussion and experimental study, we describe four key problems that capture different aspects of computation: data grouping and aggregation, linear algebra, and pattern matching. We will study how to build valid protocols for each of these problems. Throughout, let denote the universe from which data elements are drawn.

• : Given a stream of elements from , compute where is the number of occurrences of in the stream. This is also known as the second frequency moment, a special case of the th frequency moment .

• : Given a stream of elements from , compute the number of distinct elements, i.e. the number of with , where again is the number of occurrences of in the stream.

• MVMult: Given a stream defining an integer matrix , and vectors , determine whether . More generally, we are interested in the case where provides a vector which is claimed to be . This is easily handled by our protocols, since can treat the provided as part of the input, even though it may arrive after the rest of the input.

• PMwW: Given a stream representing text and pattern , the pattern is said to occur at location in if, for every position in , either or at least one of and is the wildcard symbol . The pattern-matching with wildcards problem is to determine the number of locations at which occurs in .

For simplicity, we will assume the stream length and the universe size are on the same order of magnitude i.e. .

All four problems require linear space in the streaming model to solve exactly (although there are space-efficient approximation algorithms for the first three [28]).

Non-interactive versus Multi-round Protocols. Protocols for reliable delegation fall into two classes: non-interactive, in which a single message is sent from prover to verifier and no communication occurs in the reverse direction; and multi-round, where the two parties have a sustained conversation, possibly spanning hundreds of rounds or more. There are merits and drawbacks to each.

— Non-interactive Advantages: The non-interactive model has the desirable property that the prover can compute the proof and send it to the verifier (in an email, or posted on a website) for to retrieve and validate at her leisure. In contrast, the multi-round case requires and to interact online. Due to round-trip delays, the time cost of multi-round protocols can become high; moreover, may have to do substantial computation after each message. This can involve maintaining state between messages, and performing many passes over the data. A less obvious advantage is that non-interactive protocols can be repeated for different instances (e.g. searching for different patterns in PMwW) without requiring to use fresh randomness. This allows the verifier to amortize much of its time cost over many queries, potentially achieving sublinear time cost per query. The reason this is possible is that in the course of a non-interactive protocol, learns nothing about ’s private randomness (assuming does not learn whether accepts or rejects the proof) and so we can use a union bound to bound the probability of error over multiple instances. In contrast, in the multi-round case, must divulge most of its private random bits to over the course of the protocol.

— Multi-round Advantages: The overall cost in a multi-round protocol can be lower, as most non-interactive protocols require to use substantial space and read a large proof. Indeed, prior work [8, 15] has shown that space or communication must be for most non-interactive protocols [8]. Nonetheless, even for terabyte streams of data, these costs typically translate to only a few megabytes of space and communication, which is tolerable in many applications. Of more concern is that the time cost to the prover in known non-interactive protocols is typically much higher than in the interactive case, though this gap is not known to be inherent. We make substantial progress in closing this gap in prover runtime in Section 2, but this still leaves an order of magnitude difference in practice (Section 5).

1.3 Outline and Contributions

We consider non-interactive protocols first, and interactive protocols second. To begin, we describe in Section 2 how to use Fast Fourier Transform methods to engineer ’s runtime in the  protocol of [8] down from to nearly-linear time. The  protocol is a key target, because (as we describe) several protocols build directly upon it. We show in Section 5 that this results in a speedup of hundreds of thousands of updates per second, bringing this protocol, as well as those that build upon it, from theory to practice.

Turning to interactive protocols, in Section 3 we describe an efficient instantiation of the general-purpose construction of [19]. Here, we also describe efficient protocols for specific problems of high interest including  and PMwW based on an application of our implementation to carefully chosen circuits. The latter protocol enables verifiable searching (even with wildcards) in the cloud, and complements work on searching in encrypted data within the cloud (e.g. [5]). Our final contribution in this section is to demonstrate that the use of more general arithmetic gates to enhance the basic protocol of [19] allows us to significantly decrease prover time, communication cost, and message cost of these two protocols in practice.

In Section 4 we provide alternative interactive protocols for important specific problems based on a technique known as linearization; we demonstrate in Section 5 that linearization yields a protocol for  in which runs nearly two orders of magnitude faster than in all other known protocols for this problem. Finally, we describe our observations on implementing these different methods, including our carefully engineered implementation of the powerful general-purpose construction of [19].

2 Fast Non-interactive Proofs via Fast Fourier Transforms

In this section, we describe how to drastically speed up ’s computation for a large class of specialized, non-interactive protocols. In non-interactive proofs, often needs to evaluate a low-degree extension at a large number of locations, which can be the bottleneck. Here, we show how to reduce the cost of this step to near linear, via Fast Fourier Transform (FFT) methods.

For concreteness, we describe the approach in the context of a non-interactive protocol for  given in [8]. Initial experiments on this protocol identified the prover’s runtime as the principal bottleneck in the protocol [17]. In this implementation, required time, and consequently the implementation fails to scale for . Here, we show that FFT techniques can dramatically speed up the prover, leading to a protocol that easily scales to streams consisting of billions of items.

We point out that  is a problem of significant interest, beyond being a canonical streaming problem. Many existing protocols in the non-interactive model are built on top of  protocols, including finding the inner product and Hamming distance between two vectors [8], the MVMult problem, solving a large class of linear programs, and graph problems such as testing connectivity and identifying bipartite perfect matchings [9, 15]. These protocols are particularly important because they all achieve provably optimal tradeoffs between space and communication costs [8]. Thus, by developing a scalable, practical protocol for , we also achieve big improvements in protocols for a host of important (and seemingly unrelated) problems.

Non-interactive  and MVMult Protocols. We first outline the protocol from [8, Theorem 4] for  on an dimensional vector. This construction yields an protocol for any , i.e. it allows a tradeoff between the amount of communication and space used by ; for brevity we describe the protocol when .

Assume for simplicity that is a perfect square. We treat the dimensional vector as a array . This implies a two-variate polynomial over a suitably large finite field , such that

 ∀(x,y)∈[√n]×[√n]:f(x,y)=ax,y.

To compute , we wish to compute

 ∑x∈[√n],y∈[√n]a2x,y=∑x∈[√n],y∈[√n]f2(x,y).

The low-degree extension can also be evaluated at locations outside . In the protocol, the verifier picks a random position , and evaluates for every ([8] shows how can compute any incrementally in constant space). The proof given by is in the form of a degree polynomial which is claimed to be . uses the values of to check that , and if so accepts as the correct answer. Clearly ’s check will pass if is as claimed. The proof of validity follows from the Schwartz-Zippel lemma: if as claimed by , then

 Pr[s(r)=∑y∈[√n]f(r,y)2]≤degree(s)|Fp|=2(√n−1)p

where is the size of the finite field . Thus, if deviates at all from the prescribed protocol, the verifier’s check will fail with high probability.

A non-interactive protocol for MVMult uses similar ideas. Each entry in the output is the result of an inner product between two vectors: a row of matrix and vector . Each of the entries in the output can be checked independently with a variation of the above protocol, where the squared values are replaced by products of vector entries; this naive approach yields an protocol for MVMult. [15] observes that, because is held constant throughout all inner product computations, ’s space requirements can be reduced by having keep track of hashed information, rather than full vectors. The messages from do not change, however, and computing low-degree extensions of the input data is the chief scalability bottleneck. This construction yields a 1-message protocol (as in Definition 1.2) for any , and this can be shown to be optimal.

2.1 Breaking the bottleneck

Since has degree at most it is uniquely specified by its values at any locations. We show how can quickly evaluate all values in the set

 S:={(x,s(x)):x∈[2√n]}.

Since , given all values in set

 T:={(x,y,f(x,y)):x∈[2√n],y∈[√n]},

all values in can be computed in time linear in . The implementation of [17] calculated each value in independently, requiring time overall. We show how FFT techniques allow us to calculate much faster.

The task of computing boils down to multi-point evaluation of the polynomial . It is known how to perform fast multi-point evaluation of univariate degree polynomials in time , and bivariate polynomials in subquadratic time, if the polynomial is specified by its coefficients [27]. However, there is substantial overhead in converting to a coefficient representation. It is more efficient for us to directly work with and exchange polynomials in an implicit representation, by specifying their values at sufficiently many points.

Representing as a convolution. We are given the values of at all points located on the “grid”. We leverage this fact to compute efficiently in nearly linear time by a direct application of the Fast Fourier Transform. For , is just , which can store explicitly while processing the stream. It remains to calculate for . For fixed , we may write explicitly as

 f(X,y)=∑i∈[√n]ai,yχi(X),

where is the Lagrange polynomial111 That is, the unique polynomial of degree such that , while for , . Here, the inverse is the multiplicative inverse within the field.

 χi(j)=∏x∈[√n]∖{i}(j−i)(x−i)−1

If , then we may write

 f(j,y)= ∑i∈[√n]h(j)by(i)g(j−i) (1) whereby(i) =ai,y∏x∈[√n]∖{i}(x−i)−1, h(j) =j∏k=(j+1−√n)k, andg(j−i) =(j−i)−1.

As a result can be computed as a circular convolution of and , scaled by . That is, for a fixed , all values in the set can be found by computing the convolution in Equation 1, then scaling each entry by the appropriate value of .

Computing the Convolution. We represent and by vectors of length over a suitable field, and take the Discrete Fourier Transform (DFT) of each. The convolution is the inverse transform of the inner product of the two transforms [23, Chapter 5]. There is some freedom to choose the field over which to perform the transform. We can compute the DFT of and over the complex field using arithmetic operations via standard techniques such as the Cooley-Tukey algorithm [14], and simply reduce the final result modulo , rounded to the nearest integer. Logarithmically many bits of precision past the decimal point suffice to obtain a sufficiently accurate result. Since we compute such convolutions, we obtain the following result:

Theorem 2.1

The honest prover in the  protocol of [8, Theorem 4] requires arithmetic operations on numbers of bit-complexity .

In practice, however, working over can be slow, and requires us to deal with precision issues. Since the original data resides in some finite field , and can be represented as fixed-precision integers, it is preferable to also compute the DFT over the same field. Here, we exploit the fact that in designing our protocol, we can choose to work over any sufficiently large finite field .

There are two issues to address: we need that there exists a DFT for sequences of length (or thereabouts) in , and further that this DFT has a corresponding (fast) Fourier Transform algorithm. We can resolve both issues with the Prime Factor Algorithm (PFA) for the DFT in  [6]. The “textbook” Cooley-Turkey FFT algorithm operates on sequences whose length is a power of two. Instead, the PFA works on sequences of length , where the ’s are pairwise coprime. The time cost of the transform is . The algorithm is typically applied over the complex numbers, but also applies over : it works by breaking the large DFT up into a sequence of smaller DFTs, each of size for some . These base DFTs for sequences of length exist for whenever there exists a primitive ’th root of unity in . This is the case whenever is a divisor of . So we are in good shape so long as has many distinct prime factors.

Here, we use our freedom to fix , and choose .222Arithmetic in this field can also be done quickly, see Section 5.1. Notice that

 261−2=2×32×52×7×13×31×41×61×151×331×1321,

and so there are many such divisors to choose from when working over . If is not equal to a factor of , we can simply pad the vectors and such that their lengths are factors of . Since has many small factors, we never have to use too much padding: we calculated that we never need to pad any sequence of length (good for up to ) by more than of its length. This is better than the Cooley-Tukey method, where padding can double the length of the sequence.

As an example, we can work with the length , sufficient for inputs of size , which is over . The cost scales as . Therefore, the PFA approach offers a substantial improvement over naive convolution in , which takes time .

Parallelization. This protocol is highly amenable to parallelization. Observe that performs independent convolutions of each of length (one for each column of the matrix ), followed by computing for each row of the result. The convolutions can be done in parallel, and once complete, the sum of squares of each row can also be parallelized. This protocol also possesses a simple two-round MapReduce protocol. In the first round, we assign each column of the matrix a unique key, and have each reducer perform the convolution for the corresponding column. In the second round, we assign each row a unique key, and have each reducer compute for its row .

2.2 Implications

As we experimentally demonstrate in Section 5, the results of this section make practical the fundamental building block for the majority of known non-interactive protocols. Indeed, by combining Theorem 2.1 with protocols from [8, 15], we obtain the following immediate corollaries. For all graph problems considered, is the number of nodes in the graph, and is the number of edges.

Corollary 2.2
1. (Extending [8, Theorem 4.3]) For any , there is an protocol for computing the inner product and Hamming distance of two -dimensional vectors, where runs in time and runs in time . The previous best runtime known for was .

2. (Extending [15, Theorem 4]) For any , there is an protocol for integer matrix-vector multiplication (MVMult), where runs in time and runs in time . The best runtime known for previously was .

3. (Extending [15, Corollary 3]) For any , there is an protocol for solving a linear program over variables with (integer) constraints and subdeterminants of polynomial magnitude, where runs in time and runs in time , where is the time required to solve the linear program and its dual. The best runtime known for previously was .

4. (Extending [8, Theorem 5.4]) For any , there is an protocol for counting the number of triangles in a graph, where runs in time and runs in time . The best runtime known for previously was .

5. (Extending [9, Theorem 6.6]) For any , , there is an protocol for graph connectivity, where runs in time and runs in time . The best runtime known for previously was .

6. (Extending [9, Theorem 6.5]) For any , , there is an protocol for bipartite perfect matching, where runs in time and runs in time , where is the time required to find a perfect matching if one exists, or to find a counter-example (via Hall’s Theorem) otherwise. The best runtime known for previously was .

In the common case where we choose , this represents a polynomial-speed up in ’s runtime. For example, for the MVMult problem, the prover’s cost is reduced from in prior work to .

In most cases of Corollary 2.2, runs in linear time, and runs in nearly linear time for dense inputs, plus the time required to solve the problem in the first place, which may be superlinear. Thus, pays at most a logarithmic factor overhead in solving the problem “verifiably”, compared to solving the problem in a non-verifiable manner.

3 A General Approach: Multi-round Protocols Via Circuit Checking

In this section, we study interactive protocols, and describe how to efficiently instantiate the powerful framework due to Goldwasser, Kalai, and Rothblum for verifying arbitrary computations333We are indebted to these authors for sharing their working draft of the full version of [19], which provides much greater detail than is possible in the conference presentation..

A standard approach to verified computation developed in the theoretical literature is to verify properties of circuits that compute the desired function [18, 19, 31]. One of the most promising of these is due to Goldwasser et al., which proves the following result:

Theorem 3.1

[19] Let be a function over an arbitrary field that can be computed by a family of -space uniform arithmetic circuits (over ) of fan-in 2, size , and depth . Then, assuming unit cost for transmitting or storing a value in , possesses a -protocol requiring rounds. runs in time and runs in time .

Here, an arithmetic circuit over a field is analogous to a boolean circuit, except that the inputs are elements of rather than boolean values, and the gates of the circuit compute addition and multiplication over . We address how to realize the protocol of Theorem 3.1 efficiently. Specifically, we show three technical results. The first two results, Theorems 3.2 and 3.3, state that for any log-space uniform circuit, the honest prover in the protocol of Theorem 3.1 can be made to run in time nearly linear in the size of the circuit, with a streaming verifier who uses only words of memory. Thus, these results guarantee a highly efficient prover and a space-efficient verifier. In streaming contexts, where is more space-constrained than time-constrained, this may be acceptable. Moreover, Theorem 3.3 states that can perform the time-consuming part of its computation in a data-independent non-interactive preprocessing phase, which can occur offline before the stream is observed.

Our third result, Theorem 3.4 makes a slightly stronger assumption but yields a stronger result: it states that under very mild conditions on the circuit, we can achieve a prover who runs in time nearly linear in the size of the circuit, and a verifier who is both space- and time-efficient.

Before stating our theorems, we sketch the main techniques needed to achieve the efficient implementation, with full details in Appendix A. We also direct the interested reader to the source code of our implementations [16]. The remainder of this section is intended to be reasonably accessible to readers who are familiar with the sum-check protocol [33, 25], but not necessarily with the protocol of [19].

3.1 Engineering an Efficient Prover

In the protocol of [19], and first agree on a depth circuit of gates with fan-in 2 that computes the function of interest; is assumed to be in layered form (this assumption blows up the size of the circuit by at most a factor of , and we argue that it is unrestrictive in practice, as the natural circuits for all four of our motivating problems are layered, as well as for a variety of other problems described in Appendix A). begins by claiming a value for the output gate of the circuit. The protocol then proceeds iteratively from the output layer of to the input layer, with one iteration for each layer. For presentation purposes, assume that all layers of the circuit have gates, and let .

At a high level, in iteration , reduces verifying the claimed value of the output gate to computing the value of a certain -variate polynomial at a random point . The iterations then proceed inductively over each layer of gates: in iteration , reduces computing for a certain -variate polynomial to computing for a random point . Finally, in iteration , must compute . This happens to be a function of the input alone (specifically, it is an evaluation of a low-degree extension of the input), and can compute this value in a streaming fashion, without assistance, even if only given access to the raw (unaggregated) data stream, as described in Section 1.2. If the values agree, then is convinced of the correctness of the output.

We abstract the notion of a “wiring predicate”, which encodes which pairs of wires from layer are connected to a given gate at layer . Each iteration consists of an application of the standard sum-check protocol [25, 33] to a -variate polynomial based on the wiring predicate. There is some flexibility in choosing the specific polynomial to use. This is because the definition of involves a low-degree extension of the circuit’s wiring predicate, and there are many such low-degree extensions to choose from.

A polynomial is said to be multilinear if it has degree at most one in each variable. The results in this section rely critically on the observation that the honest prover’s computation in the protocol of [19] can be greatly simplified if we use the multilinear extension of the circuit’s wiring predicate.444There are other reasons why using the multilinear extension is desirable. For example, the communication cost of the protocol is proportional to the degree of the extension used. Details of this observation follow.

As already mentioned, at iteration of the protocol of [19], the sum-check protocol is applied to the -variate polynomial . In the ’th round of this sum-check protocol, is required to send the univariate polynomial

 gj(Xj) =∑(xj+1,…,x3v)∈{0,1}3v−jfi(r(i)1,…,r(i)j−1,Xj,xj+1,…,x3v).

The sum defining involves as many as terms, and thus a naive implementation of would require time per iteration of the protocol. However, we observe that if the multilinear extension of the circuit’s wiring predicate is used in the definition of , then each gate at layer contributes to exactly one term in the sum defining , as does each gate at layer . Thus, the polynomial can be computed with a single pass over the gates at layer , and a single pass over the gates at layer . As the sum-check protocol requires messages for each layer of the circuit, requires logarithmically many passes over each layer of the circuit in total.

A complication in applying the above observation is that must process the circuit in order to pull out information about its structure necessary to check the validity of ’s messages. Specifically, each application of the sum-check protocol requires to evaluate the multilinear extension of the wiring predicate of the circuit at a random point. Theorem 3.2 follows from the fact that for any log-space uniform circuit, can evaluate the multilinear extension of the wiring predicate at any point using space . We present detailed proofs and discussions of the following theorems in Appendix A.

Theorem 3.2

For any log-space uniform circuit of size , requires time to implement the protocol of Theorem 3.1 over the entire execution, and requires space .

Moreover, because the circuit’s wiring predicate is independent of the input, we can separate ’s computation into an offline non-interactive preprocessing phase, which occurs before the data stream is seen, and an online interactive phase which occurs after both and have seen the input. This is similar to [19, Theorem 4], and ensures that is space-efficient (but may require time ) during the offline phase, and that is both time- and space-efficient in the online interactive phase. In order to determine which circuit to use, does need to know (an upper bound on) the length of the input during the preprocessing phase.

Theorem 3.3

For any log-space uniform circuit of size , requires time to implement the protocol of Theorem 3.1 over the entire execution. requires space and time in a non-interactive, data-independent preprocessing phase, and only requires space and time in an online interactive phase, where the term is due to the time required to evaluate the low-degree extension of the input at a point.

Finally, Theorem 3.4 follows by assuming can evaluate the multilinear extension of the wiring predicate quickly. A formal statement of Theorem 3.4 is in Appendix A. We believe that the hypothesis of Theorem 3.4 is extremely mild, and we discuss this point at length in Appendix A, identifying a diverse array of circuits to which Theorem 3.4 applies. Moreover, the solutions we adopt in our circuit-checking experiments for , , and PMwW correspond to Theorem 3.4, and are both space- and time-efficient for the verifier.

Theorem 3.4

(informal) Let be any log-space uniform circuit of size and depth , and assume there exists a -space, -time algorithm for evaluating the multilinear extension of ’s wiring predicate at a point. Then in order to to implement the protocol of Theorem 3.1 applied to , requires time, and requires space and time , where the term is due to the time required to evaluate the low-degree extension of the input at a point.

3.2 Circuit Design Issues

The protocol of [19] is described for arithmetic circuits with addition () and multiplication gates (). This is sufficient to prove the power of this system, since any efficiently computable boolean function on boolean inputs can be computed by an (asymptotically) small arithmetic circuit. Typically such arithmetic circuits are obtained by constructing a boolean circuit (with AND, OR, and NOT gates) for the function, and then “arithmetizing” the circuit [2, Chapter 8]. However, we strive not just for asymptotic efficiency, but genuine practicality, and the factors involved can grow quite quickly: every layer of (arithmetic) gates in the circuit adds rounds of interaction to the protocol. Hence, we further explore optimizations and implementation issues.

Extended Gates. The circuit checking protocol of [19] can be extended with any gates that compute low-degree polynomial functions of their inputs. If is a polynomial of degree , we can use gates computing ; this increases the communication complexity in each round of the protocol by at most words, as must send a degree- polynomial, rather than a degree-2 polynomial.

The low-depth circuits we use to compute functions of interest (specifically,  and PMwW) make use of the function . Using only and gates, they require depth about . If we also use gates computing for a small , we can reduce the depth of the circuits to about ; as the number of rounds in the protocol of [19] depends linearly on the depth of the circuit, this reduces the number of rounds by a factor of about . At the same time this increases the communication cost of each round by a factor of (at most) . We can optimize the choice of . In our experiments, we use (so is ) and () to simultaneously reduce the number of messages by a factor of 3, and the communication cost and prover runtime by significant factors as well.

Another optimization is possible. All four specific problems we consider, , , PMwW, and MVMult, eventually compute the sum of a large number of values. Let be the low-degree extension of the values being summed. For functions of this form, can use a single sum-check protocol [2, Chapter 8] to reduce the computation of the sum to computing for a random point . can then use the protocol of [19] to delegate computation of to . Conceptually, this optimization corresponds to replacing a binary tree of addition gates in an arithmetic circuit with a single gate with large fan-in, which sums all its inputs. This optimization can reduce the communication cost and the number of messages required by the protocol.

General Circuit Design. The circuit checking approach can be combined with existing compilers, such as that in the Fairplay system [26], that take as input a program in a high-level programming language and output a corresponding boolean circuit. This boolean circuit can then be arithmetized and “verified” by our implementation; this yields a full-fledged system implementing statistically-secure verifiable computation. However, this system is likely to remain impractical even though the prover can be made to run in time linear in the size of the arithmetic circuit. For example, in most hardware, one can compute the sum of two 32-bit integers and with a single instruction. However, when encoding this operation into a boolean circuit, it is unclear how to do this with depth less than . At rounds per circuit layer, for reasonable parameters, single additions can turn into thousands of rounds.

The protocols in Section 3.3 avoid this by avoiding boolean circuits, and instead view the input directly as elements over . For example, if the input is an array of 32-bit integers, then we view each element of the array as a value of , and calculating the sum of two integers requires a single depth-1 addition gate, rather than a depth-32 boolean circuit. However, this approach seems to severely limit the functionality that can be implemented. For instance, we know of no compact arithmetic circuit to test whether when viewing and as elements of . Indeed, if such a circuit for this function existed, we would obtain substantially improved protocols for  and PMwW.

This polylogarithmic blowup in circuit depth compared to input size appears inherent in any construction that encodes computations as arithmetic circuits. Therefore, the development of general purpose protocols that avoid this representation remains an important direction for future work.

3.3 Efficient Protocols For Specific Problems

We obtain interactive protocols for our problems of interest by applying Theorem 3.1 to carefully chosen arithmetic circuits. These are circuits where each gate executes a simple arithmetic operation on its inputs, such as addition, subtraction, or multiplication. For the first three problems, there exist specialized protocols; our purpose in describing these protocols here is to explore how the general construction performs when applied to specific functions of high interest. However, for PMwW, the protocol we describe here is the first of its kind.

For each problem, we describe a circuit which exploits the arithmetic structure of the finite field over which they are defined. For the latter three problems, this involves an interesting use of Fermat’s Little Theorem. These circuits lend themselves to extensions of the basic protocol of [19] that achieve quantitative improvements in all costs; we demonstrate the extent of these improvements in Section 5.

Protocol for : The arithmetic circuit for  is quite straightforward: the first level computes the square of input values, then subsequent levels sum these up pairwise to obtain the sum of all squared values. The total depth is . This implies a message protocol (as per Definition 1.2).

Protocol for : We describe a succinct arithmetic circuit over that computes . When is a prime larger than , Fermat’s Little Theorem (FLT) implies that for , if and only if . Consider the circuit that, for each coordinate of the input vector , computes each via multiplications, and then sums the results. This circuit has total size and depth . Applying the protocol of [19] to this circuit, we obtain a protocol where runs in time .

Protocol for MVMult: The first level of the circuit computes for all , and subsequent levels sum these to obtain . Then we use FLT to ensure that for all , via

 ∑i((∑jAijxi)−bi)p−1.

The input is as claimed if this final output of the circuit is (i.e. it counts the number of entries of that are incorrect). This circuit has depth and and size , and we therefore obtain an protocol requiring -rounds, where runs in time .

Protocol for PMwW: To handle wildcards in both (of length ) and (of length ), we replace each occurrence of the wildcard symbol with ; [13] notes that the pattern occurs at location of if and only if

 Ii:=q−1∑j=0ti+jpj(ti+j−pj)2≠0.

Thus, by FLT, it suffices to compute , which can be done naively by an arithmetic circuit of size and depth . We obtain a protocol where runs in time .

For large , a further optimization is possible: the vector can be written as the sum of a constant number of circular convolutions. Such convolutions can be computed efficiently using Fourier techniques in time and, importantly, appropriate FFT and inverse FFT operations can be implemented via arithmetic circuits. Thus, for larger than , we can reduce the circuit size (and hence ’s runtime) in this way, rather than by naively computing each entry of independently.

4 Multi-Round Protocols via Linearization

In this section, we show how the technique of linearization can improve upon the general approach of Section 3 for some important functions. Specifically, this technique can be applied to multi-round protocols which would otherwise require polynomials of very high degree to be communicated. We show this in the context of new multi-round protocols for  and PMwW  and we later empirically observe that our new protocol achieves a speed up of two orders of magnitude over existing protocols for , as well as an order of magnitude improvement in communication cost.

Existing approaches for  in the multi-round setting are based on generalizations of the multi-round protocol for  [17]. As described in [17], directly applying this approach is problematic: the central function in  maps non-zero frequencies to 1 while keeping zero frequencies as zero. Expressed as a polynomial, this function has degree (an upper bound on the frequency of any item), which translates into a factor of in the communication required and the time cost of . However, this cost can be reduced to , where denotes the maximum number of times any item appears in the stream. Further, if both and keep a buffer of input items, they can eliminate duplicate items within the buffer, and so ensure that . This leads to an message, multi-round protocol with ’s runtime being [17]. This protocol trades off increased communication for a quadratic improvement in the number of rounds of communication required compared to the protocol outlined in Section 3.3 above.

4.1 Linearization Set-up

In this section we describe a new multi-round protocol for , and later explain how it can be modified for PMwW. This protocol has similar asymptotic costs as that obtained in Section 3.3, but in practice achieves close to two orders of magnitude improvement in ’s runtime. The core idea is to represent the data as a large binary vector indicating when each item occurs in the stream. The protocol simulates progressively merging time ranges together to indicate which items occurred within the ranges. Directly verifying this computation would hit the same roadblock indicated above: using polynomials to check this would result in polynomials of high degree, dominating the cost. So we use a “linearization” technique, which ensures that the degree of the polynomials required stays low, at the cost of more rounds of interaction. This uses ideas of Shen [34] as presented in [2, Chapter 8].

As usual, we work over a finite field with elements, . The input implicitly defines an matrix such that if the ’th item of the stream equals , and otherwise.

Working over the Boolean Hypercube. A key first step is to define an indexing structure based on the -dimensional Boolean hypercube, so every input point is indexed by a bit binary string, which is the (binary) concatenation of a bit string and a bit string . We view as a function from to via . Let be the unique multilinear polynomial in variables such that for all , i.e. is the multilinear extension of the function on implied by .

The only information that the verifier needs to keep track of is the value of at a random point. That is, chooses a random vector . It is efficient for to compute as observes the stream which defines (and hence ): when the ’th update is item , this translates to the vector . The necessary update is of the form , where is the unique polynomial that is 1 at and 0 everywhere else in . For this, only needs to store and the current value of .

Linearization and Arithmetized Boolean Operators. We use three operators , and on polynomials , defined as follows:

 ⨿kg(X1,…,Xk)= g(X1,…,Xk−1,0)+g(X1,…,Xk−1,1) − g(X1,…,Xk−1,0)⋅g(X1,…,Xk−1,1). Πkg(X1,…,Xk)= g(X1,…,Xk−1,0)⋅g(X1,…,Xk−1,1). Lig(X1,…,Xk)= Xi⋅g(X1,…,Xi−1,1,Xi+1,⋯,Xk) + (1−Xi)⋅g(X1,…,Xi−1,0,Xi,…,Xk).

and generalize the familiar “OR” and “AND” operators, respectively. Thus, if is a -variate polynomial of degree at most in each variable, and are -variate polynomials of degree at most in each variable. They generalize Boolean operators in the sense that if and , and are both 0 or 1, then

 (⨿kg)(X1,…Xk)=1 iff x=1 or y=1, and (Πkg)(X1,…Xk)=1 iff x=1 and y=1.

is a linearization operator. If is a -variate polynomial, is a -variate polynomial that is linear in variable . operations are used to control the degree of the polynomials that arise throughout the execution of our protocol. Since for all , agrees with on all values in .

Throughout, when applying a sequence of operations to a polynomial to obtain a new one, the operations are applied “right-to-left”. For example, we write the variate polynomial

 (L1(L2…(Lk−1(⨿kg)))) as L1L2…Lk−1⨿kg.

Rewriting  and PMwW. For  and MVMult there is little need for linearization: the polynomials generated remain of low-degree, so the multi-round protocols described in [17, 15] already suffice. But linearization can help with  and PMwW.

Thinking of the input as a matrix as defined above, we can compute  by repeatedly taking the columnwise-OR of adjacent column pairs to end up with a vector which indicates whether item appeared in the stream, then repeatedly summing adjacent entries to get the number of distinct elements. When representing these operations as polynomials, we make additional use of linearization operations to control the degree of the polynomials that arise. Using the properties of the operations and described above and rewriting in terms of the hypercube, it can be seen that

 F0(a)= 1∑x1=0…1∑xk=0Lk1Lk1−1…L1⨿k1+1 Lk1+1Lk1…L1⨿k1+2…Ld−1Ld−2…L1⨿df (2)

because this expression only involves variables and values in . The size of this expression is .

The case for PMwW is similar. Assume for now that the pattern length is a power of two (if not, it can be padded with trailing wildcards). We now consider the input to define a matrix of size , such that if the ’th item of the stream equals , for all , and if the ’th character of the pattern equals , for all . Wildcards in the pattern or the text are treated as occurrences of all characters in the alphabet at that location. The problem is solved over this matrix by first taking the column-wise “AND” of adjacent columns: this leaves 1 where a text character matches a pattern for a certain offset. We then take column-wise “OR”s of adjacent columns times: this collapses the alphabet. Taking row-wise “AND”s of adjacent rows times leaves an indicator vector whose th entry is 1 iff the pattern occurs at location in the text. Summing the entries in this vector provides the required answer. Using linearization to bound the degree of and operators, we again obtain an expression of size .

4.2 Protocols Using Linearization

Given an expression in the form of (2), we now give an inductive description of the protocol. Conceptually, each round we ask the prover to “strip off” the left-most remaining operation in the expression. In the process, we reduce a claim by about the old expression to a claim about the new, shorter expression. Eventually, is left with a claim about the value of at a random point (specifically, at ), which can check against her independent evaluation of .

More specifically, suppose for some polynomial , the prover can convince the verifier that with probability 1 for any where this is true, and probability less than when it is false. Let be any polynomial on variables obtained as

 U(X1,X2,…,Xl)=Og(X1,…,Xj),

where is one of , , or for some variable . (Thus is in the first three cases and in the last). Let be an upper bound (known to the verifier) on the degree of with respect to . In our case, because of the inclusion of operations in between every and operation. We show how can convince that with probability 1 for any for which it is true and with probability at most when it is false. By renaming variables if necessary, assume . The verifier’s check is as follows.

Case 1: . provides a degree-1 polynomial that is supposed to be . checks if . If not, rejects. If so, picks a random value and asks to prove . If it is one of the final rounds, chooses to be the corresponding entry of .

Case 2: or . We do the same as in Case 1, but replace with in the case of , or in the case of .

Case 3: . wishes to prove that . provides a degree-2 polynomial that is supposed to be . We refer to this as “unbinding the variable” because previously was “bound” to value , but now is free. checks that . If not, rejects. If so, picks random and asks to prove (or if it is the final round, simply checks that .

The proof of correctness follows by using the observation that if is not the right polynomial, then with probability , must prove an incorrect statement at the next round (this is an instance of Schwartz-Zippel polynomial equality testing procedure [30]). The total probability of error is given by a union bound on the probabilities in each round, .

Analysis of protocol costs. Recall that both  and PMwW can be written as an expression of size operators, where linearization bounds the degree in any variable. Under the above procedure, the verifier need only store , , the current values of any “bound” variables, and the most recent value of . In total, this requires space . There are rounds, and in each round a polynomial of degree at most two is sent from to . Such a polynomial can be represented with at most words, so the total communication is . Hence we obtain (-protocols for  and PMwW.

As the stream is being processed the verifier has to update . The updates are very simple, and processing each update requires time. There is a slight overhead in PMwW, where each update in the stream requires the verifier to propagate updates to (assuming an upper bound on is fixed in advance), taking time. However, it seems plausible that these costs could be optimized further.

The prover has to store a description of the stream, which can be done in space . The prover can be implemented to require time: essentially, each round of the proof requires at most one pass over the stream data to compute the required functions. For brevity, we omit a detailed description of the implementation, the source code of which is available at [16].

Theorem 4.1

For any function which can be written as a concatenation of (binary) operators drawn from and over inputs of size , there is a round protocol, where takes time , and takes time to run the protocol, having computing the LDE of the input.

Thus we can invoke this theorem for both  and PMwW, obtaining round