Verifiable Computation with Massively Parallel Interactive Proofs
Abstract
As the cloud computing paradigm has gained prominence, the need for verifiable computation has grown increasingly urgent. The concept of verifiable computation enables a weak client to outsource difficult computations to a powerful, but untrusted, server. Protocols for verifiable computation aim to provide the client with a guarantee that the server performed the requested computations correctly, without requiring the client to perform the requested computations herself. By design, these protocols impose a minimal computational burden on the client. However, existing protocols require the server to perform a very large amount of extra bookkeeping, on top of the requested computations, in order to enable a client to easily verify the results. Verifiable computation has thus remained a theoretical curiosity, and protocols for it have not been implemented in real cloud computing systems.
In this paper, our goal is to leverage GPUs to reduce the serverside slowdown for verifiable computation. To this end, we identify abundant data parallelism in a stateoftheart generalpurpose protocol for verifiable computation, originally due to Goldwasser, Kalai, and Rothblum [muggles], and recently extended by Cormode, Mitzenmacher, and Thaler [itcs]. We implement this protocol on the GPU, and we obtain 40120 serverside speedups relative to a stateoftheart sequential implementation. For benchmark problems, our implementation thereby reduces the slowdown of the server to within factors of 100500 relative to the original computations requested by the client. Furthermore, we reduce the already small runtime of the client by 100. Similarly, we obtain 2050 serverside and clientside speedups for related protocols targeted at specific streaming problems. We believe our results demonstrate the immediate practicality of using GPUs for verifiable computation, and more generally, that protocols for verifiable computation have become sufficiently mature to deploy in real cloud computing systems.
copyrightbox
1 Introduction
A potential problem in outsourcing work to commercial cloud computing services is trust. If we store a large dataset with a server, and ask the server to perform a computation on that dataset – for example, to compute the eigenvalues of a large graph, or to compute a linear program on a large matrix derived from a database – how can we know the computation was performed correctly? Obviously we don’t want to compute the result ourselves, and we might not even be able to store all the data locally. Despite these constraints, we would like the server to not only provide us with the answer, but to convince us the answer is correct.
Protocols for verifiable computation offer a possible solution to this problem. The ultimate goal of any such protocol is to enable the client to obtain results with a guarantee of correctness from the server much more efficiently than performing the computations herself. Another important goal of any such protocol is to enable the server to provide results with guarantees of correctness almost as efficiently as providing results without guarantees of correctness.
Interactive proofs are a powerful family of protocols for establishing guarantees of correctness between a client and server. Although they have been studied in the theory community for decades, there had been no significant efforts to implement or deploy such proof systems until very recently. A recent line of work (e.g., [riva, icalp09, esa, itcs, vldb, muggles, ndss]) has made substantial progress in advancing the practicality of these techniques. In particular, prior work of Cormode, Mitzenmacher, and Thaler [itcs] demonstrates that: (1) a powerful generalpurpose methodology due to Goldwasser, Kalai and Rothblum [muggles] approaches practicality; and (2) specialpurpose protocols for a large class of streaming problems are already practical.
In this paper, we clearly articulate this line of work to researchers outside the theory community. We also take things one step further, leveraging the parallelism offered by GPUs to obtain significant speedups relative to stateoftheart implementations of [itcs]. Our goal is to invest the parallelism of the GPU to obtain correctness guarantees with minimal slowdown, rather than to obtain raw speedups, as is the case with more traditional GPU applications. We believe the insights of our GPU implementation could also apply to a multicore CPU implementation. However, GPUs are increasingly widespread, costeffective, and powerefficient, and they offer the potential for speedups in excess of those possible with commodity multicore CPUs [owens, debunk].
We obtain serverside speedups ranging from 40120 for the generalpurpose protocol due to Goldwasser et al. [muggles], and 2050 speedups for related protocols targeted at specific streaming problems. Our generalpurpose implementation reduces the serverside cost of providing results with a guarantee of correctness to within factors of 100500 relative to a sequential algorithm without guarantees of correctness. Similarly, our implementation of the specialpurpose protocols reduces the serverside slowdown to within 10100 relative to a sequential algorithm without guarantees of correctness.
We believe the additional costs of obtaining correctness guarantees demonstrated in this paper would already be considered modest in many correctnesscritical applications. For example, at one end of the application spectrum is Assured Cloud Computing for military contexts: a military user may need integrity guarantees when computing in the presence of cyber attacks, or may need such guarantees when coordinating critical computations across a mixture of secure military networks and insecure networks owned by civilians or other nations [airforce]. At the other end of the spectrum, a hospital that outsources the processing of patients’ electronic medical records to the cloud may require guarantees that the server is not dropping or corrupting any of the records. Even if every computation is not explicitly checked, the mere ability to check the computation could mitigate trust issues and stimulate users to adopt cloud computing solutions.
Our source code is available at [code2].
2 Background
2.1 What are interactive proofs?
Interactive proofs (IPs) were introduced within the computer science theory community more than a quarter century ago, in seminal papers by Babai [ip1] and Goldwasser, Micali and Rackoff [ip2]. In any IP, there are two parties: a prover , and a verifier . is typically considered to be computationally powerful, while is considered to be computationally weak.
In an IP, solves a problem using her (possibly vast) computational resources, and tells the answer. and then have a conversation, which is to say, they engage in a randomized protocol involving the exchange of one or more messages between the two parties. The term interactive proofs derives from the backandforth nature of this conversation. During this conversation, ’s goal is to convince that her answer is correct.
IPs naturally model the problem of a client (whom we model as ) outsourcing computation to an untrusted server (who we model as ). That is, IPs provide a way for a client to hire a cloud computing service to store and process data, and to efficiently check the integrity of the results returned by the server. This is useful whenever the server is not a trusted entity, either because the server is deliberately deceptive, or is simply buggy or inept. We therefore interchange the terms server and prover where appropriate. Similarly, we interchange the terms client and verifier where appropriate.
Any IP must satisfy two properties. Roughly speaking, the first is that if answers correctly and follows the prescribed protocol, then will convince to accept the provided answer. The second property is a security guarantee, which says that if is lying, then must catch in the lie and reject the provided answer with high probability. A trivial way to satisfy this property is to have compute the answer to the problem herself, and accept only if her answer matches ’s. But this defeats the purpose of having a prover. The goal of an interactive proof system is to allow to check ’s answer using resources considerably smaller than those required to solve the problem from scratch.
At first blush, this may appear difficult or even impossible to achieve. However, IPs have turned out to be surprisingly powerful. We direct the interested reader to [arorabarak, Chapter 8] for an excellent overview of this area.
2.2 How do interactive proofs work?
At the highest level, many interactive proof methods (including the ones in this paper) work as follows. Suppose the goal is to compute a function of the input .
First, the verifier makes a single streaming pass over the input , during which she extracts a short secret . This secret is actually a single (randomly chosen) symbol of an errorcorrected encoding of the input. To be clear, the secret does not depend on the problem being solved; in fact, for many interactive proofs, it is not necessary that the problem be determined until after the secret is extracted.
Next, and engage in an extended conversation, during which sends various challenges, and responds to the challenges (see Figure 1 for an illustration). The challenges are all related to each other, and the verifier checks that the prover’s responses to all challenges are internally consistent.
The challenges are chosen so that the prover’s response to the first challenge must include a (claimed) value for the function of interest. Similarly, the prover’s response to the last challenge must include a claim about what the value of the verifier’s secret should be. If all of ’s responses are internally consistent, and the claimed value of matches the true value of , then the verifier is convinced that prover followed the prescribed protocol and accepts. Otherwise, the verifier knows that the prover deviated at some point, and rejects. From this point of view, the purpose of all intermediate challenges is to guide the prover from a claim about to a claim about the secret , while maintaining ’s control over .
Intuitively, what gives the verifier surprising power to detect deviations is the errorcorrecting properties of . Any good errorcorrecting code satisfies the property that if two strings and differ in even one location, then and differ in almost every location. In the same way, interactive proofs ensure that if flips even a single bit of a single message in the protocol, then either has to make an inconsistent claim at some later point, or else has to lie almost everywhere in her final claim about the value of the secret . Thus, if the prover deviates from the prescribed protocol even once the verifier will detect this with high probability and reject.
2.3 Previous work
Unfortunately, despite their power, IPs have had very little influence on real systems where integrity guarantees on outsourced computation would be useful. There appears to have been a folklore belief that these methods are impractical [ndss]. As previously mentioned, a recent line of work (e.g., [riva, icalp09, esa, itcs, vldb, muggles, ndss]) has made substantial progress in advancing the practicality of these techniques. In particular, Goldwasser et al. [muggles] described a powerful generalpurpose protocol (henceforth referred to as the GKR protocol) that achieves a polynomialtime prover and nearly lineartime verifier for a large class of computations. Very recently, Cormode, Mitzenmacher, and Thaler [itcs] showed how to significantly speed up the prover in the GKR protocol [muggles]. They also implemented this protocol, and demonstrated experimentally that their implementation approaches practicality. Even with their optimizations, the bottleneck in the implementation of [itcs] is the prover’s runtime, with all other costs (such as verifier space and runtime) being extremely low.
A related line of work has looked at protocols for specific streaming problems. Here, the goal is not just to save the verifier time (compared to doing the computation without a prover), but also to save the verifier space. This is motivated by cloud computing settings where the client does not even have space to store a local copy of the input, and thus uses the cloud to both store and process the data. The protocols developed in this line of work do not require the client to store the input, but rather allow the client to make a single streaming pass over the input (which can occur, for example, while the client is uploading data to the cloud). Throughout this paper, whenever we mention a streaming verifier, we mean the verifier makes a single pass over the input, and uses space significantly sublinear in the size of the data.
The notion of a noninteractive streaming verifier was first put forth by Chakrabarti et al. [icalp09] and studied further by Cormode et al. [esa]. These works allow the prover to send only a single message to the verifier (e.g., as an attachment to an email, or posted on a website), with no communication in the reverse direction. Moreover, these works present protocols achieving provably optimal tradeoffs between the size of the proof and the space used by the verifier for a variety of problems, ranging from matrixvector multiplication to graph problems like bipartite perfect matching.
Later, Cormode, Thaler, and Yi extended the streaming model of [icalp09] to allow an interactive prover and verifier, who actually have a conversation. They demonstrated that interaction allows for much more efficient protocols in terms of client space, communication, and server running time than are possible in the onemessage model of [icalp09, esa]. It was also observed in this work that the generalpurpose GKR protocol works with just a streaming verifier. Finally, the aforementioned work of Cormode, Thaler, and Mitzenmacher [itcs] also showed how to use sophisticated Fast Fourier Transform (FFT) techniques to drastically speed up the prover’s computation in the protocols of [icalp09, esa].
Also relevant is work by Setty et al. [ndss], who implemented a protocol for verifiable computation due to Ishai et al. [ishai]. To set the stage for our results using parallelization, in Section 6 we compare our approach with [ndss] and [itcs] in detail. As a summary, the implementation of the GKR protocol described in both this paper and in [itcs] has several advantages over [ndss]. For example, the GKR implementation saves space and time for the verifier even when outsourcing a single computation, while [ndss] saves time for the verifier only when batching together several dozen computations at once and amortizing the verifier’s cost over the batch. Moreover, the GKR protocol is unconditionally secure against computationally unbounded adversaries who deviate from the prescribed protocol, while the Ishai et al. protocol relies on cryptographic assumptions to obtain security guarantees. We present experimental results demonstrating that that the prover in the sequential implementation of [itcs] based on the GKR protocol runs significantly faster than the prover in the implementation of [ndss] based on the Ishai et al. protocol [ishai].
Based on this comparison, we use the sequential implementation of [itcs] as our baseline. We then present results that our new GPUbased implementation runs 40120 faster than the sequential implementation in [itcs].
3 Our interactive proof protocols
In this section, we give an overview of the methods implemented in this paper. Due to their highly technical nature, we seek only to convey a highlevel description of the protocols relevant to this paper, and deliberately avoid rigorous definitions or theorems. We direct the interested reader to prior work for further details [icalp09, esa, itcs, muggles].
3.1 GKR protocol
The prover and verifier first agree on a layered arithmetic circuit of fanin two over a finite field computing the function of interest. An arithmetic circuit is just like a boolean circuit, except that the inputs are elements of rather than boolean values, and the gates perform addition and multiplication over the field , rather than computing AND, OR, and NOT operations. See Figure 2 for an example circuit. In fact, any boolean circuit can be transformed into an arithmetic circuit computing an equivalent function over a suitable finite field, although this approach may not yield the most succinct arithmetic circuit for the function.
Suppose the output layer of the circuit is layer , and the input layer is layer . The protocol of [muggles] proceeds in iterations, with one iteration for each layer of the circuit. The first iteration follows the general outline described in Section 2.2, with guiding from a claim about the output of the circuit to a claim about a secret , via a sequence of challenges and responses. The challenges sent by to are simply random coins, which are interpreted as random points in the finite field . The prescribed responses of are polynomials, where each prescribed polynomial depends on the preceding challenge. Such a polynomial can be specified either by listing its coefficients, or by listing its evaluations at several points.
However, unlike in Section 2.2, the secret is not a symbol in an errorcorrected encoding of the input, but rather a symbol in an errorcorrected encoding of the gate values at layer . Unfortunately, cannot compute this secret on her own. Doing so would require evaluating all previous layers of the circuit, and the whole point of outsourcing is to avoid this. So has tell her what should be. But now has to make sure that is not lying about .
This is what the second iteration accomplishes, with guiding from a claim about , to the claim about a new secret , which is a symbol in an encoding of the gate values at layer . This continues until we get to the input layer. At this point, the secret is actually a symbol in an errorcorrected encoding of the input, and can compute this secret in advance from the input easily on her own. Figure 1 illustrates the entirety of the GKR protocol at a very high level.
We take this opportunity to point out an important property of the protocol of [muggles], which was critical in allowing our GPUbased implementation to scale to large inputs. Namely, any iteration of the protocol involves only two layers of the circuit at a time. In the th iteration, the verifier guides the prover from a claim about gate values at layer to a claim about gate values at layer . Gates at higher or lower layers do not affect the prescribed responses within iteration .
3.2 Specialpurpose protocols
As mentioned in Section 2.3, efficient problemspecific noninteractive verifiable protocols have been developed for a variety of problems of central importance in streaming and database processing, ranging from linear programming to graph problems like shortest path. The central primitive in many of these protocols is itself a protocol originally due to Chakrabarti et al. [icalp09], for a problem known as the second frequency moment, or . In this problem, the input is a sequence of items from a universe of size , and the goal is to compute , where is the number of times item appears in the sequence. As explained in [itcs], speeding up this primitive immediately speeds up protocols for all of the problems that use the protocol as a subroutine.
The aforementioned protocol of Chakrabarti et al. [icalp09] achieves provably optimal tradeoffs between the length of the proof and the space used by the verifier. Specifically, for any positive integer , the protocol can achieve a proof length of just machine words, as long as the verifier uses words of space. For example, we may set both and to be roughly , which is substantially sublinear in the input size .
Very roughly speaking, this protocol follows the same outline as in Section 2.2, except that in order to remove the interaction from the protocol, the verifier needs to compute a more complicated secret. Specifically, the verifier’s secret consists of symbols in an errorcorrected encoding of the input, rather than a single symbol. To compute the prescribed proof, the prover has to evaluate symbols in the errorcorrected encoding of the input. The key insight of [itcs] is that these symbols need not be computed independently (which would require substantially superlinear time), but instead can be computed in time using FFT techniques. More specifically, the protocol of [itcs] partitions the universe into a grid, and it performs a sophisticated FFT variant known as the Prime Factor Algorithm [pfa] on each row of the grid. The final step of ’s computation is to compute the sum of the squared entries for each column of the (transformed) grid; these values form the actual content of ’s prescribed message.
4 Parallelizing our protocols
In this section, we explain the insights necessary to parallelize the computation of both the prover and the verifier for the protocols we implemented.
4.1 GKR protocol
Parallelizing ’s computation
In every one of ’s responses in the GKR protocol, the prescribed message from is defined via a large sum over roughly terms, where is the size of the circuit, and so computing this sum naively would take time. Roughly speaking, Cormode et al. in [itcs] observe that each gate of the circuit contributes to only a single term of this sum, and thus this sum can be computed via a single pass over the relevant gates. The contribution of each gate to the sum can be computed in constant time, and each gate contributes to logarithmically many messages over the course of the protocol. Using these observations carefully reduces ’s runtime from , to , where again is the circuit size.
The same observation reveals that ’s computation can be parallelized: each gate contributes independently to the sum in ’s prescribed response. Therefore, can compute the contribution of many gates in parallel, save the results in a temporary array, and use a parallel reduction to sum the results. We stress that all arithmetic is done within the finite field , rather than over the integers. Figure 3 illustrates this process.
Parallelizing ’s computation
The bulk of ’s computation (by far) consists of computing her secret, which consists of a single symbol in a particular errorcorrected encoding of the input . As observed in prior work [vldb], each symbol of the input contributes independently to . Thus, can compute the contribution of many input symbols in parallel, and sum the results via a parallel reduction, just as in the parallel implementation of ’s computation. This speedup is perhaps of secondary importance, as runs extremely quickly even in the sequential implementation of [itcs]. However, parallelizing ’s computation is still an appealing goal, especially as GPUs are becoming more common on personal computers and mobile devices.
4.2 Specialpurpose protocols
Parallelizing ’s computation
Recall that the prover in the specialpurpose protocols can compute the prescribed message by interpreting the input as a grid, where is roughly the proof length and is the amount of space used by the verifier. The prover then performs a sophisticated FFT on each row of the grid independently. This can be parallelized by transforming multiple rows of the grid in parallel. Indeed, Cormode et al. [itcs] achieved roughly a 7 speedup for this problem by using all eight cores of a multicore processor. Here, we obtain a much larger 2050 speedup using the GPU. (Note that [itcs] did not develop a parallel implementation of the GKR protocol, only of the specialpurpose protocols).
Parallelizing ’s computation
Recall that in the specialpurpose protocols, the verifier’s secret consists of symbols in an errorcorrected encoding of the input, rather than a single symbol. Just as in Section 3.1, this computation can be parallelized by noting that each input symbol contributes independently to each entry of the encoded input. This requires to store a large buffer of input symbols to work on in parallel. In some streaming contexts, may not have the memory to accomplish this. Still, there are many settings in which this is feasible. For example, may have several hundred megabytes of memory available, and seek to outsource processing of a stream that is many gigabytes or terabytes in length. Thus, parallel computation combined with buffering can help a streaming verifier keep up with a live stream of data: splits her memory into two buffers, and at all times one buffer will be collecting arriving items. As long as can process the full buffer (aided by parallelism) before her other buffer overflows, will be able to keep up with the live data stream. Notice this discussion applies to the client in the GKR protocol as well, as the GKR protocol also enables a streaming verifier.
5 Architectural considerations
5.1 GKR protocol
The primary issue with any GPUbased implementation of the prover in the GKR protocol is that the computation is extremely memoryintensive: for a circuit of size (which corresponds to arithmetic operations in an unverifiable algorithm), the prover in the GKR protocol has to store all gates explicitly, because she needs the values of these gates to compute her prescribed messages. We investigate three alternative strategies for managing the memory overhead of the GKR protocol, which we refer to as the nocopying approach, the copyonceperlayer approach, and the copyeverymessage approach.
The nocopying approach
The simplest approach is to store the entire circuit explicitly on the GPU. We call this the nocopying approach. However, this means that the entire circuit must fit in device memory, a requirement which is violated even for relatively small circuits, consisting of roughly tens of million of gates.
The copyonceperlayer approach
Another approach is to keep the circuit in host memory, and only copy information to the device when it is needed. This is possible because, as mentioned in Section 3.1, at any point in the protocol the prover only operates on two layers of the circuit at a time, so only two layers of the circuit need to reside in device memory. We refer to this as the copyonceperlayer approach. This is the approach we used in the experiments in Section 6.
Care must be taken with this approach to prevent hosttodevice copying from becoming a bottleneck. Fortunately, in the protocol for each layer there are several dozen messages to be computed before the prover moves on to the next layer, and this ensures that the copying from host to device makes up a very small portion of the runtime.
This method is sufficient to scale to very large circuits for all of the problems considered in the experimental section of [itcs], since no single layer of the circuits is significantly larger than the problem input itself. However, this method remains problematic for circuits that have (one or several) layers which are particularly wide, as an explicit representation of all the gates within a single wide layer may still be too large to fit in device memory.
The copyeverymessage approach
In the event that there are individual layers which are too large to reside in device memory, a third approach is to copy part of a layer at a time from the host to the device, and compute the contribution of each gate in the part to the prover’s message before swapping the part back to host memory and bringing in the next part. We call this the copyeverymessage approach. This approach is viable, but it raises a significant issue, alluded to in its name. Namely, this approach requires hosttodevice copying for every message, rather than just once per layer of the circuit. That is, in any iteration of the protocol, cannot compute her th message until after the th challenge from is received. Thus, for each message , the entirety of the th layer must be loaded piecebypiece into device memory, swapping each piece back to host memory after the piece has been processed. In contrast, the copyonceperlayer approach allows to copy an entire layer to the device and leave the entire layer in device memory for the entirety of iteration (which will consist of several dozen messages). Thus, the slowdown inherent in the copyeverymessage approach is not just that has to break each layer into parts, but that has to do hosttodevice and devicetohost copying for each message, instead of copying an entire layer and computing several messages from that layer.
We leave implementing the copyoncepermessage approach in full for future work, but preliminary experiments suggest that this approach is viable in practice, resulting in less than a 3 slowdown compared to the copyonceperlayer approach. Notice that even after paying this slowdown, our GPUbased implementation would still achieve a 1040 speedup compared to the sequential implementation of [itcs].
Memory access
Recall that for each message in the th iteration of the GKR protocol, we assign a thread to each gate at the th layer of the circuit, as each gate contributes independently to the prescribed message of the prover. The contribution of gate depends only on the index of , the indices of the two gates feeding into , and the values of the two gates feeding into .
Given this data, the contribution of gate to the prescribed message can be computed using roughly 1020 additions and multiplications within the finite field (the precise number of arithmetic operations required varies over the course of the iteration). As described in Section 6, we choose to work over a field which allows for extremely efficient arithmetic; for example, multiplying two field elements requires three machine multiplications of 64bit data types, and a handful of additions and bit shifts.
In all of the circuits we consider, the indices of ’s inneighbors can be determined with very little arithmetic and no global memory accesses. For example, if the wiring pattern of the circuit forms a binary tree, then the first inneighbor of has index , and the second inneighbor of has index . For each message, the thread assigned to can compute this information from scratch without incurring any memory accesses.
In contrast, obtaining the values of g’s inneighbors requires fetching 8 bytes per inneighbor from global memory. Memory accesses are necessary because it is infeasible to compute the value of each gate’s inneighbors from scratch each message, and so we store these values explicitly. As these global memory accesses can be a bottleneck in the protocol, we strive to arrange the data in memory to ensure that adjacent threads access adjacent memory locations. To this end, for each layer we maintain two separate arrays, with the ’th entry of the first (respectively, second) array storing the first (respectively, second) inneighbor of the ’th gate at layer . During iteration , the thread assigned to the th gate accesses location of the first and second array to retrieve the value of its first and second inneighbors respectively. This ensures that adjacent threads access adjacent memory locations.
For all layers, the corresponding arrays are populated with inneighbor values when we evaluate the circuit at the start of the protocol (we store each layer ’s arrays on the host until the ’th iteration of the protocol, at which point we transfer the array from host memory to device memory as described in Section 5.1.2). Notice this methodology sometimes requires data duplication: if many gates at layer share the same inneighbor , then ’s value will appear many times in layer ’s arrays. We feel that slightly increased space usage is a reasonable price to pay to ensure memory coalescing.
5.2 Specialpurpose protocols
Memory access
Recall that the prover in our specialpurpose protocols views the input as a grid, and performs a sophisticated FFT on each row of the grid independently. Although the independence of calculations in each row offers abundant opportunities for taskparallelism, extracting the dataparallelism required for high performance on GPUs requires care due to the irregular memory access pattern of the specific FFT algorithm used.
We observe that although each FFT has a highly irregular memory access pattern, this memory access pattern is dataindependent. Thus, we can convert abundant taskparallelism into abundant dataparallelism by transposing the data grid into columnmajor rather than rowmajor order. This simple transformation ensures perfect memory coalescing despite the irregular memory access pattern of each FFT, and improves the performance of our specialpurpose prover by more than 10.
6 Evaluation
6.1 Implementation details
Except where noted, we performed our experiments on an Intel Xeon 3 GHz workstation with 16 GB of host memory. Our workstation also has an NVIDIA GeForce GTX 480 GPU with 1.5 GB of device memory. We implemented all our GPU code in CUDA and Thrust [thrust] with all compiler optimizations turned on.
Similar to the sequential implementations of [itcs], both our implementation of the GKR protocol and the specialpurpose protocol due to [icalp09, itcs] work over the finite field with . We chose this field for a number of reasons. Firstly, the integers embed naturally within it. Secondly, the field is large enough that the probability the verifier fails to detect a cheating prover is tiny (roughly proportional to reciprocal of the field size). Thirdly, arithmetic within the field can be performed efficiently with simple shifts and bitwise operations [thorup]. We remark that we used no floating point operations were necessary in any of our implementations, because all arithmetic is done over finite fields.
Finally, we stress that in all reported costs below, we do count the time taken to copy data between the host and the device, and all reported speedups relative to sequential processing take this cost into account. We do not count the time to allocate memory for scratch space because this can be done in advance.
6.2 Experimental methodology for the GKR protocol
We ran our GPUbased implementation of the GKR protocol on four separate circuits, which together capture several different aspects of computation, from data aggregation to search, to linear algebra. The first three circuits were described and evaluated in [itcs] using the sequential implementation of the GKR protocol. The fourth problem was described and evaluated in [ndss] based on the Ishai et al. protocol [ishai]. Below, denotes the integers .

: Given a stream of elements from , compute where is the number of occurrences of in the stream.

: Given a stream of elements from , compute the number of distinct elements (i.e., the number of with , where again is the number of occurrences of in the stream).

PM: Given a stream representing text and pattern , the pattern is said to occur at location in if, for every position in , . The patternmatching problem is to determine the number of locations at which occurs in .

MatMult: Given three matrices , , , determine whether . (In practice, we do not expect to truly be part of the input data stream. Rather, prior work [vldb, itcs] has shown that the GKR protocol works even if and are specified from a stream, while is given later by .)
The first two problems, and , are classical data aggregation queries which have been studied for more than a decade in the data streaming community. is also a highly useful subroutine in more complicated computations, as it effectively allows for equality testing of vectors or matrices (by subtracting two vectors and seeing if the result is equal to the zero vector). We make use of this subroutine when designing our matrixmultiplication circuit below.
The third problem, PM, is a classic search problem, and is motivated, for example, by clients wishing to store (and search) their email on the cloud. Cormode et al. [itcs] considered the Pattern Matching with Wildcards problem, where the pattern and text can contain wildcard symbols that match with any character, but for simplicity we did not implement this additional functionality.
We chose the fourth problem, matrix multiplication, for several reasons. First was its practical importance. Second was a desire to experiment on problems requiring superlinear time to solve (in contrast to and ): running on a superlinear problem allowed us to demonstrate that our implementation as well as that of [itcs] saves the verifier time in addition to space, and it also forced us to grapple with the memoryintensive nature of the GKR protocol (see Section 4). Third was its status as a benchmark enabling us to compare the implementations of [itcs] and [ndss]. Although there are also efficient specialpurpose protocols to verify matrix multiplication (see Freivald’s algorithm [randombook, Section 7.1], as well as Chakrabarti et al. [icalp09, Theorem 5.2]), it is still interesting to see how a generalpurpose implementation performs on this problem. Finally, matrix multiplication is an attractive primitive to have at one’s disposal when verifying more complicated computations using the GKR protocol.
Description of circuits
We briefly review the circuits for our benchmark problems.
The circuit for is by far the simplest (see Figure 4 for an illustration). This circuit simply computes the square of each input wire using a layer of multiplication gates, and then sums the results using a single sumgate of very large fanin. We remark that the GKR protocol typically assumes that all gates have fanin two, but [itcs] explains how the protocol can be modified to handle a single sumgate of very large fanin at the output.
The circuit for exploits Fermat’s Little Theorem, which says that for prime , if and only if . Thus, this circuit computes the ’th power of each input wire (taking all nonzero inputs to 1, and leaving all 0inputs at 0), and sums the results via a single sumgate of high fanin.
The circuit for PM is similar to that for : essentially, for each possible location of the pattern, it computes a value that is 0 if the pattern is at the location, and nonzero otherwise. It then computes the th power of each such value and sums the results (i.e., it uses the circuit as a subroutine) to determine the number of locations where the pattern does (not) appear in the input.
Our circuit for MatMult uses similar ideas. We could run a separate instance of the GKR protocol to verify each of the entries in the output matrix and compare them to , but this would be very expensive for both the client and the server. Instead, we specify a suitable circuit with a single output gate, allowing us to run a single instance of the protocol to verify the output. Our circuit computes the entries in via naive matrix multiplication, and subtracts the corresponding entry of from each. It then computes the number of nonzero values using the circuit as a subroutine. The final output of the circuit is zero if and only if .
Scaling to large inputs
As described in Section 5, the memoryintensive nature of the GKR protocol made it challenging to scale to large inputs, especially given the limited amount of device memory. Indeed, with the nocopying approach (where we simply keep the entire circuit in device memory), we were only able to scale to inputs of size roughly for the problem, and to matrices for the MatMult problem on a machine with 1 GB of device memory. Using the copyonceperlayer approach, we were able to scale to inputs with over 2 million entries for the problem, and matrices for the MatMult problem. By running on a NVIDIA Tesla C2070 GPU with 6 GBs of device memory, we were able to push to matrices for the MatMult problem; the data from this experiment is reported in Table 2.
Evaluation of previous implementations
To our knowledge, the only existing implementation for verifiable computation that can be directly compared to that of Cormode et al. [itcs] is that of Setty et al. [ndss]. We therefore performed a brief comparison of the sequential implementation of [itcs] with that of [ndss]. This provides important context in which to evaluate our results: our 40120 speedups compared to the sequential implementation of [itcs] would be less interesting if the sequential implementation of [itcs] was slower than alternative methods. Prior to this paper, these implementations had never been run on the same problems, so we picked a benchmark problem (matrix multiplication) evaluated in [ndss] and compared to the results reported there.
We stress that our goal is not to provide a rigorous quantitative comparison of the two implementations. Indeed, we only compare the implementation of [itcs] to the numbers reported in [ndss]; we never ran the implementations on the same system, leaving this more rigorous comparison for future work. Moreover, both implementations may be amenable to further optimization. Despite these caveats, the comparison between the two implementations seems clear. The results are summarized in Table 1.
Implementation  Matrix Size  Time  Time  Total Communication 

[itcs]  3.11 hours  0.12 seconds  138.1 KB  
[ndss], Pepper  8.1 years  14 hours  Not Reported  
[ndss], Habanero  17 days  2.1 minutes  17.1 GB 
In Table 1, Pepper refers to an implementation in [ndss] which is actually proven secure against polynomialtime adversaries under cryptographic assumptions, while Habenero is an implementation in [ndss] which runs faster by allowing for a very high soundness probability of that a deviating prover can fool the verifier, and utilizing what the authors themselves refer to as heuristics (not proven secure in [ndss], though the authors indicate this may be due to space constraints). In contrast, the soundness probability in the implementation of [itcs] is roughly (roughly proportional to the reciprocal of the field size ), and the protocol is unconditionally secure even against computationally unbounded adversaries.
The implementation of [ndss] has very high setup costs for both and , and therefore the costs of a single query are very high. But this setup cost can be amortized over many queries, and the most detailed experimental results provided in [ndss] give the costs for batches of hundreds or thousands of queries. The costs reported in the second and third rows of Table 1 are therefore the total costs of the implementation when run on a large number of queries.
When we run the implementation of [itcs] on a single matrix, the server takes 3.11 hours, the client takes 0.12 seconds, and the total length of all messages transmitted between the two parties is 138.1 KB. In contrast, the server in the heuristic implementation of [ndss], Habanero, requires 17 days amortized over 111 queries when run on considerably smaller matrices (. This translates to roughly hours per query, but the cost of a single query without batching is likely about two orders of magnitude higher. The client in Habanero requires 2.1 minutes to process the same 111 queries, or a little over 1 second per query, while the total communication is 17.1 GBs, or about 157 MBs per query. Again, the per query costs will be roughly two orders of magnitude higher without the batching.
We conclude that, even under large batching the perquery time for the server of the sequential implementation of [itcs] is competitive with the heuristic implementation of [ndss], while the perquery time for the verifier is about two orders of magnitude smaller, and the perquery communication cost is between two and three orders of magnitude smaller. Without the batching, the perquery time of [itcs] is roughly 100 smaller for the server and 1,000 smaller for the client, and the communication cost is about 100,000 smaller.
Likewise, the implementation of [itcs] is over 5 orders of magnitude faster for the client than the nonheuristic implementation Pepper, and four orders of magnitude faster for the server.
Evaluation of our GPUbased implementation
Problem  Input Size  Circuit Size  GPU  Sequential  Circuit  GPU  Sequential  Unverified 

(number of  (number of  Time (s)  Time (s)  Evaluation  Time (s)  Time (s)  Algorithm  
entries)  gates)  Time (s)  Time (s)  
8.4 million  25.2 million  3.7  424.6  0.1  0.035  3.600  0.028  
2.1 million  255.8 million  128.5  8,268.0  4.2  0.009  0.826  0.005  
PM  524,288  76.0 million  38.9  1,893.1  1.2  0.004  0.124  0.006 
MatMult  65,536  42.3 million  39.6  1,658.0  0.9  0.003  0.045  0.080 
Figure 5 demonstrates the performance of our GPUbased implementation of the GKR protocol. Table 2 also gives a succinct summary of our results, showing the costs for the largest instance of each problem we ran on. We consider the main takeaways of our experiments to be the following.
Serverside speedup obtained by GPU computing. Compared to the sequential implementation of [itcs], our GPUbased server implementation ran close to 115 faster for the circuit, about 60 faster for the circuit, 45 faster for PM, and about 40 faster for MatMult (see Figure 5).
Notice that for the first three problems, we need to look at large inputs to see the asymptotic behavior of the curve corresponding to the parallel prover’s runtime. Due to the loglog scale in Figure 5, the curves for both the sequential and parallel implementations are asymptotically linear, and the 45120 speedup obtained by our GPUbased implementation is manifested as an additive gap between the two curves. The explanation for this is simple: there is considerable overhead relative to the total computation time in parallelizing the computation at small inputs, but this overhead is more effectively amortized as the input size grows.
In contrast, notice that for MatMult the slope of the curve for the parallel prover remains significantly smaller than that of the sequential prover throughout the entire plot. This is because our GPUbased implementation ran out of device memory well before the overhead in parallelizing the prover’s computation became negligible. We therefore believe the speedup for MatMult would be somewhat higher than the 40 speedup observed if we were able to run on larger inputs.
Could a parallel verifiable program be faster than a sequential unverifiable one? The very first step of the prover’s computation in the GKR protocol is to evaluate the circuit. In theory this can be done efficiently in parallel, by proceeding sequentially layer by layer and evaluating all gates at a given layer in parallel. However, in practice we observed that the time it takes to copy the circuit to the device exceeds the time it takes to evaluate the circuit sequentially. This observation suggests that on the current generation of GPUs, no GPUbased implementation of the prover could run faster than a sequential unverifiable algorithm. This is because sequentially evaluating the circuit takes at least as long as the unverifiable sequential algorithm, and copying the data to the GPU takes longer than sequentially evaluating the circuit. This observation applies not just to the GKR protocol, but to any protocol that uses a circuit representation of the computation (which is a standard technique in the theory literature [ishai, hotos]). Nonetheless, we can certainly hope to obtain a GPUbased implementation that is competitive with sequential unverifiable algorithms.
Serverside slowdown relative to unverifiable sequential algorithms. For , the total slowdown for the prover was roughly 130 (3.7 seconds compared to 0.028 seconds for the unverifiable algorithm, which simply iterates over all entries of the frequency vector and computes the sum of the squares of each entry). We stress that it is likely that we overestimate the slowdown resulting from our protocol, because we did not count the time it takes for the unverifiable implementation to compute the number of occurrences of each item , that is, to aggregate the stream into its frequency vector representation . Instead, we simply generated the vector of frequencies at random (we did not count the generation time), and calculated the time to compute the sum of their squares. In practice, this aggregation step may take much longer than the time required to compute the sum of the squared frequencies once the stream is in aggregated form.
For , our GPUbased server implementation ran roughly 25,000 slower than the obvious unverifiable algorithm which simply counts the number of nonzero items in a vector. The larger slowdown compared to the problem is unsurprising. Since is a less arithmetic problem than , its circuit representation is much larger. Once again, it is likely that we overestimate the slowdowns for this problem, as we did not count the time for an unverifiable algorithm to aggregate the stream into its frequencyvector representation. Despite the substantial slowdown incurred for compared to a naive unverifiable algorithm, it remains valuable as a primitive for use in heavierduty computations like PM and MatMult.
For PM, the bulk of the circuit consists of a subroutine, and so the runtime of our GPUbased implementation was similar to those for . However, the sequential unverifiable algorithm for PM takes longer than that for . Thus, our GPUbased server implementation ran roughly 6,500 slower than the naive unverifiable algorithm, which exhaustively searches all possible locations for occurrences of the pattern.
For MatMult, our GPUbased server implementation ran roughly 500 slower than naive matrixmultiplication for matrices. Moreover, this number is likely inflated due to cache effects from which the naive unverifiable algorithm benefited. That is, the naive unverifiable algorithm takes only seconds for matrices, but takes seconds for matrices, likely because the algorithm experiences very few cache misses on the smaller matrix. We therefore expect the slowdown of our implementation to fall to under 100 if we were to scale to larger matrices. Furthermore, the GKR protocol is capable of verifying matrixmultiplication over the finite field rather than over the integers at no additional cost. Naive matrixmultiplication over this field is between 23 slower than matrix multiplication over the integers (even using the fast arithmetic operations available for this field). Thus, if our goal was to work over this finite field rather than the integers, our slowdown would fall by another 23. It is therefore possible that our serverside slowdown may be less than 50 at larger inputs compared to naive matrix multiplication over .
Clientside speedup obtained by GPU computing. The bulk of ’s computation consists of evaluating a single symbol in an errorcorrected encoding of the input; this computation is independent of the circuit being verified. For reasonably large inputs (see the row for in Table 2), our GPUbased client implementation performed this computation over 100 faster than the sequential implementation of [itcs]. For smaller inputs the speedup was unsurprisingly smaller due to increased overhead relative to total computation time. Still, we obtained a 15 speedup even for an input of length 65,536 ( matrix multiplication).
Clientside speedup relative to unverifiable sequential algorithms. Our matrixmultiplication results clearly demonstrate that for problems requiring superlinear time to solve, even the sequential implementation of [itcs] will save the client time compared to doing the computation locally. Indeed, the runtime of the client is dominated by the cost of evaluating a single symbol in an errorcorrected encoding of the input, and this cost grows linearly with the input size. Even for relatively small matrices of size , the client in the implementation of [itcs] saved time. For matrices with tens of millions of entries, our results demonstrate that the client will still take just a few seconds, while performing the matrix multiplication computation would require orders of magnitude more time. Our results demonstrate that GPU computing can be used to reduce the verifier’s computation time by another 100.
space  Proof length  GPU  Sequential  GPU  Sequential 

(KB)  (KB)  Time (s)  Time (s)  Time (s)  Time (s) 
39.1  78.1  2.901  43.773  0.019  0.858 
78.2  39.1  1.872  43.544  0.010  0.639 
156.5  19.5  1.154  37.254  0.010  0.577 
313.2  9.8  0.909  36.554  0.008  0.552 
1953.1  0.78  0.357  20.658  0.007  0.551 
6.3 Specialpurpose protocols.
We implemented both the client and the server of the noninteractive protocol of [icalp09, itcs] on the GPU. As described in Section 2.3, this protocol is the fundamental building block for a host of noninteractive protocols achieving optimal tradeoffs between the space usage of the client and the length of the proof. Figure 6 demonstrates the performance of our GPUbased implementation of this protocol. Our GPU implementation obtained a 2050 serverside speedup relative to the sequential implementation of [itcs]. This speedup was only possible after transposing the data grid into columnmajor order so as to achieve perfect memory coalescing, as described in Section 5.2.1.
The serverside speedups we observed depended on the desired tradeoff between proof length and space usage. That is, the protocol partitions the universe into a grid where is roughly the proof length and is the verifier’s space usage. The prover processes each row of the grid independently (many rows in parallel). When is large, each row requires a substantial amount of processing. In this case, the overhead of parallelization is effectively amortized over the total computation time. If is smaller, then the overhead is less effectively amortized and we see less impressive speedups.
We note that Figure 6 depicts the prover runtime for both the sequential implementation of [itcs] and our GPUbased implementation with the parameters . With these parameters, our GPUbased implementation achieved roughly a 20 speedup relative to the sequential program. Table 3 shows the costs of the protocol for fixed universe size million as we vary the tradeoff between and . The data in this table shows that our parallel implementation enjoys a 4060 speedup relative to the sequential implementation when is substantially larger than . This indicates that we would see similar speedups even when if we scaled to larger input sizes . Notice that universe size million corresponds to over 190 MBs of data, while the verifier’s space usage and the proof length are hundreds or thousands of times smaller in all our experiments. An unverifiable sequential algorithm for computing the second frequency moment over this universe required seconds; thus, our parallel server implementation achieved a slowdown of 10100 relative to an unverifiable algorithm.
In contrast, the verifier’s computation was much easier to parallelize, as its memory access pattern is highly regular. Our GPUbased implementation obtained 4070 speedups relative to the sequential verifier of [itcs] across all input lengths , including when we set .
7 Conclusions
This paper adds to a growing line of work focused on obtaining fully practical methods for verifiable computation. Our primary contribution in this paper was in demonstrating the power of parallelization, and GPU computing in particular, to obtain robust speedups in some of the most promising protocols in this area. We believe the additional costs of obtaining correctness guarantees demonstrated in this paper would already be considered modest in many correctnesscritical applications. Moreover, it seems likely that future advances in interactive proof methodology will also be amenable to parallelization. This is because the protocols we implement utilize a number of common primitives (such as the sumcheck protocol [sumcheck]) as subroutines, and these primitives are likely to appear in future protocols as well.
Several avenues for future work suggest themselves. First, the GKR protocol is rather inefficient for the prover when applied to computations which are nonarithmetic in nature, as the circuit representation of such a computation is necessarily large. Developing improved protocols for such problems (even specialpurpose ones) would be interesting. Prime candidates include many graph problems like minimum spanning tree and perfect matching. More generally, a top priority is to further reduce the slowdown or the memoryintensity for the prover in generalpurpose protocols. Both these goals could be accomplished by developing an entirely new construction that avoids the circuit representation of the computation; it is also possible that the the prover within the GKR construction can be further optimized without fundamentally altering the protocol.