CHEETAH: An Ultra-Fast, Approximation-Free, and Privacy-Preserved Neural Network Framework based on Joint Obscure Linear and Nonlinear Computations

CHEETAH: An Ultra-Fast, Approximation-Free, and Privacy-Preserved Neural Network Framework based on Joint Obscure Linear and Nonlinear Computations

Qiao Zhang, Cong Wang, Chunsheng Xin, and Hongyi Wu Qiao Zhang, Chunsheng Xin, and Hongyi Wu are with the Department of Electrical and Computer Engineering, Old Dominion University, Norfolk, VA, 23529. Cong Wang is with the Department of Computer Science, Old Dominion University, Norfolk, VA, 23529.
E-mail: {qzhan002, c1wang, cxin, h1wu}

Machine Learning as a Service (MLaaS) is enabling a wide range of smart applications on end devices. However, such convenience comes with a cost of privacy because users have to upload their private data to the cloud. This research aims to provide effective and efficient MLaaS such that the cloud server learns nothing about user data and the users cannot infer the proprietary model parameters owned by the server. This work makes the following contributions. First, it unveils the fundamental performance bottleneck of existing schemes due to the heavy permutations in computing linear transformation and the use of communication intensive Garbled Circuits for nonlinear transformation. Second, it introduces an ultra-fast secure MLaaS framework, CHEETAH, which features a carefully crafted secret sharing scheme that runs significantly faster than existing schemes without accuracy loss. Third, CHEETAH is evaluated on the benchmark of well-known, practical deep networks such as AlexNet and VGG-16 on the MNIST and ImageNet datasets. The results demonstrate more than speedup over the fastest GAZELLE (Usenix Security’18), speedup over MiniONN (ACM CCS’17) and five orders of magnitude speedup over CryptoNets (ICML’16). This significant speedup enables a wide range of practical applications based on privacy-preserved deep neural networks.

privacy; machine learning as a service; secure two party computation; joint obscure neural computing

1 Introduction

From Alexa and Google Assistant to self-driving vehicles and Cyborg technologies, deep learning is rapidly advancing and transforming the way we work and live. It is becoming prevalent and pervasive, embedded in many systems, e.g., for pattern recognition [26], medical diagnosis [8], speech recognition [5] and credit-risk assessment [10]. In particular, deep Convolutional Neural Network (CNN) has demonstrated superior performance in computer vision such as image classification [24, 39] and facial recognition [35], among many others.

Since training a deep neural network model is resource-intensive, cloud providers begin to offer Machine Learning as a Service (MLaaS) [46], where a proprietary model is trained and hosted on clouds, and clients make queries (inference) and receive results through a web portal. While this emerging cloud service is embraced as important tools for efficiency and productivity, the interaction between clients and cloud servers creates new vulnerabilities for unauthorized access to private information. This work focuses on ensuring privacy-preserved while efficient inference in MLaaS.

Although communication can be readily secured from end to end, privacy still remains a fundamental challenge. On the one hand, the clients must submit their data to the cloud for inference, but they want the data privacy well protected, preventing curious cloud provider from mining valuable information. In many domains such as health care [31] and finance [40], data are extremely sensitive. For example, when patients transmit their physiological data to the server for medical diagnosis, they do not want anyone (including the cloud provider) to see it. Regulations such as Health Insurance Portability and Accountability Act (HIPAA) [1] and the recent General Data Protection Regulation (GDPR) in Europe [11] have been in place to impose restrictions on sharing sensitive user information. On the other hand, cloud providers do not want users to be able to extract their proprietary, valuable model that has been trained with significant resource and efforts, as it may turn customers into one-time shoppers [43]. Furthermore, the trained model contains private information about the training data set and can be exploited by malicious users [38, 41, 45]. To this end, there is an urgent need to develop effective and efficient schemes to ensure that, in MLaaS, a cloud server does not have access to users’ data and a user cannot learn the server’s model.

Scheme for Linear Computation Scheme for Non-Linear Computation Speedup over  [14]
CryptoNets [14] HE HE (square approx.)
Faster CryptoNets [4] HE HE (polynomial approx.) 10
GELU-Net [48] HE Plaintext (no approx.) 14
E2DM [22] Packed HE & Matrix optimization HE (square approx.) 30
SecureML [30] HE & Secret share GC (piecewise linear approx.) 60
Chameleon [33] Secret share GMW & GC (piecewise linear approx.) 150
MiniONN [28] Packed HE & Secret share GC (piecewise linear approx.) 230
DeepSecure [34] GC GC (polynomial approx.) 527
SecureNN [44] Secret share GMW (piecewise linear approx.) 1000
FALCON [27] Packed HE with FFT GC (piecewise linear approx.) 1000
XONN [32] GC GC (piecewise linear approx.) 1000
GAZELLE [23] Packed HE & Matrix optimization GC (piecewise linear approx.) 1000
CHEETAH Packed HE & Obscure matrix cal. Obscure HE & SS (no approx.) 100,000
TABLE I: Comparison of Privacy-Preserved Neural Networks.

1.1 Retrospection: Evolvement of Privacy-Preserved Neural Networks

The quest began in 2016 when CryptoNets [14] was proposed to embed Homomorphic Encryption (HE) [13] into CNN. It was the first work that successfully demonstrated the feasibility of calculating inference over Homomorphically encrypted data. While the idea is conceptually straightforward, its prohibitively high computation cost renders it impractical for most applications that rely on non-trivial deep neural networks with a practical size in order to characterize complex feature relations [39]. For instance, CryptoNets takes about s for computing inference even on a simple three-layer CNN architecture. With the increase of depth, the computation time grows exponentially. Moreover, several key functions in neural networks (e.g., activation and pooling) are nonlinear. CryptoNets had to use Taylor approximation, e.g., replacing the original activation function with a square function. Such approximation leads to not only degraded accuracy compared with the original model, but also instability and failure in training.

Following CryptoNets, the past two years have seen a multitude of works aiming to improve the computation accuracy and efficiency (as summarized in Table I). A neural network essentially consists of two types of computations, i.e., linear and nonlinear computations. The former focuses on matrix calculation to compute dot product (for fully-connected dense layers) and convolution (for convolutional layers). The latter includes nonlinear functions such as activation, pooling and softmax. A series of studies have been carried out to accelerate the linear computation, or nonlinear computation, or both. For example, faster CryptoNets [4] leveraged sparse polynomial multiplication to accelerate the linear computation. It achieved about 10 times speedup over CryptoNets. SecureML [30], Chameleon [33] and MiniONN [28] adopted a similar design concept. Among them, MiniONN achieved the highest performance gain. It applied Secret Share (SS) for linear computation, and packed HE [9] to pre-share a noise vector between the client and server offline, in order to cancel the noise during secure online computation. In [28], non-linear functions were approximated by piece-wise linear segments, and computed by using Garbled Circuits (GC), which resulted in times speedup over CryptoNets. DeepSecure [34] took an all-GC approach, i.e., implemented both linear and nonlinear computations using GC. It optimized the gates in the traditional GC module to achieve a speedup of 527 times over CryptoNets. Finally, GAZELLE [23] focused on the linear computation, to accelerate the matrix-vector multiplication based on packed HE, such that Homomorphic computations can be efficiently parallelized on multiple packed ciphertexts. GAZELLE demonstrated impressive speedup of about times compared with MiniONN and three orders of magnitude faster than CryptoNets. So far, GAZELLE is considered the state-of-art framework for secure inference computation.

Two recent works unofficially published in arXiv reported new designs that achieved computation speed at the same order of magnitude as GAZELLE. FALCON [27] leveraged fast Fourier Transform (FFT) to accelerate linear computation. Its computing speed is similar to GAZELLE, while the communication cost is higher. SecureNN [44] adopted a design philosophy similar to Chameleon and MiniONN, but exploited a 3-party setting to accelerate the secure computation, to obtain a 4 times speedup over GAZELLE, at the cost of using a semi-trust third party. Additionally, XONN [32] worked in line with DeepSecure to explore the GC based design for Binary Neural Network (BNN), achieving up to 7 times speedup over GAZELLE, at the cost of accuracy drop due to the binary quantization in BNN.

In addition, a few approaches were introduced to not just improve computation efficiency but also provide other desirable features. For example, GELU-Net [48] aims to avoid approximation of non-linear functions. It partitioned computation onto non-colluding parties: one party performs linear computations on encrypted data, and the other executes nonpolynomial computation in an unencrypted but secure manner. It showed over 14 times speedup than CryptoNets and does not have accuracy loss. E2DM [22] aimed to encrypt both data and neural network models, assuming the latter are uploaded by users to untrusted cloud. It focused on matrix optimization by combining Homomorphic operation and ciphertext permutation, demonstrating 30 times speedup over CryptoNets.

1.2 Contribution of This Work

Despite the fast and promising improvement in computation speed, there is still a significant performance gap to apply privacy-preserved neural networks on practical applications. The time constraints in many real-time applications (such as speech recognition in Alexa and Google Assistant) are within seconds [2, 16]; self-driving cars even demand an immediate response less than a second [6]. In contrast, our benchmark has showed that GAZELLE, which has achieved the best performance so far in terms of speed among existing schemes, takes 161s and 1731s to run the well-known practical deep neural networks AlexNet [24] and VGG-16 [39], which renders it impractical in real-world applications.

In this paper, we propose CHEETAH, an ultra-fast, secure MLaaS framework that features a carefully crafted secret sharing scheme to enable efficient, joint linear and nonlinear computation, so that it can run significantly faster than the state-of-the-art schemes. It eliminates the need to use approximation for nonlinear computations; hence, unlike the existing schemes, CHEETAH does not have accuracy loss. It, for the first time, reduces the computation delay to milliseconds and thus enables a wide range of practical applications to utilize privacy-preserved deep neural networks. To the best of knowledge, this is also the first work that demonstrates privacy-preserved inference based on the well-known, practical deep architectures such as AlexNet and VGG.

The significant performance improvement of CHEETAH stems from a creative design, called joint obscure neural computing. Computations in neural networks follow a series of operations alternating between linear and nonlinear transformations for feature extraction. Each operation takes the output from the previous layer as the input. For example, the nonlinear activation is computed on the weighted values of linear transformations (i.e., the dot product or convolution). All existing approaches discussed in Sec. 1.1 essentially follow the same framework, aiming to securely compute the results for each layer and then propagate to the next layer. This seemingly logic approach, however, becomes the fundamental performance hurdle as revealed by our analysis.

First, although matrix computation has been deeply optimized based on packed HE for the linear transformation in the state-of-the-art GAZELLE, it is still costly. The computation time of the linear transformation is dominated by the operation called ciphertext permutation (or Perm) [23], which generates the sum based on a packed vector. It is required in both convolution (for a convolutional layer) and dot product (for a dense layer). From our experiments, one Perm is 56 times slower than one Homomorphic addition and 34 times slower than one Homomorphic multiplication. We propose an approach to enable an incomplete (or obscure) linear transformation result to propagate to the next nonlinear transformation as the input to continue the neural computation, reducing the number of ciphertext permutations to zero in both convolution and linear dot product computation.

Second, most existing schemes (including GAZELLE) adopted GC to compute the nonlinear transformation (such as activation, pooling and softmax), because GC generally performs better than HE when the multiplicative depth is greater than 0 (i.e., nonlinear) [23]. However, the GC-based approach is still costly. The overall network must be represented as circuits and involves interactive communications between two parties to jointly evaluate neural functions over their private inputs. The time cost is often significant for large and deep networks. Specifically, our benchmark shows that it takes about 263 seconds to compute a nonlinear ReLu function with 3.2M input values, which is part of the VGG-16 framework [39]. Moreover, all existing GC-based solutions rely on piece-wise or polynomial approximation for nonlinear functions such as activation. This leads to degraded accuracy and the accuracy loss is often significant. The proposed scheme takes a secret sharing approach with 0-multiplicative-depth packed HE to avoid the use of computationally expensive GC. A novel design is developed to allow the server and client to each obtain a share of Homomorphic encrypted nonlinear transformation result based on the obscure linear transformation as discussed above. This approach eliminates the need to use approximation for nonlinear functions and achieves enormous speedup. For example, it is 1793 times faster than GAZELLE in computing the most common nonlinear ReLu activation function, under the output dimension of 10K.

Overall, the proposed CHEETAH is an ultra-fast privacy-preserved neural network inference framework without accuracy loss. It enables obscure neural computing that intrinsically merges the calculation of linear and nonlinear transformations and effectively reduces the computation time. We benchmark the performance of CHEETAH with well-known deep networks for secure inference. Our results show that it is 218 and 334 times faster than GAZELLE, respectively, for a 3-layer and a 4-layer CNN used in previous works. It achieves a significant speedup of 130 and 140 times, respectively, over GAZELLE in the well-known, practical deep networks AlexNet and VGG-16. Compared with CryptoNets, CHEETAH achieves a speedup of five orders of magnitudes.

The rest of the paper is organized as follows. Section 2 introduces the system and threat models. Section 3 elaborates the system design of CHEETAH, followed by the security analysis in Section 4. Experimental results are discussed in Section 5. Finally, Section 6 concludes the paper.

2 System and Threat Models

In this section, we introduce the overall system architecture and threat model, as well as the background knowledge about cryptographic tools used in our design.

2.1 System Model

We consider a MLaaS system as shown in Fig. 1. The client is the party that generates or owns the private data. The server is the party that has a well-trained deep learning model and provides the inference service based on the client’s data. For example, a doctor performs a chest X-ray for her patient and sends the X-ray image to the server on the cloud, which runs the neural network model and returns the inference result to assist the doctor’s diagnosis.

Fig. 1: An overview of the MLaaS system.

While various deep learning techniques can be employed to enable MLaaS, we focus on the Convolutional Neural Network (CNN), which has achieved wide success and demonstrated superior performance in computer vision such as image classification [24, 39] and face recognition [35]. A CNN consists of a stack of layers to learn a complex relation among the input data, e.g., the relations between pixels of an input image. It operates on a sequence of linear and nonlinear transformations to infer a result, e.g., whether an input medical image indicates the patient has pneumonia. The linear transformations are in two typical forms: dot product and convolution. The nonlinear transformations leverage activations such as the Rectified Linear Unit (ReLu) to approximate complex functions [20] and pooling (e.g., max pooling and mean pooling) for dimensionality reduction. CNN repeats the linear and nonlinear transformations recursively to reduce the high-dimensional input data to a low-dimensional feature vector for classification at the fully connected layer. Without losing generality, we use image classification as an example in the following discussion, aiming to provide a lucid understanding of the CNN architecture as illustrated in Fig. 2.

(a) Overall network structure.
(b) Convolutional layer.
(c) Pooling.
(d) Fully connected layer.
Fig. 2: A three-layer CNN: (a) overall network structure, (b) convolutional layer, (c) pooling, (d) fully connected layer.

Convolutional Layer. As shown in Fig. 2(b), the input to a convolutional layer has the dimensions of , where and are the width and height of the input feature map and is the number of the feature maps (or channels). For the first layer, the feature maps are simply the input images. Hereafter, we use the subscript to denote input and output. The input is convolved with groups of kernels. The size of each group of kernel is , in which and are the width and height of the kernel. The number of channels of the kernel group must match with the input, i.e., . The convolution will produce the feature output, with a size of . More specifically, the -th element in the -th output feature is calculated as follows:


where and are the kernel and input, respectively. For the ease of description, we omit the bias in Eq. (1). Nevertheless, it can be easily transformed into the convolution or weight matrix multiplication [18].

The last convolutional layer is typically connected with the fully-connected layer, which computes the weighted sum, i.e., a dot product between the weight matrix of size and a flatten feature vector of size . The output is a vector with the size of . Each element of the output vector is calculated below:


Activation. Nonlinear activation is applied to convolutional and weighted-sum outputs in an elementwise manner. The commonly used activation functions include ReLu, ; sigmoid, ; and tanh, . The last layer uses the softmax function to normalize the outputs into a probability vector.

Pooling. Pooling conducts downsampling to reduce dimensionality. In this work, we consider Mean pooling, which is implemented in CryptoNets and also commonly adopted in state-of-art CNNs. It splits a feature map into regions and averages the regional elements. Compared to max pooling (another pooling function which selects the maximum value in each region), authors in [49] have claimed that while the max and mean pooling functions are rather similar, the use of mean pooling encourages the network to identify the complete extent of the object, which builds a generic localizable deep representation that exposes the implicit attention of CNNs on an image.

2.2 Threat Model

Similar to [28, 23, 30, 34], we adopt the semi-honest model, in which both parties try to learn additional information from the message received (assuming they have bounded computational capability). That is, the client and server will follow the protocol, but wants to learn the model parameters and attempts to learn the data. Hence, the goal is to make the server oblivious of the private data from the clients, and prevent the client from learning the model parameters of the server. We would prove that the proposed framework is secure under semi-honest corruption using ideal/real security [15]. Our framework targets to protect clients’ sensative data, and service providers’ models which have been trained by service providers with significant resources (e.g., private training data and computing power). Protecting models is usually sufficient through protecting the model parameters, which are the most critical information for a model. Moreover, many applications are even built on well-known deep network structures such as AlexNet [24], VGG16/19 [39] and ResNet50 [19]. Hence it is typically not necessary to protect the structure (number of layers, kernel size, etc). In the case that the implemented structure is proprietary and has to be protected, service providers can introduce redundant layers and kernels to hide the real structure [28, 23].

There is also an array of emerging attacks to the security and privacy of the neural networks [45, 38, 43, 12, 29, 21]. They can be further classified by the processes that they are targeting at: training, inference (model) and input. (1) Training. The attack in [45] attempts to steal the hyperparameters during training. The membership inference attack [38] wants to find out whether an input belongs to the training set based on the similarities between models that are privately trained or duplicated by the attacker. This paper focuses on the inference stage and does not consider such attacks in training, since the necessary variables for launching these attacks have been released in memory and the training API is not provided. (2) Model. The model extraction attack [43] exploits the linear transformation at the inference stage to extract the model parameters and the model inversion attack [12] attempts to deduce the training sets by finding the input that maximizes the classification probability. The success of these attacks requires full knowledge of the softmax probability vectors. To mitigate them, the server can return only the predicted label but not the probability vector or limits the number of queries from the attacker. The Generative Adversarial Networks (GAN) based attacks [29] can recover the training data by accessing the model. In this research, since the model parameters are successfully protected from the clients, this attack can be defended effectively. (3) Input. A plethora of attacks adopt adversarial examples by adding a small perturbation to the input in order to cause the neural network to misclassify [21]. Since rational clients pay for prediction services, it is not of their interest to obtain an erroneous output. Thus, this attack does not apply in our framework.

2.3 Cryptographic Tools

The proposed privacy-preserved deep neural network framework, i.e., CHEETAH, employs two fundamental cryptographic tools as outlined below.

(1) Packed Homomorphic Encryption. Homomorphic Encryption (HE) is a cryptographic primitive that supports meaningful computations on encrypted data without the decryption key. It has found increasing applications in data communication, storage and computation [42]. Traditional HE operates on individual ciphertext [48], while the packed homomorphic encryption (PHE) enables packing of multiple values into a single ciphertext and performs component-wise homomorphic computation in a Single Instruction Multiple Data (SIMD) manner [3] to take the advantages of parallelism. Among various PHE techniques, our work builds on the private-key Brakerski-Fan-Vercauteren (BFV) scheme [9], which involves four parameters111The readers are referred to [23] for more detail.: 1) ciphertext modulus , 2) plaintext modulus , 3) number of ciphertext slots , and 4) a Gaussian noise with a standard deviation . The secure computation involves two parties, i.e., the client and server .

In PHE, the encryption algorithm encrypts a plaintext message vector from into a ciphertext with slots. We denote and as the ciphertexts encrypted by client and server , respectively. The decryption algorithm returns the plaintext vector from the ciphertext . Computation can be performed on the ciphertext. In a general sense, an evaluation algorithm inputs several ciphertexts and outputs a ciphertext . The function is constructed by homomorphic addition (Add), multiplication (Mult) and permutation (Perm). Add(,) outputs a ciphertext which encrypts the elementwise sum of and . Mult(,) outputs a ciphertext which encrypts the elementwise multiplication of and plaintext . It is worth pointing out that CHEETAH is designed to require multiplication between a ciphertext and a plaintext only, but not the much more expensive multiplication between two ciphertexts. Perm() permutes the elements in into another ciphertext , where and is a permutation of .

The run-time complexities of Add and Mult are significantly lower than Perm. From our experiments, one Perm is 56 times slower than one Add and 34 times slower than one Mult. This observation motivates the design of CHEETAH, which completely eliminates permutations in convolution and dot product transformations, thus substantially reducing the overall computation time.

It is worth pointing out that neural networks always deal with floating point numbers while the PHE is in the integer domain. Specifically, neural networks typically use real number arithmetic, not modular arithmetic. On the other hand, direct increasing plaintext modulus in PHE increases noise budget consumption, and also decreases the initial noise budget, which causes limited Homomorphic operations. our implementation adopts the highly efficient encoding for BFV in Microsoft SEAL library [36] to establish a mapping from real numbers in neural network to plaintext elements in PHE. This makes real number arithmetic workable in PHE without data overflow. Thereafter, our design is described in floating point domain with real number input.

(2) Secret Sharing. In the secret sharing protocol, a value is shared between two parties, such that combining the two secrets yields the true value [33]. In order to additively share a secret , a random number, , is selected and two shares are created as and . Here, can be either plaintext or ciphertext. A party that wants to share a secret sends one of the shares to the other party. To reconstruct the secret, one needs to only add two shares .

While the overall idea of secret share (SS) is straightforward, creative designs are often required to enable its effective application in practice, because in many applications the two parties need to perform complex nonlinear computation on their respective shares and thus it is non-trivial to reconstruct the final result based on the computed shares. Due to this fundamental hurdle, the existing approaches discussed in Sec. 1.1 predominately chose to use GC, instead of SS, to implement the nonlinear functions. However, GC is computationally costly for large input [34, 47, 23]. Specifically, our benchmark shows that GC takes about 263 seconds to compute a nonlinear ReLu function with 3.2M input values, which is part of the VGG-16 framework [39]. In this work, we propose a creative PHE-based SS for CHEETAH to implement secret nonlinear computation, which requires only round communication for each nonlinear function, thus achieving multiple orders of magnitude reduction of the computation time. For example, CHEETAH achieves a speedup of 1793 times over GAZELLE in computing the nonlinear ReLu function.

3 Design of Privacy Preserved Inference

A neural network is organized into layers. For example, CNN consists of convolutional layers and fully-connected dense layers. Each layer includes linear transformation (i.e., weighted sum for a fully-connected dense layer or convolution for a convolutional layer), followed by nonlinear tranformation (such as activation and pooling). All existing schemes intend to securely compute the results for linear transformation first, and then perform the nonlinear computation. Although it appears logical, such design leads to a fundamental performance bottleneck as discussed in Sec. 1. The proposed approach, CHEETAH, is based on a creative design, named joint obscure neural computing, which only computes a partial linear transformation output and uses it to complete the nonlinear transformation. It achieves several orders of magnitude speedup compared with existing schemes.

We introduce the basic idea of CHEETAH via a simple example based on a two-layer CNN (with a convolutional layer and a dense layer), which can be formulated as follows:


where is the activation function, is the input data, is a kernel for the convolutional layer, stands for convolution and is the weight matrix for the dense layer: \@fleqntrue\@mathmargin0pt


Note that while we use the simple two-layer CNN to lucidly describe the main idea, CHEETAH is actually applicable to any neural networks with any layer structure and input data size. In the rest of this section, we first present CHEETAH for a Single Input Single Output (SISO) convolution layer and then discuss the cases for Multiple Input Multiple Output (MIMO) convolution and fully connected dense layers.

3.1 SISO Convolutional Layer

The process of convolution can be visualized as placing the kernels at different locations of the input data. At each location, an element-wise sum of product is computed between the kernel and corresponding data values. If the convolution of the above example, i.e., , is computed in plaintext, the result, denoted as , should include four elements, :

In the problem setting of secure MLaaS (as introduced in Sec. 2), the client owns the data , while the server owns the CNN model (including and ). The goal is to ensure that the server does not have access to and the client cannot learn the server’s model parameters. To this end, in GAZELLE, encrypts into by using HE and sends it to . In the following discussion, both server and client use private-key BFV encryption [9]. The subscript denotes ciphertext encrypted by the client’s private key, while denotes ciphertext encrypted by the private key of server.

performs HE computation to calculate the convolution . To accelerate the computation, packed HE is employed. For example, to compute the first element of the convolution (i.e., ), a single cipheretxt can be created to contain the vector . On the other hand, a packed plaintext vector is created for . The packed HE supports the computation of element-wise multiplication between the two vectors in a single operation, yielding a single ciphertext for the vector . However, we still need to add the vector’s elements together to compute . Since the vector is in a single ciphertext, direct addition is not possible. GAZELLE uses permutation (Perm) to compute the sum [23]. However, computing the sum using Perm is costly, with the complexity of for convolution and for weighted sum in the dense layer, where , and are the output dimension, input dimension, and kernel size, respectively). From our experiments, one Perm is 56 times slower than one Add and 34 times slower than one Mult.

In this paper, we propose a novel idea to enable an incomplete (or obscure) linear transformation result to propagate to the next nonlinear transformation to continue the neural computation, thus eliminating the need for ciphertext permutations. The overall design is motivated by the double-secret scheme for solving linear system of equations [7]. Our scheme is illustrated in Fig. 3.

Fig. 3: The overall design of CHEETAH.

(1) Packed HE Encryption. and transform the data and kernel into and , respectively, as follows:

Fig. 4: Data transformation at client and server.

As illustrated in Fig. 4, four convolutional blocks are computed. For example, the first convolutional block computes . The elements in each convolutional block are sequentially extracted into a packed ciphertext . Meanwhile, also transforms the kernel into according to each convolutional block. Note that the transformation is completed offline. encrypts and sends to .

(2) Perm-free Secure Linear Computation. Upon receiving , performs the linear computation based on the client-encrypted data. A distinguished feature of the proposed design is to eliminate the costly permutations.

Let denote the elementwise multiplication between and . As we can see, the sum of four elements for each block in corresponds to one element of the convolution result. For example, the four elements for first block, i.e., and , correspond to . The next block (i.e., and ) correspond to , and so on and so forth.

performs Mult to obtain . The result is the client-encrypted elementwise multiplication between and . But does not intend to calculate the sum of each block to obtain the final convolution result as GAZELLE does, because it would need the costly permutations. Instead, it intends to let decrypt to compute the sum in the plaintext.

However, naively sending to the client would allow the client to obtain the neural network model information, i.e., . To this end, disturbs each element of the convolution result with a randomly multiplicative blinding factor. Specifically, pre-generates a pair of random numbers that satisfy , for each -th to-be-summed block in , where in this example.

constructs the following vector by using :

which will be used to scramble by multiplying it with before it is sent to . Note that, as each individual in the -th four-element block is multiplied with the same factor (since are repeated four times in ), it would leak the relative magnitude among those four elements in each block. To this end, further constructs a zero-sum vector as follows:

where are random numbers subject to . The zero-sum strategy has been adopted in the classic privacy-preserving data aggregation [37], where a secret key is distributed based on zero sum of random noise.

At the same time, uses to create the following vectors:

where is a pair of polar indicator,


encrypts and by using packed HE. The encrypted values, i.e., and , will be sent to for the nonlinear computation as to be discussed later. Note that, and can be transmitted to offline, as and are pre-generated by .

Now, let us put all pieces together for the secure computation of convolution: encrypts and sends to . pre-computes in plaintext and then multiplies the result with to obtain . As we can see, the -th convolution element (which corresponds to the sum of -th four-element block in ) is actually multiplied with a random number . Finally, adds the zero-sum vector by Add. In this way, disturbs each element of convolution result (the sum of four elements in each block) while disturbs each individual element in the convolutional block.

Next, we will show that, although the convolution result is not explicitly calculated, the partial (obscure) result, i.e., , is sufficient to compute the nonlinear transformation (e.g., activation and pooling).

(3) PHE-based Secret Share for Non-Linear Transformation. sends , and to (note that and are transmitted to offline).

decrypts and sums up each four-element block in plaintext, yielding . It is not difficult to show that is times of the true convolution, i.e., .

If had the true convolution outcome, i.e., , it would compute the ReLu function as follows:


However, only has . Since is a random number that could be positive or negative, it is infeasible to obtain correct activation directly. Instead, computes


We can show that the above calculation essentially recovers the server-encrypted true ReLu function outcome, i.e., . Since , may yield four possible outputs, depending on the signs of and .


For example, when and , we have according to Eq. (4) and thus . On the other hand, . Since , we have . Note that we have chosen . Therefore, Eq. (6) should yield . This is clearly the server-encrypted ReLu output. Similarly, we can examine other cases of and in Eq. (7) and show that Eq. (6) always produce the server-encrypted ReLu outcome.

Eqs. (6) and (7) are created for ReLu function only. Similar design can be developed for other activation functions as shown in the Appendix A.

Subsequently, creates a ReLu share and computes the server’s share as Add. sends it along with (i.e., the client-encrypted share , which can be pre-generated by ) to .

decrypts to obtain a share of the plaintext activation result, i.e., . It then computes Add to obtain , i.e., the client-encrypted true nonlinear transformation result. Note that, the random share has been canceled in this HE addition.

Till now, the computation of the current layer (including linear convolution and nonlinear activation) is completed. The output of this layer (i.e., ) will serve as the input for the next layer. If the next layer is still convolution, the server simply repeats the above process. Otherwise, if the next is a fully-connected dense layer, a similar approach can be taken as to be discussed in Sec. 3.3.

Note that some CNN models employ pooling after activation to reduce its dimensionality. For example, mean pooling takes the activations as the input, which is divided into a number of regions. The averaged value of each region is used to represent that region. Both and can respectively average their activation shares (i.e., and ) to obtain the share of mean pooling. Meanwhile, a similar scheme can be applied if the bias is included.

3.2 MIMO Convolutional Layer

The above SISO method can be readily extended to MIMO convolutional layer in order to process multiple inputs simultaneously. Assume there are input data (i.e., ). Let be the number of input data that can be packed into one ciphertext. Recall that each must be transformed to as discussed in Sec. 3.1. Let denote the number of kernels and the size of each kernel. After transformation, the size of is times of the original . Therefore, each ciphertext can hold such transformed input data. Accordingly, the input data are transformed and encrypted into ciphertexts.

The remaining process for linear and nonlinear computation is similar to SISO, except that the computation on a ciphertext actually calculates multiple input data simultaneously and that the convolution of all input ciphertexts based on one kernel are combined into one output ciphertext, yielding a total of output ciphertexts. MIMO is obviously more efficient in processing batches of input data.

3.3 Fully-connected Dense Layer

In a fully-connected dense layer, uses the output of the previous layer (i.e., ) to compute the weighted sum. Take the simple two-layer CNN as an example, the weighted sum computes

The computation of and is intrinsically the same as the computation of each convolution element (i.e., ) as discussed above.

3.4 Complexity Analysis

In this subsection, we analyze the computation and communication cost of CHEETAH and compare it with other schemes.

(1) Computation Complexity. The analysis of the computation complexity focuses on the number of ciphertext permutations (Perm), multiplications (Mult), and additions (Add). The notations to be used in the analysis are summarized as follows:

  • is the number of slots in a ciphertext.

  • is the ciphertext space.

  • is the number of bits of a ciphertext.

  • is the input dimension of a fully connected layer.

  • is the output dimension of a fully connected layer.

  • is the kernel size.

  • is the number of input data (channels) in MIMO.

  • is the number of kernels or the number of output feature maps in MIMO.

  • is the number of input data that can be packed into one ciphertext.

In SISO, recall that a ciphertext is firstly sent to . conducts one ciphertext multiplication and addition to get . Then receives , performs the decryption, and gets the summed convolution in plaintext, which is followed by 2 multiplications and 1 addition to get the encrypted ReLu, according to Eq. 6. Finally, does another addition namely Add to generate ’s ReLu share. finaly recovers the encrypted nonlinear result with another addition. Therefore, total 3 multiplications and 4 additions are required in SISO. The complexity is .

In MIMO, sends ciphertexts. Then performs Mult and Add to get an incomplete ciphertext for each of kernels. After that, each of incomplete ciphertext is added with zero-sum vector by one addition. Then sends those cipheretxts to , which decrypts them and obtain output features, creating plaintext. Based on Eq. (6), gets the encrypted ReLu with multiplications and additions, because each of plaintext associates with multiplications and 1 addition. Finally, performs another addition on each of ReLu ciphertexts to generate the ReLu share for . then gets its ReLu share by decryption and recovers the nonlinear result by Add. Therefore, MIMO needs multiplications and additions, both with the complexity of .

In a fully-connected (FC) dense layer, conducts multiplications to get intermediate ciphertext, where is usually much larger than and . After that, the zero-sum vector is added on each of intermediate ciphertext to form 222The structure of is similar with . which is sent to . does the decryption and gets the summed result in plaintext. Then calculates the encrypted ReLu with 2 multiplications and 1 addition by Eq. (6). Finally, one addition is performed to generate the ReLu share for , and needs another Add to recover the encrypted nonlinear result. So the FC layer needs multiplications and additions, resulting in the complexity of .

Table II compares the computation complexity between CHEETAH and other schemes. Specifically, In the SISO case, CHEETAH has a constant complexity without permutation while GAZELLE has the complexity . In the MIMO case, GAZELLE has two traditional options for permutation, i.e., Input Rotation (IR) and Output Rotation (OR) [23]. CHEETAH eliminates the expensive permutation without incurring more multiplications and additions, thus yielding a considerable gain. In the FC layer, we compare CHEETAH with a naive method in [23] (the baseline of GAZELLE), Halevi-Shoup (HS) [17] and GAZELLE. Through the obscure matrix calculation, obscure HE and secret share, CHEETAH further reduces the complexity of addition by compared to GAZELLE. In particular, is usually much larger than , which makes this reduction significant. It is worth pointing out that CHEETAH completes both the linear and nonlinear operations with the above complexity while the existing schemes such as GAZELLE only finish the linear operation.

Layer Type Methodology Perm Mult Add Communication (bits)
MIMO IR[23] -
FC Naive method[23] -
HS[17] -
TABLE II: Comparison of computation and communication complexity.

(2) Communication Complexity. In the SISO case, CHEETAH has two transmissions: 1) sends the encrypted data to ; 2) sends to . Thus the communication cost is bits. Note that the third transmission in Fig. 4 where sends the encrypted ReLu share to is the beginning of the next layer.

Similarly, in MIMO, the two transmissions are 1) sends ciphertexts for input images; 2) sends cipheretexts for kernels. Note that, in the first transmission, ciphertexts are transmitted at the first convolutional layer while only ciphertexts are needed in other layers. This is because the size of -encrypted ReLu will not be changed. In the second transmission, since can simultaneously send each of cipheretexts after each calculation, the actual communication cost is on transmiting the last one of ciphertexts. Thus, the communication cost is bits.

In the FC layer, the two transmissions are 1) sends an input ciphertext; 2) sends cipheretexts. As each of cipheretexts can be simultaneously transmitted after each calculation, the actual communication cost is the last one of ciphertexts. The total cost is thus bits.

Table II also compares the communication costs between CHEETAH and GAZELLE. We can see that through the obscure matrix calculation, obscure HE and secret share, CHEETAH completes the convolution and FC layers with a smaller communication cost while GAZELLE needs expensive GC for nonlinear functions, resulting in a much higher communication cost.

4 Security Analysis

We prove the security of CHEETAH using the simulation approach [15]. As discussed in Sec. 2, the semi-honest adversary can compromise any one of the client or server, but not both (i.e., the client and server do not collude). Here, security means that the adversary only learns the inputs from the party that it has compromised, but nothing else beyond that. It is modeled by two interactions: first, an interaction in the real world where parties follow the protocol in the presence of an adversary , and the environment machine which chooses the inputs to the parties; second, an ideal interaction that parties forward their inputs to a trusted functionality machine . To prove security, the goal is to demonstrate that no environment can distinguish the real and ideal interactions. In other words, we want to show that the real-world simulator achieves the same effect in the ideal interaction. The proofs are sketched in the following333Since SISO and MIMO only differ in the number of transmitted ciphertext, it is sufficient to demonstrate the security of SISO. The proof is for convolutional kernels and the same follows for dot products. .

(1) Security against a semi-honest client. We define a simulator that simulates an admissible adversary which has compromised the client in the real world. conducts the following: 1) receives the transformed input data from the environment , transmits it to and obtains the convolution result , an -encrypted ReLu and ReLu (pooling) share ; 2) constructs a ciphertext and sends to ; 3) receives the ciphertext from , decrypts it, performs summation; 4) receives encrypted polar indicator pair, and , from and calculates -encrypted ReLu by Eq. (6); 5) randomly generates ReLu (pooling) share and sends to , where is the true convolution of .

Here the view of that simulates in the ideal world is the convolution result , an -encrypted ReLu and ReLu (or pooling) share from , while ’s view in the real execution is the convolution result of , another -encrypted ReLu for and the ReLu (pooling) share . The above is secure against the semi-honest client: 1) the randomness of and in and makes the two convolution results from ideal and real worlds indistinguishable; 2) the private-key HE is semantically secure [9] so the two -encrypted ReLu in ideal and real worlds are also indistinguishable from ; 3) the ReLu (pooling) shares, and , are uniformly random, thus they are not distinguishable. In summary, the output distribution of in the ideal world is indistinguishable from that in the real world.

(2) Security against a semi-honest server. Similarly, we construct a simulator, , to emulate an admissible adversary which has compromised the server in the real world. acts as follows: 1) receives the transformed kernel and zero-sum vector from , sends it to and obtains the -encrypted partial convolution and the ReLu (pooling) share; 2) constructs a transformed kernel, zero-sum vector and blinding vector as , and (corresponding to in section 3.1); 3) receives the encrypted input data from , calculates the -encrypted partial convolution and sends it to ; 4) constructs the polar indicator pair and , and sends them to ; 5) receives the ReLu (pooling) share and decrypts it to get .

Here the view of that simulates in the ideal world is the -encrypted partial convolution and the ReLu (pooling) share from , while the ’s view in the real world is another -encrypted partial convolution and the ReLu (pooling) share . First, because the private-key HE [9] is semantically secure, the -encrypted partial convolution from and are indistinguishable from ; Second, the uniformly random at makes the ReLu (pooling) share from and not distinguishable. So the output distribution of in the ideal world is indistinguishable from that in the real world.

5 Performance Evaluation

We implement CHEETAH with C++ based on Microsoft SEAL Library [36], and compare it with the best existing scheme, GAZELLE444Available at: We use two workstations as the client and server. Both machines run Ubuntu with Intel i7-8700 3.2GHz CPU with 12 cores and 16 GB RAM. The network link between them is a Gigabit Ethernet. Recall that the four parameters in BFV scheme are: 1) ciphertext modulus ; 2) plaintext modulus ; 3) number of ciphertext slots and 4) a Gaussian noise with a standard deviation . A larger tolerates more noise. We set to be a 20-bit number and to be a 60-bit psuedo-Mersenne prime. The number of slots for the packed encryption is set to 10,000.

5.1 Component-wise Benchmark

We first examine the performance of each functional component including Conv, FC and ReLu.

Convolution Benchmark. We define the time of the convolution operation as the duration between receives the encrypted data or secret share from the previous layer (e.g., ReLu) till completes the convolution computation, just before sending the (partial) convolution results to . It does not contain the communication time between and , such as transmitting the (partial) convolution results to , or secret share to , or in the case of GAZELLE, the time for the HE to GC transformation between and for fair comparison. All such communication time is accounted in ReLu and pooling discussed later.

Table III benchmarks the convolution with different input and kernel sizes. The ‘In_rot’ and ‘Out_rot’ indicate two GAZELLE variants with the input or output rotation, from which, one of them has to be used for convolution (see [23] for details). From Table III, CHEETAH significantly outperforms GAZELLE. E.g., with the kernel size , both the GAZELLE In_rot and Out_rot variants need more than 25 Mult, 24 Add and 24 Perm operations to yield the result of convolution. In contrast, CHEETAH needs only 5 Mult and 5 Add operations, one for each kernel, to obtain the (partial) convolution results. Those results are then sent to for computing ReLu (to be discussed). Overall, CHEETAH accomplishes a speedup of and times compared with the GAZELLE In_rot and Out_rot variants, respectively, for the case with the kernel size and input data size .

Input data size
Kernel size
2828@1 55@5 In_rot 7.4 247
Out_rot 6.2 207
1616@128 11@2 In_rot 21.4 306
Out_rot 4.65 66
3232@2 33@1 In_rot 2.3 115
Out_rot 1.94 97
TABLE III: Benchmark for convolution operation

Fig. 5 illustrates the speedup and communication cost vs. the kernel size . Large kernel sizes have large receptive fields, thus capable of learning more information.555Classic structures such as VGG-16 cascade multiple small kernels to realize the same functionality of a large one. CHEETAH offers more boost for large kernels, and achieves an average of (see Fig. 5(c)) to (as shown in Fig. 5(a)) speedup, with a slight exception when . This is because with a larger kernel, GAZELLE conducts more expensive Permutations for the final convolutional results. When the kernel is sufficiently large, this speedup is offset by other computations hence the curve turns flat. This is reasonable since large kernels (e.g. ) are not desired in practice due to their heavy computations even in plaintext. Fig. 5(d) compares the communication cost (the size of the data sent to ). For clarity, we sequentially denote the three rows of system configurations in Table III as R1, R2 and R3. As can be seen, CHEETAH reduces the communication cost by , and times for R1, R2 and R3, respectively.

Fig. 5: Speedup and communication cost with various kernel sizes: (a) input data size 28 and kernel size , (b) input data size 16 and kernel size , (c) input data size 32 and kernel size , (d) communication cost for a convolutional layer.

FC Benchmark. Table IV compares the time for the weighted sum function in an FC layer (matrix-vector multiplication and nonlinear ReLu) for various input and output dimensions. The speedup of CHEETAH over GAZELLE is rather impressive, from about to over times, thanks to the cost savings from elimination of permutation. For example, when is , the input is a column vector with a size of 2048 which can be packed into one ciphertext. GAZELLE first conducts the Mult between the input and the weight matrix. As there are 2048 chunks with , it then performs Perm and Add to compute the weighted sum. On the other hand, CHEETAH needs only one Mult and one Add, without the Perm. The results are packed into one ciphertext, sent to and then used in the obscure HE to compute the nonlinear activation function. This gives CHEETAH times speedup compared with GAZELLE. From Table IV, we can see that a larger ratio of leads to a higher speedup as GAZELLE needs more Perm and Add operations. In contrast, CHEETAH needs only one Mult and one Add operation, independent of the ratio or , value. This is more beneficial for designing privacy-preserved, large-scale learning tasks since the objective is to map the inputs into a high-dimensional vector in order to characterize all the classes. The large ratio is common in networks such as VGG16, GoogleNet and ResNet50 for the ImageNet task.

Method #Perm #Mult #Add
12048 GAZELLE 11 1 11 3.8 422
CHEETAH 0 1 1 0.009
21024 GAZELLE 10 1 10 3.63 403
CHEETAH 0 1 1 0.009
4512 GAZELLE 9 1 9 3.3 367
CHEETAH 0 1 1 0.009
8256 GAZELLE 8 1 8 3 333
CHEETAH 0 1 1 0.009
16128 GAZELLE 7 1 7 2.65 294
CHEETAH 0 1 1 0.009
TABLE IV: Benchmark for matrix-vector mult.

Table V presents the communication cost for the FC layer. As can be seen, GAZELLE has a higher communication cost due to GC, especially for a large output dimension , while the communication cost of CHEETAH is independent of the input or output dimensions, as it needs only one ciphertext. Using GC, GAZELLE needs more communication overhead between and . Note that this is also true for the communication time after convolution, for computing the nonlinear ReLu activation.

12048 21024 4512 8256 16128
CHEETAH 143.1 143.1 143.1 143.1 143.1
GAZELLE 147.8 152.5 161.9 180.8 218.6
TABLE V: Commun. cost for matrix-vector mult. (KB)

ReLu Benchmark. Table VI shows the speedup of nonlinear operations, i.e., the ReLu function. As its obscure HE only conducts -multiplicative-depth HE operation to compute the ReLu function, followed by a one-way communication from to , to send the ReLu share, CHEETAH dramatically improves the efficiency of nonlinear operations by up to times for ReLu, compared with GAZELLE.

cost (ms)
ReLu 1000 GAZELLE 115 267
10000 GAZELLE 843 1793
TABLE VI: Benchmark for ReLu operation.
Fig. 6: Benchmark for VGG-16. (best view in color)

Fig. 7 plots the speedup and communication cost as a function of the output dimension. Similarly, CHEETAH achieves an outstanding speedup with much smaller communication cost, independent of the output dimension, compared with GAZELLE. The speedup quickly increases when the output dimension increases. The communication cost of CHEETAH only involves the number of packed ciphertexts for nonlinear share of . CHEETAH needs only one round of communications. In comparison, GAZELLE needs the GC module to obtain the nonlinear result, which has a large communication cost proportional to the output dimension, and needs multiple rounds of communications between and . Overall, CHEETAH achieves a communication cost reduction up to two orders of magnitude compared with GAZELLE.

Fig. 7: (a) Speedup of CHEETAH over GAZELLE for computing ReLu. (b) Comparison of communication cost for ReLu.

5.2 Benchmark with Classic Networks

This section compares the overall performance of CHEETAH and GAZELLE in complete networks from end to end, i.e., including all the computation and communication from the data input to the final inference result. We benchmark on four neural network structures: (i) Network A [34]: 1 Conv and 2 FC layers with ReLu activation, (ii) Network B [28]: 2 Conv and 2 FC layers with ReLu activation and pooling, (iii) AlexNet [24]: Conv and FC layers with ReLu activation and pooling, (iv) VGG-16 [39]: Conv and FC layers with ReLu activation and pooling. Networks A and B represent relatively shallow structures like LeNet [25] that were used in simple tasks like recognition of handwritten digits. AlexNet and VGG-16 achieve record-breaking performance on large-scale dataset like ImageNet [24]. These networks leverage a stack of convolutional layers to extract complex feature relations from inputs of large dimensions ( RGB images). Thus, the capability of running such deep networks in reasonable time is a significant contribution to bridge the gap from research to practical privacy-preserved computer vision applications.

Table VII presents the speedup of CHEETAH over GAZELLE. CHEETAH achieves to times speedup across the four networks. It not only dramatically reduces the running time compared with GAZELLE, but also brings the running time down to the practical range. For instance, for the VGG-16 network, GAZELLE needs about half an hour to get an inference result (image classification), while CHEETAH takes only 12 seconds, which is practical for many MLaaS applications, or even potential mobile applications. This is the first time that the privacy preserved learning can be applied to practical neural networks for real world applications. Table