# An Adaptive and Fast Convergent Approach to Differentially Private Deep Learning

## Abstract

With the advent of the era of big data, deep learning has become a prevalent building block in a variety of machine learning or data mining tasks, such as signal processing, network modeling and traffic analysis, to name a few. The massive user data crowdsourced plays a crucial role in the success of deep learning models. However, it has been shown that user data may be inferred from trained neural models and thereby exposed to potential adversaries, which raises information security and privacy concerns. To address this issue, recent studies leverage the technique of differential privacy to design private-preserving deep learning algorithms. Albeit successful at privacy protection, differential privacy degrades the performance of neural models. In this paper, we develop AdaDp, an adaptive and fast convergent learning algorithm with a provable privacy guarantee. AdaDp significantly reduces the privacy cost by improving the convergence speed with an adaptive learning rate and mitigates the negative effect of differential privacy upon the model accuracy by introducing adaptive noise. The performance of AdaDp is evaluated on real-world datasets. Experiment results show that it outperforms state-of-the-art differentially private approaches in terms of both privacy cost and model accuracy.

## I Introduction

### I-a Background and Motivation

The past decade has witnessed the remarkable success of deep learning techniques in various machine learning / data mining tasks, such as signal processing [1], network modeling [2] and traffic analysis [3]. The great success relies heavily on the massive collection of user data, which, however, often raise severe privacy and security issues. For example, Fredrikson et al. [4], demonstrates that the individual privacy information in the training dataset can be recovered by repeatedly querying the output probabilities of a disease recognition classifier built upon a convolutional neural network (CNN). Existing privacy concerns are likely to discourage users from sharing their data and thereby obstruct the future development of deep learning itself.

This paper studies the problem of user privacy protection in the training process of neural models. We consider the white-box scenario where an adversary has access to the parameters of a trained model. In this scenario, many service providers allow users to download models to their personal devices (e.g., computers and smart phones), and malicious users could analyze the parameters of the model which may expose personal information in the training dataset.

### I-B Limitations of Prior Art

To address the privacy issue, several differential privacy (DP) [5] based approaches were proposed, which may be classified into two categories: data obfuscation and gradient obfuscation. Data obfuscation based approaches obfuscate data with noise prior to potential exposure of sensitive information [6, 7]. These approaches may suffer from significant accuracy degradation of the trained model. The reason is that to guarantee the differential privacy bound, the added noise may be excessively intense and make differently labeled training instances almost indistinguishable. In contrast to data obfuscation, gradient obfuscation based approaches add noise to the gradient in the training process [8, 9, 10, 11, 12, 13]. However, they may not circumvent the accuracy degradation issue completely. Although some methods aim to improve the accuracy of gradient obfuscation [11, 13], they have three key limitations. First, the privacy cost is high because the convergence speed of these methods is slow while the privacy cost is accumulated for each gradient calculation. Second, the accuracy still cannot meet the high-precision requirements of many applications since they add identically distributed noise to all components of the gradient which results in large distortion of the original gradient. Third, these methods are computationally inefficient because they need to evaluate the model multiple times or solve a large-scale optimization problem per iteration which make the task computationally prohibitive.

### I-C Proposed Approach

In this paper, we propose AdaDp, an adaptive and fast convergent approach to differentially private deep learning. Our key observation is that different components of the gradient have inhomogeneous sensitivity to the training data. In light of this observation, AdaDp mitigates the influence of noise on the model performance by adaptively sampling noise from different Gaussian distributions, based on the sensitivity of each component. In the first stage, AdaDp adjusts the learning rate in an adaptive manner based on historical gradients such that infrequently updated components tend to have a larger learning rate. Then, AdaDp samples noise from different Gaussian distributions according to the sensitivity of each gradient component and constructs differentially private gradients by adding the sensitivity-dependent noise to the original gradient. In this way, Gaussian noise with a lower variance is added to components with a smaller sensitivity.

Compared to existing data obfuscation and gradient obfuscation based methods, AdaDp has three key advantages.

First, the privacy cost of AdaDp is low because it exhibits remarkable improvement in the convergence speed due to an adaptive learning rate. Since the privacy cost is accumulated on each gradient update, a faster rate of convergence indicates that training a model using AdaDp incurs lower privacy cost.

Second, AdaDp achieves both a provable privacy guarantee and a comparable accuracy to non-differentially private models simultaneously. We will show later that this is attained by adding adaptive noise to different gradient components, depending on their sensitivity. As the model converges, we will see a decrease in the expected sensitivity of each gradient component, which thereby reduces the variance of noise distribution. In other words, in contrast to prior works, the noise distribution of AdaDp is adaptive to not only different gradient components but also different training iterations.

Third, AdaDp is computationally efficient since it does not need to solve any optimization problem to determine the noise distribution at each iteration, in sharp contrast to [13] that requires solving a large-scale non-convex optimization problems. We design an efficient scheme for adjusting the noise distribution and the scheme is evaluated via both theoretical analysis and numerical experiments.

### I-D Technical Challenges and Solutions

First, it is technically challenging to mathematically analyze the influence of the noise distribution upon the prediction performance, in light of the complicated nature of deep neural networks. As an alternative, we analyze the sufficient and necessary condition for noise distributions to guarantee the target differential privacy level. The condition involves an inequality with respect to the sensitivity of each gradient component and the variance of each corresponding Gaussian distribution. Based on the analysis, we conduct an experiment to compare the influence of different noise distributions upon the original function which can be seen as a query on a dataset. According to the theoretical and experiment results, we propose a heuristic that adapts the noise distributions to the sensitivity of each gradient component. Finally, we perform another experiment to verify the effectiveness of this heuristic according to the influence of noise on the gradient descent algorithm which is widely used for optimizing deep learning models.

Another technical challenge is to compute the privacy cost without any assumptions on the parameters such as the noise level and the sampling ratio. This is in sharp contrast to prior methods. For example, the moments accountant method requires the noise level and the sampling ratio [10]. To remove the assumptions, we use a technique termed subsampled Rényi differential privacy (RDP) [14], which computes the privacy cost by analyzing the privacy amplification in the subsampling scenario. And importantly, it requires no assumption on the parameters in the analysis [15].

### I-E Summary of Experiment Results

We evaluated the privacy cost, the accuracy and the computational efficiency of AdaDp and baselines on two real datasets: MNIST [16] and CIFAR-10 [17]. The results show that AdaDp outperforms state-of-the-art methods with 50% privacy cost reduction. On MNIST and CIFAR-10, AdaDp achieves an improvement of up to 4.3% and 3.5%, respectively, in accuracy over state-of-the-art methods. Our experimental results also show that the processing time of AdaDp at each iteration is much less than that of [13, 11], validating the computational efficiency of AdaDp.

## Ii Related Work

To protect sensitive information in crowdsourced data collection, differentially private crowdsourcing mechanisms were designed [18, 19, 20, 21]. To preclude personal information from being inferred and/or identified from neural models [4], a line of works emerged which applied DP to deep learning [8, 9, 22, 10, 23, 24, 25, 26, 11, 12, 27, 28, 13]. For instance, [24, 26] integrate DP into a teacher-student framework to protect data on student nodes. [9, 8, 29, 27] study transferring features or gradients with privacy control in collaborative deep learning. [18, 19, 20, 21] apply DP to personal data collection. In the above works, black-box attacks are (implicitly) assumed, i.e., the learned model is inaccessible to adversaries. Other works studied privacy protection in a more realistic white-box model where adversaries may have full knowledge of the model [10, 25, 11, 12, 13]. [10] proposes a differentially private gradient descent algorithm DpSgd by adding Gaussian noise to the gradient. [12] uses an adaptive learning rate to improve the convergence rate and reduce the privacy cost. [25] introduces the Laplace mechanism such that the privacy budget consumption is independent of the number of training steps. [11] allocates different privacy budgets to each training iteration to counteract the influence of noise on the gradient. In all aforementioned works, however, the noise on each gradient component follows the same probability distribution. As a result, the original gradient is distorted to a large extent. Although [13] samples noise from different distributions for each gradient component, solving a large-scale optimization is required at every step.

## Iii Our Approach

To reduce the privacy cost, AdaDp uses an adaptive learning rate for acceleration of the convergence. Additionally, AdaDp adds inhomogeneous and adaptive noise to different coordinates of the gradient based on their sensitivity in order to mitigate the influence of noise on the model performance. The next two subsections elaborate the adaptive learning rate and noise respectively. Then we present AdaDp and show its differential privacy guarantee.

### Iii-a Adaptive Learning Rate

The most popular method for training deep models is the gradient-descent-type algorithms. They iteratively update the parameters of a model by moving them in the direction opposite to the gradient of the loss function evaluated on the training data. The loss function is the difference between the predictions and the true labels. To minimize the loss , stochastic gradient descent (Sgd) randomly chooses a subset of training data (denoted by ) at each iteration and performs the update . Sgd poses several challenges, e.g., selection of a proper learning rate and avoidance of local minimum traps.

More advanced optimizers, including RMSProp, Adam, Adadelta and Nadam, are proposed to address the above issues. They adjust the learning rate on a per-parameter basis in an adaptive manner and scale the coordinates of the gradient according to historical data.

AdaDp uses an adaptive strategy similar to that of RMSProp. Nevertheless, we would like to note that the framework proposed in this paper is applicable to other adaptive gradient-descent-type algorithms. Recall the update of RMSProp

(1) | ||||

where denotes the parameters at step , denotes the original gradient, is the learning rate, and is the smoothing term (in case that the denominator is 0).

In AdaDp, let denote the update term at step and we have where is the noisy gradient which has been added Gaussian noise. In other words, AdaDp uses the denominator to adjust the learning rate in an adaptive manner.

### Iii-B Adaptive Noise

The intuition of adaptive noise is that different coordinates of the gradient exhibit inhomogeneous sensitivities due to their different values. It significantly affects the direction of the gradient if noise with higher intensity is added to coordinates with a smaller value, and vice versa. In light of this intuition, AdaDp clips the gradient and adds Gaussian noise with a smaller/larger variance to dimensions of the clipped gradient with a smaller/larger sensitivity.

Formally, we define the -sensitivity of a function as where are two datasets which differ in only one record. Correspondingly, adaptive noise is sampled from Gaussian distributions with different variances based on the -sensitivity of each dimension. In the meanwhile, adaptive noise must satisfy the constraint of differential privacy. To guarantee this, we first present the following lemma which provides the theoretical foundation of adaptive noise.

###### Lemma 1.

Suppose mechanism , where is the input dataset, is a -dimension function that and noise , is -differentially private, then mechanism , where is also a -dimension function, and , is also -differentially private if where is the -sensitivity of the ^{th} dimension of .

To prove Lemma 1, we need an auxiliary result which involves the sufficient and necessary condition on the privacy loss variable to satisfy -DP.

###### Lemma 2 (Analytical differential privacy [30]).

A mechanism is -DP if and only if for each , the following holds:

(2) |

where is the privacy loss variable defined by .

We are now ready to present the proof of Lemma 1.

###### Proof.

The first step is to show that is a Gaussian random variable. Let . We consider the worst case of . In this case, we have , which yields . As a result, the following equations hold

(3) |

In light of , we obtain that also obeys a Gaussian distribution. Specifically, we have . If we let denote , it can be re-written as .

The second step is to show that is monotonically increasing in . Let us compute the first term

where . Similarly, the second term can be re-written as

where .

Then the derivative of about is:

Since , then we have

Therefore, is monotonically increasing with .

Lemma 1 shows the condition of differential privacy when we add noise from Gaussian distributions with different variances to different dimensions of a query function. For example, suppose we have a -dimension query function , and . If mechanism satisfies -DP with , then mechanism satisfies the same -DP with and since . Namely, we sample noise with standard deviation for the first dimension and for the second dimension of . In contrast, previous methods sample noise from the same Gaussian distribution for each dimension of since .

Continue with the above example, we conducted a numerical experiment to compare the influence of different noise distributions on the original query function. Specifically, we calculate cosine similarity between the noisy result () and the original result (). As a metric of such influence, the higher similarity two results have, the less the influence of noise on the original query function. Without loss of generality, we assume given that . We then sample noise 10000 times and compute the average cosine similarity. When we set and , the average cosine similarity between the noisy result and the original result is 0.52. However, when we set , the average is only 0.36. Another strategy is to set and , the cosine similarity in this scenario is reduced to 0.28. Note that like the third strategy, the optimization techniques in [13] tend to sample noise with higher variance for dimensions with smaller sensitivity. In other words, this numeric experiment shows that coordinates of the query function with larger sensitivity can tolerate noise with higher variance.

In the scenario of training a deep learning model, we consider the gradient at each iteration as a query function of the training dataset. Specifically, given the gradient at the ^{th} iteration, we still use to denote the -sensitivity of the ^{th} dimension of . Then depending on Lemma 1, a Gaussian mechanism , where and , satisfies -DP if where will be determined later. The remained problem is how to calculate and . Note that RMSProp uses the term to estimate the average square of historical gradients. We observe that historical gradients can also be used to estimate the current gradient. For example, if , then it is quite possible that . In other words, can be considered as a kind of prior knowledge about the value of . To distinguish from the term used in the adaptive learning rate, we use to denote this prior knowledge which can be computed as follows:

(4) |

Then depending on the prior knowledge, can be calculated as where is a parameter. Since is the -sensitivity of the ^{th} dimension, we clip each coordinate of the original gradient to guarantee this. Specifically, the ^{th} dimension of the clipped gradient (denoted by ) is calculated as:

(5) |

In contrast with previous methods which clip the gradient with a global -norm upper bound such that , we call our method as local clipping since it operates on each coordinate separately and call as local clipping factor. Then considering , we have where is the number of dimensions. This heuristic means that for the ^{th} dimension of the gradient, the standard deviation of Gaussian noise () is scale to the -sensitivity of the ^{th} dimension (). On the other hand, as the training tends to converge, the gradual decline of will lead to the decrease of each . That means, the noise distributions are not only adaptive to different dimensions of the gradient, but also adaptive to different iterations of the training.

Since we have at the first iteration, this value cannot be used to clip the gradient which will cause . We set another parameter called local clipping threshold and AdaDp applies local clipping only when . In other words, as a kind of prior knowledge about the gradient , lacks sufficient information at the beginning of the training. During this phase, AdaDp applies global clipping as follows:

(6) |

where is a parameter called -norm clipping bound. After several iterations, will become more stable and AdaDp will perform local clipping when .

To gain more insight into the advantages of adaptive noise, we illustrate and compare the updates of AdaDp, DpSgd, and their non-private counterparts by testing them on the Beale function , as shown in Figs. 0(b) and 0(a). In this experiment, we set , , and .

We observe that the trajectory of DpSgd exhibits a remarkable deviation from that of its non-private version, while AdaDp and its non-private version display similar trajectories. Quantitatively, We define the distance between two trajectories and of the same length as . Note that the length of a trajectory is the number of training iterations. The distance between the trajectories of DpSgd and its non-private version is 0.90 and the distance between AdaDp and its non-private version is 0.21. Experiments on all widely used test functions for optimization [31] also show similar results.

The above experiment suggests that adaptive noise can mitigate the deviation of the noisy result from the original result and the performance of AdaDp is more robust to the privacy-protecting Gaussian noise than DpSgd. Recall that DpSgd adds Gaussian noise with the same intensity to all dimensions. In contrast, AdaDp updates each dimension separately and the intensity of the added Gaussian noise relies on the sensitivity of each dimension of the gradient.

### Iii-C Algorithm and Main Results

Now we present AdaDp in Algorithm 1. Let denote the initial values of the
trainable parameters. Note that, a lot is a sample from the training dataset
which is different from batch. There can be several batches in a lot and the lot
size is the batch size multiplied with the number of batches in it.
At each step , we compute the gradient with per-example
gradient algorithm [32] (creftype 5). Then for each dimension of the gradient, we perform the local clipping at line creftype 9 and creftype 10 if the local clipping threshold is satisfied (creftype 7). After that, we calculate and add Gaussian noise to the ^{th} dimension of the gradient (creftype 12 and creftype 13). If the local clipping threshold is not achieved, we perform global clipping and add non-adaptive noise at creftype 16 and creftype 18. Then we update the two expectations used in the adaptive learning rate and the adaptive noise correspondingly (creftype 22 and creftype 23).
Afterwards, we
update the parameters at the ^{th} iteration with (creftype 26). After training steps, AdaDp returns
the final values (creftype 28).

Before presenting our main results on the privacy guarantee of AdaDp, let us review the definition of differential privacy.

###### Definition 1 (Differential Privacy [5, 10]).

A randomized mechanism satisfies -differential privacy, if for any two neighboring datasets and that differ only in one tuple, and for any possible subset of outputs of , we have

where denotes the probability of an event. If , is said to satisfy -differential privacy.

We now show the privacy guarantee of AdaDp in Theorem 1.

###### Theorem 1.

To prove Theorem 1, we use the techniques of RDP to analyze the privacy cost of the composition of Gaussian mechanisms. The results that we obtain via RDP are later translated to DP.

###### Definition 2 (Rényi Divergence [14]).

Given two probability distributions and , the Rényi divergence between and with order is defined by

###### Definition 3 (Rényi differential privacy [14]).

A randomized mechanism is said to be -Rényi differential private if for any adjacent datasets , it holds that

###### Lemma 3 (RDP of Gaussian Mechanism [14]).

If , then the Gaussian mechanism satisfies -RDP.

###### Lemma 4 (Composition of RDP [14]).

For two randomized mechanisms such that is -RDP and is -RDP, the composition of and which is defined as (a sequence of results), where and , satisfies -RDP.

###### Lemma 5 (Translation from RDP to DP [14]).

If a randomized mechanism satisfies -RDP, then it satisfies -DP where .

Lemma 6 analyzes the privacy amplification in the subsampling setting.

###### Lemma 6 (RDP of subsampled Gaussian Mechanism [15]).

Define function . Given a dataset of records and a Gaussian mechanism satisfying -RDP, define a new randomized mechanism as: (1) subsample records where , (2) apply these records as the input of mechanism , then for any integer , satisfies -RDP, where:

(8) |

Before showing the proof of Theorem 1, we need to establish the following important lemma. In Lemma 7, we bound the noise level of a Gaussian mechanism in the subsampling setting.

###### Lemma 7.

Given the sampling probability , the number of steps , the Gaussian mechanism where , then the composition of these mechanisms satisfies -differentially private if satisfies:

(9) | ||||

(10) |

where can be any integer satisfying and the function is defined as .

###### Proof.

By Lemma 3, we could compute defined in Lemma 6 on Gaussian mechanisms as

(11) |

Since a lot used in AdaDp is a subsample of the training dataset, each step is -RDP where satisfies (6) in Lemma 6. Then after training steps of AdaDp, we could obtain the total privacy cost via composing such subsampled Gaussian mechanisms depending on Lemma 4. Then by substituting the function in (6) with (11), can be clearly expressed as

(12) |

At last, through converting to DP representation via Lemma 5, AdaDp satisfies -DP. Let where is the given privacy budget and combine this with (III-C) as follows:

then the proof is completed. ∎

Now we are ready to prove Theorem 1.

###### Proof.

Combining Lemma 1 and Lemma 7, Theorem 1 is easy to prove. Suppose we use as a Gaussian mechanism at training step in Algorithm 1, then the composition of such mechanisms satisfies Eq. 9. Based on Lemma 1, if we use at each training step in which , then satisfies the same differential privacy guarantee as . Therefore, the proof of Lemma 7 is also suitable for . Combine this fact with the post-processing property of DP [33], the proof is completed.

∎

We would like to remark that Theorem 1 covers the realm of small noise and high sampling ratio that the moments accountant method [10] omits (which requires and ). For instance, if we train a model using AdaDp with and on a dataset of examples, Theorem 1 bounds the privacy cost by choosing the optimal and implies that the model achieves -differential privacy after 1800 training steps.

## Iv Experimental Results

### Iv-a Evaluation Setup

We evaluate the privacy cost, the accuracy and the computational efficiency of AdaDp compared with state-of-the-art methods: DpSgd [10], AGD [11] and DpOpt [13] on two real datasets: MNIST and CIFAR-10. We implemented all these algorithms using TensorFlow [34] with a GTX 1080Ti GPU.

MNIST is a standard dataset for handwritten digit recognition, which consists of 60,000 training examples and 10,000 testing examples. Each example is a gray-level image. CIFAR-10 consists of 60,000 labeled examples of RGB images. There are 50,000 training images and 10,000 testing images.

The model for MNIST task first performs a 60-dimensional differentially private PCA (DpPCA) projection [35] and then applies a single 1,000-unit ReLU hidden layer [36]. For the CIFAR-10 task, we use a variant of AlexNet [37], which contains two convolutional layers followed by two fully connected layers and one softmax layer. The first convolutional layer includes 64 filters of size with stride 1. The layer is followed by the ReLU activation, max pooling, and local response normalization. The structure of the second convolutional layer is identical to that of the first one except that the local response normalization is performed before max pooling. The output is then flattened into a vector as the input for the following fully connected layer.

As outlined in Algorithm 1, we compute gradients for each training example. However, it is prohibitive to compute per-example gradients due to the parameter sharing scheme of convolutional layers. Since convolutional layers are shown to be well transferred [38], we pre-train the model on CIFAR-100 and initialize the network with the trained parameters. When we train it on CIFAR-10, the parameters of convolutional layers are maintained and updates happen to the fully connected layers and the softmax layers.

We use the result of Theorem 1 to calculate the privacy cost. Specifically, given , , and , at each iteration, we select ’s from and determine the smallest that satisfies (1) in Theorem 1. The privacy cost is the pair .

For AdaDp, we set , and in all experiments. For AGD, since it was only evaluated on shallow machine learning tasks in [11], its privacy computation is not suitable for deep learning as it does not consider the privacy amplification due to subsampling. Therefore, we use Theorem 1 to compute the privacy guarantee for AGD to give a fair comparison. To implement DpOpt^{1}

### Iv-B Privacy Cost

Dataset | Accuracy | |||
---|---|---|---|---|

MNIST | 0.92 | 1.10 | 0.75 | |

0.94 | 1.61 | 0.78 | ||

0.95 | 1.96 | 0.80 | ||

0.96 | 5.48 | 1.40 | ||

CIFAR-10 | 0.68 | 0.89 | 0.62 | |

0.69 | 1.85 | 1.07 | ||

0.70 | 4.75 | 2.13 | ||

0.71 | 8.36 | 3.55 |

To illustrate the trade-off between the privacy cost and the accuracy, we measured the privacy cost of AdaDp and DpSgd to attain a pre-specified accuracy level. We set the noise level and on MNIST and on CIFAR-10, where is the noise level for DpPCA [35].

Compared with DpSgd (as a representative method with a non-adaptive learning rate), AdaDp achieves an average reduction of 54% and 46% in privacy cost on MNIST and CIFAR-10 respectively. Table I summarizes the results, where and denote the minimum privacy cost required by DpSgd and AdaDp, respectively, to attain the pre-specified accuracy level. We observe that AdaDp always requires much lower privacy cost than DpSgd to achieve the same accuracy level. This is mainly due to the faster convergence and fewer training steps of AdaDp.

### Iv-C Accuracy

Furthermore, given a fixed privacy budget, we evaluate the accuracy of AdaDp, DpSgd, AGD and DpOpt on both MNIST and CIFAR-10 datasets under the high privacy level, the medium privacy level and the low privacy level respectively. In all experiments on MNIST, we set the lot size and -norm clipping bound . For AdaDp, we set the local clipping factor . The noise levels for training the fully connected layer and the DpPCA layer are set to (8, 16), (4, 7), and (2, 4), respectively, for the three experiments. When testing DpSgd and DpOpt, we set the initial learning rate to 0.1 and linearly drop it down to 0.052 after 10 epochs and then keep it constant. When applying AdaDp, we set the learning rate to 0.002 and keep it unchanged. On CIFAR-10, we set the noise level , lot size , -norm clipping bound for fully connected layers and for the softmax layer. For AdaDp, we set the local clipping factor . For DpSgd and DpOpt, we set the initial learning rate to 0.1 and apply an exponentially decaying schedule in the first experiment and set the learning rate to 0.052 in the last two experiments. For AdaDp, the learning rate is set to 0.001.

The accuracy achieved by AdaDp significantly outperforms that of DpSgd, AGD and DpOpt under all three privacy levels on both MNIST and CIFAR-10 datasets. Fig. 2 illustrates how the accuracy varies with the number of epochs and on MNIST. The gray line and the black line (please refer to the right vertical axis) denote the accumulating privacy cost for AGD and other methods respectively in terms of given a fixed . We observe that the final test accuracy of AdaDp on the MNIST achieves an increase of , and respectively compared with DpSgd, , and respectively compared with AGD, , and respectively compared with DpOpt. The results on CIFAR-10 are shown in Fig. 3. In all three settings, AdaDp achieves an accuracy increase of 3.2%, 5.3% and 4.8% respectively compared with DpSgd, , and respectively compared with AGD, , and respectively compared with DpOpt.

We now analyze how we achieve higher accuracy than each state-of-the-art method respectively. Compared with DpSgd, the performance improvement of AdaDp is achieved by both the adaptive learning rate and the adaptive noise. As to AGD, in this algorithm can only decrease gradually which cannot be reversed such that the privacy budget is consumed too fast. Generally, a few epochs of training are not enough to guarantee the convergence for deep learning models. Therefore, AGD performs poorly in deep learning. For DpOpt, on the one hand, AdaDp adopts an adaptive learning rate to improve the convergence. On the other hand, as our first numeric experiment indicates, AdaDp samples noise with higher variance for dimensions of the gradient with larger sensitivity which mitigates the influence of noise significantly. In addition, the expected variance of each noise distribution will gradually decrease as the training progresses.

### Iv-D Computational Efficiency

Dataset | AdaDp | DpSgd | AGD | DpOpt |
---|---|---|---|---|

MNIST | 0.5s | 0.3s | 12.0s | 15.5s |

CIFAR-10 | 1.2s | 0.9s | 23.7s | 19.1s |

In addition, we evaluate the average processing time of AdaDp, DpSgd, AGD and DpOpt per iteration, as a measure of computational efficiency. The time involves processing a whole batch where we set the batch size as 120 on MNIST and 32 on CIFAR-10 respectively. Other settings are the same as Fig. 1(a) and Fig. 2(a) respectively.

AdaDp is much more computationally efficient than both AGD and DpOpt. The results are shown in Table II. We observe that for each iteration, the average processing time of AdaDp is close to that of DpSgd which is far shorter than the other two algorithms. Note that AGD repeatedly evaluates the model multiple times to obtain the best update step size while DpOpt needs to solve a non-convex optimization problem with variables as many as the parameters of the deep learning model. In contrast to AGD and DpOpt, AdaDp adopts a heuristic to avoid heavy computation at each iteration.

### Iv-E Micro Benchmarks

Considering there are two components in AdaDp, namely the adaptive learning rate and the adaptive noise, we study their independent contribution to the final performance in this subsection. We call the methods with only one component AdaL (only adaptive learning rate) and AdaN (only adaptive noise) respectively. AdaL uses global clipping method to clip the original gradient and samples noise for each dimension from the same Gaussian distribution. The only difference between AdaL and DpSgd is that AdaL adjusts the learning rate based on Eq. 1. As to AdaN, we implement it directly by removing the adaptive learning rate part in AdaDp. For the experiment on MNIST, the learning rate is set to 0.1 and 0.001 for AdaN and AdaL respectively. Other settings are the same as Fig. 1(a). On CIFAR-10, the learning rate is set to 0.05 and 0.001 for AdaN and AdaL respectively and other settings are the same as Fig. 2(a).

Both adaptive components contribute to the performance gain of AdaDp while the adaptive noise component contributes more. As illustrated in Fig. 3(a) and Fig. 3(b), we observe that AdaN achieves a higher accuracy than AdaL, showing that, compared with the adaptive learning rate, the adaptive noise component has a more significant impact on the performance of AdaDp.

Since AdaDp depends on whether the estimation of is accurate for , we further conducted another experiment to verify its effectiveness under the settings as the same as Fig. 1(a).

The estimation given by is accurate and can reflect the changing trend of . The results are shown in Fig. 3(c), from which we observe that the distribution of (the figure below) is close to that of (the figure above) at each iteration. Also, from the ^{th} iteration to the ^{th} iteration, both the gradient value and the estimated value show a declining trend.

Additionally, as aforementioned, given the expected sensitivity of the gradient decreases with the training progresses, each will also decrease gradually. We select the in AdaDp at the ^{th} iteration and the ^{th} iteration respectively to illustrate this property. All the settings are the same as Fig. 1(a).

The standard deviation of most noise distributions for each coordinate of the gradient gradually decreases as the training progresses. As shown in Fig. 3(d) (the area of each circle is proportional to the value of ), we observe that most at the ^{th} iteration (the figure below) are more concentrated in the range less than 10 while there are many distributed from 10 to 20 at the ^{th} iteration (the figure above). Therefore, the noise distributions in AdaDp are adaptive not only to each dimension of the gradient, but also to different iterations during training.

### Iv-F Impact of Parameters

In this set of experiments on the MNIST dataset, we study the impact of the learning rate, the local clipping factor, the noise level and the lot size on the accuracy. In all experiments, if not specified, we set the learning rate , lot size , local clipping factor , -norm clipping bound , noise level , and privacy level .

Learning Rate. As shown in Fig. 4(a), the accuracy stays consistently above 93%, irrespective of the learning rate ranging from to . When the learning rate is lower than , the accuracy increases with the learning rate. When it is higher than , a higher learning rate results in a lower accuracy.

Local Clipping Factor. The local clipping factor controls the scale of each dimension of the gradient. Note that the estimation is not equal to , a smaller drops more information of the gradient since some dimensions will be clipped as or . Fig. 4(b) shows that the performance of AdaDp is relatively resistant to the local clipping factor and a reasonable setting for is to take a value slightly greater than one. However, due to the adaptive noise scheme of AdaDp, a large value of will not raise the noise level drastically (especially because when the gradient is close to 0 as the training converges, will also be close to 0 whatever takes).

Noise Scale. The noise scale determines the amount of Gaussian noise added to the update term at each step. Although a smaller noise scale mitigates the effect of noises, it results in fewer training steps and the model is hard to converge. Meanwhile, setting a larger noise scale allows more training steps, but excessive noise will ruin the original gradient. Results achieved with different noise scales are shown in Fig. 4(c), where our model attains the most superior performance with . Compared with DpSgd, which reaches the highest accuracy at [10], we attribute the much larger value of of AdaDp to adaptive noises which alleviate the impact of the differential privacy mechanism and a larger noise scale yields more training steps.

Lot Size. The lot size controls the sampling ratio. A large lot size yields a higher sampling ratio and reduces the number of training steps. If the noise intensity is fixed, a smaller