Practical and Bilateral Privacy-preserving Federated Learning

Practical and Bilateral Privacy-preserving Federated Learning

Appendix: Practical and Bilateral Privacy-preserving Federated Learning

Abstract

Federated learning, as an emerging distributed training model of neural networks without collecting raw data, has attracted widespread attention. However, almost all existing researches of federated learning only consider protecting the privacy of clients, but not preventing model iterates and final model parameters from leaking to untrusted clients and external attackers. In this paper, we present the first bilateral privacy-preserving federated learning scheme, which protects not only the raw training data of clients, but also model iterates during the training phase as well as final model parameters. Specifically, we present an efficient privacy-preserving technique to mask or encrypt the global model, which not only allows clients to train over the noisy global model, but also ensures only the server can obtain the exact updated model. Detailed security analysis shows that clients can access neither model iterates nor the final global model; meanwhile, the server cannot obtain raw training data of clients from additional information used for recovering the exact updated model. Finally, extensive experiments demonstrate the proposed scheme has comparable model accuracy with traditional federated learning without bringing much extra communication overhead.

Abstract

The appendix contains the proofs for the noisy outputs of forward propagation in CNN and security analysis for the raw training data of clients in the proposed PBPFL. Then, an extended PBPFL is introduced to further improve privacy-preservation of the raw training data. Finally, the additional performance comparison between our PBPFL and the traditional FL, e.g., FedAvg, is also illustrated.

\printAffiliationsAndNotice\icmlEqualContribution

1 Introduction

With the continued emergence of privacy breaches and data abuse Wikipedia (2018), data privacy and security issues gradually impede the flourishing development of deep learning Yang et al. (2019). In order to solve the privacy concerns of users, federated learning (FL) McMahan et al. (2017) has recently been presented as a promising solution, where many clients collaboratively train a shared global model under the orchestration of a central server, while ensuring that each client’s raw data is stored locally and not exchanged or transferred. Based on the type of clients, FL is divided into two settings Kairouz et al. (2019): the cross-device FL, where clients are mobile or edge devices, and the cross-silo FL, where clients are relatively reliable organizations (e.g., medical or financial institutions). In this paper, we focus on solving the challenges faced in the cross-silo FL that has received greatly interests recently.

Recently, many FL algorithms McMahan et al. (2017); Bonawitz et al. (2017) and the corresponding variants Li et al. (2019c, b) have been developed to show its potential value in applications such as healthcare Li et al. (2019b) and vehicle-to-vehicle communication Samarakoon et al. (2018). Almost all existing researches focus on protecting local training data from disclosing through uploaded gradients, however, not consider protecting the intermediate iterates and the final model parameters from disclosing. In fact, for the cross-silo FL, clients (e.g., medical or financial institutions) contribute their data and consume resources (e.g., computation and communication) to collaboratively train a shared global model to gain wealth Liu et al. (2016). Thus, it is necessary to protect the global model from disclosing to external entities. In addition, due to competition among clients, it is impossible to ensure that all clients are trusted and not curious about the training data of others. In other words, they may try to obtain some additional information from the intermediate iterates during the training or final model parameters. Therefore, it is meaningful and urgent to design a solution of FL that can further protect the intermediate iterates and the final model parameters.

Although some existing privacy-preserving techniques, like differential privacy Dwork et al. (2006) and homomorphic encryption Phong et al. (2018), seem to be alternatives, they cannot address the above problem well. For example, the differential privacy technique usually involves a trade-off between accuracy and privacy, and cannot maintain the sparsity of the model updates Kairouz et al. (2019). The homomorphic encryption is achieved at the expense of high computational and communication overhead. Furthermore, as described in Yang et al. (2019), existing homomorphic encryption can only deal with the bounded polynomial operations, but cannot effectively address the non-linear activation functions used in deep learning.

As stated in Kairouz et al. (2019), it is still a big challenge to further effectively protect intermediate iterates and final model parameters on the traditional FL. In this paper, we present a first Practical and Bilateral Privacy-preserving Federated Learning (PBPFL) scheme, where the main contributions are three-fold:

(1) We present a new privacy-preserving technique to encrypt intermediate iterates and final model parameters, which allows clients to train model updates on noisy intermediate iterates. The technique is versatile and applicable to most state-of-the-art models. More importantly, it ensures that only the server can obtain accurate model updates.

(2) Security analysis demonstrates that under the privacy-preservation of the traditional FL, any honest-but-curious client cannot get intermediate iterates and local training data of others during the training, even in collusion with some clients. After completing the training, clients can only use the final model (i.e., obtain the correct prediction), but cannot know model parameters.

(3) Extensive experiments conducted on real-world data demonstrate the effectiveness of our scheme compared with the traditional FL, and the efficiency of computation and communication.

2 Preliminaries and Problem Statement

In this section, we first outline the concept of cross-silo FL and the Hadamard product. After that, we state the system model, threat model and design goals.

2.1 The Cross-Silo FL

In the cross-silo FL Kairouz et al. (2019), clients are different organizations (e.g. medical or financial), the network connection is relatively stable and the network bandwidth is relatively large. That is, all clients are always available and can afford relatively large communication cost. Thus, the cross-silo FL allows all clients to join each iteration.

Formally, consider the FL with clients, where the -th client has the local training dataset , where and are the feature vector and the ground-truth label vector, respectively. Thus, the cross-silo FL aims to solving an optimization problem McMahan et al. (2017); Li et al. (2019c):

(1)

where is the total sample size such that , and is the local object of the -th client such that

(2)

where is the specific loss function. In this paper, we adopt the mean square error (MSE) loss function

(3)

where is norm of a vector, and is the prediction.

In general, this optimization problem can be handled with stochastic gradient descent (SGD). Thus, each client first computes local gradients by adopting the SGD technique and returns them to the server for aggregation and updating

(4)

where is the local gradient on local data of -th client at the current model , and is the learning rate.

2.2 Hadamard Product

The Hadamard product Horn and Johnson (2012) takes two matrices of the same dimensions and produces another matrix of the same dimension as the operands.

Definition 1

For two matrices and of the same dimension , the Hadamard product is a matrix of the same dimension as the operands, with elements given by .

Besides, two properties of Hadamard product used in our scheme are given as follows:

(1) For any two matrices and , and a diagonal matrix , we have

(2) For any two vectors and , we have , where is the corresponding diagonal matrix with the vector as its main diagonal.

2.3 Problem Statement

In this part, we introduce the system and threat models considered in this paper, and identify our design goals.

System Model

As shown in Fig. 1, our system model mainly includes two components: a server and a number of clients, where the corresponding roles are described as follows:

The server is responsible for initializing the model and assisting clients in training the global model. Particularly, in order to protect the privacy, the server sends the noisy models to clients for training.

Clients have their own local training data and want to collaboratively train a global model. Specifically, each client computes local gradients with their own data and the noisy model received from the server, and then returns noisy local gradients to the server for aggregating and updating.

Figure 1: System architecture of the proposed scheme.

Threat Model and Design Goals

Similar to Kairouz et al. (2019), we assume the server and clients are honest-but-curious Hitaj et al. (2017); Phong et al. (2018), which means that they honestly follow the underlying scheme, but attempt to infer other entities’ data privacy independently. Particularly, we allow each client to collude with multiple clients to get the most offensive capabilities. Besides, an eavesdropper is also considered here, who tries to obtain information from the observed data obtained by eavesdropping.

From the above, the design goals of our PBPFL mainly include the following three aspects:

(1) Functionality: Each client can only train models over noisy intermediate iterates, meanwhile, only the server can recover the exact model updates during the training phase and the true final model parameters during interface phase.

(2) Privacy-preservation: Clients know neither local training data of others nor intermediate iterates and the final model parameters. Meanwhile, the server cannot obtain the training data of clients from the received information.

(3) Efficiency: The proposed scheme should minimize extra computation and communication overhead without reducing model accuracy.

3 Proposed Scheme

In this section, we introduce our PBPFL scheme, which mainly includes four parts: parameter perturbation (server-side), noisy gradient computation (client-side), model update (server-side), and data processing (server-side).

3.1 Parameter Perturbation

The parameter perturbation is performed by the server, which takes model parameters and random noises as inputs, and outputs noisy model parameters 1. In order to facilitate the understanding of our encryption method, we give a simple case of multiple layer perceptron (MLP) with ReLU non-linear activation. Then, we show this method can be easily applied on the state-of-the-art models such as ResNetHe et al. (2016) and DenseNetHuang et al. (2017).

Multiple Layer Perceptron

Consider a MLP with layers, where the parameter and the output of the -th layer are denoted as and , respectively, and is the number of neurons in the -th layer. Specifically, is computed as

(5)

Note that when , then is the input of neural network .

1. Parameter encryption. In order to encrypt the model parameter , the server first randomly selects the multiplicative noisy vector for , the additive noisy vector with pairwise different components, and a random coefficient . Then, the server computes

(6)

where and satisfy

(7)
(8)

where the subscripts and in Eq. (7), satisfy that and , and in Eq. (8), and .

Finally, the server sends together with to each client for local training.

2. Forward propagation. In order to facilitate the understanding of our parameter perturbation method, we introduce the forward propagation in advance. According to Eq. (5), each client computes the noisy output with the received and the sample , i.e.,

(9)

In the following lemma, we present the important relations between the noisy outputs and the true outputs.

Lemma 1

For any , the noisy output vector and the true output vector have the following relationships

(10)
(11)

where and .

{proof}

Based on Eq. (7), we can deduce that , for , and , where is the matrix whose entries are all 1s.

First, we prove Eq. (10) by induction. When , we can obtain that

where follows from the condition .

Then, for , assuming by induction, we have

Thus the Eq. (10) is proved. Furthermore, we have

Convolutional Neural Networks

Convolutional Neural Networks (CNN) have proven to be effective in many computer vision and natural language processing tasks. In this section, we continue to demonstrate how to apply our encryption method to CNNs.

Although convolution operation can be applied to any dimensional input, we focus on 3-dimensional inputs since we are most concerned about image data. For any input , where , and are the number of channels, weight and height, respectively, the convolution operation, denoted by , is defined as

where is a set of filters with each filter of shape , and is the output image.

From the above, convolutional layer can be implemented with a convolution operation followed by a non-linear function, and a CNN can be constructed by interweaving several convolutoinal and spatial pooling layers. At last, the CNN ends with a fully-connected layer for regression or classification tasks. In order to apply our encryption method to CNNs, we choose ReLU as the non-linear function and MaxPooling as the spatial pooling layer. In fact, Ma and Lu (2017) proved the equivalence of convolutional and fully connected operations. Therefore, the multiplicative noise introduced for MLP will not affect the operation of convolutional layers, which can be adopted with small alteration.

Figure 2: Common connections in neural networks. can be element-wise summation, element-wise multiplication or concatenation. In this paper we choose concatenation.

Section 3.1.1 mainly discusses about the “one-path” connection, i.e., the input of -th layer is exactly the output of -th layer. Most state-of-the-art CNN models, e.g., ResNet He et al. (2016) and DenseNet Huang et al. (2017), however, utilize residual connections in their structures, i.e., the -th layer takes the output of the multiple precedent layers as input, as illustrated in Fig. 2. There are various kind of residual connections, for simplicity, we choose to connect the precedent outputs over the channel dimension.

1. Parameter encryption. Similar to Section 3.1.1, the server first randomly selects the multiplicative noisy vector for , the additive noisy vector with pairwise different components , and a random coefficient . Then, the server computes

(12)

where for , satisfies

(13)

where , and . is the concatenate vector of all vectors for , and denotes the set of layers connected with the -th layer. satisfy

(14)
(15)

where and .

It is worth noting that compared with the noisy vector of MLP, we can ignore the spatial dimensions (i.e., height and width) and just impose the same noisy value on them.

2. Forward propagation. Similar to the MLP, each client computes the noisy output of each layer as follows.

(1) For the -th layer, where , the corresponding noisy output is a matrix for the -th channel, where each element in is represented as

(16)

(2) For the last fully connected layer, the final noisy prediction vector is computed as

(17)

where , and the means expanding into a vector along the channel, height and width dimensions, which implies .

Since the derivations of output in the CNN are similar to the MLP, due to the space limitation, we omit the derivations of Eqs. (16) and (17) here, and the details can be found in the attachment.

3.2 Noisy Gradient Computation

Generally, the local training performed by each client includes the forward propagation and backward propagation. Since the forward propagation has been introduced in Section 3.1, we just show the backward propagation here, i.e., noisy gradient computation2. More specifically, after obtain the noisy outputs of each layer in forward propagation process, i.e., for , the client calculates the corresponding gradients based on the MSE loss function (see Eq. (3)3). Considering our encrypted model, for each sample , from Lemma 1, the noisy MSE is

In what follows, we give the relation between the noisy gradients and the true gradients.

Lemma 2

For any , the noisy gradient matrix and the true gradient matrix satisfy

(18)

where , and .

{proof}

Based on the noisy MSE, we have

Thus, we can derive that

Note that and can be computed directly by the clients. For all samples in the mini-batch dataset , the -th client computes the average gradients and the noisy items: , and . Finally, the -th client returns to the server, for . \comment

3.3 Model Update

After receiving the noisy local gradients from all clients, the server can recover the true model updates for the next iteration (i.e., -th iteration) by Lemma 2.

(1) Gradient recovery. For the noisy local gradients of the -th client, the server computes: for ,

(19)

(2) Parameter aggregation. Based on Eq. (4), the server aggregates all clients’ local gradients to update the global model parameters for the -th iteration as

Data Processing. The server and all clients interactively iterate the processes in Sections 3.1-3.3 until the convergence. Consequently, the server obtains final model parameters . In order to protect and allow each client to use the model, the server still needs to encrypt . Specifically, the operations are similar to that in Section 3.1, the only difference is that in the last layer, the server does not adopt the additive noises and . Thus, the noisy model parameters are computed

Specifically, satisfies Eq. (7) for the MLP, and satisfies Eqs. (13) and (14) for the CNN. Obviously, without the influence of additive noises and , based on Lemma 1 and Eq. (17), it can be verified that the final prediction is , i.e., the true prediction.

4 Security Analysis

Based on design goals, we analyze the security properties of our scheme in this section.

(1) Privacy-preservation of model parameters. Whether it is training or data processing, clients only possess the noisy model parameters, i.e., , where and (or after training). Since the noisy matrices and are randomly chosen as the secret keys, the server keeps them and the original model parameters secret from clients. For given , there exist infinite pairs and infinite triples such that and Therefore, clients cannot obtain the exact model parameters , even in collusion with others.

(2) Privacy-preservation of the true prediction in each iteration: As shown in Section 3.1, the client computes the noisy prediction vector as where is the true prediction vector. The parameter and noisy vector are known to the client, while is chosen by the server randomly which is unknown to the client.

Lemma 3

For any and , where , there exist infinite triples ( such that and

(20)
\comment
(21)

Similarly, there exist infinite triples ( such that and Eq. (21) holds.

{proof}

We only prove the first conclusion since the proof of the second conclusion is completely similar to the first one. Without loss of generality, we assume and , and let be any real number such that

\comment

We take and , then the Eq. (21) holds and According to Lemma 3, for any , the clients cannot know whether or . Thus the clients cannot know the largest one in , i.e., they cannot obtain the true prediction.

From the above, there is no doubt that each clients also cannot know the true local gradients. In other words, each client can obtain neither the intermediate iterates nor final model parameters, let alone the training data of others.

(3) Privacy-preservation of the raw training data of clients: In order to help the server to recover true gradients, each client returns additional noisy terms and to the server. Since and are still combinations of some related gradients, they contain typically significantly less additional information compared to the raw training data. Besides, and are also affected by the ReLU non-linear activation. Thus these gradients will become complicated non-linear functions of the raw data. Similar to the traditional FL, resolving the raw data from additional noisy terms is as difficult as resolving from the gradients. In other words, our PBPFL can provide clients with the same level of privacy protection as the traditional FL.

In fact, our PBPFL can improve the privacy of local training data by adopting secret sharing technique. For example, clients consult with a set of random numbers such that , which are unknown to the server. The -th client holds , and after computing , and , he or she adds to them. The server can only obtain the aggregated results rather than the individual gradient of each client. Since our main goal is to achieve the privacy-preservation of intermediate iterates and final model parameters, we do not delve into this approach, but will improve our PBPFL in future work.

5 Performance Evaluation

We empirically evaluate our PBPFL algorithm on real-world datasets, from two different perspectives: effectiveness (i.e., how well our algorithm perform on these datasets) and efficiency (i.e., how much extra computation and communication cost our method spends).

5.1 Experimental Setup

We implement our methods based on the native network layer in Pytorch Paszke et al. (2019) running on single Tesla M40 GPU. We adopt FedAvg McMahan et al. (2017) as the baseline algorithm for comparison. In all experiments, the training epochs and the batch-size of each client are set to be and , respectively. \comment The experiments are conducted over single Tesla M40 GPU.We implement our methods based on the native networklayer implementations on Pytorch ?. In all of our experiments, we adopt FedAvg McMahan et al. (2017) as the baseline aggregation algorithm, and also build our encryption method based on this for comparison. For all the experiments, we train the models for 200 epoches, and the batch-size for each client is 32.

Datasets and Metrics. We evaluate our method on three privacy-sensitive datasets covering both the bank and medical scenarios.

(1) UCI Bank Marketing Dataset Moro et al. (2014) (UBMD) is related to direct marketing campaigns of a Portuguese banking institution and aims to predict the possibility of clients for subscribing deposits. It contains instances of dimensional bank data. Following conventional practise, we split the dataset into training/validation/test sets by 8:1:1. We adopt MSE as the evaluation metric.

(2) APTOS Blindness Detection 4 (ABD) consists of k training and k test datasets of retina images for predicting the severity of diabetic retinopathy. For preprocessing, we center-crop images and resize them into size of . Again, we use MSE as the evaluation metric.

(3) Lesion Disease Classification Tschandl et al. (2018) Codella et al. (2019) (LDC) provides k training and k test skin images for the classification of lesion disease. We downsample the images into and adopt classification accuracy as the evaluation metric.

5.2 Empirical Results

We evaluate the training accuracy of our algorithm against native FedAvg on both regression and classification tasks. Besides, we present the computation and communication overhead of the basic building blocks in PBPFL.

Regression

We evaluate the performance of PBPFL on regression tasks with UBMD and LDC. For a more comprehensive comparison, we train ResNet20 and MLP with , and layers on ABD and UBMD, respectively. Also, we evaluate the performance for clients on both datasets. Table 2 shows the MSE for the final converged model on testsets.

From the table, the accuracy of PBPFL elegantly aligns with that of FedAvg under various settings, which verifies our derivation in Section 3. Although some operations (e.g., dividing by random vectors) may cause precision errors, we find those have little impact on the training procedure. \comment

=1 =5 =10
UBMD FedAvg MLP-3 0.059 0.079 0.097
MLP-5 0.059 0.079 0.100
MLP-7 0.058 0.086 0.113
PBPFL MLP-3 0.060 0.078 0.097
MLP-5 0.059 0.077 0.101
MLP-7 0.059 0.082 0.114
ABD FedAvg ResNet20 0.048 0.085 0.117
PBPFL ResNet20 0.047 0.088 0.114
Table 1: MSE Result for regression tasks. Lower MSE means better performance
UBMD FedAvg MLP-3 0.059 0.079 0.097
MLP-5 0.059 0.079 0.100
MLP-7 0.058 0.086 0.113
PBPFL MLP-3 0.060 0.078 0.097
MLP-5 0.059 0.077 0.101
MLP-7 0.059 0.082 0.114
ABD FedAvg ResNet20 0.048 0.085 0.117
PBPFL ResNet20 0.047 0.088 0.114
Table 2: MSE Result for regression tasks. Lower MSE means better performance.

Classification

In this section, we evaluate PBPFL for the classification task with ResNet20 models. Note that the MSE loss is not primarily designed for classification tasks, thus, we evaluate the accuracy of FedAvg by adopting both MSE loss (denoted as FedAvg-MSE) and cross-entropy loss (denoted as FedAvg-CE) as baselines. The accuracy of converged models on testsets is shown in Tab. 4.

Similar to regression tasks, the accuracy of PBPFL elegantly aligns with FedAvg-MSE. Compared to FedAvg-CE adopting a more suitable loss for classification, PBPFL suffers from an acceptable accuracy loss, i.e., decreased by 0.67, 3.2, 2.11 when the client number is 1, 5, 10 respectively.

LDC FedAvg-CE 66.90 64.93 63.38
FedAvg-MSE 66.24 61.68 61.44
PBPFL 66.23 61.74 61.27
Table 3: Accuracy result for classification task.
\comment
k=1 k=5 k=10
LDC FedAvg-CE 66.90 64.93 63.38
FedAvg-MSE 66.24 61.68 61.44
PBPFL 66.23 61.74 61.27
Table 4: Accuracy result for classification task.

Communication and Computation

We dive into each of the components of PBPFL and compare computation and communication overheads of our method with FedAvg. The experiments are conducted on ResNet56 with an input size of , a batch size of and iteration number of .

Tab. 5 shows the comparison of computational cost for PBPFL and FedAvg. PP, LFP, LBP, and MU stands for parameter perturbation, local forward propagation, local backward propagation and model update, respectively. From the table, we can see compared to FedAvg, the computational cost of PBPFL has approximately doubled, which is mainly caused by the backward propagation on client side (i.e., 91.08 to 228.06 seconds). This is due to the computation of additional information, i.e. and , for noisy gradient recovery. The increased cost of PBPFL on the server side is basically negligible, indicating the server can afford the FL as usual.

PP LFP LBP MU Total
FedAvg Client 0 2.19 88.89 0 91.08
Server 0 0 0 89.11 89.11
PBPFL Client 0 2.20 225.86 0 228.06
Server 3.66 0 0 92.39 96.05
Table 5: Computational overhead of PBPFL (Seconds).

The communication overhead mainly includes two interactions: the server sends noisy parameters together with the -dimensional noisy vector to clients and each client returns local noisy gradients together with two extra noisy items . Obviously, compared to FedAvg, the added communication costs are and , where the cost of can be negligible. Therefore, theoretical analysis shows that the additional communication is , where is the size of model parameters. The experiments also confirm our theoretical results. Particularly, both the server-to-client and client-to-server communication overheads in FedAvg are 0.85 MB, while the server-to-client and client-to-server communication overheads in PBPFL are 0.85 MB and 2.55 MB, respectively.

In summary, in order to achieve the privacy preservation, we bring about certain amount of extra computation and communication costs. Nonetheless, we try the best to decrease the additional cost and keep it in constant level without decreasing the model accuracy compared to the original FL.

\comment
PP GR
FedAvg 0.85M 0.85M
PBPFL 0.85M 2.55M
Table 6: Commnicational Overhead

6 Related Work

Federated learning was formally introduced by Google in 2016 Konecný et al. (2016) to address the data privacy in machine learning. Then, FedAvg McMahan et al. (2017) and its theoretical research Li et al. (2019c) were introduced to implement and flourish FL. After that, many improvements and variants of FedAvg were deployed to deal with statistical challenges Smith et al. (2017); Eichner et al. (2019); Mohri et al. (2019), communication challenges Agarwal et al. (2018); Zhu and Jin (2019); Chen et al. (2019) and privacy issuesBonawitz et al. (2017); Xu et al. (2020); Li et al. (2019b); Bonawitz et al. (2016). Considering the potential value of federal learning, many promising applications, such as healthcare Li et al. (2019b); Xu and Wang (2019), virtual keyboard prediction Ramaswamy et al. (2019); Yang et al. () and vehicle-to-vehicle communication Samarakoon et al. (2018), have tried to adopt FL as an innovative mechanism to train global model from multiple parties with privacy-preserving property.

Recently, some summary works on FL have been presented Dai et al. (2018); Yang et al. (2019); Li et al. (2019a); Kairouz et al. (2019). Specifically, Dai et al. Dai et al. (2018) provided an overview of the architecture and optimization approach for federated data analysis. Yang et al. Yang et al. (2019) identified architectures for the FL framework and summarized general privacy-preserving techniques that can be applied to FL. Li et al. Li et al. (2019a) provided a broad overview of current approaches and outlined several directions of future work of FL. Peter et al. Kairouz et al. (2019) outlined the classification of FL and discussed recent advances and presented an extensive collection of open problems and challenges.

From the above, it is still a big challenge to effectively protect the intermediate iterates during the training phase and the final model parameters in FL Kairouz et al. (2019).

7 Conclusion

In this paper, we present a practical and bilateral privacy-preserving federated learning scheme, which aims to protect model iterates and final model parameters from disclosing. We introduce an efficient privacy-preserving technique to encrypt model iterates and final model parameters. This technique allows clients to train the updated model under noisy current model, and more importantly, ensures only the server can eliminate the noise to get accurate results. Security analysis shows the high security of our PBPFL under the honest but curious security setting. Besides, experiments conducted on real data also demonstrate the practical performance of our PBPFL.

Appendix A Proofs for The Noisy Outputs of Forward Propagation in CNN.

a.1 Proof of Eq. (16).

For the -th layer, where , the corresponding noisy output is a matrix for the -th channel, where each element in is represented as

(16)
{proof}

As defined in Section 3.1.2, the convolutional layer is implemented as a convolution operation followed by a ReLU. Due to the existence of residual connection, the -th layer may be connected to a set of preceding layers, denoted as . Obviously, the input of the -th convolutional layer, denoted as , is the concatenation of the output of layers in set along the channel dimension, which can be represented as

For example, if the -th layer is connected to layers , and , then we have and