Private Deep Learning with Teacher Ensembles
Abstract
Privacypreserving deep learning is crucial for deploying deep neural network based solutions, especially when the model works on data that contains sensitive information. Most privacypreserving methods lead to undesirable performance degradation. Ensemble learning is an effective way to improve model performance. In this work, we propose a new method for teacher ensembles that uses more informative network outputs under differential private stochastic gradient descent and provide provable privacy guarantees. Out method employs knowledge distillation and hint learning on intermediate representations to facilitate the training of student model. Additionally, we propose a simple weighted ensemble scheme that works more robustly across different teaching settings. Experimental results on three common image datasets benchmark (i.e., CIFAR10, MINST, and SVHN) demonstrate that our approach outperforms previous stateoftheart methods on both performance and privacybudget.
Private Deep Learning with Teacher Ensembles
Lichao Sun, Philip S. Yu University of Illinois at Chicago james.lichao.sun@gmail.com Ji Wang National University of Defense Technology psyu@uic.edu Yingbo Zhou, Jia Li, Richard Sochar, Caiming Xiong^{†}^{†}thanks: Corresponding Author Salesforce Research yingbo.zhou,jia.li,rsocher,cxiong@microsoft.com
noticebox[b]Preprint. Under review.\end@float
1 Introduction
Recent years have witnessed impressive breakthroughs of deep learning techniques in a wide variety of domains, such as image classification He et al. (2016), language processing Devlin et al. (2018), reinforcement learning Silver et al. (2017), and many more. Many attractive applications involve training models using highly sensitive data, for example, diagnosis of diseases with medical records or genetic sequences Alipanahi et al. (2015), mobile commerce behavior prediction Yan (2017), and locationbased social network activity recognition Gong et al. (2018), etc. However, recent studies exploiting privacy leakage from deep learning models have demonstrated that private, sensitive training data can be recovered from released models Nicolas Papernot (2017). Therefore, privacy protection is a critical issue in this context, which should protect sensitive data from being disclosed and attacked during the application process.
Typically, according to the accessibility to the target model, there are two types of attacks that lead to leakage of private information: whitebox and blackbox attacks Samangouei et al. (2018). In whitebox attack setting, the adversary has full access to the model architecture and parameters, and can even modify the data during execution. In the blackbox attack setting, the adversary does not have access to the model parameters, but the adversary can repeatedly query the target models to gather data for the attack’s analysis. In order to protect the privacy of the training data and mitigate the effects of adversarial attacks, various privacy protection works have been proposed in the literature Michie et al. (1994); Nissim et al. (2007); Samangouei et al. (2018); Ma et al. (2018). The “teacherstudent” learning framework with privacy constraints is of particular interest here, since it can provide a private student model without touching any sensitive data directly Hamm et al. (2016); Pathak et al. (2010); Papernot et al. (2017). The original purpose of a teacherstudent framework is to transfer the knowledge from the teacher model to help train a student model to achieve similar performance with the teacher model. To satisfy the privacypreserving need, we carefully add small random perturbations with privacy guarantee analysis on the knowledge from the teacher model and sought to build a private student model. In this way, no adversary can recover the original sensitive information even they have the full access of the student model.
Privacy preserving is important to protect models from disclosing the identity of an individual, however, most privacypreserving methods have relatively large performance/privacy tradeoffs on applied tasks, which made applying of such techniques less attractive. To alleviate this issue, we propose a simple, yet effective ensemble based method that designed to work on deep neural networks trained with gradient based methods. Our major contributions are as follows:

We propose Private Deep Learning with Teacher Ensembles, PETDL, a framework for improving performance of training deep neural networks via mutlilayer privacypreserving knowledge transfer. Our method employs knowledge distillation on teacher ensembles so that more information can get transferred, which leads to faster and more accurate student model. To facilitate learning with deep neural networks, we use the representations from intermediate layers of the teacher network as corresponding targets for the students. We empirically demonstrate that our method significantly outperforms other privacypreserving methods, in terms of both privacy budget and absolute performance.

In addition, we also propose a weighted ensemble scheme to handle more general cases, where each teacher may be trained on different or biased datasets. We show that the proposed method is robust under different experimental settings, in particular, it works far better than simple ensembles when the training data for different teachers are biased.

We give a detailed privacy analysis of PETDL. Compared with the samplebysample query from multiple teachers in previous works, the proposed mechanism designs a new batchbybatch query mode to reduce the number of queries. To provide provable privacy guarantee, we carefully perturb the knowledge distilled from the model to satisfy the standard of differential privacy.
2 PetDl: Private Deep Learning with Ensembles of Teachers
In this section, we introduce the details of our PETDL framework, while Figure 1 gives a graphical overview of the ideas.
2.1 Training Neural Network Teachers on Disjoint Data
We denote the large sensitive data as and the corresponding labels as . In order to adopt the ensemble learning, we first partition the sensitive data into disjoint subsets for , and we train each neural network on each subset called teacher, denoted as . Unlike other learning techniques, the neural network training always requires abundant of data for provable performance, and as so PETDL requires a reasonable number of teachers, denoted as . Next, we will introduce the knowledge transfer approach in PETDL which considers the rich information from the neural network structures.
2.2 Multilayer Knowledge Transfer via Deep Neural Networks
Compared with other machine learning techniques, neural networks contain richer information due to the complex structure. For each teacher receiving an unseen sample from the query of the student, it could transfer more than one type of knowledge to student, including the outputs of selected intermediate layers and the outputs of the prediction output layer.
Intermediate layer learning. For many deep learning techniques, besides the distribution of prediction results, there are intermediate representations of the model, such as the hidden layer of the neural network or lowrank representations on the original dataset. These intermediate representations from teacher models contain valuable information, which can be used to guide the training of the student model as a hint. Analogously, the intermediate outputs of the student model can mimic the corresponding representations of the teachers, which can be trained by minimizing the loss function below:
(1) 
where represents the student model up to the intermediate outputs with the parameters , denotes the public dataset, and is the output of the teacher’s hint outputs over the public samples.
Prediction layer learning. We use the knowledge distillation technique to transfer the knowledge learned by multiple teacher models to the student model.
Let denote the output of the last hidden layer of the th teacher. The soften probability Hinton et al. (2014) is regarded as the knowledge: , where is the temperature parameter which is usually set to 1. Different from the normal case , the cases can increase the probabilities of the classes whose normal values are near zero. So that the relationship between various classes is embodied as knowledge in the soften probability. The thirdparty aggregates the teachers’ soften probability , where is the number of teachers. To learn the aggregated knowledge, the student model is trained to minimize the difference between its own soften probability and the aggregated soften probability from the teachers, i.e., knowledge distillation loss:
(2) 
where is the student’s trainable parameters, denotes the crossentropy loss, and denotes the student’s soften probability over the public samples , defined as , where represents the logits of the student model.
2.3 Aggregation Mechanisms
The most straightforward approach is to directly average the intermediate layer or prediction layer loss of the sample of the teachers, as shown in the Algorithm 1. Considering each query will expose some privacy information from teacher to student, we first bound the loss of each teacher by a threshold , and so the sensitivity of the aggregated loss can be controlled. Then, we add Gaussian noise to perturb the loss.
More specifically, let represent the general knowledge transfer loss (i.e., hint learning loss and distillation learning loss ) and denote its clipped bound loss. In line 5 and line 13 of Algorithm 1, the max value of for each teacher is clipped within a given bound . If , the value of is scaled down as,
(3) 
Otherwise, it is maintained as the original value. After clipping, the sensitivity of for each teacher is bounded as , which can be preset. Note that, if it is set too large, the loss will be perturbed by excessive noise. On the other hand, a too small will overly clip the loss.
Let be the aggregated loss defined as the average value of each teacher’s bounded loss . We add Gaussian noise into the aggregated loss to preserve privacy of the sensitive data from all teachers and define the aggregation mechanism as follows:
(4) 
where is a normal (Gaussian) distribution with mean 0 and standard deviation .
3 Privacy Analysis
In this section, we provide the privacy analysis of the multilayer knowledge transfer and give the implementation with privacy guarantee. The sensitive data of multiple teachers are considered as a sensitive data pool. To enforce theoretical privacy guarantee over the sensitive data pool, the information related with the sensitive data is perturbed by random noise during training the student model, i.e., the knowledge distillation loss and the hint loss. In order to provide a stronger privacy protection in the training process, we provide two new techniques of privacy learning, each of which help PETDL cost less privacy during training.
3.1 Differential Privacy Preliminaries
Differential privacy Dwork et al. (2006); Dwork (2011b) constitutes a strong standard that provides privacy guarantees for machine learning algorithms, by limiting the range of the output distribution of an algorithm facing small perturbations on its inputs:
Definition 1.
A randomized mechanism with domain and range is enforced by differential privacy, if for any adjacent data , and any output it holds that,
(5) 
Two data and are regarded as adjacent data when they are identical except for one single data item. The parameter represents the privacy budget Dwork (2011a) that controls the privacy loss of . A larger value of indicates a weaker privacy protection.
A general method for enforcing a deterministic function with the differential privacy is to add random noise calibrated to the sensitivity of , represented by , . For instance, the Gaussian mechanism is defined by,
If the L2 norm sensitivity of a deterministic function is , we have:
(6) 
where is a random variable obeying the Gaussian distribution with mean 0 and standard deviation . The randomized mechanism is differentially private if and .
Theorem 1.
3.2 Privacy Analysis of PetDl
Teachers’ training data are disjoint with each other, the sensitivity of is . Based on Theorem 3.1, each query approximated by the above randomized mechanism is differentially private when is set as ). During the training process, the student submits queries to the teachers. The overall privacy loss can be tracked by using moments accountant Abadi et al. (2016), which leads to the following theorem:
If , , where is a constant. The overall privacy budget of Algorithm 1 is , when is set as:
(7) 
where is a constant and is number of teachers.
Theorem 2.
Batch loss optimization Each query will expose some privacy information from teacher to student, in order to provide stronger privacy protection, we propose a new optimization strategy such that the student model send a batch set of samples as a query to the teacher models, instead of samplebysample query. All teacher models will transfer their ensemble knowledge with carefully perturbed random Gaussian noise for privacy In this case, it could reduce the number of queries during the training, and it update the overall privacy loss of the Theorem 3.2 by resetting:
Weighted knowledge transfer via teachers Rather than the directly loss aggregation among teachers, each teacher can use additional information such as confidence score to weight its response before the aggregation. Here, we regard the highest probability among different classes as the confidence score, and calculating the aggregated loss as follows:
where is the classes of the samples. The privacy loss of each query is the same as that of average aggregation. Therefore, the overall privacy loss is determined by Theorem 2.
4 Experimental Evaluation
In this paper, we evaluate PETDL on deep neural networks on three popular image datasets: CIFAR10 Krizhevsky and Hinton (2009), SVHN Netzer et al. (2011) and MNIST LeCun et al. (1998). We first use CIFAR10 to examine the performance impact of different parameters and the effectiveness of the proposed techniques in PETDL. Then we verify privacy protection, effectiveness and efficiency of the student model on MNIST, SVHN, and CIFAR10. Due to the space limitation, the MNIST’s and SVHN’s model evaluation and more experiments are in the appendix. The code will be released in the final publication version.
We present the performance of the student model with a strong privacy protection guarantee on MNIST, SVHN, and CIFAR10. MNIST and SVHN are two wellknown digit image datasets consisting of 60K and 73K training samples, respectively. We equally split the training data as five subsets for all three datasets. Four of them are used to train four teacher models and last one is considered as public data. For SVHN, the teacher uses same ConvMiddle network Laine (2017); Park et al. (2017) as CIFAR10. Due to the properties of the dataset, we design a customized ConvSmall network Laine (2017); Park et al. (2017) as the teacher on MNIST. Lots of parameters are related to the privacy budget, and we report the results with a good balance between the effectiveness and privacy protection guarantee.
4.1 Model Ablations
There are 50K training samples belonging to 10 classes in CIFAR10. The training data is randomly partitioned into 5 subsets in equal size, and four of them are considered as sensitive data used to train four teachers, and the last one is used as unlabelled public data for training a student model. Note that, in all experiments, we compare student with the average performance of the teachers instead of a single ensemble teacher.
Each teacher is pretrained based on convolutional deep neural network, ConvMiddle Laine (2017); Park et al. (2017). The student model is trained on a customized ConvSmall network though private knowledge distillation in PETDL. The performance of the student model is affected by multiple parameters. We examine them individually by keeping the others fixed to show their effects.
Privacy Budget. We present the results of privacy budgets on three datasets in Fig. 2, by fixing all hyperparameters except one. It is obvious that the accuracy generally increases with a larger privacy budget. An additional table in the appendix shows that student models outperform than its teachers on MNSIT and SVHN in general, but very close to its teachers on CIFAR10 even with a small privacy budget.
Epoch for hint learning . In Fig. 3(a), we can find that without hint learning, i.e., , the accuracy of student model is determined by the distillation learning. A small value of distillation learning epoch significantly deteriorates the student’s performance. However, this performance deterioration can be mitigated by the hint learning, even with a small value of hint learning epoch. When , the performance difference between and is negligible. It argues that the hint learning can help improve the student’s performance with little privacy loss. Hint learning epoch is our recommendation for CIFAR10.
Epoch for distillation learning. The experimental results in Fig. 3(b) shows that the performance of student model would be more effective when the value of distillation learning epoch raise, because of more private knowledge transfer responded by those teachers. In CIFAR10, distillation learning epoch is recommended for training.
Batch size. The performance of student ascends with a smaller batch size as shown in Fig. 3(c). A large value of batch size leads to less times query requests from the student model, and thus the privacy would be well protected. In order to balance the effectiveness and the privacy, we recommend to set the batch size set as 128 on CIFAR10.
Noise scale. In the Fig. 3(d), we can observe that larger noise scale would help to protect the data privacy, but also descends the performance in general. However, as a neural network or other machine learning techniques also frequently suffers from the overfitting problem, the norm bound and additional noise act as regularization roles during training. Compared with CIFAR10, the other two datasets are not very sensitive to the noise scale, so we can set a large value of noise scale for privacy preserving. If you are more interested in this analysis of SVHN and MNIST, all details are in the appendix.
Compression rate. Teacher student model could support using a large teacher to train a small student. Fig. 3(e) show that the student’s performance rises with a larger size of the neural network. Student model with a very large size of neural network, however, requires more public data and more queries for a stable and effect model. Comparing to CIFAR10, the other two datasets can performance well, even little better than the average performance of the teachers as shown in appendix.
Dataset  CIFAR 10  MNIST  SVHN  

Model  Accuracy  Privacy Budget  Accuracy  Privacy Budget  Accuracy  Privacy Budget 
DPSGDAbadi et al. (2016)  73.00%  8.00  97.00%  8.00     
PATELNMax Papernot et al. (2017)      98.10%  8.03  90.77%  8.19 
PATEGNMax Papernot et al. (2018)      98.50%  1.97  91.60%  4.96 
PETDL (n = 2)  74.33%  6.50  99.12%  1.43  94.67%  4.51 
PETDL (n = 4)  76.81%  5.48  99.33%  1.21  95.35%  2.96 
4.2 Compared with Stateoftheart Baselines
We compare with three recent stateoftheart approaches, including DPSGD Abadi et al. (2016), PATE Papernot et al. (2017) and scale PATE Papernot et al. (2018). DPSGD uses the noisy loss for optimization and two PATE based approaches add the random noisy perturbation on the voting strategy.
Table 1 shows that the proposed PETDL outperforms the previous stateoftheart baselines on both privacy cost and accuracy. The main reason is that PETDL leverages the richer information from the neural network structure. It helps the student to train a better performance neural network with less privacy cost. We also evaluate the variants of the PETDL with weighted learning approach. It shows that weighed learning gets slight gain on performance in general. Weighted learning approach can prevent some teachers from making wrong knowledge transfer when they are not that confident. It can improve the robustness of the training approach, and more details of robustness evaluation are in the following section.
4.3 Evaluation on Unbalanced Dataset
In this experiment, we design the weighted schema to evaluate the robustness of the PETDL with weighted learning approach. First, we partition the sensitive dataset into ten subsets. 82% data of each subset is corresponding to only one label, and the rest 18% data is uniformly corresponding to the rest nine labels. Then, we train ten teachers on each subset, which makes each teacher is only very good at one label prediction. Note that, each teacher adopts the oversampling to overcome the unbalanced subset in training approach.
In the Table 2, we can easily see that PETDL with weighted learning is more robust comparing to without weighted learning approach. Meanwhile, we can see that more complex of the dataset (i.e. CIFAR10 > SVHN > MNSIT), the performance of the student model drops more because of the lower performance for each teacher. Another interesting observation we find is that the privacy cost is much less than most results in Table 1 due to a larger number of teachers. As a result, we recommend to train as many teachers as possible in PETDL until the student model performance drops.
Dataset  CIFAR 10  MNIST  SVHN  

Model  Accuracy  Privacy Budget  Accuracy  Privacy Budget  Accuracy  Privacy Budget 
PETDL (n = 10)  62.21%  2.33  97.26%  1.17  90.31%  2.51 
PETDL weighted  71.3%  2.23  98.69%  0.91  91.15%  2.12 
5 Discussion and Related Work
Differential privacy is increasingly regarded as a standard privacy principle that guarantees provable privacy protection Beimel et al. (2014). Early work adopting differential privacy focus on restricted classifiers with convex loss Bassily et al. (2014); Chaudhuri et al. (2011); Hamm et al. (2016); Pathak et al. (2010); Song et al. (2013). Abadi et al. Abadi et al. (2016) proposed DPSGD, a new optimizer by carefully adding random Gaussian noise into stochastic gradient descent for privacypreserving for deep learning approaches. At each step of DPSGD by given a set random of examples, it need to compute the gradient, clip the norm of each gradient, add random Gaussian noise for privacy protection, and updates the model parameters based on the noisy gradient.
Intuitively, DPSGD could be easily adopted with most existing deep neural network models built on the SGD optimizer. Based on DPSDG , Agarwal et al. (2018) applies differential privacy on distributed stochastic gradient descent to achieve both communicate efficiency and privacy preserving. McMahan et al. (2017) applies differential privacy to LSTM language models by combining federated learning and differential private SGD to guarantee userlevel privacy.
Papernot et al. Papernot et al. (2017) proposed a general approach by aggregation of teacher ensembles, or PATE that uses the teacher models’ aggregate voting decisions to transfer the knowledge for student model training. In order to solve the privacy issues, PATE adds carefullycalibrated Laplacian noise on the aggregate voting decisions between the communication. To solve the scalability of the original PATE model, same group published an advanced version of PATE Papernot et al. (2018) by optimizing the voting behaviors from teacher models. PATEGAN Jordon et al. (2018) applies PATE to GANs to provide privacy guarantee for generate data over the original data. Compared with PATE, DPSGD makes less assumptions about the ML, but this comes at the expense of making modifications to the training algorithm. Nonetheless, they merely used output labels generated by teachers as the knowledge. Our approach leverages richer information from deep neural networks to train a better model.
6 Conclusion
We propose a simple, yet effective privacypreserving ensemble based method for deep neural networks. Our method employs knowledge distillation and hint learning on intermediate representations to facilitate the training of student model. Empirically, our method significantly outperforms previous methods on both privacy budget and accuracy on three datasets. Additionally, we propose an alternative weighted ensemble method that works more robust across different teacher settings. In particular, the weighted ensemble method works well when the teacher training sets are biased. Moreover, we provide formal privacy analysis and provable privacy guarantee for the proposed methods.
References
 Abadi et al. [2016] Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS ’16, pages 308–318, 2016.
 Agarwal et al. [2018] Naman Agarwal, Ananda Theertha Suresh, Felix Xinnan X Yu, Sanjiv Kumar, and Brendan McMahan. cpsgd: Communicationefficient and differentiallyprivate distributed sgd. In Advances in Neural Information Processing Systems, pages 7564–7575, 2018.
 Alipanahi et al. [2015] Babak Alipanahi, Andrew Delong, Matthew T Weirauch, and Brendan J Frey. Predicting the sequence specificities of dnaand rnabinding proteins by deep learning. Nature biotechnology, 33(8):831, 2015.
 Bassily et al. [2014] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Differentially private empirical risk minimization: Efficient algorithms and tight error bounds. arXiv preprint arXiv:1405.7085, 2014.
 Beimel et al. [2014] Amos Beimel, Hai Brenner, Shiva Prasad Kasiviswanathan, and Kobbi Nissim. Bounds on the sample complexity for private learning and private data release. Machine Learning, 94(3):401–437, 2014.
 Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109, 2011.
 Devlin et al. [2018] Jacob Devlin, MingWei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
 Dwork et al. [2006] Cynthia Dwork, Krishnaram Kenthapadi, Frank McSherry, Ilya Mironov, and Moni Naor. Our Data, Ourselves: Privacy Via Distributed Noise Generation, pages 486–503. Springer Berlin Heidelberg, Berlin, Heidelberg, 2006.
 Dwork [2011a] Cynthia Dwork. Differential Privacy, pages 338–340. 2011.
 Dwork [2011b] Cynthia Dwork. A firm foundation for private data analysis. Communications of the ACM, 54(1):86–95, 2011.
 Gong et al. [2018] Qingyuan Gong, Yang Chen, Xinlei He, Zhou Zhuang, Tianyi Wang, Hong Huang, Xin Wang, and Xiaoming Fu. Deepscan: Exploiting deep learning for malicious account detection in locationbased social networks. IEEE Communications Magazine, 56(11):21–27, 2018.
 Hamm et al. [2016] Jihun Hamm, Yingjun Cao, and Mikhail Belkin. Learning privately from multiparty data. In International Conference on Machine Learning, pages 555–563, 2016.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 Hinton et al. [2014] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. In Advances in Neural Information Processing Systems Deep Learning Workshop, NIPS ’14. 2014.
 Jordon et al. [2018] James Jordon, Jinsung Yoon, and Mihaela van der Schaar. Pategan: Generating synthetic data with differential privacy guarantees. 2018.
 Krizhevsky and Hinton [2009] A Krizhevsky and G Hinton. Learning multiple layers of features from tiny images. 2009.
 Laine [2017] Samuli Laine. Temporal ensembling for semisupervised learning. In 5th International Conference on Learning Representations, ICLR ’17, 2017.
 LeCun et al. [1998] Yann LeCun, Lon Bottou, Yoshua Bengio, and Geoffrey Hinton. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Ma et al. [2018] Xindi Ma, Jianfeng Ma, Hui Li, Qi Jiang, and Sheng Gao. Pdlm: Privacypreserving deep learning model on cloud with multiple keys. IEEE Transactions on Services Computing, 2018.
 McMahan et al. [2017] H Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning differentially private recurrent language models. arXiv preprint arXiv:1710.06963, 2017.
 Michie et al. [1994] Donald Michie, David J Spiegelhalter, CC Taylor, et al. Machine learning. Neural and Statistical Classification, 13, 1994.
 Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, and Bo Wu Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In Proceedings of the 24th International Conference on Neural Information Processing Systems, NIPS ’11, pages 1–9. 2011.
 Nicolas Papernot [2017] Ulfar Erlingsson Ian Goodfellow Kunal Talwar Nicolas Papernot, Martin Abadi. Semisupervised knowledge transfer for deep learning from private training data. In 5th International Conference on Learning Representations (ICLR), 2017.
 Nissim et al. [2007] Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. Smooth sensitivity and sampling in private data analysis. In Proceedings of the thirtyninth annual ACM symposium on Theory of computing, pages 75–84. ACM, 2007.
 Papernot et al. [2017] Nicolas Papernot, Martin Abadi, Ulfar Erlingsson, Ian Goodfellow, and Kunal Talwar. Semisupervised knowledge transfer for deep learning from private training data. In 5th International Conference on Learning Representations, ICLR ’17, 2017.
 Papernot et al. [2018] Nicolas Papernot, Shuang Song, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Ulfar Erlingsson. Scalable private learning with pate. In 6th International Conference on Learning Representations, ICLR ’18, 2018.
 Park et al. [2017] Sungrae Park, JunKeon Park, SuJin Shin, and IlChul Moon. Adversarial dropout for supervised and semisupervised learning. arXiv:1707.03631, 2017.
 Pathak et al. [2010] Manas Pathak, Shantanu Rane, and Bhiksha Raj. Multiparty differential privacy via aggregation of locally trained classifiers. In Advances in Neural Information Processing Systems, pages 1876–1884, 2010.
 Samangouei et al. [2018] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defensegan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.
 Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 Song et al. [2013] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In 2013 IEEE Global Conference on Signal and Information Processing, pages 245–248. IEEE, 2013.
 Wang et al. [2018] Ji Wang, Weidong Bao, Lichao Sun, Xiaomin Zhu, Bokai Cao, and Philip S. Yu. Private model compression via knowledge distillation. arXiv:1811.05072, 2018.
 Yan [2017] Zheng Yan. Mobile phone behavior. Cambridge University Press, 2017.
Appendix
Appendix A Proof of Theorem 2
We introduce the moments accountant proposed in Abadi et al. [2016], which is also used to prove our Theorem 3.2. Regard the privacy loss as a random variable , for two adjacent inputs and , a randomized mechanism , auxiliary input , and any output of , we define:
We can estimate the privacy loss by bounding the log of the th moment generation function which is defined as:
The bound of is defined as the moments accountant over all possible and :
The moments accountant enjoys good properties of composability and tail bound as given in Abadi et al. [2016]:
[Composability]. Suppose that a mechanism consists of a sequence of adaptive mechanisms , where . Then, for any
where is conditioned on ’s output being oi for .
[Tail bound] For any , the mechanism is differential privacy for
By using the above two properties, we can bound the moments of randomized mechanism based on each submechanism, and then convert the moments accountant to differential privacy based on the tail bound.
Theorem ??.
Proof.
Assume the teacher models are pretrained by percentage of the total samples. According to Lemma 3 in Abadi et al. [2016], if , the moments accountant of each query is bounded as . Based on the composability property, we have
By the tail bound property, to enforce differential privacy, it suffices to have,
Through easy calculation, it can be found that all these conditions can be met by setting,
for two constants and . When , the claim follows. ∎
Appendix B Additional Experimental Evaluation
In this section, we first includes the model evaluation on SVHN and MNIST datasets.
Meanwhile, we also tray to train a small student to see the effectiveness and efficiency with compression strategy benefiting from the teacher student model.
b.1 Effect of Parameters on SVHN
SVHN contains 73K training samples belonging to 10 classes. Similar to CIFAR10, The training data is also randomly split into 5 equal size subsets, and four of them are used to train four teachers separately, and last one is used as sensitive data.
Epoch for hint learning . In Fig. 4(a), we can find that when , the performance of student outperforms than its teachers.
Epoch for distillation learning. The experimental results in Fig. 4(b) shows that the student would outperform than its teachers when .
Batch size. The performance of student always outperforms than its teachers in a large range of batch size (from 32 to 192) as shown in Fig. 4(c). we recommend to set the batch size set as 128 on SVHN for better privacy peserving.
Noise scale. In in Fig. 4(d), similar to batch size, we can observe that the student is always better than its teachers even with a large noise scale.
Compression rate. Fig. 4(e) shows that the student’s performance rises with a larger size of the neural network. we recommend to set the compression rate as 3 on SVHN.
Number of teachers. Similar to CIFAR10, we also evaluate the performance of the student with different teachers with same settings. It is obvious that the performance of the teacher would increase with more training samples, however, more teachers would be more effective to train a better student in the Fig. 4(f).
b.2 Effect of Parameters on MNIST
MNIST contains 60K training samples belonging to 10 classes. Similar to other two datasets, the training data randomly split into 5 equal size subsets, and four of them are used to train four teachers separately, and last one is used as sensitive data.
Epoch for hint learning . The experimental results in Fig. 5(a) shows that the student would outperform than its teachers even without hint learning (). However, the performance of student model would be improved when the value of hint learning epoch raise.
Epoch for distillation learning. In Fig. 5(b), the results shows that the student always outperform than its teachers, and the performance of student model would be more effective when the value of distillation learning epoch raise.
Batch size. The performance of student would outperform than its teachers when the batch size is less or equal than 160 as shown in Fig. 5(c). The results also shows a clifflike drop when the batch size is more than 192. we recommend to set the batch size set as 128 on MNIST for a good balance between performance and privacy peserving.
Noise scale. In in Fig. 5(d), similar to the same analysis on SVHN, we can observe that the student is always better than its teachers even with a large noise scale.
Compression rate. Fig. 5(e) shows that the student’s performance rises with a larger size of the neural network. we recommend to set the compression rate from 2 to 4 on MNIST.
Number of teachers. Similar to other two datasets, we also evaluate the performance of the student with different teachers with same settings. We also get the similar results that the performance of the teacher would increase with more training samples, however, more teachers would be more effective to train a better student in the Fig. 5(f).
# Params  Time(s)  Acc(%)  


T  1.43M  9.293  77.66  
S1  0.37M  3.671  76.81  
S2  0.17M  2.260  72.20  
MNIST  T  157.1K  2.021  99.21  
S1  23.0K  1.230  99.33  
S2  13.2K  1.157  99.23  
SVHN  T  1.43M  9.237  94.68  
S1  0.37M  3.693  95.35  
S2  0.09M  2.011  94.02 
b.3 Efficiency and Effectiveness
We evaluate the efficiency and effective of the student model. “Time” in the table represents the time used for model running. The testing environment is a super desktop equipped with Intel(R) Xeon(R) CPU E52620 v4 @ 2.10GHz with 16 cores cpu and 82.4 GB memory.
Each teacher contains a smaller size of data compared to previous single teacher approaches. Due to the reduced training data size of each teacher, the average performance of our teachers are not as high as previous work. Despite that, our student models outperform stateoftheart approaches as shown in Table 3. We achieve better accuracy with better privacy budgets.
The small students outperform than the average performance of its teachers on both SVHN and MNIST datasets, and still obtain comparable performance on CIFAR10. On MNIST, the student achieves 11.9 compression ratio and 0.75 speedup on sample evaluation with 0.12% accuracy increase. On SVHN, the student model is also better than the teacher models on model size (3.8 compression ratio), efficiency (2.5 speedup) and effectiveness (+0.67% accuracy). On CIFAR10, the accuracy decreases less than 1% using only 39.5% times. By applying hint learning and knowledge distillation in PETDL, the student is effective and efficient, which would win the trust between model provider and model users.