Abstract
Federated learning is a recent approach for distributed model training without sharing the raw data of clients. It allows model training using the large amount of user data collected by edge and mobile devices, while preserving data privacy. A challenge in federated learning is that the devices usually have much lower computational power and communication bandwidth than machines in data centers. Training largesized deep neural networks in such a federated setting can consume a large amount of time and resources. To overcome this challenge, we propose a method that integrates model pruning with federated learning in this paper, which includes initial model pruning at the server, further model pruning as part of the federated learning process, followed by the regular federated learning procedure. Our proposed approach can save the computation, communication, and storage costs compared to standard federated learning approaches. Extensive experiments on real edge devices validate the benefit of our proposed method.
0.4cm 0.3cm \sysmltitlerunningModel Pruning Enables Efficient Federated Learning on Edge Devices
Model Pruning Enables Efficient Federated Learning on Edge Devices
equal*
Yuang Jiangyale \sysmlauthorShiqiang Wangibm \sysmlauthorBong Jun Kostanford \sysmlauthorWeiHan Leeibm \sysmlauthorLeandros Tassiulasyale
yaleYale University, New Haven, CT, USA \sysmlaffiliationibmIBM T. J. Watson Research Center, Yorktown Heights, NY, USA \sysmlaffiliationstanfordStanford Institute for HumanCentered Artificial Intelligence (HAI), CA, USA
Yuang Jiangyuang.jiang@yale.edu \sysmlcorrespondingauthorShiqiang Wangwangshiq@us.ibm.com
Edge/mobile devices, federated learning, model pruning, resource efficiency
1 Introduction
Federated learning, since its inception, has attracted considerable attention due to its capability of distributed model training using data collected by possibly a large number of edge and mobile devices, such as Internet of Things (IoT) gateways and smartphones McMahan et al. (2017); Park et al. (2019); Li et al. (2019). The federated learning procedure includes local computations at clients (edge/mobile devices) and model parameter exchange among clients and a server in an iterative manner. The significance of this procedure is that the clients’ data remain local and are not shared with others, which preserves data privacy and is more resourceefficient than transmitting all the data to a central server.
However, modern DNNs (deep neural networks) could contain hundreds of millions of parameters Simonyan & Zisserman (2014); He et al. (2016); training such large models directly on edge devices is often infeasible due to resource (such as memory) limitation or is otherwise very slow. In addition, the communication of such a large number of parameters between clients and the server is also a major impediment in federated learning, since the clients are often geodistributed edge devices that need to upload their local models to the parameter server frequently McMahan et al. (2017); Lin et al. (2018).
Approaches to reducing the communication overhead in federated learning have been proposed recently Konečnỳ et al. (2016), which, however, do not reduce the complexity of local computation at clients. For efficient computation, new model architectures such as MobileNet Howard et al. (2017) and model pruning techniques Molchanov et al. (2016) have been developed, where the model is trained/pruned at the server with centrally available data, then deployed at the edge for efficient inference. These methods have not been applied in the federated learning setting where data are decentralized in clients.
In this paper, we aim at overcoming the bottleneck of computation, communication, and storage jointly, and propose a new federated learning paradigm that is combined with distributed model pruning. The benefit of model pruning is that it has high degree of freedom in terms of the amount of parameters to prune, which may be adapted depending on the computation and communication capabilities of the clients involved in the federated learning process. To the best of our knowledge, we are the first to propose such a paradigm for federated learning, which focuses on both computation and communication efficiency.
Our main contributions in this paper are as follows.

[nolistsep]

We propose a federated model pruning mechanism where the server can initially perform some degree of pruning based on the initialized (untrained) model, possibly using a small amount of sample data if available, then further pruning can be performed distributedly.

We integrate model pruning with the federated learning procedure, providing a mechanism that jointly trains and prunes the model in a federated manner.

We conduct extensive experiments with various settings and measurements on real edge devices, and discuss the insights obtained from the results.
The rest of this paper is organized as follows. In Section 2, we present the related work. In Section 3, we introduce preliminaries and the proposed approach. Implementation details and complexity analysis can be found in Section 4. We evaluate the proposed approaches extensively in Section 5. Further issues are discussed in Section 6. Finally, we conclude our work in Section 7.
2 Related Work
In this section, we summarize existing works from three directions, including model compression, efficient communication, and implementation.
Model compression. Quantization mitigates computational cost by compressing the number of bits required for each parameter Gupta et al. (2015); Hubara et al. (2017). This, however, is orthogonal to our approach, and we will show in the following sections that, our approach can easily exceed the compression upper bound ( in the case of 32bit parameters) of quantization. Knowledge distillation Buciluǎ et al. (2006) transfers the knowledge from large, possibly ensemble of models to a smaller model, usually following a teacherstudent paradigm Hinton et al. (2015). It enables the compression of pretrained large models to much smaller ones with minimal loss of accuracy. Nevertheless, it by no means provides a solution for small models to learn knowledge from scratch or learn new knowledge incrementally. Model pruning was first proposed in LeCun et al. (1990) to remove redundant parameters according to their importance. It is further improved in Han et al. (2015) with iterative training and pruning. A single shot pruning method was recently proposed by Lee et al. (2018).
Efficient communication. A number of works try to overcome the communication bottleneck by sending only important gradients for aggregation, e.g., send gradients whose magnitudes are above a certain threshold Strom (2015), or send a fixed proportion of gradients to the server Dryden et al. (2016). The work by Lin et al. (2018) demonstrated a high communication compression ratio while achieving similar accuracy in, using warmup training and momentum manipulation in addition to gradient clipping.
Implementation. In the literature, federated learning experiments using edge devices have been done Wang et al. (2019). There are also performance measurements of deep learning on TX2 platforms Liu et al. (2019) and implementation on IoT devices Li et al. (2018); Yao et al. (2017).
Our approach is superior to the existing methods for the following reasons: 1) it overcomes not only the communication bottleneck, but also computation and storage bottlenecks (compared with gradient sparsification methods); 2) the pruned model is capable of learning incrementally from new data (compared with knowledge distillation); 3) it can easily exceed the compression upper bound of quantization methods; 4) compared to existing model pruning approaches LeCun et al. (1990); Han et al. (2015); Lee et al. (2018), our approach is fundamentally different because in our paradigm, the parameter server has access to only a small subset of data, not the entire dataset, to protect clients’ privacy. Furthermore, quantization and gradient sparsification are orthogonal to our approach and can be applied simultaneously with our approach.
3 Federated Learning with Pruning
In this section, we present our method of joint federated learning and model pruning.
3.1 Preliminaries
Standard federated learning procedure. A federated learning system consists of multiple clients, each with its own data, and a server that works as a model parameter aggregator. In the existing (standard) federated learning procedure, when a model training task is instantiated, the server first initializes the model parameters, and starts the iterative process of (i) the server distributing the model to clients, (ii) each client training the model with its own data (or subset) for a certain number of model training iteration, (iii) each client sending the updated model to the server, and (iv) the server aggregating the model parameters updated by the clients. We refer to steps i–iv above as one federation.
As discussed in Section 1, in the edge/mobile computing environment where the clients’ resources and communication bandwidth are limited, the above standard federated learning procedure is significantly challenged when the model size is large. To address this challenge, we employ model pruning as a means to reduce the computation and communication overhead at clients.
Model pruning. In the iterative training and pruning approach proposed in Han et al. (2015) for the centralized machine learning setting, the model is first trained using stochastic gradient descent (SGD) for a given number of iterations. Then, a “level” of model pruning is performed by removing a certain percentage (referred to as the pruning rate) of weights that have smallest absolute values layerwise. This training and pruning process is repeated until a desired pruning level (corresponding to a model size) is reached. The benefit of this approach is that the training and pruning occurs at the same time, so that a trained model with a desired (small) size can be obtained in the end. However, the approach in Han et al. (2015), as well as other pruning techniques LeCun et al. (1990); Lee et al. (2018), require the availability of training data at a central location, which is not applicable to federated learning.
3.2 Proposed Approach
Our proposed approach includes initial model pruning at the server and further model pruning involving both the server and clients during the federated learning process. The details are described as follows.
3.2.1 Initial Pruning at the Server
Initial model pruning at the server can be done with the following two cases of pruning with respect to the data availability at the server:

[nolistsep]

Samplebased pruning: This case assumes availability of some data samples at the server, typically a small subset of training dataset. These samples may be obtained by requesting each client to provide a few (a small portion of its own dataset that the client is willing to share) before the federated learning process starts, or by the server collecting a small amount of data on its own. In this case, existing approaches LeCun et al. (1990); Han et al. (2015); Lee et al. (2018) can be used for pruning using the available sample data. We primarily focus on the joint pruning and training approach Han et al. (2015) since our ultimate goal is model training, but our framework applies to other pruning techniques as well.

Sampleless pruning: In this case, there is no data sample available at the server. The model is pruned in an “uninformed” way, either in an initializationbased manner where a given amount of weights with small amplitudes are removed right after model initialization, or in a random manner where randomly selected weights are pruned regardless of its amplitude.
One would normally expect the quality of the pruned model in the above two cases to be poor compared to traditional pruning which makes use of a full training dataset Han et al. (2015). Rather surprisingly, however, we find that the model quality can be effectively retained even when the pruning is done using only a small amount of data. Even more surprising is that sampleless pruning can also retain the model quality to a certain extent, according to our experiments in Section 5.
3.2.2 Further Pruning Involving both Server and Clients
After initial pruning, the system can further perform the repeated training and pruning operations in addition to the standard federated learning procedure. Here, the system performs one or a few federations (training using federated learning procedure), followed by a pruning step that removes a certain amount of small parameters from the model. We call this federated pruning. Compared with pruning at the server, the benefit of federated pruning is that it incorporates the impact of local data available at clients (reflected in the model update by distributed gradient descent in the federated learning process). The federated pruning step is optional. When we only apply the initial serverbased pruning and not federated pruning, we call it oneshot pruning in this paper.
3.2.3 Overall Procedure
The initial pruning at the server is often needed so that the initial model is small enough for training on edge devices without consuming too much time. If desired, federated pruning can further reduce the model size to a even smaller size. In summary, our framework is amenable to handle the four cases of pruning: (i) samplebased and oneshot, (ii) sampleless and oneshot, (iv) samplebased and federated, and (iv) sampleless and federated. A consolidated procedure that covers all these cases can be described as follows (Figure 1):

When appropriate, the server requests the clients to send a small portion of their data.

The server initializes the model parameters and performs the initial pruning (either with or without sample data) until a desired model size is reached.

The server distributes the initially pruned model to the clients.

Each client updates the model using its own dataset. The updated model is aggregated by the server. If federated pruning is used, the next round of pruning is performed at the server to remove weights with small magnitudes. This iterative process is repeated until a desired pruning level is reached. Afterwards, regular federated learning is performed.
Network  Lenet300100  ConvFashionMNIST  ConvFEMNIST  
Convolutions  None  32, pool, 64, pool  32, pool, 128, pool  
Fullyconnected Layers  
Conv/FC/All Parameters  0/266.6K/266.6K  18.8K/722.6K/741.4K  103.4K/2654.8K/2758.1K  
Optimizer  SGD (LR = 0.5)  SGD (LR = 0.25)  SGD (LR = 0.25)  
Pruning Rates 




Number of Samples  200  200  389  
Number of Clients  10  5  5  
Local Updates/Batch Size  5/20  5/20  5/20 
4 Implementation
4.1 Using Sparse Matrices
Throughout this paper, we use dense matrices for original networks, and sparse matrices for weights in fullyconnected (FC) layers in pruned networks. The format we choose for sparse matrices is coordinate list (COO). It stores a list of <row, column, value> triplets sorted first by row index and then by column index in order to improve random access times. Although the benefit of model pruning in terms of computation is constantly mentioned in the literature from a theoretical point of view (e.g., Han et al. (2015)), most existing implementations substitute sparse parameters by applying binary masks to dense parameters. Applying masks increases the overhead of computation, instead of reducing it. In this paper, we implement sparse matrices to model pruning, and we will show its efficacy in the following sections.
4.2 Complexity Analysis
Computation. Suppose is a sparse matrix with fraction of nonzero entries in COO format, representing the weights at a layer in the neural network, and is a dense matrix representing the layer’s input. The computational complexity of matrix multiplication is
(1) 
using Sparse BLAS Duff et al. (2002). Yet, regarding the computation time, if matrix multiplication can be efficiently parallelized (e.g., in GPUs), computation of sparse matrices is not necessarily faster than dense matrix computation. Because dense matrix multiplication is extremely optimized, sparse matrices will show advantage in computation time only when the matrix is beyond a certain degree of sparsity (percentage of zero entries), where this sparsity threshold depends on specific hardware and software implementations.
Storage, memory, and communication. The storage, memory, and communication overhead is proportional to the size of parameters. In our implementation, we use the integer type (with the smallest number of bits needed) to store the sparse matrix indices, and 32bit floating point numbers to store the values. More precisely, when using bit integers for indices and 32bit floating point numbers for values, the ratio of the sparse parameter size to the dense parameter size is:
(2) 
where is the number of nonzero entries and is the total number of entries in the sparse parameter. For example, in the case where , it is advantageous to use sparse matrices when .
4.3 Implementation Challenges
As of today, wellknown machine learning frameworks have limited support of sparse matrix computation. For instance, in the current implementation in PyTorch, sparse matrices do not support persistent storage; the computations on sparse matrices are slow; and sparse matrices are not supported for the convolutional layers in convolutional neural networks (CNNs), etc. Therefore, in our implementation for the experiments presented in Section 5, we use actual sparse matrices in fullyconnected layers and conceptually implement binary masks to the convolutional layers as surrogate for sparse matrices. Consequently, for models where the computation in convolutional layers dominates, we will not see an apparent decrease in computation time. We note that this problem is solvable in the future by implementing and optimizing efficient sparse matrix multiplication (particularly for convolutional layers) on a software level, as well as developing specific hardware for such purpose. Nevertheless, compared with existing work, the novelty in our implementation is that we use sparse matrix representation in fullyconnected layers of the pruned model.
We note that there also exist pruning techniques where the resulting pruned model is constrained on using dense matrix representation Lym et al. (2019), in which case sparse matrix computation is not needed. Our framework of joint federated learning and model pruning can directly support such pruning methods as well. However, we focus on sparse matrices in this paper as it provides a higher degree of freedom of the pruned model.
5 Experimentation
To study the performance of our proposed approach, we conduct experiments in (i) a real edge computing system, where the server is a personal computer and the clients are Raspberry Pi devices, and (ii) a simulated setting with multiple clients and a server.
Pruning methods. For the initial samplebased pruning and federated pruning, we use the magnitudebased method as described in Section 3.2. For the sampleless pruning, we employ two methods, namely, (i) initializationbased pruning, where the pruning is based on the magnitude of model parameter values at initialization, and (ii) random pruning, where model parameters are pruned randomly. Besides the above pruning methods, two other benchmarks are also considered: the baseline accuracy and upper bound accuracy. Baseline accuracy is the test accuracy of the model trained (yet not pruned) only on the sample data available at the server, and upper bound accuracy is the accuracy of the original (not pruned) model trained on the entire dataset (union of all clients’ local datasets).
Models and datasets. Three different models are studied in our experiments, namely, LeNet300100 for MNIST dataset LeCun et al. (1998), ConvFashionMNIST for FashionMNIST dataset Xiao et al. (2017), and ConvFEMNIST for Federated Extended MNIST (FEMNIST) dataset Caldas et al. (2018); their details can be found in Table 1. Due to space limitation, the results of ConvFashionMNIST are included in the appendix. We use SGD with a constant learning rate for training the neural networks across all experiments. When using samplebased approaches, the models are trained with the samples for 50 epochs at the server. In each federation, all clients update their local parameters 5 times, each of which uses SGD with a minibatch size of 20.
5.1 MNIST with i.i.d. Data Partition
We first consider the MNIST dataset using a fullyconnected LeNet300100 architecture LeCun et al. (1998). We prune the network up to 30 levels, removing 20%, 20%, 10% of the weights at each level, respectively. Experiments both on Raspberry Pi’s and simulations are performed. For the experiments, we use a system where the server is a personal computer, and the clients are ten Raspberry Pi devices (five Pi 3 Model B’s and five Pi 4 Model B’s). Their specs can be found in Table 2^{1}^{1}1Raspberry Pi devices are not equipped with GPUs, so the training and inference are performed only on CPUs.. Server and clients communicate with each other using TCP protocol.
Device  CPU  RAM  Storage 

Pi 3  4 cores/1.2GHz  1GB  32GB SD card 
Pi 4  4 cores/1.5GHz  2GB  32GB SD card 
5.1.1 Time Measurements of One Federation
We first present the time measurements of one federation on real devices. We implement the original LeNet300100 model as well as models pruned for every 2 levels up to 30 levels on the Raspberry Pi devices and measure the average elapsed time on both server and clients over 100 federations.
Figure 2 shows the average total time, computation time, and communication time in one federation as we vary the pruning level. We also plot the actual file size of the parameters that are exchanged between server and clients in this figure. The original network will crash the Raspberry Pi 3’s, i.e., the system dies when training on the first minibatch due to resource exhaustion. Thus, we annotate “inf.” for the time measurements at pruning level 0.
Computation time. We see from Figure 2 that the original model cannot be trained on Raspberry Pi 3’s, and as the pruning level increases, the computation time decreases from 1.87 seconds per federation to 0.07 seconds per federation. This result agrees with the discussion in Section 4. Additionally, we plot in Figure 3 the inference time for 10,000 test data samples, averaged over 100 experiments. On Raspberry Pi versions 3 and 4, after pruning 13 and 9 levels, respectively, the inference time of models using sparse weights becomes smaller than using the original model.
Communication time. Compared with computation time, the decrease in communication time is even more noticeable: it drops from 10.57 seconds per federation to 0.48 seconds per federation. Since sparse matrices doubles the size of storage, as we use 16bit integers for their indices, the benefit of size reduction comes only after 50% of the parameters are pruned (level 4 in Figure 2), according to (2).
5.1.2 Model Training on Raspberry Pi’s
Level  5  10  15  20  


0.5K/6322s  1.0K/3958s  1.5K/3157s  7.3K/7214s  

0.7K/10093s  1.6K/5320s  9.2K/5964s  N/A/N/A  
Random  0.8K/10907s  1.9K/7130s  12.4K/8996s  N/A/N/A 
Next, we look at how long it takes for the federated learning to reach a certain accuracy when a new learning task is assigned. Figure 4 exhibits the accuracy vs. elapsed training time at pruning levels 5, 10, 15, and 20. We set a target accuracy of 95% for all of them. Training will end once it reaches the threshold for ten consecutive evaluations or reaches the 6,000 seconds time limit, whichever comes first.
Clearly, there is a tradeoff between computation time and the final accuracy: larger models will result in better accuracies (at convergence) but slower training, and conversely, smaller models will result in worse accuraries (at convergence) but faster training, as shown in Figure 5. A good choice is in between: using pruning level 15 takes the least amount of time to reach the 95% accuracy target. Table 3 shows the number of federations and time spent to reach the given accuracy target. The samplebased approach reaches the accuracy target with less number of federations, and more importantly, less time. Additionally, we find initializationbased pruning always performs comparable to or better than random pruning, an intuitive explanation of which can be found in the discussion Section 6.3.
In Figure 6 we consider all the four possible cases: {samplebased, sampleless} {oneshot, federated} pruning. For federated pruning, we start from level 5 and prune the model every 100 federations until level 15. When sample data are available at the server, we consider a sample data size of 200 for the initial pruning. It is clear that pruning with samples always gives better final accuracy compared with no samples, and similarly federated pruning is always better than oneshot pruning at convergence. Nevertheless, federated pruning will slow down the training at the early stage and thus we see the samplesbased, oneshot case gives better accuracy than sampleless, federated pruning and samplesbased, federated pruning before 1,500 seconds.
5.1.3 Simulations of Samplebased, Oneshot Pruning
To extend the experiments on Raspberry Pi’s, we conduct simulations of samplebased, oneshot pruning for 10,000 federations at all 30 pruning levels. We repeat the simulation with 5 different random seeds.
Level  5  10  15  20  25  30  

Percent  32.97  10.97  3.73  1.35  0.56  0.30  

98.1  97.7  97.1  95.4  91.7  84.5  

98.1  97.4  95.2  86.1  31.8  14.5  
Random  98.0  97.3  95.1  85.5  30.1  13.2  
Baseline  83.2  82.8  82.4  81.3  79.5  76.1 
In Figure 7, we plot the test accuracy of the network at 1K, 2K, and 10K federations according to the three pruning methods (samplebased, initializationbased, and random). The axis of each figure represents the pruning level (and percentage of remaining parameters). At all federations, there is a monotonic decrease on test accuracy as the number of pruning levels increases. The reason is twofold: first, the learnability of sparse networks is significantly reduced as we remove more and more weights; second, as we use oneshot pruning, the model is pruned using a small fraction of the training data, thus the resulting sparse model might not represent a good subnetwork architecture (per the winning ticket hypothesis in Frankle & Carbin (2019)). That being said, our samplebased approach achieves better test accuracies than initializationbased and random approaches at all pruning levels, especially when the network is highly sparse after pruning (i.e., high pruning level). Table 4 lists the accuracies of all pruning approaches and their baselines. For example, at pruning level 20 where only 1.35% parameters are left, the samplebased oneshot pruning still achieves accuracy while other two approaches achieve significantly lower accuracies (around ).
5.2 FEMNIST with Noni.i.d. Data Partition
Next, we consider the FEMNIST dataset with a ConvFEMNIST network. FEMNIST is a benchmark dataset for federated learning settings. It contains images that comprise 62 classes of handwritten digits and letters, including lower and upper cases. The data are collected from 3,500 different writers. It provides an option of retrieving noni.i.d. distributed images according to the writers. We extract a biased subset of 389 samples from 2 writers out of the entire amount of 35,948 data samples and study whether a biased subset of sample data can still be effective for initial pruning. We also explore the performance when convolutional layers are pruned.
The ConvFEMNIST network consists of 2 convolutional layers and 3 FC layers. We prune 20%, 20%, 10% of weights in FC layers and 5%, 10% in convolutional layers at each pruning level, up to 30 pruning levels. Due to the limited support for sparse matrices, we use elementwise multiplication of weights and their binary masks in the convolutional layers as a surrogate for actual sparse matrices. Because of the excessive training time on Raspberry Pi’s, we first measure the computation and communication time on Raspberry Pi 4’s and then use the measured time to simulate the model training process.
5.2.1 Time Measurements of One Federation
The measurement of computation, communication, and total time of each federation on Raspberry Pi 4’s can be found in Figure 8. It is consistently observed that the communication time is 24.34 seconds per federation when the original network is used, and 39.15 seconds per federation with sparse matrices representation at level 2, from which it is decreased as the level increases, finally down to 1.54 seconds per federation at level 30. This trend coincides with the change in the size of the parameters, where it goes from 10.53 MB at level 0 to 16.62 MB at level 2 due to the use of sparse matrices, and then decreased to 0.44 MB at level 30. The parameter size is comparable to the original network at around level 4.
Compared with results in Figure 2, where all layers are fullyconnected, the reduction in computation time here is less prominent. The reason is that the majority of computation is in convolutional layers, for which sparse matrix computation is not supported, as explained in Section 4.3. To conceptually illustrate how pruning can potentially help reduce computational complexity, the theoretical complexity and actual computation time are shown in Figure 9. The complexity is defined as the sum of multiplication operations in convolutional layers according to (1). The actual computation time per federation jumps initially from 5.68 seconds to 13.65 seconds, and gradually decreases to 3.97 seconds at level 30. This is worse than the theoretical, nearexponentially decreasing computational complexity, allowing room for further optimization in reducing the computation time.
5.2.2 Simulated Model Training
In Figure 10, we compare the test accuracy vs. training time among the original network and the network pruned to level 4, 8, 12, 16, 20 using samplebased, oneshot pruning. In this figure, the sparser the model is, the faster it learns. This is mainly due to the reduction in training time. The average time to complete one federation in the original network is 4.52 times compared to the time at pruning level 20.
In Figure 11, we compare the four possible combinations of {samplebased, sampleless}{oneshot, federated} pruning. For federated pruning, we start from level 5 and prune the model every 100 federations until level 15. Oneshot pruning takes the model directly to level 15 using initializationbased pruning. We see an apparent advantage of samplebased pruning. It is worth mentioning that the sampleless, federated pruning remained unimproved throughout the entire run (Figure 11). This is possibly due to the “information loss” phenomenon which will be detailed in Section 6.1.
Recall that our sample data at the server only includes images from two writers. Hence, we have shown that model pruning works on slightly biased samples as well. Because of this, one can also consider a federated learning framework with only edge devices and no server: the model can be pruned at powerful edge devices (e.g., Raspberry Pi 4’s) with its own data and then dispatched to other devices. Federated learning can be carried out in a completely distributed manner afterwards.
Similar to the observations in Section 5.1, samplebased approach learns always better than the other two approaches after any number of federations at all pruning levels. More importantly, it is revealed in Figure 12 that initializationbased and random pruning have surprisingly slow convergence in the early stage of training. Samplebased approach starts learning immediately from the clients’ data but the other two approaches have extremely long “cold start”, i.e., the time period where the model performs equivalent to random guess. This is particularly undesirable for timesensitive tasks. If one task aims to be able to predict reasonably well as fast as possible, a samplebased approach is highly preferred.
Similar to Figure 7, we extend the simulation of test accuracy comparing oneshot pruning of samplebased, initializationbased, and random pruning approaches at all 30 levels for 10K federations using 5 different seeds. The results are as expected (Figure 13): higher pruning levels will results in worse accuracies, and samplebased approaches perform better than initializationbased and random pruning.
6 Discussion
6.1 Robustness of Neural Network Architecture
The sampleless pruning methods we employ (initializationbased and random) are essentially ”uninformed” ways to remove the network’s weights. We observed such methods can still retain the model quality but only up to a certain level. Hence a natural question to ask is: how much can we prune the network weights in an uninformed way until the input information will be mostly lost during the forward pass over the layers due to the extreme sparsity? In other words, when a network is pruned beyond a certain level without any training data, the part of the output neurons values may remain constants no matter how the inputs vary (e.g., an output neuron losing all incoming connections).
To investigate this, we train LeNet300100 model using 5 different seeds to level 30. We vary the input and observe the variation in the output. On average, initializationbased and random pruning leaves 3.2 and 3.8 constant entries out of ten entries in the output layer, respectively, while there are no such entries in samplebased approaches. This result explains why in Figure 7 at pruning level close to 30, there is a dramatic degradation in accuracy and surge in instability for the initializationbased and random pruning approaches. Though we can avoid this structural information loss by imposing specific constraints to the network structure, we found empirically that our proposed samplebased pruning can automatically guarantee the robustness of network architectures.
6.2 Reinitializing Parameters?
It is recently hypothesized that deep neural networks contain subnetworks (winning tickets) that can be trained to reach a similar accuracy as the original network, when the weights in the subnetwork are reinitialized with the same values as the original network Frankle & Carbin (2019). It is therefore interesting to study whether it is more beneficial to reinitialize the model after pruning. We plot in Figure 14 reinitialization vs. no reinitialization using samplebased, oneshot pruning at level 15 with LeNet300100 network. We observe that there is no obvious difference between the two, while the noreinitialization approach converges marginally faster at earlier federations. For this reason, we adhere to the iterative training and pruning approach without reinitialization in this paper.
6.3 How Important is Parameter Initialization?
Pruning Level  5  10  15  20 

Pct. of #Params. Remaining  59.2  35.1  20.9  12.6 
Overlap (Largest)  73.3  58.1  41.1  35.7 
Overlap (Smallest)  44.6  12.5  4.8  0.8 
Here we intuitively explain why in Figure 7, initializationbased pruning works better than random pruning. To do so, we first obtain the pruned model (using regular centralized pruning Han et al. (2015)) and count the number of pruned parameters in this model. We then extract the largest/smallest parameters from the original (not pruned) model at initialization, denoted by . Finally we intersect with the pruned model’s parameter set respectively, and calculate the overlap ratio. We measure this for the parameters of the first FC layer in LeNet300100. The results are in Table 5, where the second row gives the percentage of the remaining parameters, and the bottom two rows give the overlap ratio of largest/smallest values as defined above. We find that the parameters initialized to large values are likely to be kept in the pruning procedure, while those initialized to small values are likely to be eliminated. Indeed, this result is very straightforward and it depends on other hyperparameters such as the learning rate. Still, from this result, we can qualitatively explain this phenomenon: by initializationbased pruning, we are removing parameters that are likely to be eliminated eventually and keeping those that are likely to be eventually kept.
6.4 Impact of Sample Data Size
Intuitively, the more samples we use for pruning, the better resulting subnetwork we get. Now we increase the sample data size from 200 in Section 5.1 to 400, 600, 800, and finally the entire data set of 60,000, which corresponds to Han et al. (2015). We study the impact of sample size with LeNet300100 and present the results in Figure 15.
It is clear that using more sample data has advantages in various aspects. Even without the federated learning stage, the starting test accuracy at training iteration 0 is immediately higher when using more samples. Also, with more samples, the convergence speed of accuracy as well as the final accuracy that can be achieved are better than those using fewer samples.
It should be however emphasized that using more sample data inevitably increases the pruning time at the server. On an Amazon g3s.xlarge instance, training 50 epochs (after which the model is pruned for one level) on 200, 400, 600, 800, 60,000 samples takes 10 seconds, 16 seconds, 24 seconds, 30 seconds, 2,158 seconds, respectively. Training large samples might not be affordable due to time or resource limits, and most importantly such samples are often not available at the server at all.
7 Conclusion
In this paper, we have proposed a new model pruning framework for federated learning in edge/mobile computing environments, where the goal is to effectively reduce the size of deep neural network models so that resourcelimited clients can train them with their own data and contribute in the federated learning processes. Through complexity analysis and extensive experiments on both simulated and real devices, we have shown that the framework enables a federated learning system to achieve this goal by having the participating clients share only little or no data at all to others, preserving the main benefit of federated learning, i.e., the privacy of clients’ data, while dramatically reducing the communication and computation load. We have also discussed additional insights gained from our experimental analysis on the effectiveness of model pruning under various conditions regarding data size and initialization. These insights and experimental results should provide further research directions on model pruning in federated learning, such as measuring the impacts of different types of optimization methods on the efficacy and quality of model pruning in federated learning.
References
 Buciluǎ et al. (2006) Buciluǎ, C., Caruana, R., and NiculescuMizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541. ACM, 2006.
 Caldas et al. (2018) Caldas, S., Wu, P., Li, T., Konecný, J., McMahan, H. B., Smith, V., and Talwalkar, A. LEAF: A benchmark for federated settings. CoRR, abs/1812.01097, 2018. URL http://arxiv.org/abs/1812.01097.
 Dryden et al. (2016) Dryden, N., Moon, T., Jacobs, S. A., and Van Essen, B. Communication quantization for dataparallel training of deep neural networks. In 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), pp. 1–8. IEEE, 2016.
 Duff et al. (2002) Duff, I. S., Heroux, M. A., and Pozo, R. An overview of the sparse basic linear algebra subprograms: The new standard from the blas technical forum. ACM Transactions on Mathematical Software (TOMS), 28(2):239–267, 2002.
 Frankle & Carbin (2019) Frankle, J. and Carbin, M. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019.
 Gupta et al. (2015) Gupta, S., Agrawal, A., Gopalakrishnan, K., and Narayanan, P. Deep learning with limited numerical precision. In International Conference on Machine Learning, pp. 1737–1746, 2015.
 Han et al. (2015) Han, S., Pool, J., Tran, J., and Dally, W. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.
 He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
 Howard et al. (2017) Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
 Hubara et al. (2017) Hubara, I., Courbariaux, M., Soudry, D., ElYaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
 Konečnỳ et al. (2016) Konečnỳ, J., McMahan, H. B., Yu, F. X., Richtárik, P., Suresh, A. T., and Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
 LeCun et al. (1990) LeCun, Y., Denker, J. S., and Solla, S. A. Optimal brain damage. In Advances in neural information processing systems, pp. 598–605, 1990.
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 Lee et al. (2018) Lee, N., Ajanthan, T., and Torr, P. H. Snip: Singleshot network pruning based on connection sensitivity. arXiv preprint arXiv:1810.02340, 2018.
 Li et al. (2018) Li, H., Ota, K., and Dong, M. Learning iot in edge: Deep learning for the internet of things with edge computing. IEEE Network, 32(1):96–101, 2018.
 Li et al. (2019) Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. Federated learning: Challenges, methods, and future directions. arXiv preprint arXiv:1908.07873, 2019.
 Lin et al. (2018) Lin, Y., Han, S., Mao, H., Wang, Y., and Dally, B. Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018.
 Liu et al. (2019) Liu, J., Liu, J., Du, W., and Li, D. Performance analysis and characterization of training deep learning models on nvidia tx2. arXiv preprint arXiv:1906.04278, 2019.
 Lym et al. (2019) Lym, S., Choukse, E., Zangeneh, S., Wen, W., Erez, M., and Shanghavi, S. Prunetrain: Gradual structured pruning from scratch for faster neural network training. arXiv preprint arXiv:1901.09290, 2019.
 McMahan et al. (2017) McMahan, H. B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communicationefficient learning of deep networks from decentralized data. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
 Molchanov et al. (2016) Molchanov, P., Tyree, S., Karras, T., Aila, T., and Kautz, J. Pruning convolutional neural networks for resource efficient inference. arXiv preprint arXiv:1611.06440, 2016.
 Park et al. (2019) Park, J., Wang, S., Elgabli, A., Oh, S., Jeong, E., Cha, H., Kim, H., Kim, S.L., and Bennis, M. Distilling ondevice intelligence at the network edge. arXiv preprint arXiv:1908.05895, 2019.
 Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Strom (2015) Strom, N. Scalable distributed dnn training using commodity gpu cloud computing. In Sixteenth Annual Conference of the International Speech Communication Association, 2015.
 Wang et al. (2019) Wang, S., Tuor, T., Salonidis, T., Leung, K. K., Makaya, C., He, T., and Chan, K. Adaptive federated learning in resource constrained edge computing systems. IEEE Journal on Selected Areas in Communications, 37(6):1205–1221, June 2019. ISSN 07338716. doi: 10.1109/JSAC.2019.2904348.
 Xiao et al. (2017) Xiao, H., Rasul, K., and Vollgraf, R. Fashionmnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
 Yao et al. (2017) Yao, S., Zhao, Y., Zhang, A., Su, L., and Abdelzaher, T. Deepiot: Compressing deep neural network structures for sensing systems with a compressorcritic framework. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems, pp. 4. ACM, 2017.
Appendix A FashionMNIST with i.i.d. Data Partition
The analysis of FashionMNIST data agrees with the analysis in Section 5.2 to a large extent. Therefore, we present the experimental results in the appendix without further explanation. Each of the figures will be associated with its counterpart in the previous sections.
Figure 16 and Figure 17 correspond to Figure 8 and Figure 9, respectively. They demonstrate the actual computation/communication time as well as the theoretical computational complexity. Figure 18 corresponds to Figure 12. It illustrates the three oneshot pruning approaches at pruning levels 10, 20, and 30. Figure 19 corresponds to Figure 13. It compares the test accuracy of samplebased, initializationbased, and random pruning in a oneshot setting as the pruning level increases.