1 Introduction
Abstract

Federated learning is a recent approach for distributed model training without sharing the raw data of clients. It allows model training using the large amount of user data collected by edge and mobile devices, while preserving data privacy. A challenge in federated learning is that the devices usually have much lower computational power and communication bandwidth than machines in data centers. Training large-sized deep neural networks in such a federated setting can consume a large amount of time and resources. To overcome this challenge, we propose a method that integrates model pruning with federated learning in this paper, which includes initial model pruning at the server, further model pruning as part of the federated learning process, followed by the regular federated learning procedure. Our proposed approach can save the computation, communication, and storage costs compared to standard federated learning approaches. Extensive experiments on real edge devices validate the benefit of our proposed method.

0.4cm 0.3cm \sysmltitlerunningModel Pruning Enables Efficient Federated Learning on Edge Devices

\sysmltitle

Model Pruning Enables Efficient Federated Learning on Edge Devices

\sysmlsetsymbol

equal*

{sysmlauthorlist}\sysmlauthor

Yuang Jiangyale \sysmlauthorShiqiang Wangibm \sysmlauthorBong Jun Kostanford \sysmlauthorWei-Han Leeibm \sysmlauthorLeandros Tassiulasyale

\sysmlaffiliation

yaleYale University, New Haven, CT, USA \sysmlaffiliationibmIBM T. J. Watson Research Center, Yorktown Heights, NY, USA \sysmlaffiliationstanfordStanford Institute for Human-Centered Artificial Intelligence (HAI), CA, USA

\sysmlcorrespondingauthor

Yuang Jiangyuang.jiang@yale.edu \sysmlcorrespondingauthorShiqiang Wangwangshiq@us.ibm.com

\sysmlkeywords

Edge/mobile devices, federated learning, model pruning, resource efficiency


\printAffiliationsAndNotice

1 Introduction

Federated learning, since its inception, has attracted considerable attention due to its capability of distributed model training using data collected by possibly a large number of edge and mobile devices, such as Internet of Things (IoT) gateways and smartphones McMahan et al. (2017); Park et al. (2019); Li et al. (2019). The federated learning procedure includes local computations at clients (edge/mobile devices) and model parameter exchange among clients and a server in an iterative manner. The significance of this procedure is that the clients’ data remain local and are not shared with others, which preserves data privacy and is more resource-efficient than transmitting all the data to a central server.

However, modern DNNs (deep neural networks) could contain hundreds of millions of parameters Simonyan & Zisserman (2014); He et al. (2016); training such large models directly on edge devices is often infeasible due to resource (such as memory) limitation or is otherwise very slow. In addition, the communication of such a large number of parameters between clients and the server is also a major impediment in federated learning, since the clients are often geo-distributed edge devices that need to upload their local models to the parameter server frequently McMahan et al. (2017); Lin et al. (2018).

Approaches to reducing the communication overhead in federated learning have been proposed recently Konečnỳ et al. (2016), which, however, do not reduce the complexity of local computation at clients. For efficient computation, new model architectures such as MobileNet Howard et al. (2017) and model pruning techniques Molchanov et al. (2016) have been developed, where the model is trained/pruned at the server with centrally available data, then deployed at the edge for efficient inference. These methods have not been applied in the federated learning setting where data are decentralized in clients.

In this paper, we aim at overcoming the bottleneck of computation, communication, and storage jointly, and propose a new federated learning paradigm that is combined with distributed model pruning. The benefit of model pruning is that it has high degree of freedom in terms of the amount of parameters to prune, which may be adapted depending on the computation and communication capabilities of the clients involved in the federated learning process. To the best of our knowledge, we are the first to propose such a paradigm for federated learning, which focuses on both computation and communication efficiency.

Our main contributions in this paper are as follows.

  1. [nolistsep]

  2. We propose a federated model pruning mechanism where the server can initially perform some degree of pruning based on the initialized (untrained) model, possibly using a small amount of sample data if available, then further pruning can be performed distributedly.

  3. We integrate model pruning with the federated learning procedure, providing a mechanism that jointly trains and prunes the model in a federated manner.

  4. We conduct extensive experiments with various settings and measurements on real edge devices, and discuss the insights obtained from the results.

The rest of this paper is organized as follows. In Section 2, we present the related work. In Section 3, we introduce preliminaries and the proposed approach. Implementation details and complexity analysis can be found in Section 4. We evaluate the proposed approaches extensively in Section 5. Further issues are discussed in Section 6. Finally, we conclude our work in Section 7.

2 Related Work

In this section, we summarize existing works from three directions, including model compression, efficient communication, and implementation.

Model compression. Quantization mitigates computational cost by compressing the number of bits required for each parameter Gupta et al. (2015); Hubara et al. (2017). This, however, is orthogonal to our approach, and we will show in the following sections that, our approach can easily exceed the compression upper bound ( in the case of 32-bit parameters) of quantization. Knowledge distillation Buciluǎ et al. (2006) transfers the knowledge from large, possibly ensemble of models to a smaller model, usually following a teacher-student paradigm Hinton et al. (2015). It enables the compression of pre-trained large models to much smaller ones with minimal loss of accuracy. Nevertheless, it by no means provides a solution for small models to learn knowledge from scratch or learn new knowledge incrementally. Model pruning was first proposed in LeCun et al. (1990) to remove redundant parameters according to their importance. It is further improved in Han et al. (2015) with iterative training and pruning. A single shot pruning method was recently proposed by Lee et al. (2018).

Efficient communication. A number of works try to overcome the communication bottleneck by sending only important gradients for aggregation, e.g., send gradients whose magnitudes are above a certain threshold Strom (2015), or send a fixed proportion of gradients to the server Dryden et al. (2016). The work by Lin et al. (2018) demonstrated a high communication compression ratio while achieving similar accuracy in, using warm-up training and momentum manipulation in addition to gradient clipping.

Implementation. In the literature, federated learning experiments using edge devices have been done Wang et al. (2019). There are also performance measurements of deep learning on TX2 platforms Liu et al. (2019) and implementation on IoT devices Li et al. (2018); Yao et al. (2017).

Our approach is superior to the existing methods for the following reasons: 1) it overcomes not only the communication bottleneck, but also computation and storage bottlenecks (compared with gradient sparsification methods); 2) the pruned model is capable of learning incrementally from new data (compared with knowledge distillation); 3) it can easily exceed the compression upper bound of quantization methods; 4) compared to existing model pruning approaches LeCun et al. (1990); Han et al. (2015); Lee et al. (2018), our approach is fundamentally different because in our paradigm, the parameter server has access to only a small subset of data, not the entire dataset, to protect clients’ privacy. Furthermore, quantization and gradient sparsification are orthogonal to our approach and can be applied simultaneously with our approach.

3 Federated Learning with Pruning

In this section, we present our method of joint federated learning and model pruning.

3.1 Preliminaries

Standard federated learning procedure. A federated learning system consists of multiple clients, each with its own data, and a server that works as a model parameter aggregator. In the existing (standard) federated learning procedure, when a model training task is instantiated, the server first initializes the model parameters, and starts the iterative process of (i) the server distributing the model to clients, (ii) each client training the model with its own data (or subset) for a certain number of model training iteration, (iii) each client sending the updated model to the server, and (iv) the server aggregating the model parameters updated by the clients. We refer to steps i–iv above as one federation.

As discussed in Section 1, in the edge/mobile computing environment where the clients’ resources and communication bandwidth are limited, the above standard federated learning procedure is significantly challenged when the model size is large. To address this challenge, we employ model pruning as a means to reduce the computation and communication overhead at clients.

Model pruning. In the iterative training and pruning approach proposed in Han et al. (2015) for the centralized machine learning setting, the model is first trained using stochastic gradient descent (SGD) for a given number of iterations. Then, a “level” of model pruning is performed by removing a certain percentage (referred to as the pruning rate) of weights that have smallest absolute values layer-wise. This training and pruning process is repeated until a desired pruning level (corresponding to a model size) is reached. The benefit of this approach is that the training and pruning occurs at the same time, so that a trained model with a desired (small) size can be obtained in the end. However, the approach in Han et al. (2015), as well as other pruning techniques LeCun et al. (1990); Lee et al. (2018), require the availability of training data at a central location, which is not applicable to federated learning.

3.2 Proposed Approach

Our proposed approach includes initial model pruning at the server and further model pruning involving both the server and clients during the federated learning process. The details are described as follows.

3.2.1 Initial Pruning at the Server

Initial model pruning at the server can be done with the following two cases of pruning with respect to the data availability at the server:

  • [nolistsep]

  • Sample-based pruning: This case assumes availability of some data samples at the server, typically a small subset of training dataset. These samples may be obtained by requesting each client to provide a few (a small portion of its own dataset that the client is willing to share) before the federated learning process starts, or by the server collecting a small amount of data on its own. In this case, existing approaches LeCun et al. (1990); Han et al. (2015); Lee et al. (2018) can be used for pruning using the available sample data. We primarily focus on the joint pruning and training approach Han et al. (2015) since our ultimate goal is model training, but our framework applies to other pruning techniques as well.

  • Sample-less pruning: In this case, there is no data sample available at the server. The model is pruned in an “uninformed” way, either in an initialization-based manner where a given amount of weights with small amplitudes are removed right after model initialization, or in a random manner where randomly selected weights are pruned regardless of its amplitude.

One would normally expect the quality of the pruned model in the above two cases to be poor compared to traditional pruning which makes use of a full training dataset Han et al. (2015). Rather surprisingly, however, we find that the model quality can be effectively retained even when the pruning is done using only a small amount of data. Even more surprising is that sample-less pruning can also retain the model quality to a certain extent, according to our experiments in Section 5.

3.2.2 Further Pruning Involving both Server and Clients

After initial pruning, the system can further perform the repeated training and pruning operations in addition to the standard federated learning procedure. Here, the system performs one or a few federations (training using federated learning procedure), followed by a pruning step that removes a certain amount of small parameters from the model. We call this federated pruning. Compared with pruning at the server, the benefit of federated pruning is that it incorporates the impact of local data available at clients (reflected in the model update by distributed gradient descent in the federated learning process). The federated pruning step is optional. When we only apply the initial server-based pruning and not federated pruning, we call it one-shot pruning in this paper.

3.2.3 Overall Procedure

Figure 1: Illustration of Proposed Framework

The initial pruning at the server is often needed so that the initial model is small enough for training on edge devices without consuming too much time. If desired, federated pruning can further reduce the model size to a even smaller size. In summary, our framework is amenable to handle the four cases of pruning: (i) sample-based and one-shot, (ii) sample-less and one-shot, (iv) sample-based and federated, and (iv) sample-less and federated. A consolidated procedure that covers all these cases can be described as follows (Figure 1):

  1. When appropriate, the server requests the clients to send a small portion of their data.

  2. The server initializes the model parameters and performs the initial pruning (either with or without sample data) until a desired model size is reached.

  3. The server distributes the initially pruned model to the clients.

  4. Each client updates the model using its own dataset. The updated model is aggregated by the server. If federated pruning is used, the next round of pruning is performed at the server to remove weights with small magnitudes. This iterative process is repeated until a desired pruning level is reached. Afterwards, regular federated learning is performed.

Network Lenet-300-100 Conv-FashionMNIST Conv-FEMNIST
Convolutions None 32, pool, 64, pool 32, pool, 128, pool
Fully-connected Layers
Conv/FC/All Parameters 0/266.6K/266.6K 18.8K/722.6K/741.4K 103.4K/2654.8K/2758.1K
Optimizer SGD (LR = 0.5) SGD (LR = 0.25) SGD (LR = 0.25)
Pruning Rates
Conv: N/A
FC: 20%/20%/10%
Conv: 5%/10%
FC: 20%/20%/10%
Conv: 5%/10%
FC: 20%/20%/10%
Number of Samples 200 200 389
Number of Clients 10 5 5
Local Updates/Batch Size 5/20 5/20 5/20
Table 1: Neural Network Architectures

4 Implementation

4.1 Using Sparse Matrices

Throughout this paper, we use dense matrices for original networks, and sparse matrices for weights in fully-connected (FC) layers in pruned networks. The format we choose for sparse matrices is coordinate list (COO). It stores a list of <row, column, value>  triplets sorted first by row index and then by column index in order to improve random access times. Although the benefit of model pruning in terms of computation is constantly mentioned in the literature from a theoretical point of view (e.g., Han et al. (2015)), most existing implementations substitute sparse parameters by applying binary masks to dense parameters. Applying masks increases the overhead of computation, instead of reducing it. In this paper, we implement sparse matrices to model pruning, and we will show its efficacy in the following sections.

4.2 Complexity Analysis

Computation. Suppose is a sparse matrix with fraction of non-zero entries in COO format, representing the weights at a layer in the neural network, and is a dense matrix representing the layer’s input. The computational complexity of matrix multiplication is

(1)

using Sparse BLAS Duff et al. (2002). Yet, regarding the computation time, if matrix multiplication can be efficiently parallelized (e.g., in GPUs), computation of sparse matrices is not necessarily faster than dense matrix computation. Because dense matrix multiplication is extremely optimized, sparse matrices will show advantage in computation time only when the matrix is beyond a certain degree of sparsity (percentage of zero entries), where this sparsity threshold depends on specific hardware and software implementations.

Storage, memory, and communication. The storage, memory, and communication overhead is proportional to the size of parameters. In our implementation, we use the integer type (with the smallest number of bits needed) to store the sparse matrix indices, and 32-bit floating point numbers to store the values. More precisely, when using -bit integers for indices and 32-bit floating point numbers for values, the ratio of the sparse parameter size to the dense parameter size is:

(2)

where is the number of non-zero entries and is the total number of entries in the sparse parameter. For example, in the case where , it is advantageous to use sparse matrices when .

4.3 Implementation Challenges

As of today, well-known machine learning frameworks have limited support of sparse matrix computation. For instance, in the current implementation in PyTorch, sparse matrices do not support persistent storage; the computations on sparse matrices are slow; and sparse matrices are not supported for the convolutional layers in convolutional neural networks (CNNs), etc. Therefore, in our implementation for the experiments presented in Section 5, we use actual sparse matrices in fully-connected layers and conceptually implement binary masks to the convolutional layers as surrogate for sparse matrices. Consequently, for models where the computation in convolutional layers dominates, we will not see an apparent decrease in computation time. We note that this problem is solvable in the future by implementing and optimizing efficient sparse matrix multiplication (particularly for convolutional layers) on a software level, as well as developing specific hardware for such purpose. Nevertheless, compared with existing work, the novelty in our implementation is that we use sparse matrix representation in fully-connected layers of the pruned model.

We note that there also exist pruning techniques where the resulting pruned model is constrained on using dense matrix representation Lym et al. (2019), in which case sparse matrix computation is not needed. Our framework of joint federated learning and model pruning can directly support such pruning methods as well. However, we focus on sparse matrices in this paper as it provides a higher degree of freedom of the pruned model.

5 Experimentation

To study the performance of our proposed approach, we conduct experiments in (i) a real edge computing system, where the server is a personal computer and the clients are Raspberry Pi devices, and (ii) a simulated setting with multiple clients and a server.

Pruning methods. For the initial sample-based pruning and federated pruning, we use the magnitude-based method as described in Section 3.2. For the sample-less pruning, we employ two methods, namely, (i) initialization-based pruning, where the pruning is based on the magnitude of model parameter values at initialization, and (ii) random pruning, where model parameters are pruned randomly. Besides the above pruning methods, two other benchmarks are also considered: the baseline accuracy and upper bound accuracy. Baseline accuracy is the test accuracy of the model trained (yet not pruned) only on the sample data available at the server, and upper bound accuracy is the accuracy of the original (not pruned) model trained on the entire dataset (union of all clients’ local datasets).

Models and datasets. Three different models are studied in our experiments, namely, LeNet-300-100 for MNIST dataset LeCun et al. (1998), Conv-FashionMNIST for FashionMNIST dataset Xiao et al. (2017), and Conv-FEMNIST for Federated Extended MNIST (FEMNIST) dataset Caldas et al. (2018); their details can be found in Table 1. Due to space limitation, the results of Conv-FashionMNIST are included in the appendix. We use SGD with a constant learning rate for training the neural networks across all experiments. When using sample-based approaches, the models are trained with the samples for 50 epochs at the server. In each federation, all clients update their local parameters 5 times, each of which uses SGD with a mini-batch size of 20.

5.1 MNIST with i.i.d. Data Partition

Figure 2: Time Measurements on Raspberry Pi 3 with LeNet-300-100 Network
Figure 3: Inference Time on Raspberry Pi versions 3 and 4

We first consider the MNIST dataset using a fully-connected LeNet-300-100 architecture LeCun et al. (1998). We prune the network up to 30 levels, removing 20%, 20%, 10% of the weights at each level, respectively. Experiments both on Raspberry Pi’s and simulations are performed. For the experiments, we use a system where the server is a personal computer, and the clients are ten Raspberry Pi devices (five Pi 3 Model B’s and five Pi 4 Model B’s). Their specs can be found in Table 2111Raspberry Pi devices are not equipped with GPUs, so the training and inference are performed only on CPUs.. Server and clients communicate with each other using TCP protocol.

Device CPU RAM Storage
Pi 3 4 cores/1.2GHz 1GB 32GB SD card
Pi 4 4 cores/1.5GHz 2GB 32GB SD card
Table 2: Device Specifications

5.1.1 Time Measurements of One Federation

Figure 4: Test Accuracy vs. Time with LeNet-300-100 at Pruning Level 5/10/15/20

We first present the time measurements of one federation on real devices. We implement the original LeNet-300-100 model as well as models pruned for every 2 levels up to 30 levels on the Raspberry Pi devices and measure the average elapsed time on both server and clients over 100 federations.

Figure 2 shows the average total time, computation time, and communication time in one federation as we vary the pruning level. We also plot the actual file size of the parameters that are exchanged between server and clients in this figure. The original network will crash the Raspberry Pi 3’s, i.e., the system dies when training on the first mini-batch due to resource exhaustion. Thus, we annotate “inf.” for the time measurements at pruning level 0.

Computation time. We see from Figure 2 that the original model cannot be trained on Raspberry Pi 3’s, and as the pruning level increases, the computation time decreases from 1.87 seconds per federation to 0.07 seconds per federation. This result agrees with the discussion in Section 4. Additionally, we plot in Figure 3 the inference time for 10,000 test data samples, averaged over 100 experiments. On Raspberry Pi versions 3 and 4, after pruning 13 and 9 levels, respectively, the inference time of models using sparse weights becomes smaller than using the original model.

Communication time. Compared with computation time, the decrease in communication time is even more noticeable: it drops from 10.57 seconds per federation to 0.48 seconds per federation. Since sparse matrices doubles the size of storage, as we use 16-bit integers for their indices, the benefit of size reduction comes only after 50% of the parameters are pruned (level 4 in Figure 2), according to (2).

5.1.2 Model Training on Raspberry Pi’s

Figure 5: Comparing Sample-based, One-shot Approach at Pruning Level 5/10/15/20 with LeNet-300-100
Figure 6: The 4 Possible Pruning Cases with LeNet-300-100
Figure 7: Test Accuracy of Three One-shot Pruning Approaches with LeNet-300-100 at Federation #1K/2K/10K
Level 5 10 15 20
Sample-
Based
0.5K/6322s 1.0K/3958s 1.5K/3157s 7.3K/7214s
Init.-
Based
0.7K/10093s 1.6K/5320s 9.2K/5964s N/A/N/A
Random 0.8K/10907s 1.9K/7130s 12.4K/8996s N/A/N/A
Table 3: Number of Federations/Elapsed Time to Reach 95% Accuracy Threshold

Next, we look at how long it takes for the federated learning to reach a certain accuracy when a new learning task is assigned. Figure 4 exhibits the accuracy vs. elapsed training time at pruning levels 5, 10, 15, and 20. We set a target accuracy of 95% for all of them. Training will end once it reaches the threshold for ten consecutive evaluations or reaches the 6,000 seconds time limit, whichever comes first.

Clearly, there is a trade-off between computation time and the final accuracy: larger models will result in better accuracies (at convergence) but slower training, and conversely, smaller models will result in worse accuraries (at convergence) but faster training, as shown in Figure 5. A good choice is in between: using pruning level 15 takes the least amount of time to reach the 95% accuracy target. Table 3 shows the number of federations and time spent to reach the given accuracy target. The sample-based approach reaches the accuracy target with less number of federations, and more importantly, less time. Additionally, we find initialization-based pruning always performs comparable to or better than random pruning, an intuitive explanation of which can be found in the discussion Section 6.3.

In Figure 6 we consider all the four possible cases: {sample-based, sample-less} {one-shot, federated} pruning. For federated pruning, we start from level 5 and prune the model every 100 federations until level 15. When sample data are available at the server, we consider a sample data size of 200 for the initial pruning. It is clear that pruning with samples always gives better final accuracy compared with no samples, and similarly federated pruning is always better than one-shot pruning at convergence. Nevertheless, federated pruning will slow down the training at the early stage and thus we see the samples-based, one-shot case gives better accuracy than sample-less, federated pruning and samples-based, federated pruning before 1,500 seconds.

5.1.3 Simulations of Sample-based, One-shot Pruning

To extend the experiments on Raspberry Pi’s, we conduct simulations of sample-based, one-shot pruning for 10,000 federations at all 30 pruning levels. We repeat the simulation with 5 different random seeds.

Level 5 10 15 20 25 30
Percent 32.97 10.97 3.73 1.35 0.56 0.30
Sample
Based
98.1 97.7 97.1 95.4 91.7 84.5
Init.
Based
98.1 97.4 95.2 86.1 31.8 14.5
Random 98.0 97.3 95.1 85.5 30.1 13.2
Baseline 83.2 82.8 82.4 81.3 79.5 76.1
Table 4: Test Accuracy (%) of Pruning Level 5/10/15/20/25/30 at Federation #10K (Level: Pruning Level; Percent: Percentage of Remaining Parameters)

In Figure 7, we plot the test accuracy of the network at 1K, 2K, and 10K federations according to the three pruning methods (sample-based, initialization-based, and random). The -axis of each figure represents the pruning level (and percentage of remaining parameters). At all federations, there is a monotonic decrease on test accuracy as the number of pruning levels increases. The reason is twofold: first, the learnability of sparse networks is significantly reduced as we remove more and more weights; second, as we use one-shot pruning, the model is pruned using a small fraction of the training data, thus the resulting sparse model might not represent a good subnetwork architecture (per the winning ticket hypothesis in  Frankle & Carbin (2019)). That being said, our sample-based approach achieves better test accuracies than initialization-based and random approaches at all pruning levels, especially when the network is highly sparse after pruning (i.e., high pruning level). Table 4 lists the accuracies of all pruning approaches and their baselines. For example, at pruning level 20 where only 1.35% parameters are left, the sample-based one-shot pruning still achieves accuracy while other two approaches achieve significantly lower accuracies (around ).

5.2 FEMNIST with Non-i.i.d. Data Partition

Figure 8: Time Measurements on Pi 4 with Conv-FEMNIST
Figure 9: Computational Complexity of Conv-FEMNIST
Figure 10: Comparing Sample-based, One-shot Approach at Pruning Level 4/8/12/16/20 with Conv-FEMNIST
Figure 11: The 4 Possible Pruning Cases with Conv-FEMNIST
Figure 12: Test Accuracy of Three One-shot Pruning Approaches with Conv-FEMNIST at Pruning Level 10/20/30
Figure 13: Test Accuracy of Three One-shot Pruning Approaches with Conv-FEMNIST at Federation #1K/2K/10K

Next, we consider the FEMNIST dataset with a Conv-FEMNIST network. FEMNIST is a benchmark dataset for federated learning settings. It contains images that comprise 62 classes of handwritten digits and letters, including lower and upper cases. The data are collected from 3,500 different writers. It provides an option of retrieving non-i.i.d. distributed images according to the writers. We extract a biased subset of 389 samples from 2 writers out of the entire amount of 35,948 data samples and study whether a biased subset of sample data can still be effective for initial pruning. We also explore the performance when convolutional layers are pruned.

The Conv-FEMNIST network consists of 2 convolutional layers and 3 FC layers. We prune 20%, 20%, 10% of weights in FC layers and 5%, 10% in convolutional layers at each pruning level, up to 30 pruning levels. Due to the limited support for sparse matrices, we use element-wise multiplication of weights and their binary masks in the convolutional layers as a surrogate for actual sparse matrices. Because of the excessive training time on Raspberry Pi’s, we first measure the computation and communication time on Raspberry Pi 4’s and then use the measured time to simulate the model training process.

5.2.1 Time Measurements of One Federation

The measurement of computation, communication, and total time of each federation on Raspberry Pi 4’s can be found in Figure 8. It is consistently observed that the communication time is 24.34 seconds per federation when the original network is used, and 39.15 seconds per federation with sparse matrices representation at level 2, from which it is decreased as the level increases, finally down to 1.54 seconds per federation at level 30. This trend coincides with the change in the size of the parameters, where it goes from 10.53 MB at level 0 to 16.62 MB at level 2 due to the use of sparse matrices, and then decreased to 0.44 MB at level 30. The parameter size is comparable to the original network at around level 4.

Compared with results in Figure 2, where all layers are fully-connected, the reduction in computation time here is less prominent. The reason is that the majority of computation is in convolutional layers, for which sparse matrix computation is not supported, as explained in Section 4.3. To conceptually illustrate how pruning can potentially help reduce computational complexity, the theoretical complexity and actual computation time are shown in Figure 9. The complexity is defined as the sum of multiplication operations in convolutional layers according to (1). The actual computation time per federation jumps initially from 5.68 seconds to 13.65 seconds, and gradually decreases to 3.97 seconds at level 30. This is worse than the theoretical, near-exponentially decreasing computational complexity, allowing room for further optimization in reducing the computation time.

5.2.2 Simulated Model Training

In Figure 10, we compare the test accuracy vs. training time among the original network and the network pruned to level 4, 8, 12, 16, 20 using sample-based, one-shot pruning. In this figure, the sparser the model is, the faster it learns. This is mainly due to the reduction in training time. The average time to complete one federation in the original network is 4.52 times compared to the time at pruning level 20.

In Figure 11, we compare the four possible combinations of {sample-based, sample-less}{one-shot, federated} pruning. For federated pruning, we start from level 5 and prune the model every 100 federations until level 15. One-shot pruning takes the model directly to level 15 using initialization-based pruning. We see an apparent advantage of sample-based pruning. It is worth mentioning that the sample-less, federated pruning remained unimproved throughout the entire run (Figure 11). This is possibly due to the “information loss” phenomenon which will be detailed in Section 6.1.

Recall that our sample data at the server only includes images from two writers. Hence, we have shown that model pruning works on slightly biased samples as well. Because of this, one can also consider a federated learning framework with only edge devices and no server: the model can be pruned at powerful edge devices (e.g., Raspberry Pi 4’s) with its own data and then dispatched to other devices. Federated learning can be carried out in a completely distributed manner afterwards.

Similar to the observations in Section 5.1, sample-based approach learns always better than the other two approaches after any number of federations at all pruning levels. More importantly, it is revealed in Figure 12 that initialization-based and random pruning have surprisingly slow convergence in the early stage of training. Sample-based approach starts learning immediately from the clients’ data but the other two approaches have extremely long “cold start”, i.e., the time period where the model performs equivalent to random guess. This is particularly undesirable for time-sensitive tasks. If one task aims to be able to predict reasonably well as fast as possible, a sample-based approach is highly preferred.

Similar to Figure 7, we extend the simulation of test accuracy comparing one-shot pruning of sample-based, initialization-based, and random pruning approaches at all 30 levels for 10K federations using 5 different seeds. The results are as expected (Figure 13): higher pruning levels will results in worse accuracies, and sample-based approaches perform better than initialization-based and random pruning.

6 Discussion

Figure 14: Reinitialization vs. No Reinitialization with LeNet-300-100 at Pruning Level 15
Figure 15: Impact of Sample Size with LeNet-300-100 at Pruning Level 5/10/15/20

6.1 Robustness of Neural Network Architecture

The sample-less pruning methods we employ (initialization-based and random) are essentially ”uninformed” ways to remove the network’s weights. We observed such methods can still retain the model quality but only up to a certain level. Hence a natural question to ask is: how much can we prune the network weights in an uninformed way until the input information will be mostly lost during the forward pass over the layers due to the extreme sparsity? In other words, when a network is pruned beyond a certain level without any training data, the part of the output neurons values may remain constants no matter how the inputs vary (e.g., an output neuron losing all incoming connections).

To investigate this, we train LeNet-300-100 model using 5 different seeds to level 30. We vary the input and observe the variation in the output. On average, initialization-based and random pruning leaves 3.2 and 3.8 constant entries out of ten entries in the output layer, respectively, while there are no such entries in sample-based approaches. This result explains why in Figure 7 at pruning level close to 30, there is a dramatic degradation in accuracy and surge in instability for the initialization-based and random pruning approaches. Though we can avoid this structural information loss by imposing specific constraints to the network structure, we found empirically that our proposed sample-based pruning can automatically guarantee the robustness of network architectures.

6.2 Reinitializing Parameters?

It is recently hypothesized that deep neural networks contain subnetworks (winning tickets) that can be trained to reach a similar accuracy as the original network, when the weights in the subnetwork are reinitialized with the same values as the original network Frankle & Carbin (2019). It is therefore interesting to study whether it is more beneficial to reinitialize the model after pruning. We plot in Figure 14 reinitialization vs. no reinitialization using sample-based, one-shot pruning at level 15 with LeNet-300-100 network. We observe that there is no obvious difference between the two, while the no-reinitialization approach converges marginally faster at earlier federations. For this reason, we adhere to the iterative training and pruning approach without reinitialization in this paper.

6.3 How Important is Parameter Initialization?

Pruning Level 5 10 15 20
Pct. of #Params. Remaining 59.2 35.1 20.9 12.6
Overlap (Largest) 73.3 58.1 41.1 35.7
Overlap (Smallest) 44.6 12.5 4.8 0.8
Table 5: Overlap Ratio at Pruning Level 5/10/15/20

Here we intuitively explain why in Figure 7, initialization-based pruning works better than random pruning. To do so, we first obtain the pruned model (using regular centralized pruning Han et al. (2015)) and count the number of pruned parameters in this model. We then extract the largest/smallest parameters from the original (not pruned) model at initialization, denoted by . Finally we intersect with the pruned model’s parameter set respectively, and calculate the overlap ratio. We measure this for the parameters of the first FC layer in LeNet-300-100. The results are in Table 5, where the second row gives the percentage of the remaining parameters, and the bottom two rows give the overlap ratio of largest/smallest values as defined above. We find that the parameters initialized to large values are likely to be kept in the pruning procedure, while those initialized to small values are likely to be eliminated. Indeed, this result is very straightforward and it depends on other hyper-parameters such as the learning rate. Still, from this result, we can qualitatively explain this phenomenon: by initialization-based pruning, we are removing parameters that are likely to be eliminated eventually and keeping those that are likely to be eventually kept.

6.4 Impact of Sample Data Size

Intuitively, the more samples we use for pruning, the better resulting subnetwork we get. Now we increase the sample data size from 200 in Section 5.1 to 400, 600, 800, and finally the entire data set of 60,000, which corresponds to  Han et al. (2015). We study the impact of sample size with LeNet-300-100 and present the results in Figure 15.

It is clear that using more sample data has advantages in various aspects. Even without the federated learning stage, the starting test accuracy at training iteration 0 is immediately higher when using more samples. Also, with more samples, the convergence speed of accuracy as well as the final accuracy that can be achieved are better than those using fewer samples.

It should be however emphasized that using more sample data inevitably increases the pruning time at the server. On an Amazon g3s.xlarge instance, training 50 epochs (after which the model is pruned for one level) on 200, 400, 600, 800, 60,000 samples takes 10 seconds, 16 seconds, 24 seconds, 30 seconds, 2,158 seconds, respectively. Training large samples might not be affordable due to time or resource limits, and most importantly such samples are often not available at the server at all.

7 Conclusion

In this paper, we have proposed a new model pruning framework for federated learning in edge/mobile computing environments, where the goal is to effectively reduce the size of deep neural network models so that resource-limited clients can train them with their own data and contribute in the federated learning processes. Through complexity analysis and extensive experiments on both simulated and real devices, we have shown that the framework enables a federated learning system to achieve this goal by having the participating clients share only little or no data at all to others, preserving the main benefit of federated learning, i.e., the privacy of clients’ data, while dramatically reducing the communication and computation load. We have also discussed additional insights gained from our experimental analysis on the effectiveness of model pruning under various conditions regarding data size and initialization. These insights and experimental results should provide further research directions on model pruning in federated learning, such as measuring the impacts of different types of optimization methods on the efficacy and quality of model pruning in federated learning.

References

Appendix A FashionMNIST with i.i.d. Data Partition

The analysis of FashionMNIST data agrees with the analysis in Section 5.2 to a large extent. Therefore, we present the experimental results in the appendix without further explanation. Each of the figures will be associated with its counterpart in the previous sections.

Figure 16 and Figure 17 correspond to Figure 8 and Figure 9, respectively. They demonstrate the actual computation/communication time as well as the theoretical computational complexity. Figure 18 corresponds to Figure 12. It illustrates the three one-shot pruning approaches at pruning levels 10, 20, and 30. Figure 19 corresponds to Figure 13. It compares the test accuracy of sample-based, initialization-based, and random pruning in a one-shot setting as the pruning level increases.

Figure 16: Time Measurements on Raspberry Pi 4 with Conv-FashionMNIST Network
Figure 17: Computational Complexity of Conv-FashionMNIST
Figure 18: Test Accuracy of Three One-shot Pruning Approaches with Conv-FashionMNIST at Pruning Level 10/20/30
Figure 19: Test Accuracy of Three One-shot Pruning Approaches with Conv-FashionMNIST at Iteration 1K/2K/10K
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
392192
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description