# ExpandNets: Linear Over-parameterization to Train Compact Convolutional Networks

## Abstract

In this paper, we introduce an approach to training a given compact network. To this end, we leverage over-parameterization, which typically improves both optimization and generalization in neural network training, while being unnecessary at inference time. We propose to expand each linear layer, both fully-connected and convolutional, of the compact network into multiple linear layers, without adding any nonlinearity. As such, the resulting expanded network can benefit from over-parameterization during training but can be compressed back to the compact one algebraically at inference. We introduce several expansion strategies, together with an initialization scheme, and demonstrate the benefits of our ExpandNets on several tasks, including image classification, object detection, and semantic segmentation. As evidenced by our experiments, our approach outperforms both training the compact network from scratch and performing knowledge distillation from a teacher.

## 1 Introduction

With the growing availability of large-scale datasets and advanced computational resources, convolutional neural networks have achieved tremendous success in a variety of tasks, such as image classification (Krizhevsky et al., 2012; He et al., 2016), object detection (Ren et al., 2015; Redmon and Farhadi, 2018, 2016) and semantic segmentation (Long et al., 2015; Ronneberger et al., 2015). Over the past few years, “Wider and deeper are better” has become the rule of thumb to design network architectures (Simonyan and Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017). This trend, however, raises memory- and computation-related challenges, especially in the context of constrained environments, such as embedded platforms.

Deep and wide networks are well-known to be heavily over-parameterized, and thus a compact network, both shallow and thin, should often be sufficient. Unfortunately, compact networks are notoriously hard to train from scratch. As a consequence, designing strategies to train a given compact network has drawn growing attention, the most popular approach consisting of transferring the knowledge of a deep teacher network to the compact one of interest (Hinton et al., 2015; Romero et al., 2014; Yim et al., 2017a; Zagoruyko and Komodakis, 2017; Passalis and Tefas, 2018).

In this paper, we introduce an alternative approach to training compact neural networks, complementary to knowledge transfer. To this end, building upon the observation that network over-parameterization improves both optimization and generalization (Arora et al., 2018; Zhang et al., 2017; Reed, 1993; Allen-Zhu et al., 2018a, b; Kawaguchi et al., 2018), we propose to increase the number of parameters of a given compact network by incorporating additional layers. However, instead of separating every two layers with a nonlinearity, as in most of the deep learning literature, we advocate introducing consecutive linear layers. In other words, we expand each linear layer of a compact network into a succession of multiple linear layers, without any nonlinearity in between. Since consecutive linear layers are equivalent to a single one (Saxe et al., 2014), such an expanded network, or ExpandNet, can be algebraically compressed back to the original compact one without any information loss.

While the use of successive linear layers appears in the literature, existing work (Baldi and Hornik, 1989; Saxe et al., 2014; Kawaguchi, 2016; Laurent and von Brecht, 2018; Zhou and Liang, 2018; Arora et al., 2018) has been mostly confined to fully-connected networks without any nonlinearities and to the theoretical study of their behavior under statistical assumptions that typically do not hold in practice. In particular, these studies have attempted to understand the learning dynamics and the loss landscapes of deep networks. Here, by contrast, we focus on practical, nonlinear, compact convolutional neural networks, and investigate the use of linear expansion to introduce over-parameterization and improve training, so as to allow such networks to achieve better performance.

Specifically, as illustrated by Figure 1, we introduce three ways to expand a compact network: (i) replacing a fully-connected layer with multiple ones; (ii) replacing a convolution by three convolutional layers with kernel size , and , respectively; and (iii) replacing a convolution with by multiple convolutions. Our experiments demonstrate that expanding convolutions is the key to obtaining more effective compact networks.

Furthermore, we introduce a natural initialization strategy for our ExpandNets. Specifically, an ExpandNet has a nonlinear counterpart, with an identical number of parameters, obtained by adding nonlinearity between every two consecutive linear layers. We therefore propose to directly transfer the weights of this nonlinear model to the ExpandNet for initialization. We show that this yields a further performance boost for the resulting compact network.

In short, our contributions are (i) a novel approach to training given compact, nonlinear convolutional networks by expanding their linear layers; (ii) two strategies to expand convolutional layers; and (iii) an effective initialization scheme for the resulting ExpandNets. We demonstrate the benefits of our approach on several tasks, including image classification on ImageNet, object detection on PASCAL VOC and image segmentation on Cityscapes. Our ExpandNets outperform both training the corresponding compact networks from scratch and using knowledge distillation. We further analyze the benefits of linear over-parameterization on training via a series of experiments studying generalization, gradient confusion and loss landscapes. We will make our code publicly available.

## 2 Related Work

Very deep convolutional neural networks currently constitute the state of the art for many tasks. These networks, however, are known to be heavily over-parameterized, and making them smaller would facilitate their use in resource-constrained environments, such as embedded platforms. As a consequence, much research has recently been devoted to developing more compact architectures.

Network compression constitutes one of the most popular trends in this area. In essence, it aims to reduce the size of a large network while losing as little accuracy as possible, or even none at all. In this context, existing approaches can be roughly grouped into two categories: (i) parameter pruning and sharing (LeCun et al., 1990; Han et al., 2015; Courbariaux et al., 2016; Figurnov et al., 2016; Molchanov et al., 2017; Ullrich et al., 2017; Carreira-Perpinán and Idelbayev, 2018), which aims to remove the least informative parameters; and (ii) low-rank factorization (Denil et al., 2013; Sainath et al., 2013; Lebedev et al., 2014; Jin et al., 2015; Liu et al., 2015), which uses decomposition techniques to reduce the size of the parameter matrix/tensor in each layer. While compression is typically performed as a post-processing step, it has been shown that incorporating it during training could be beneficial (Alvarez and Salzmann, 2016, 2017; Wen et al., 2016, 2017). In any event, even though compression reduces a network’s size, it does not provide one with the flexibility of designing a network with a specific architecture. Furthermore, it often produces networks that are much larger than the ones we consider here, e.g., compressed networks with parameters vs for the SmallNets used in our experiments.

In a parallel line of research, several works have proposed design strategies to reduce a network’s number of parameters (Wu et al., 2016; Szegedy et al., 2016; Howard et al., 2017; Romera et al., 2018; Sandler et al., 2018). Again, while more compact networks can indeed be developed with these mechanisms, they impose constraints on the network architecture, and thus do not allow one to simply train a given compact network. Furthermore, as shown by our experiments, our approach is complementary to these works. For example, we can improve the results of MobileNets (Howard et al., 2017; Sandler et al., 2018) by training them using our expansion strategy.

Here, in contrast to the above-mentioned literature on compact networks, we seek to train a given compact network with an arbitrary architecture. This is also the task addressed by knowledge transfer approaches. To achieve this, existing methods leverage the availability of a pre-trained very deep teacher network. In (Hinton et al., 2015), this is done by defining soft labels from the logits of the teacher; in (Romero et al., 2014; Zagoruyko and Komodakis, 2017; Yim et al., 2017b), by transferring intermediate representations, attention maps and Gram matrices, respectively, from the teacher to the compact network; in (Passalis and Tefas, 2018) by aligning the feature distributions of the deep and compact networks.

In this paper, we introduce an alternative strategy to train compact networks, complementary to knowledge transfer. Inspired by the theory showing that over-parameterization helps training (Arora et al., 2018; Zhang et al., 2017; Reed, 1993; Allen-Zhu et al., 2018a, b), we propose to expand each linear layer in a given compact network into a succession of multiple linear layers. Our experiments evidence that training such expanded networks, which can then be compressed back algebraically, yields better results than training the original compact networks, thus empirically confirming the benefits of over-parameterization. Our results also show that our approach outperforms knowledge transfer, even when not using a teacher network.

Note that linearly over-parameterized neural networks have been investigated both in the early neural network days (Baldi and Hornik, 1989) and more recently (Saxe et al., 2014; Kawaguchi, 2016; Laurent and von Brecht, 2018; Zhou and Liang, 2018; Arora et al., 2018). These methods, however, typically study purely linear networks, with a focus on the convergence behavior of training in this linear regime. For example, Arora et al. (2018) showed that linear over-parameterization modifies the gradient updates in a unique way that speeds up convergence. In contrast to the above-mentioned methods which all focus on fully-connected layers, we develop two strategies to expand convolutional layers, as well as an approach to initializing our expanded networks, and empirically demonstrate the impact of both contributions on prediction accuracy.

Note that the concurrent work of Ding et al. (2019) also advocates for expansion of convolutional layers. However, the two strategies we introduce differ from their use of 1D asymmetric convolutions. More importantly, we believe that their analysis, parallel to ours, further confirms the benefits of linear expansion.

## 3 ExpandNets

Let us now introduce our approach to training compact networks by expanding their linear layers. In particular, we propose to make use of linear expansion, such that the resulting ExpandNet is equivalent to the original compact network and can be compressed back to the original structure algebraically. Below, we describe three different expansion strategies, starting with the case of fully-connected layers, followed by two ways to expand convolutional layers.

### 3.1 Expanding Fully-connected Layers

The weights of a fully-connected layer can easily be represented in matrix form. Therefore, expanding such layers can be done in a straightforward manner by relying on matrix product. Specifically, let be the parameter matrix of a fully-connected layer with input channels and output ones. That is, given an input vector , the output can be obtained as . Note that we ignore the bias, which can be taken into account by incorporating an additional channel with value 1 to .

Expanding such a fully-connected layer with an arbitrary number of linear layers () can be achieved by observing that its parameter matrix can be equivalently written as

(1) |

While this yields an over-parameterization, it does not affect the underlying model capacity. In particular, we can compress this model back to the original one, and thus, at inference, use the original compact architecture. More importantly, this allows us to increase not only the number of layers, but also the number of channels by setting for all . To this end, we rely on the notion of expansion rate. Specifically, for an expansion rate , we define and , . Note that other strategies are possible, e.g., , but ours has the advantage of preventing the number of parameters from exploding. In practice, considering the computational complexity of fully-connected layers, we advocate expanding each layer into only two or three layers with a small expansion rate.

Note that this fully-connected expansion is similar to that used by Arora et al. (2018), whose work we discuss in more detail in the supplementary material. As will be shown by our experiments, however, expanding only fully-connected layers, as in (Arora et al., 2018), does typically not yield a performance boost. By contrast, the two convolutional expansion strategies we introduce below do.

### 3.2 Expanding Convolutional Layers

The operation performed by a convolutional layer can also be expressed in matrix form, by vectorizing the input tensor and defining a highly structured matrix whose elements are obtained from the vectorized convolutional filters. While this representation could therefore allow us to use the same strategy as with fully-connected layers, using arbitrary intermediate matrices would ignore the convolution structure, and thus alter the original operation performed by the layer. For a similar reason, one cannot simply expand a convolutional layer with kernel size as a series of convolutions, because, unless , the resulting receptive field size would differ from the original one.

To overcome this, we note that convolutions retain the computational benefits of convolutional layers while not modifying the receptive field size. As illustrated in Figure 1, we therefore propose to expand a convolutional layer into 3 consecutive convolutional layers: a convolution; a one; and another one. This strategy still allows us to increase the number of channels in the intermediate layer. Specifically, for an original layer with input channels and output ones, given an expansion rate , we define the number of output channels of the first layer as and the number of output channels of the intermediate layer as .

Compressing an expanded convolutional layer into the original one can still be done algebraically. To this end, one can reason with the matrix representation of convolutions. For an input tensor of size , the matrix representation of the original layer can be recovered as

(2) |

where and each intermediate matrix has a structure encoding a convolution. The resulting matrix will also have a convolution structure, with filter size .

### 3.3 Expanding Convolutional Kernels

While kernels have become increasingly popular in very deep architectures (He et al., 2016), larger kernel sizes are often exploited in compact networks, so as to increase their expressiveness and their receptive fields. As illustrated in Figure 1, kernels with can be equivalently represented with a series of convolutions, where .

As before, the number of channels in the intermediate layers can be larger than that in the original one, thus allowing us to linearly over-parameterize the model. Similarly to the fully-connected case, for an expansion rate , we set the number of output channels of the first layer to and that of the subsequent layers to . The same matrix-based strategy as before can then be used to algebraically compress back the expanded kernels.

### 3.4 Dealing with Convolution Padding and Strides

In modern convolutional networks, padding and strides are widely used to retain information from the input feature map while controlling the size of the output one. To expand a convolutional layer with padding , we propose to use padding in the first layer of the expanded unit while not padding the remaining layers. Furthermore, to handle a stride , when expanding convolutional layers, we set the stride of the middle layer to and of the others to . When expanding convolutional kernels, we use a stride for all layers except for the last one whose stride is set to . These two strategies guarantee that the resulting ExpandNet can be compressed back to the compact model without any information loss.

Overall, the strategies discussed above allow us to expand an arbitrary compact network into an equivalent deeper and wider one. Note that these strategies can be used independently or together. In any event, once the resulting ExpandNet is trained, it can be compressed back to the original compact architecture in a purely algebraic manner, that is, at absolutely no loss of information.

### 3.5 Initializing ExpandNets

As will be demonstrated by our experiments, training an ExpandNet from scratch yields consistently better results than training the original compact network. However, with deep networks, initialization can have an important effect on the final results (Mishkin and Matas, 2015; He et al., 2015). While designing an initialization strategy specifically for compact networks is an unexplored research direction, our ExpandNets can be initialized in a natural manner.

To this end, we exploit the fact that an ExpandNet has a natural nonlinear counterpart, which can be obtained by incorporating a nonlinear activation function between each pair of linear layers. We therefore propose to initialize the parameters of an ExpandNet by simply training its nonlinear counterpart and transferring the resulting parameters to the ExpandNet. The initialized ExpandNet is then trained in the standard manner. As evidenced by our experiments below, when the nonlinear counterpart achieves better performance than the compact network, this simple strategy typically yields an additional accuracy boost to our approach.

## 4 Experiments

In this section, we demonstrate the benefits of our ExpandNets on image classification, object detection, and semantic segmentation. We further provide an ablation study to analyze the influence of different expansion strategies and expansion rates in the supplementary material.

We denote the expansion of fully-connected layers by FC(Arora18) to indicate that it is similar to the strategy used in Arora et al. (2018), of convolutional layers by CL, and of convolutional kernels by CK. When combining convolutional expansions with fully-connected ones, we use CL+FC or CK+FC, and add +Init to indicate the use of our initialization strategy.

### 4.1 Image Classification

We first study the use of our approach with very small networks on CIFAR-10 and CIFAR-100, and then turn to the more challenging ImageNet dataset, where we show that our method can improve the results of the compact MobileNet (Howard et al., 2017), MobileNetV2 (Sandler et al., 2018) and ShuffleNetV2 0.5 (Ma et al., 2018).

#### CIFAR-10 and CIFAR-100

Experimental setup. For CIFAR-10 and CIFAR-100, we use the same compact network as in (Passalis and Tefas, 2018). It is composed of 3 convolutional layers with kernels and no padding. These 3 layers have 8, 16 and 32 output channels, respectively. Each of them is followed by a batch normalization layer, a ReLU layer and a max pooling layer. The output of the last layer is passed through a fully-connected layer with 64 units, followed by a logit layer with either 10 or 100 units. To evaluate our kernel expansion method, we also report results obtained with a similar network where the kernels were replaced by ones, with a padding of . All networks, including our ExpandNets, were trained for epochs using a batch size of . We used standard stochastic gradient descent (SGD) with a momentum of 0.9 and a learning rate of , divided by at epochs and . With this strategy, all networks reached convergence. In this set of experiments, the expansion rate is always set to to balance the accuracy-efficiency trade-off. We evaluate the influence of this parameter in the ablation study in the supplementary material. Note that our ExpandNet strategy is complementary to knowledge transfer; that is, we can apply any typical knowledge transfer method using our ExpandNet as student instead of the compact network directly. To demonstrate the benefits of ExpandNets in this scenario, we conduct experiments using knowledge distillation (KD) (Hinton et al., 2015), hint-based transfer (Hint)(Romero et al., 2014) or probabilistic knowledge transfer (PKT) (Passalis and Tefas, 2018) from a ResNet18 teacher network.

Results. We focus here on the SmallNet with kernels, for which we can evaluate all our expansion strategies, including the CK ones, and report the results obtained with the model with kernels in the supplementary material. Table 1 provides the results over 5 runs of all our networks with and without knowledge distillation, which we have found to be the most effective knowledge transfer strategy, as evidenced by comparing these results with those obtained by Hint and PKT reported in the supplementary material. As shown in the top portion of the table, only expanding the fully-connected layer, as proposed by Arora et al. (2018), yields mild improvement. However, expanding the convolutional ones clearly outperforms the compact network, and is further boosted by expanding the fully-connected one and by using our initialization scheme. Overall, on both datasets, expanding the kernels yields higher accuracy, with ExpandNet-CK+FC+Init achieving the best results. Note that even without KD, our ExpandNets outperform the SmallNet with KD. The gap is further increased when we also use KD, as can be seen in the bottom portion of the table.

Network | Transfer | CIFAR-10 | CIFAR-100 |
---|---|---|---|

SmallNet | w/o KD | ||

SmallNet | w/ KD | ||

FC(Arora18) | w/o KD | ||

ExpandNet-CL | w/o KD | ||

ExpandNet-CL+FC | |||

ExpandNet-CL+FC+Init | |||

ExpandNet-CK | |||

ExpandNet-CK+FC | |||

ExpandNet-CK+FC+Init | |||

ExpandNet-CL+FC | w/ KD | ||

ExpandNet-CL+FC+Init | |||

ExpandNet-CK+FC | |||

ExpandNet-CK+FC+Init |

#### ImageNet

Model | w/o KD | w/ KD |
---|---|---|

MobileNet | ||

ExpandNet-CL | 69.40 | 70.47 |

MobileNetV2 | ||

ExpandNet-CL | 65.62 | 67.19 |

ShuffleNetV2 0.5 | ||

ExpandNet-CL | 57.38 | 57.68 |

Experimental setup. For our experiments on ImageNet (Russakovsky et al., 2015), we make use of the compact MobileNet (Howard et al., 2017), MobileNetV2 (Sandler et al., 2018) and ShuffleNetV2 (Ma et al., 2018) models, which were designed to be compact and yet achieve good results. We rely on a pytorch implementation of these models. For our approach, we use our CL strategy to expand all convolutional layers with kernel size of in MobileNet and ShuffleNetV2, while only expanding the non-residual convolutional layers in MobileNetV2. We trained the MobileNets using the short-term regime advocated in (He et al., 2016), corresponding to training for 90 epochs with a weight decay of 0.0001 and an initial learning rate of 0.1, divided by 10 every 30 epochs. We employed SGD with a momentum of 0.9 and a batch size of 256. We also performed data augmentation by taking random image crops and resizing them to pixels. We further performed random horizontal flips and subtract the per-pixel mean. For ShuffleNet, we used the small ShuffleNetV2 0.5, trained in the same manner as in (Ma et al., 2018). We also conducted KD from a ResNet152 (with top-1 accuracy), tuning the KD hyper-parameters to the best accuracy for each method.

Results. We compare the results of the original models with those of our expanded versions in Table 2. Our expansion strategy yields a large improvement, increasing the top-1 accuracy of MobileNet, MobileNetV2 and ShuffleNetV2 0.5 by 2.92, 1.87 and 0.49 percentage points (pp). Furthermore, our vanilla ExpandNets outperform the MobileNets with KD, even though we do not require a teacher. Combining KD with the ShuffleNetV2 0.5 confirms that the performance of our ExpandNets can be further boosted with the help of a teacher network. Note that, on this dataset, the nonlinear counterparts of the ExpandNets did not outperform the original models, thus we did not use our initialization strategy.

### 4.2 Object Detection

Our approach is not restricted to image classification. We first demonstrate its benefits in the context of one-stage object detection using the PASCAL VOC dataset (Everingham et al., 2007, 2012). To this end, we trained networks on the PASCAL VOC2007 + 2012 training and validation sets and report the mean average precision (mAP) on the PASCAL VOC2007 test set.

Experimental setup. We make use of YOLO-LITE (Huang et al., 2018), which was designed to work in constrained environments. It consists of a backbone part and a head part, and we only expand the convolutional layers in the backbone. YOLO-LITE is a very compact network, with only 5 convolutional layers in the backbone, each followed by a batch normalization layer, a leaky-ReLU layer and a max pooling layer. We expand all 5 convolutional layers using our CL strategy with . We trained all networks in the standard YOLO fashion (Redmon and Farhadi, 2018, 2016).

Model | VOC2007 Test |
---|---|

YOLO-LITE | |

ExpandNet-CL | |

ExpandNet-CL+Init | 35.14 |

Results. The results on the PASCAL VOC 2007 test set are reported in Table 3. They show that, as for object detection, our expansion strategy yields better performance. Since YOLO-LITE is very compact, our initialization scheme boosts the performance by over 7pp.

### 4.3 Semantic Segmentation

We then demonstrate the benefits of our approach on the task of semantic segmentation using the Cityscapes dataset (Cordts et al., 2016), which contains 5000 images of size . Following the standard protocol, we report the mean Intersection over Union (mIoU), mean recall (mRec) and mean precision (mPrec).

Experimental setup. For this experiment, we rely on the U-Net (Ronneberger et al., 2015), which is a relatively compact network consisting of a contracting path and an expansive path. The contracting path follows the typical architecture of a convolutional network. It consists of blocks of two convolutions, each followed by a ReLU, and a max pooling. The number of channels in each block is 32, 64, 128, 256, 512, respectively. We apply our CL expansion strategy with to all convolutions in the contracting path. We train the networks on 4 GPUs using the standard SGD optimizer with a momentum of 0.9 and a learning rate of .

Results. Our results on the Cityscapes validation set are shown in Table 4. Note that our ExpandNet also outperforms the original compact U-Net on this task.

Model | mIoU | mRec | mPrec |
---|---|---|---|

U-Net | |||

ExpandNet-CL | 57.85 | 76.53 | 65.94 |

## 5 Analysis of our Approach

To further understand our approach, we first study its behavior during training and its generalization ability. For these experiments, we make use of the CIFAR-10 and CIFAR-100 datasets, and use the settings described in detail in our ablation study in the supplementary material. We then propose and analyze two hypotheses to empirically evidence that the better performance of our approach truly is a consequence of over-parameterization during training. In the supplementary material, we also showcase the use of our approach with the larger AlexNet architecture on ImageNet and evaluate the complexity of the models in terms of number of parameters, FLOPs, and training and testing inference speed. Note that, since our ExpandNets can be compressed back to the original networks, at test time, they have exactly the same number of parameters, FLOPS, and inference time as the original networks, but achieve higher accuracy.

### 5.1 Training Behavior

To investigate the benefits of linear over-parameterization on training, we make use of the gradient confusion introduced by Sankararaman et al. (2019) to show that the gradients of nonlinearly over-parameterized networks were more consistent across mini-batches. Specifically, following (Sankararaman et al., 2019), we measure gradient confusion (or rather consistency) as the minimum cosine similarity from 100 randomly-sampled pairs of mini-batches at the end of each training epoch. As in (Sankararaman et al., 2019), we also combine the gradient cosine similarity of 100 pairs of sampled mini-batches at the end of training from each independent run and perform Gaussian kernel density estimation on this data. While, during training, we aim for the minimum cosine similarity to be high, at the end of training, the resulting distribution should be peaked around zero.

We run each experiment 5 times and show the average values across all runs in Figure 2. From the training and test curves, we observe that our ExpandNets-CL/CK speed up convergence and yield a smaller generalization error gap. They also yield lower gradient confusion (higher minimum cosine similarity) and a more peaked density of pairwise gradient cosine similarity. This indicates that our ExpandNets-CL/CK are easier to train than the compact model. By contrast, only expanding the fully-connected layers, as in Arora et al. (2018), does not facilitate training. Additional plots are provided in the supplementary material.

Dataset | Model | Kernel size | |||||
---|---|---|---|---|---|---|---|

5 | 9 | ||||||

Best Test | Last Test | Train | Best Test | Last Test | Train | ||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK |

### 5.2 Generalization Ability

We then analyse the generalization ability of our approach. To this end, we first study the loss landscapes using the method in (Li et al., 2018). As shown in Figure 3, our ExpandNets, particularly with convolutional expansion, produce flatter minima, which, as discussed in (Li et al., 2018), indicates better generalization.

As a second study of generalization, we evaluate the memorization ability of our ExpandNets on corrupted datasets, as suggested by Zhang et al. (2017). To this end, we utilize the open-source implementation of Zhang et al. (2017) to generate three CIFAR-10 and CIFAR-100 training sets, containing 20%, 50% and 80% of random labels, respectively, while the test set remains clean.

In Table 5 (and Tables A5, A6 and A7 in the supplementary material), we report the top-1 test errors of the best and last models, as well as the training errors of the last model. These results evidence that expanding convolutional layers and kernels typically yields lower test errors and higher training ones, which implies that our better results in the other experiments are not due to simply memorizing the datasets, but truly to better generalization ability.

### 5.3 Is Over-parameterization the Key to the Success?

In the previous experiments, we have shown the good training behavior and generalization ability of our expansion strategies. Below, we explore and reject two hypotheses other than over-parameterization that could be thought to explain our better results.

Hypothesis 1: The improvement comes from initialization with the linear product of the expanded networks.

The standard (e.g., Kaiming) initialization of our ExpandNets is in fact equivalent to a non-standard initialization of the compact network. In other words, an alternative would consist of initializing the compact network with an untrained algebraically-compressed ExpandNet. To investigate the influence of such different initialization schemes, we conduct several experiments on CIFAR-10, CIFAR-100 and ImageNet.

Network | Initialization | CIFAR-10 | CIFAR-100 |
---|---|---|---|

SmallNet | Standard | ||

FC(Arora18) | |||

ExpandNet-CL | |||

ExpandNet-CL+FC | |||

ExpandNet-CK | |||

ExpandNet-CK+FC | |||

ExpandNet-CK+FC | Standard | ||

Initialization | ImageNet | ||

MobileNet | Standard | ||

MobileNet | ExpandNet-CL | ||

ExpandNet-CL | Standard | 69.40 | |

MobileNetV2 | Standard | ||

MobileNetV2 | ExpandNet-CL | ||

ExpandNet-CL | Standard | 65.62 | |

ShuffleNetV2 0.5 | Standard | ||

ShuffleNetV2 0.5 | ExpandNet-CL | ||

ExpandNet-CL | Standard | 57.38 |

The results for these different initialization strategies on CIFAR-10, CIFAR-100 and ImageNet are provided in Table 6. On CIFAR-10, the compact networks initialized with FC(Arora18) and ExpandNet-CL yield slightly better results than training the ExpandNets by standard initialization. However, the same trend does not occur on CIFAR-100 and ImageNet, where, with ExpandNet-CL initialization, MobileNet, MobileNetV2 and ShuffleNetV2 0.5 reach results similar to or worse than standard initialization, while training their ExpandNet-CL counterparts always outperforms the baselines. Moreover, the compact networks initialized by ExpandNet-CK always yield worse results than training ExpandNets-CK from scratch. This confirms that initialization via linear expansion is not sufficient to yield better results.

Hypothesis 2: Some intrinsic property of the CK expansion strategy is more important than over-parameterization.

#params(K) | CIFAR-10 | CIFAR-100 | |

0.25 | |||

0.50 | |||

0.75 | |||

1.00 | |||

SmallNet | |||

2.0 | |||

4.0 | |||

#params(K) denotes the number of parameters (CIFAR-10 / CIFAR-100). |

We note that the amount of over-parameterization is directly related to the expansion rate . Therefore, if some property of the CK strategy was the sole reason for our better results, setting should be sufficient. To study this, following the same experimental setting as for Table 1, we set the expansion rate in [0.25, 0.50, 0.75, 1.0, 2.0, 4.0], and report the corresponding results in Table 7. For , the performance of ExpandNet-CK drops by 6.38pp from 78.70% to 72.32% on CIFAR-10 and by 7.09pp from 46.41% to 39.32% on CIFAR-100 as the number of parameters decreases. For , ExpandNet-CK yields consistently higher accuracy. Interestingly, with , ExpandNet-CK still yields better performance (79.22% vs 78.63% and 47.25% vs 46.63%) with fewer parameters (54.77K vs 66.19K and 60.62K vs 72.04K). This shows that both ExpandNet-CK and over-parameterization are important in our method.

## 6 Conclusion

We have introduced an approach to training a given compact network from scratch by exploiting over-parameterization. Specifically, we have shown that over-parameterizing the network linearly facilitates the training of compact networks, particularly when linearly expanding convolutional layers. Our experiments have demonstrated that, when applicable, our CK expansion strategy tends to yield the best results. When it doesn’t, that is, for a kernel size of 3, the CL one nonetheless remains highly effective. Our analysis has further evidenced that over-parameterization is the key to the success of our approach, improving both the training behavior and generalization ability of the networks, and ultimately leading to better performance at inference without any increase in computational cost. Our technique is general and can also be used in conjunction with knowledge transfer approaches. Furthermore, when the nonlinear ExpandNet counterpart outperforms the compact network, our initialization scheme yields a significant accuracy boost. This initialization strategy, however, is not the only possible one. In the future, we will therefore aim to develop other initialization schemes for our ExpandNets, and generally for compact networks.

## Appendix

## Appendix A Discussion of (Arora et al., 2018)

In this section, we discuss in more detail the work of Arora et al. (2018) to further evidence the differences with our work. Importantly, Arora et al. (2018) worked mostly with purely linear, fully-connected models, with only one example using a nonlinear model, where again only the fully-connected layer was expanded. By contrast, we focus on practical, nonlinear, compact convolutional networks, and we propose two ways to expand convolutional layers, which have not been studied before. As shown by our experiments in the main paper and in Section B of this supplementary material, our convolutional linear expansion strategies yield better solutions than vanilla training, with higher accuracy, more zero-centered gradient cosine similarity during training and minima that generalizes better. This is in general not the case when expanding the fully-connected layers only, as proposed by Arora et al. (2018). Furthermore, in contrast with Arora et al. (2018), who only argue that depth speeds up convergence, we empirically show, by using different expansion rates, that increasing width helps to reach better solutions. We now discuss in more detail the only experiment in Arora et al. (2018) with a nonlinear network.

In their paper, Arora et al. performed a sanity test on MNIST with a CNN, but only expanding the fully-connected layer. According to our experiments, expanding fully-connected layers only (denoted as FC(Arora18) in our results) is typically insufficient to outperform vanilla training of the compact network. This was confirmed by using their code, with which we found that, in their setting, the over-parameterized model yields higher test error. We acknowledge that Arora et al. (2018) did not claim that expansion led to better results but sped up convergence. Nevertheless, this seemed to contradict our experiments, in which our FC expansion was achieving better results than that of Arora et al. (2018).

While analyzing the reasons for this, we found that Arora et al. (2018) used a different weight decay regularizer than us. Specifically, considering a single fully-connected layer expanded into two, this regularizer is defined as

(3) |

where and represent the parameter matrices of the two fully-connected layers after expansion. That is, the regularizer is defined over the product of these parameter matrices. While this corresponds to weight decay on the original parameter matrix, without expansion, it contrasts with usual weight decay, which sums over the different parameter matrices, yielding a regularizer of the form

(4) |

The product norm regularizer used by Arora et al. (2018) imposes weaker constraints on the individual parameter matrices, and we observed their over-parameterized model to converge to a worse minimum and lead to worse test performance when used in a nonlinear CNN.

To evidence this, in Figure A1, we compare the original model with an over-parameterized one relying on a product regularizer as in (Arora et al., 2018), and with an over-parameterized network with normal regularization, corresponding to our FC expansion strategy. Even though the overall loss of Arora et al. (2018)’s over-parameterized model decreases faster than that of the baseline, the cross-entropy loss term, the training error and the test error do not show the same trend. The test errors of the original model, Arora et al. (2018)’s over-parameterized model with product norm and our ExpandNet-FC with normal norm are 0.9%, 1.1% and 0.8%, respectively. Furthermore, we also compare Arora et al. (2018)’s over-parameterized model and our ExpandNet-FC with an expansion rate . We observe that Arora et al. (2018)’s over-parameterized model performs even worse with a larger expansion rate, while our ExpandNet-FC works well.

Note that, in the experiments in the main paper and below, the models denoted by FC(Arora18) rely on a normal regularizer, which we observed to yield better results and makes the comparison fair as all models then use the same regularization strategy.

## Appendix B Complementary Experiments

We provide additional experimental results to further evidence the effectiveness of our expansion strategies.

### b.1 SmallNet with Kernels

As mentioned in the main paper, we also evaluate our approach using the same small network as in (Passalis and Tefas, 2018), with kernel size instead of in the main paper, on CIFAR-10 and CIFAR-100. In this case, we make use of our CL expansion strategy, with and without our initialization scheme, because the CK one does not apply to kernels. We trained all these networks for 100 epochs using a batch size of 128. We used Adam with a learning rate of 0.001, divided by 10 at epoch 50, which matches the setup in (Passalis and Tefas, 2018). As reported in Table A1, expanding the convolutional layers yields higher accuracy than the small network. This is further improved by also expanding the fully-connected layer, and even more so when using our initialization strategy.

### b.2 Knowledge Transfer with ExpandNets

Model | Transfer | CIFAR-10 | CIFAR-100 |
---|---|---|---|

SmallNet | w/o KD | ||

SmallNet | w/ KD | ||

FC(Arora18) | w/o KD | ||

ExpandNet-CL | w/o KD | ||

ExpandNet-CL+FC | |||

ExpandNet-CL+FC+Init | |||

ExpandNet-CL+FC | w/ KD | ||

ExpandNet-CL+FC+Init |

Network | Transfer | Top-1 Accuracy |
---|---|---|

SmallNet | Baseline | |

SmallNet | KD | |

Hint | ||

PKT | ||

ExpandNet | KD | |

Hint | ||

(CL+FC) | PKT | |

ExpandNet | KD | |

Hint | ||

(CL+FC+Init) | PKT |

Model | Kernel size | ||||
---|---|---|---|---|---|

3 | 5 | 7 | 9 | ||

SmallNet | |||||

FC(Arora18) | 2 | ||||

4 | |||||

8 | |||||

ExpandNet-CL | 2 | ||||

4 | |||||

8 | |||||

ExpandNet-CK | 2 | ||||

4 | |||||

8 | |||||

SmallNet | |||||

FC(Arora18) | 2 | ||||

4 | |||||

8 | |||||

ExpandNet-CL | 2 | ||||

4 | |||||

8 | |||||

ExpandNet-CK | 2 | ||||

4 | |||||

8 |

In the main paper, we claim that our ExpandNet strategy is complementary to knowledge transfer. Following (Passalis and Tefas, 2018), on CIFAR-10, we make use of the ResNet18 as teacher. Furthermore, we also use the same compact network with kernel size and training setting as in (Passalis and Tefas, 2018). In Table A2, we compare the results of different knowledge transfer strategies, including knowledge distillation (KD) (Hinton et al., 2015), hint-based transfer (Hint)(Romero et al., 2014) and probabilistic knowledge transfer (PKT) (Passalis and Tefas, 2018), applied to the compact network and to our ExpandNets, without and with our initialization scheme. Note that using knowledge transfer with our ExpandNets, with and without initialization, consistently outperforms using it with the compact network. Altogether, we therefore believe that, to train a given compact network, one should really use both knowledge transfer and our ExpandNets to obtain the best results.

### b.3 Ablation Study

In this section, we conduct an ablation study on the hyper-parameter choice, specifically the expansion rate and kernel size. We evaluate the behavior of different expansion strategies, FC(Arora18), CL and CK, separately, when varying the expansion rate and the kernel size . Compared to our previous CIFAR-10 and CIFAR-100 experiments, we use a deeper network with an extra convolutional layer with 64 channels, followed by a batch normalization layer, a ReLU layer and a max pooling layer. We use SGD with a momentum of and a weight decay of for epochs. The initial learning rate was , divided by at epoch and . Furthermore, we used zero-padding to keep the size of the input and output feature maps of each convolutional layer unchanged. We use the same networks in our approach analysis experiments.

The results of these experiments are provided in Table A3. We observe that our different strategies to expand convolutional layers outperform the compact network in almost all cases, while only expanding fully-connected layers doesn’t work well. Besides, the benefits of our convolution expansion strategies (CL and CK) grow as the expansion rate increases. In particular, for kernel sizes , ExpandNet-CK yields consistently higher accuracy than the corresponding compact network, independently of the expansion rate. For , where ExpandNet-CK is not applicable, ExpandNet-CL comes as an effective alternative, also consistently outperforming the baseline.

# Channels | 128 | 256 (Original) | 512 |
---|---|---|---|

Baseline | |||

ExpandNet-CK | 49.66 | 55.46 | 58.75 |

Dataset | Model | Kernel size | |||||
---|---|---|---|---|---|---|---|

3 | 7 | ||||||

Best Test | Last Test | Train | Best Test | Last Test | Train | ||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK |

Dataset | Model | Kernel size | |||||
---|---|---|---|---|---|---|---|

3 | 5 | ||||||

Best Test | Last Test | Train | Best Test | Last Test | Train | ||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK |

Dataset | Model | Kernel size | |||||
---|---|---|---|---|---|---|---|

7 | 9 | ||||||

Best Test | Last Test | Train | Best Test | Last Test | Train | ||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK | |||||||

SmallNet | |||||||

FC(Arora18) | |||||||

ExpandNet-CL | |||||||

ExpandNet-CK |

### b.4 Working with Larger Networks

We also evaluate the use of our approach with a larger network. To this end, we make use of AlexNet (Krizhevsky et al., 2012) on ImageNet. AlexNet relies on kernels of size and in its first two convolutional layers, which makes our CK expansion strategy applicable.

We use a modified, more compact version of AlexNet, where we replace the first fully-connected layer with a global average pooling layer, followed by a 1000-class fully-connected layer with softmax. To evaluate the impact of the network size, we explore the use of different dimensions, , for the final convolutional features. We trained the resulting AlexNets and corresponding ExpandNets using the same training regime as for our MobileNets experiments in the main paper.

As shown in Table A4, while our approach outperforms the baseline AlexNets for all feature dimensions, the benefits decrease as the feature dimension increases. This indicates that our approach is better suited for truly compact networks, and developing similar strategies for deeper ones will be the focus of our future research.

Model | # Params(M) | # FLOPs | GPU Speed (imgs/sec) | |||

Train | Test | Train | Test | Train | Test | |

SmallNet () | M | M | ||||

ExpandNet-CL | M | |||||

ExpandNet-CL+FC | M | |||||

ExpandNet-CK | M | |||||

ExpandNet-CK+FC | M | |||||

MobileNet | G | G | ||||

ExpandNet-CL | G | |||||

MobileNetV2 | G | G | ||||

ExpandNet-CL | G | |||||

ShuffleNetV2 | G | G | ||||

ExpandNet-CL | G | |||||

YOLO-LITE | G | G | ||||

ExpandNet-CL | G | |||||

U-Net | G | G | ||||

ExpandNet-CL | G |

### b.5 Generalization Ability on Corrupted CIFAR-10 and CIFAR-100

Our experiments on corrupted datasets in the main paper imply better generalization. Here, we provide more experimental results on Corrupted CIFAR-10 (in Table A5) and CIFAR-100 (in Tables A6 and A7) by using different networks with kernel sizes of 3, 5, 7, 9, respectively. Our method consistently improves the generalization error gap across all kernel sizes and corruption rates (20%, 50%, 80%) and yields from around 1pp to over 6pp error drop in testing.

## Appendix C Complexity Analysis

Here, we compare the complexity of our expanded networks and the original networks in terms of number of parameters, number of FLOPs and GPU inference speed.

In Table A8, we provide numbers for both training and testing. During training, because our approach introduces more parameters, inference is 2 to 5 times slower than in the original network for an expansion rate of 4. However, since our ExpandNets can be compressed back to the original network, at test time, they have exactly the same number of parameters and FLOPS, and inference time, but our networks achieve higher accuracy.

## Appendix D Additional Visualizations

Here, we provide additional visualizations for the training behavior and the loss landscapes as in the main paper , corresponding to networks with kernel sizes of 3, 5, 7, 9, respectively.

We plot the loss landscapes of SmallNets and corresponding ExpandNets on CIFAR-10 in Figure A2, and analyze the training behavior on CIFAR-10 in Figure A3 and on CIFAR-100 in Figure A4. These plots further confirm that in almost all cases, our convolution expansion strategies (CL and CK) facilitate training (with lower gradient confusion and more 0-concentrated gradient cosine similarity density) and yield better generalization ability (with flatter minima).

### References

- Learning and generalization in overparameterized neural networks, going beyond two layers. arxiv. Cited by: §1, §2.
- A convergence theory for deep learning via over-parameterization. arxiv. Cited by: §1, §2.
- Learning the number of neurons in deep networks. In nips, Cited by: §2.
- Compression-aware training of deep networks. In nips, Cited by: §2.
- On the optimization of deep networks: implicit acceleration by overparameterization. In icml, Cited by: Appendix A, Appendix A, Appendix A, Appendix A, Appendix A, §1, §1, §2, §2, §3.1, §4.1.1, §4, §5.1.
- Neural networks and principal component analysis: learning from examples without local minima. tnn. Cited by: §1, §2.
- âLearning-compressionâ algorithms for neural net pruning. In cvpr, Cited by: §2.
- The cityscapes dataset for semantic urban scene understanding. In cvpr, Cited by: §4.3.
- Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arxiv. Cited by: §2.
- Predicting parameters in deep learning. In nips, Cited by: §2.
- ACNet: strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
- The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html Cited by: §4.2.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html Cited by: §4.2.
- Perforatedcnns: acceleration through elimination of redundant convolutions. In nips, Cited by: §2.
- Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arxiv. Cited by: §2.
- Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In iccv, iccv. Cited by: §3.5.
- Deep residual learning for image recognition. In cvpr, Cited by: §1, §3.3, §4.1.2.
- Distilling the knowledge in a neural network. arxiv. Cited by: §B.2, §1, §2, §4.1.1.
- Mobilenets: efficient convolutional neural networks for mobile vision applications. arxiv. Cited by: §2, §4.1.2, §4.1.
- Densely connected convolutional networks. In cvpr, Cited by: §1.
- YOLO-lite: a real-time object detection algorithm optimized for non-gpu computers. In icbd, Cited by: §4.2.
- Flattened convolutional neural networks for feedforward acceleration.. In iclrw, Cited by: §2.
- Generalization in deep learning. In Mathematics of Deep Learning, Cambridge University Press. Prepint avaliable as: MIT-CSAIL-TR-2018-014, Massachusetts Institute of Technology, Cited by: §1.
- Deep learning without poor local minima. In nips, Cited by: §1, §2.
- ImageNet classification with deep convolutional neural networks. In nips, Cited by: §B.4, §1.
- Deep linear networks with arbitrary loss: all local minima are global. In icml, Cited by: §1, §2.
- Speeding-up convolutional neural networks using fine-tuned cp-decomposition. arxiv. Cited by: §2.
- Optimal brain damage. In nips, Cited by: §2.
- Visualizing the loss landscape of neural nets. In nips, Cited by: §5.2.
- Sparse convolutional neural networks. In cvpr, Cited by: §2.
- Fully convolutional networks for semantic segmentation. In cvpr, Cited by: §1.
- ShuffleNet v2: practical guidelines for efficient cnn architecture design. In eccv, Cited by: §4.1.2, §4.1.
- All you need is a good init. corr. Cited by: §3.5.
- Pruning convolutional neural networks for resource efficient inference. In iclr, Cited by: §2.
- Learning deep representations with probabilistic knowledge transfer. In eccv, Cited by: §B.1, §B.2, §1, §2, §4.1.1.
- YOLO9000: better, faster, stronger. arxiv. Cited by: §1, §4.2.
- Yolov3: an incremental improvement. arxiv. Cited by: §1, §4.2.
- Pruning algorithms-a survey. tnn 4 (5), pp. 740–747. External Links: Document, ISSN 1045-9227 Cited by: §1, §2.
- Faster r-cnn: towards real-time object detection with region proposal networks. In nips, Cited by: §1.
- ERFNet: efficient residual factorized convnet for real-time semantic segmentation. tits. Cited by: §2.
- Fitnets: hints for thin deep nets. arxiv. Cited by: §B.2, §1, §2, §4.1.1.
- U-net: convolutional networks for biomedical image segmentation. In miccai, Cited by: §1, §4.3.
- ImageNet large scale visual recognition challenge. ijcv 115 (3), pp. 211–252. External Links: Document Cited by: §4.1.2.
- Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In icassp, Cited by: §2.
- MobileNetV2: inverted residuals and linear bottlenecks. In cvpr, Cited by: §2, §4.1.2, §4.1.
- The impact of neural network overparameterization on gradient confusion and stochastic gradient descent. arxiv. Cited by: §5.1.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In iclr, Cited by: §1, §1, §2.
- Very deep convolutional networks for large-scale image recognition. In iclr, Cited by: §1.
- Going deeper with convolutions. In cvpr, Cited by: §1.
- Rethinking the inception architecture for computer vision. In cvpr, Cited by: §2.
- Soft weight-sharing for neural network compression.. In iclr, Cited by: §2.
- Learning structured sparsity in deep neural networks. In nips, Cited by: §2.
- Coordinating filters for faster deep neural networks. In iccv, Cited by: §2.
- Squeezedet: unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. arxiv. Cited by: §2.
- A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In cvpr, Cited by: §1.
- A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In cvpr, Cited by: §2.
- Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In iclr, Cited by: §1, §2.
- Understanding deep learning requires rethinking generalization. In iclr, Cited by: §1, §2, §5.2.
- Critical points of linear neural networks: analytical forms and landscape properties. In iclr, Cited by: §1, §2.