Designing Network Design Spaces
Abstract
In this work, we present a new network design paradigm. Our goal is to help advance the understanding of network design and discover design principles that generalize across settings. Instead of focusing on designing individual network instances, we design network design spaces that parametrize populations of networks. The overall process is analogous to classic manual design of networks, but elevated to the design space level. Using our methodology we explore the structure aspect of network design and arrive at a lowdimensional design space consisting of simple, regular networks that we call RegNet. The core insight of the RegNet parametrization is surprisingly simple: widths and depths of good networks can be explained by a quantized linear function. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. The RegNet design space provides simple and fast networks that work well across a wide range of flop regimes. Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet models while being up to 5 faster on GPUs.
1 Introduction
Deep convolutional neural networks are the engine of visual recognition. Over the past several years better architectures have resulted in considerable progress in a wide range of visual recognition tasks. Examples include LeNet [15], AlexNet [13], VGG [26], and ResNet [8]. This body of work advanced both the effectiveness of neural networks as well as our understanding of network design. In particular, the above sequence of works demonstrated the importance of convolution, network and data size, depth, and residuals, respectively. The outcome of these works is not just particular network instantiations, but also design principles that can be generalized and applied to numerous settings.
While manual network design has led to large advances, finding welloptimized networks manually can be challenging, especially as the number of design choices increases. A popular approach to address this limitation is neural architecture search (NAS). Given a fixed search space of possible networks, NAS automatically finds a good model within the search space. Recently, NAS has received a lot of attention and shown excellent results [34, 18, 29].
Despite the effectiveness of NAS, the paradigm has limitations. The outcome of the search is a single network instance tuned to a specific setting (\eg, hardware platform). This is sufficient in some cases; however, it does not enable discovery of network design principles that deepen our understanding and allow us to generalize to new settings. In particular, our aim is to find simple models that are easy to understand, build upon, and generalize.
In this work, we present a new network design paradigm that combines the advantages of manual design and NAS. Instead of focusing on designing individual network instances, we design design spaces that parametrize populations of networks.
The general strategy we adopt is to progressively design simplified versions of an initial, relatively unconstrained, design space while maintaining or improving its quality (Figure 1). The overall process is analogous to manual design, elevated to the population level and guided via distribution estimates of network design spaces [21].
As a testbed for this paradigm, our focus is on exploring network structure (\eg, width, depth, groups, \etc) assuming standard model families including VGG [26], ResNet [8], and ResNeXt [31]. We start with a relatively unconstrained design space we call AnyNet (\eg, widths and depths vary freely across stages) and apply our humanintheloop methodology to arrive at a lowdimensional design space consisting of simple “regular” networks, that we call RegNet. The core of the RegNet design space is simple: stage widths and depths are determined by a quantized linear function. Compared to AnyNet, the RegNet design space has simpler models, is easier to interpret, and has a higher concentration of good models.
We design the RegNet design space in a lowcompute, lowepoch regime using a single network block type on ImageNet [3]. We then show that the RegNet design space generalizes to larger compute regimes, schedule lengths, and network block types. Furthermore, an important property of the design space design is that it is more interpretable and can lead to insights that we can learn from. We analyze the RegNet design space and arrive at interesting findings that do not match the current practice of network design. For example, we find that the depth of the best models is stable across compute regimes ( blocks) and that the best models do not use either a bottleneck or inverted bottleneck.
We compare top RegNet models to existing networks in various settings. First, RegNet models are surprisingly effective in the mobile regime. We hope that these simple models can serve as strong baselines for future work. Next, RegNet models lead to considerable improvements over standard ResNe(X)t [8, 31] models in all metrics. We highlight the improvements for fixed activations, which is of high practical interest as the number of activations can strongly influence the runtime on accelerators such as GPUs. Next, we compare to the stateoftheart EfficientNet [29] models across compute regimes. Under comparable training settings and flops, RegNet models outperform EfficientNet models while being up to 5 faster on GPUs. We further test generalization on ImageNetV2 [24].
We note that network structure is arguably the simplest form of a design space design one can consider. Focusing on designing richer design spaces (\eg, including operators) may lead to better networks. Nevertheless, the structure will likely remain a core component of such design spaces.
In order to facilitate future research we will release all code and pretrained models introduced in this work.
2 Related Work
Manual network design.
The introduction of AlexNet [13] catapulted network design into a thriving research area. In the following years, improved network designs were proposed; examples include VGG [26], Inception [27, 28], ResNet [8], ResNeXt [31], DenseNet [11], and MobileNet [9, 25]. The design process behind these networks was largely manual and focussed on discovering new design choices that improve accuracy \eg, the use of deeper models or residuals. We likewise share the goal of discovering new design principles. In fact, our methodology is analogous to manual design but performed at the design space level.
Automated network design.
Recently, the network design process has shifted from a manual exploration to more automated network design, popularized by NAS. NAS has proven to be an effective tool for finding good models, \eg, [35, 23, 17, 20, 18, 29]. The majority of work in NAS focuses on the search algorithm, \ie, efficiently finding the best network instances within a fixed, manually designed search space (which we call a design space). Instead, our focus is on a paradigm for designing novel design spaces. The two are complementary: better design spaces can improve the efficiency of NAS search algorithms and also lead to existence of better models by enriching the design space.
Network scaling.
Both manual and semiautomated network design typically focus on finding bestperforming network instances for a specific regime (\eg, number of flops comparable to ResNet50). Since the result of this procedure is a single network instance, it is not clear how to adapt the instance to a different regime (\eg, fewer flops). A common practice is to apply network scaling rules, such as varying network depth [8], width [32], resolution [9], or all three jointly [29]. Instead, our goal is to discover general design principles that hold across regimes and allow for efficient tuning for the optimal network in any target regime.
Comparing networks.
Given the vast number of possible network design spaces, it is essential to use a reliable comparison metric to guide our design process. Recently, the authors of [21] proposed a methodology for comparing and analyzing populations of networks sampled from a design space. This distributionlevel view is fullyaligned with our goal of finding general design principles. Thus, we adopt this methodology and demonstrate that it can serve as a useful tool for the design space design process.
Parameterization.
Our final quantized linear parameterization shares similarity with previous work, \eghow stage widths are set [26, 7, 32, 11, 9]. However, there are two key differences. First, we provide an empirical study justifying the design choices we make. Second, we give insights into structural design choices that were not previously understood (\eg, how to set the number of blocks in each stages).
3 Design Space Design
Our goal is to design better networks for visual recognition. Rather than designing or searching for a single best model under specific settings, we study the behavior of populations of models. We aim to discover general design principles that can apply to and improve an entire model population. Such design principles can provide insights into network design and are more likely to generalize to new settings (unlike a single model tuned for a specific scenario).
We rely on the concept of network design spaces introduced by Radosavovic \etal [21]. A design space is a large, possibly infinite, population of model architectures. The core insight from [21] is that we can sample models from a design space, giving rise to a model distribution, and turn to tools from classical statistics to analyze the design space. We note that this differs from architecture search, where the goal is to find the single best model from the space.
In this work, we propose to design progressively simplified versions of an initial, unconstrained design space. We refer to this process as design space design. Design space design is akin to sequential manual network design, but elevated to the population level. Specifically, in each step of our design process the input is an initial design space and the output is a refined design space, where the aim of each design step is to discover design principles that yield populations of simpler or better performing models.
We begin by describing the basic tools we use for design space design in §3.1. Next, in §3.2 we apply our methodology to a design space, called AnyNet, that allows unconstrained network structures. In §3.3, after a sequence of design steps, we obtain a simplified design space consisting of only regular network structures that we name RegNet. Finally, as our goal is not to design a design space for a single setting, but rather to discover general principles of network design that generalize to new settings, in §3.4 we test the generalization of the RegNet design space to new settings.
Relative to the AnyNet design space, the RegNet design space is: (1) simplified both in terms of its dimension and type of network configurations it permits, (2) contains a higher concentration of topperforming models, and (3) is more amenable to analysis and interpretation.
3.1 Tools for Design Space Design
We begin with an overview of tools for design space design. To evaluate and compare design spaces, we use the tools introduced by Radosavovic \etal [21], who propose to quantify the quality of a design space by sampling a set of models from that design space and characterizing the resulting model error distribution. The key intuition behind this approach is that comparing distributions is more robust and informative than using search (manual or automated) and comparing the best found models from two design spaces.
To obtain a distribution of models, we sample and train models from a design space. For efficiency, we primarily do so in a lowcompute, lowepoch training regime. In particular, in this section we use the 400 million flop
As in [21], our primary tool for analyzing design space quality is the error empirical distribution function (EDF). The error EDF of models with errors is given by:
(1) 
gives the fraction of models with error less than . We show the error EDF for sampled models from the AnyNetX design space (described in §3.2) in Figure 2 (left).
Given a population of trained models, we can plot and analyze various network properties versus network error, see Figure 2 (middle) and (right) for two examples taken from the AnyNetX design space. Such visualizations show 1D projections of a complex, highdimensional space, and can help obtain insights into the design space. For these plots, we employ an empirical bootstrap
To summarize: (1) we generate distributions of models obtained by sampling and training models from a design space, (2) we compute and plot error EDFs to summarize design space quality, (3) we visualize various properties of a design space and use an empirical bootstrap to gain insight, and (4) we use these insights to refine the design space.
3.2 The AnyNet Design Space
We next introduce our initial AnyNet design space. Our focus is on exploring the structure of neural networks assuming standard, fixed network blocks (\eg, residual bottleneck blocks). In our terminology the structure of the network includes elements such as the number of blocks (\ienetwork depth), block widths (\ienumber of channels), and other block parameters such as bottleneck ratios or group widths. The structure of the network determines the distribution of compute, parameters, and memory throughout the computational graph of the network and is key in determining its accuracy and efficiency.
The basic design of networks in our AnyNet design space is straightforward. Given an input image, a network consists of a simple stem, followed by the network body that performs the bulk of the computation, and a final network head that predicts the output classes, see Figure 3a. We keep the stem and head fixed and as simple as possible, and instead focus on the structure of the network body that is central in determining network compute and accuracy.
The network body consists of 4 stages operating at progressively reduced resolution, see Figure 3b (we explore varying the number of stages in §3.4). Each stage consists of a sequence of identical blocks, see Figure 3c. In total, for each stage the degrees of freedom include the number of blocks , block width , and any other block parameters. While the general structure is simple, the total number of possible networks in the AnyNet design space is vast.
Most of our experiments use the standard residual bottlenecks block with group convolution [31], shown in Figure 4. We refer to this as the X block, and the AnyNet design space built on it as AnyNetX (we explore other blocks in §3.4). While the X block is quite rudimentary, we show it can be surprisingly effective when network structure is optimized.
The AnyNetX design space has 16 degrees of freedom as each network consists of 4 stages and each stage has 4 parameters: the number of blocks , block width , bottleneck ratio , and group width . We fix the input resolution unless otherwise noted. To obtain valid models, we perform loguniform sampling of , and divisible by 8, , and (we test these ranges later). We repeat the sampling until we obtain models in our target complexity regime (360MF to 400MF), and train each model for 10 epochs.
There are possible model configurations in the AnyNetX design space. Rather than searching for the single best model out of these configurations, we explore whether there are general design principles that can help us understand and refine this design space. To do so, we apply our approach of designing design spaces. In each step of this approach, our aims are:

to simplify the structure of the design space,

to improve the interpretability of the design space,

to improve or maintain the design space quality,

to maintain model diversity in the design space.
We now apply this approach to the AnyNetX design space.
AnyNetX\textsubscriptA.
For clarity, going forward we refer to the initial, unconstrained AnyNetX design space as AnyNetX\textsubscriptA.
AnyNetX\textsubscriptB.
We first test a shared bottleneck ratio for all stages for the AnyNetX\textsubscriptA design space, and refer to the resulting design space as AnyNetX\textsubscriptB. As before, we sample and train 500 models from AnyNetX\textsubscriptB in the same settings. The EDFs of AnyNetX\textsubscriptA and AnyNetX\textsubscriptB, shown in Figure 5 (left), are virtually identical both in the average and best case. This indicates no loss in accuracy when coupling the . In addition to being simpler, the AnyNetX\textsubscriptB is more amenable to analysis, see for example Figure 5 (right).
AnyNetX\textsubscriptC.
Our second refinement step closely follows the first. Starting with AnyNetX\textsubscriptB, we additionally use a shared group width for all stages to obtain AnyNetX\textsubscriptC. As before, the EDFs are nearly unchanged, see Figure 5 (middle). Overall, AnyNetX\textsubscriptC has 6 fewer degrees of freedom than AnyNetX\textsubscriptA, and reduces the design space size nearly four orders of magnitude. Interestingly, we find is best (not shown); we analyze this in more detail in §4.
AnyNetX\textsubscriptD.
Next, we examine typical network structures of both good and bad networks from AnyNetX\textsubscriptC in Figure 6. A pattern emerges: good network have increasing widths. We test the design principle of , and refer to the design space with this constraint as AnyNetX\textsubscriptD. In Figure 7 (left) we see this improves the EDF substantially. We return to examining other options for controlling width shortly.
AnyNetX\textsubscriptE.
Upon further inspection of many models (not shown), we observed another interesting trend. In addition to stage widths increasing with , the stage depths likewise tend to increase for the best models, although not necessarily in the last stage. Nevertheless, we test a design space variant AnyNetX\textsubscriptE with in Figure 7 (right), and see it also improves results. Finally, we note that the constraints on and each reduce the design space by , with a cumulative reduction of from AnyNetX\textsubscriptA.
3.3 The RegNet Design Space
To gain further insight into the model structure, we show the best 20 models from AnyNetX\textsubscriptE in a single plot, see Figure 8 (topleft). For each model, we plot the perblock width of every block up to the network depth (we use and to index over stages and blocks, respectively). See Figure 6 for reference of our model visualization.
While there is significant variance in the individual models (gray curves), in the aggregate a pattern emerges. In particular, in the same plot we show the line for (solid black curve, please note that the yaxis is logarithmic). Remarkably, this trivial linear fit seems to explain the population trend of the growth of network widths for top models. Note, however, that this linear fit assigns a different width to each block, whereas individual models have quantized widths (piecewise constant functions).
To see if a similar pattern applies to individual models, we need a strategy to quantize a line to a piecewise constant function. Inspired by our observations from AnyNetX\textsubscriptD and AnyNetX\textsubscriptE, we propose the following approach. First, we introduce a linear parameterization for block widths:
(2) 
This parameterization has three parameters: depth , initial width , and slope , and generates a different block width for each block . To quantize , we introduce an additional parameter that controls quantization as follows. First, given from Eqn. (2), we compute for each block j such that the following holds:
(3) 
Then, to quantize , we simply round (denoted by ) and compute quantized perblock widths via:
(4) 
We can convert the perblock to our perstage format by simply counting the number of blocks with constant width, that is, each stage has block width and number of blocks ]. When only considering four stage networks, we ignore the parameter combinations that give rise to a different number of stages.
We test this parameterization by fitting to models from AnyNetX. In particular, given a model, we compute the fit by setting to the network depth and performing a grid search over , and to minimize the mean logratio (denoted by ) of predicted to observed perblock widths. Results for two top networks from AnyNetX\textsubscriptE are shown in Figure 8 (topright). The quantized linear fits (dashed curves) are good fits of these best models (solid curves).
Next, we plot the fitting error versus network error for every network in AnyNetX\textsubscriptC through AnyNetX\textsubscriptE in Figure 8 (bottom). First, we note that the best models in each design space all have good linear fits. Indeed, an empirical bootstrap gives a narrow band of near 0 that likely contains the best models in each design space. Second, we note that on average, improves going from AnyNetX\textsubscriptC to AnyNetX\textsubscriptE, showing that the linear parametrization naturally enforces related constraints to and increasing.
To further test the linear parameterization, we design a design space that only contains models with such linear structure. In particular, we specify a network structure via 6 parameters: , , , (and also , ). Given these, we generate block widths and depths via Eqn. (2)(4). We refer to the resulting design space as RegNet, as it contains only simple, regular models. We sample , , and and as before (ranges set based on on AnyNetX\textsubscriptE).
The error EDF of RegNetX is shown in Figure 9 (left). Models in RegNetX have better average error than AnyNetX while maintaining the best models. In Figure 9 (middle) we test two further simplifications. First, using (doubling width between stages) slightly improves the EDF, but we note that using performs better (shown later). Second, we test setting , further simplifying the linear parameterization to . Interestingly, this performs even better. However, to maintain the diversity of models, we do not impose either restriction. Finally, in Figure 9 (right) we show that random search efficiency is much higher for RegNetX; searching over just 32 random models is likely to yield good models.
restriction  dim.  combinations  total  

AnyNetX\textsubscriptA  none  16  
AnyNetX\textsubscriptB  +  13  
AnyNetX\textsubscriptC  +  10  
AnyNetX\textsubscriptD  +  10  
AnyNetX\textsubscriptE  +  10  
RegNet  quantized linear  6 
Table 1 shows a summary of the design space sizes (for RegNet we estimate the size by quantizing its continuous parameters). In designing RegNetX, we reduced the dimension of the original AnyNetX design space from 16 to 6 dimensions, and the size nearly 10 orders of magnitude. We note, however, that RegNet still contains a good diversity of models that can be tuned for a variety of settings.
3.4 Design Space Generalization
We designed the RegNet design space in a lowcompute, lowepoch training regime with only a single block type. However, our goal is not to design a design space for a single setting, but rather to discover general principles of network design that can generalize to new settings.
In Figure 10, we compare the RegNetX design space to AnyNetX\textsubscriptA and AnyNetX\textsubscriptE at higher flops, higher epochs, with 5stage networks, and with various block types (described in the appendix). In all cases the ordering of the design spaces is consistent, with RegNetX AnyNetX\textsubscriptE AnyNetX\textsubscriptA. In other words, we see no signs of overfitting. These results are promising because they show RegNet can generalize to new settings. The 5stage results show the regular structure of RegNet can generalize to more stages, where AnyNetX\textsubscriptA has even more degrees of freedom.
4 Analyzing the RegNetX Design Space
We next further analyze the RegNetX design space and revisit common deep network design choices. Our analysis yields surprising insights that don’t match popular practice, which allows us to achieve good results with simple models.
As the RegNetX design space has a high concentration of good models, for the following results we switch to sampling fewer models (100) but training them for longer (25 epochs) with a learning rate of 0.1 (see appendix). We do so to observe more finegrained trends in network behavior.
RegNet trends.
We show trends in the RegNetX parameters across flop regimes in Figure 11. Remarkably, the depth of best models is stable across regimes (topleft), with an optimal depth of 20 blocks (60 layers). This is in contrast to the common practice of using deeper models for higher flop regimes. We also observe that the best models use a bottleneck ratio of 1.0 (topmiddle), which effectively removes the bottleneck (commonly used in practice). Next, we observe that the width multiplier of good models is (topright), similar but not identical to the popular recipe of doubling widths across stages. The remaining parameters (, , ) increase with complexity (bottom).
Complexity analysis.
In addition to flops and parameters, we analyze network activations, which we define as the size of the output tensors of all conv layers (we list complexity measures of common conv operators in Figure 12, topleft). While not a common measure of network complexity, activations can heavily affect runtime on memorybound hardware accelerators (\eg, GPUs, TPUs), for example, see Figure 12 (top). In Figure 12 (bottom), we observe that for the best models in the population, activations increase with the squareroot of flops, parameters increase linearly, and runtime is best modeled using both a linear and a squareroot term due to its dependence on both flops and activations.
RegNetX constrained.
Using these findings, we refine the RegNetX design space. First, based on Figure 11 (top), we set , , and . Second, we limit parameters and activations, following Figure 12 (bottom). This yields fast, lowparameter, lowmemory models without affecting accuracy. In Figure 13, we test RegNetX with theses constraints and observe that the constrained version is superior across all flop regimes. We use this version in §5, and further limit depth to (see also Appendix D).
Alternate design choices.
Modern mobile networks often employ the inverted bottleneck () proposed in [25] along with depthwise conv [1] (). In Figure 14 (left), we observe that the inverted bottleneck degrades the EDF slightly and depthwise conv performs even worse relative to and (see appendix for further analysis). Next, motivated by [29] who found that scaling the input image resolution can be helpful, we test varying resolution in Figure 14 (middle). Contrary to [29], we find that for RegNetX a fixed resolution of is best, even at higher flops.
Se.
5 Comparison to Existing Networks
We now compare top models from the RegNetX and RegNetY design spaces at various complexities to the stateoftheart on ImageNet [3]. We denote individual models using small caps, \egRegNetX. We also suffix the models with the flop regime, \eg400MF. For each flop regime, we pick the best model from 25 random settings of the RegNet parameters (, , , , ), and retrain the top model 5 times at 100 epochs to obtain robust error estimates.
flops  params  acts  batch  infer  train  error  

(B)  (M)  (M)  size  (ms)  (hr)  (top1)  
RegNetX200MF  0.2  2.7  2.2  1024  10  2.8  31.10.09 
RegNetX400MF  0.4  5.2  3.1  1024  15  3.9  27.30.15 
RegNetX600MF  0.6  6.2  4.0  1024  17  4.4  25.90.03 
RegNetX800MF  0.8  7.3  5.1  1024  21  5.7  24.80.09 
RegNetX1.6GF  1.6  9.2  7.9  1024  33  8.7  23.00.13 
RegNetX3.2GF  3.2  15.3  11.4  512  57  14.3  21.70.08 
RegNetX4.0GF  4.0  22.1  12.2  512  69  17.1  21.40.19 
RegNetX6.4GF  6.5  26.2  16.4  512  92  23.5  20.80.07 
RegNetX8.0GF  8.0  39.6  14.1  512  94  22.6  20.70.07 
RegNetX12GF  12.1  46.1  21.4  512  137  32.9  20.30.04 
RegNetX16GF  15.9  54.3  25.5  512  168  39.7  20.00.11 
RegNetX32GF  31.7  107.8  36.3  256  318  76.9  19.50.12 
Resulting top RegNetX and RegNetY models for each flop regime are shown in Figures 15 and 16, respectively. In addition to the simple linear structure and the trends we analyzed in §4, we observe an interesting pattern. Namely, the higher flop models have a large number of blocks in the third stage and a small number of blocks in the last stage. This is similar to the design of standard ResNet models. Moreover, we observe that the group width increases with complexity, but depth saturates for large models.
Our goal is to perform fair comparisons and provide simple and easytoreproduce baselines. We note that along with better architectures, much of the recently reported gains in network performance are based on enhancements to the training setup and regularization scheme (see Table 7). As our focus is on evaluating network architectures, we perform carefully controlled experiments under the same training setup. In particular, to provide fair comparisons to classic work, we do not use any trainingtime enhancements.
flops  params  acts  batch  infer  train  error  

(B)  (M)  (M)  size  (ms)  (hr)  (top1)  
RegNetY200MF  0.2  3.2  2.2  1024  11  3.1  29.60.11 
RegNetY400MF  0.4  4.3  3.9  1024  19  5.1  25.90.16 
RegNetY600MF  0.6  6.1  4.3  1024  19  5.2  24.50.07 
RegNetY800MF  0.8  6.3  5.2  1024  22  6.0  23.70.03 
RegNetY1.6GF  1.6  11.2  8.0  1024  39  10.1  22.00.08 
RegNetY3.2GF  3.2  19.4  11.3  512  67  16.5  21.00.05 
RegNetY4.0GF  4.0  20.6  12.3  512  68  16.8  20.60.08 
RegNetY6.4GF  6.4  30.6  16.4  512  104  26.1  20.10.04 
RegNetY8.0GF  8.0  39.2  18.0  512  113  28.1  20.10.09 
RegNetY12GF  12.1  51.8  21.4  512  150  36.0  19.70.06 
RegNetY16GF  15.9  83.6  23.0  512  189  45.6  19.60.16 
RegNetY32GF  32.3  145.0  30.3  256  319  76.0  19.00.12 
5.1 StateoftheArt Comparison: Mobile Regime
Much of the recent work on network design has focused on the mobile regime (600MF). In Table 2, we compare RegNet models at 600MF to existing mobile networks. We observe that RegNets are surprisingly effective in this regime considering the substantial body of work on finding better mobile networks via both manual design [9, 25, 19] and NAS [35, 23, 17, 18].
We emphasize that RegNet models use our basic 100 epoch schedule with no regularization except weight decay, while most mobile networks use longer schedules with various enhancements, such as deep supervision [16], Cutout [4], DropPath [14], AutoAugment [2], and so on. As such, we hope our strong results obtained with a short training schedule without enhancements can serve as a simple baseline for future work.
flops (B)  params (M)  top1 error  
MobileNet [9]  0.57  4.2  29.4 
MobileNetV2 [25]  0.59  6.9  25.3 
ShuffleNet [33]  0.52    26.3 
ShuffleNetV2 [19]  0.59    25.1 
NASNetA [35]  0.56  5.3  26.0 
AmoebaNetC [23]  0.57  6.4  24.3 
PNASNet5 [17]  0.59  5.1  25.8 
DARTS [18]  0.57  4.7  26.7 
RegNetX600MF  0.60  6.2  25.90.03 
RegNetY600MF  0.60  6.1  24.50.07 
flops  params  acts  infer  train  top1 error  
(B)  (M)  (M)  (ms)  (hr)  oursstd [orig]  
ResNet50  4.1  22.6  11.1  53  12.2  23.20.09 [23.9] 
RegNetX3.2GF  3.2  15.3  11.4  57  14.3  21.70.08 
ResNeXt50  4.2  25.0  14.4  78  18.0  21.90.10 [22.2] 
ResNet101  7.8  44.6  16.2  90  20.4  21.40.11 [22.0] 
RegNetX6.4GF  6.5  26.2  16.4  92  23.5  20.80.07 
ResNeXt101  8.0  44.2  21.2  137  31.8  20.70.08 [21.2] 
ResNet152  11.5  60.2  22.6  130  29.2  20.90.12 [21.6] 
RegNetX12GF  12.1  46.1  21.4  137  32.9  20.30.04 
(a) Comparisons grouped by activations.  
ResNet50  4.1  22.6  11.1  53  12.2  23.20.09 [23.9] 
ResNeXt50  4.2  25.0  14.4  78  18.0  21.90.10 [22.2] 
RegNetX4.0GF  4.0  22.1  12.2  69  17.1  21.40.19 
ResNet101  7.8  44.6  16.2  90  20.4  21.40.11 [22.0] 
ResNeXt101  8.0  44.2  21.2  137  31.8  20.70.08 [21.2] 
RegNetX8.0GF  8.0  39.6  14.1  94  22.6  20.70.07 
ResNet152  11.5  60.2  22.6  130  29.2  20.90.12 [21.6] 
ResNeXt152  11.7  60.0  29.7  197  45.7  20.40.06 [21.1] 
RegNetX12GF  12.1  46.1  21.4  137  32.9  20.30.04 
(b) Comparisons grouped by flops. 
5.2 Standard Baselines Comparison: ResNe(X)t
Next, we compare RegNetX to standard ResNet [8] and ResNeXt [31] models. All of the models in this experiment come from the exact same design space, the former being manually designed, the latter being obtained through design space design. For fair comparisons, we compare RegNet and ResNe(X)t models under the same training setup (our standard RegNet training setup). We note that this results in improved ResNe(X)t baselines and highlights the importance of carefully controlling the training setup.
Comparisons are shown in Figure 17 and Table 3. Overall, we see that RegNetX models, by optimizing the network structure alone, provide considerable improvements under all complexity metrics. We emphasize that good RegNet models are available across a wide range of compute regimes, including in lowcompute regimes where good ResNe(X)t models are not available.
Table 3a shows comparisons grouped by activations (which can strongly influence runtime on accelerators such as GPUs). This setting is of particular interest to the research community where model training time is a bottleneck and will likely have more realworld use cases in the future, especially as accelerators gain more use at inference time (\eg, in selfdriving cars). RegNetX models are quite effective given a fixed inference or training time budget.
flops  params  acts  batch  infer  train  top1 error  

(B)  (M)  (M)  size  (ms)  (hr)  oursstd [orig]  
EfficientNetB0  0.4  5.3  6.7  256  34  11.7  24.90.03 [23.7] 
RegNetY400MF  0.4  4.3  3.9  1024  19  5.1  25.90.16 
EfficientNetB1  0.7  7.8  10.9  256  52  15.6  24.10.16 [21.2] 
RegNetY600MF  0.6  6.1  4.3  1024  19  5.2  24.50.07 
EfficientNetB2  1.0  9.2  13.8  256  68  18.4  23.40.06 [20.2] 
RegNetY800MF  0.8  6.3  5.2  1024  22  6.0  23.70.03 
EfficientNetB3  1.8  12.0  23.8  256  114  32.1  22.50.05 [18.9] 
RegNetY1.6GF  1.6  11.2  8.0  1024  39  10.1  22.00.08 
EfficientNetB4  4.2  19.0  48.5  128  240  65.1  21.20.06 [17.4] 
RegNetY4.0GF  4.0  20.6  12.3  512  68  16.8  20.60.08 
EfficientNetB5  9.9  30.0  98.9  64  504  135.1  21.50.11 [16.7] 
RegNetY8.0GF  8.0  39.2  18.0  512  113  28.1  20.10.09 
5.3 StateoftheArt Comparison: Full Regime
We focus our comparison on EfficientNet [29], which is representative of the state of the art and has reported impressive gains using a combination of NAS and an interesting model scaling rule across complexity regimes.
To enable direct comparisons, and to isolate gains due to improvements solely of the network architecture, we opt to reproduce the exact EfficientNet models but using our standard training setup, with a 100 epoch schedule and no regularization except weight decay (effect of longer schedule and stronger regularization are shown in Table 7). We optimize only and , see Figure 22 in appendix. This is the same setup as RegNet and enables fair comparisons.
Results are shown in Figure 18 and Table 4. At low flops, EfficientNet outperforms the RegNetY. At intermediate flops, RegNetY outperforms EfficientNet, and at higher flops both RegNetX and RegNetY perform better.
We also observe that for EfficientNet, activations scale linearly with flops (due to the scaling of both resolution and depth), compared to activations scaling with the squareroot of flops for RegNets. This leads to slow GPU training and inference times for EfficientNet. \Eg, RegNetX8000 is 5 faster than EfficientNetB5, while having lower error.
6 Conclusion
In this work, we present a new network design paradigm. Our results suggest that designing network design spaces is a promising avenue for future research.
Appendix A: Test Set Evaluation
flops  params  acts  infer  train  error  
(B)  (M)  (M)  (ms)  (hr)  (top1)  
ResNet50  4.1  22.6  11.1  53  12.2  35.00.20 
RegNetX3.2GF  3.2  15.3  11.4  57  14.3  33.60.25 
ResNeXt50  4.2  25.0  14.4  78  18.0  33.50.10 
ResNet101  7.8  44.6  16.2  90  20.4  33.20.24 
RegNetX6.4GF  6.5  26.2  16.4  92  23.5  32.60.15 
ResNeXt101  8.0  44.2  21.2  137  31.8  32.10.30 
ResNet152  11.5  60.2  22.6  130  29.2  32.20.22 
RegNetX12GF  12.1  46.1  21.4  137  32.9  32.00.27 
(a) Comparisons grouped by activations.  
ResNet50  4.1  22.6  11.1  53  12.2  35.00.20 
ResNeXt50  4.2  25.0  14.4  78  18.0  33.50.10 
RegNetX4.0GF  4.0  22.1  12.2  69  17.1  33.20.20 
ResNet101  7.8  44.6  16.2  90  20.4  33.20.24 
ResNeXt101  8.0  44.2  21.2  137  31.8  32.10.30 
RegNetX8.0GF  8.0  39.6  14.1  94  22.6  32.50.18 
ResNet152  11.5  60.2  22.6  130  29.2  32.20.22 
ResNeXt152  11.7  60.0  29.7  197  45.7  31.50.26 
RegNetX12GF  12.1  46.1  21.4  137  32.9  32.00.27 
(b) Comparisons grouped by flops. 
flops  params  acts  batch  infer  train  error  

(B)  (M)  (M)  size  (ms)  (hr)  (top1)  
EfficientNetB0  0.4  5.3  6.7  256  34  11.7  37.10.22 
RegNetY400MF  0.4  4.3  3.9  1024  19  5.1  38.30.26 
EfficientNetB1  0.7  7.8  10.9  256  52  15.6  36.40.10 
RegNetY600MF  0.6  6.1  4.3  1024  19  5.2  36.90.17 
EfficientNetB2  1.0  9.2  13.8  256  68  18.4  35.30.25 
RegNetY800MF  0.8  6.3  5.2  1024  22  6.0  35.70.40 
EfficientNetB3  1.8  12.0  23.8  256  114  32.1  34.40.27 
RegNetY1.6GF  1.6  11.2  8.0  1024  39  10.1  33.90.19 
EfficientNetB4  4.2  19.0  48.5  128  240  65.1  32.50.23 
RegNetY4.0GF  4.0  20.6  12.3  512  68  16.8  32.30.28 
EfficientNetB5  9.9  30.0  98.9  64  504  135.1  31.50.17 
RegNetY8.0GF  8.0  39.2  18.0  512  113  28.1  31.30.08 
In the main paper we perform all experiments on the ImageNet [3] validation set. Here we evaluate our models on the ImageNetV2 [24] test set (original test set unavailable).
Evaluation setup.
To study generalization of models developed on ImageNet, the authors of [24] collect a new test set following the original procedure (ImageNetV2). They find that the overall model ranks are preserved on the new test set. The absolute errors, however, increase. We repeat the comparisons from §5 on the ImageNetV2 test set.
ResNe(X)t comparisons.
We compare to ResNe(X)t models in Table 5. We observe that while model ranks are generally consistent, the gap between them decreases. Nevertheless, RegNetX models still compare favorably, and provide good models across flop regimes, including in lowcompute regimes where good ResNe(X)t models are not available. Best results can be achieved using RegNetY.
EfficientNet comparisons.
We compare to EfficientNet models in Table 6. As before, we observe that the model ranks are generally consistent but the gap decreases. Overall, the results confirm that the RegNet models perform comparably to stateoftheart EfficientNet while being up to 5 faster on GPUs.
Appendix B: Additional Ablations
In this section we perform additional ablations to further support or supplement the results of the main text.
Fixed depth.
In §5 we observed that the depths of our top models are fairly stable ( blocks). In Figure 19 (left) we compare using fixed depth () across flop regimes. To compare to our best results, we trained each model for 100 epochs. Surprisingly, we find that fixeddepth networks can match the performance of variable depth networks for all flop regimes, in both the average and best case. Indeed, these fixed depth networks match our best results in §5.
Fewer stages.
In §5 we observed that the top RegNet models at high flops have few blocks in the fourth stage (one or two). Hence we tested 3 stage networks at 6.4GF, trained for 100 epochs each. In Figure 19 (middle), we show the results and observe that the three stage networks perform considerably worse. We note, however, that additional changes (\eg, in the stem or head) may be necessary for three stage networks to perform well (left for future work).
Inverted Bottleneck.
Swish \vsReLU
Many recent methods employ the Swish [22] activation function, \eg [29]. In Figure 20, we study RegNetY with Swish and ReLU. We find that Swish outperforms ReLU at low flops, but ReLU is better at high flops. Interestingly, if is restricted to be 1 (depthwise conv), Swish performs much better than ReLU. This suggests that depthwise conv and Swish interact favorably, although the underlying reason is not at all clear.
Appendix C: Optimization Settings
Our basic training settings follow [21] as discussed in §3. To tune the learning rate and weight decay for RegNet models, we perform a study, described in Figure 21. Based on this, we set and for all models in §4 and §5. To enable faster training of our final models at 100 epochs, we increase the number of GPUs to 8, while keeping the number of images per GPU fixed. When scaling the batch size, we adjust using the linear scaling rule and apply 5 epoch gradual warmup [6].
To enable fair comparisons, we repeat the same optimization for EfficientNet in Figure 22. Interestingly, learning rate and weight decay are again stable across complexity regimes. Finally, in Table 7 we report the sizable effect of training enhancement on EfficientNetB0. The gap may be even larger for larger models (see Table 4).
flops (B)  params (M)  acts (M)  epochs  enhance  error  

EfficientNetB0  0.39  5.3  6.7  100  25.6  
EfficientNetB0  0.39  5.3  6.7  250  25.0  
EfficientNetB0  0.39  5.3  6.7  250  ✓  24.4 
EfficientNetB0 [29]  0.39  5.3  6.7  350  ✓✓✓  23.7 
Appendix D: Implementation Details
We conclude with additional implementation details.
Group width compatibility.
When sampling widths and groups widths for our models, we may end up with incompatible values (\ie not divisible by ). To address this, we employ a simple strategy. Namely, we set if and round to be divisible by otherwise. The final can be at most 1/3 different from the original (proof omitted). For models with bottlenecks, we apply this strategy to the bottleneck width instead (and adjust widths accordingly).
Group width ranges.
As discussed in §4, we notice the general trend that the group widths of good models are larger in higher compute regimes. To account for this, we gradually adjust the group width ranges for higher compute regimes. For example, instead of sampling , at 3.2GF we use and allow any divisible by 8.
Block types.
In §3, we showed that the RegNet design space generalizes to different block types. We describe these additional block types, shown in Figure 23, next:

R block: same as the X block except without groups,

V block: a basic block with only a single 33 conv,

VR block: same as V block plus residual connections.
We note that good parameter values may differ across block types. \Eg, in contrast to the X block, for the R block using is better than . Our approach is robust to this.
Y block details.
To obtain the Y block, we add the SE op after the conv of the X block, and we use an SE reduction ratio of . We experimented with these choices but found that they performed comparably (not shown).
Footnotes
 We use the term design space following [21], rather than search space, to emphasize that we are not searching for network instances within the space. Instead, we are designing the space itself.
 https://github.com/facebookresearch/pycls
 Following common practice, we use flops to mean multiplyadds. Moreover, we use MF and GF to denote and flops, respectively.
 Given pairs of model statistic (\egdepth) and corresponding error , we compute the empirical bootstrap by: (1) sampling with replacement 25% of the pairs, (2) selecting the pair with min error in the sample, (3) repeating this times, and finally (4) computing the 95% CI for the min value. The median gives the most likely best value.
 Our training setup in §3 exactly follows [21]. We use SGD with momentum of 0.9, minibatch size of 128 on 1 GPU, and a halfperiod cosine schedule with initial learning rate of 0.05 and weight decay of . Ten epochs are usually sufficient to give robust population statistics.
References
 (2017) Xception: deep learning with depthwise separable convolutions. In CVPR, Cited by: §4.
 (2018) AutoAugment: learning augmentation policies from data. arXiv:1805.09501. Cited by: §5.1, Table 7.
 (2009) Imagenet: a largescale hierarchical image database. In CVPR, Cited by: §1, §3.1, §5, Appendix A: Test Set Evaluation.
 (2017) Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552. Cited by: §5.1.
 (1994) An introduction to the bootstrap. CRC press. Cited by: §3.1.
 (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv:1706.02677. Cited by: Appendix C: Optimization Settings.
 (2015) Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In ICCV, Cited by: §2.
 (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §1, §1, §2, §2, §3.1, §5.2.
 (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv:1704.04861. Cited by: §2, §2, §2, §5.1, Table 2.
 (2018) Squeezeandexcitation networks. In CVPR, Cited by: §4.
 (2017) Densely connected convolutional networks. In CVPR, Cited by: §2, §2.
 (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In ICML, Cited by: Figure 4.
 (2012) Imagenet classification with deep convolutional neural networks. In NIPS, Cited by: §1, §2.
 (2017) FractalNet: ultradeep neural networks without residuals. In ICLR, Cited by: §5.1, Table 7.
 (1989) Backpropagation applied to handwritten zip code recognition. Neural computation. Cited by: §1.
 (2015) Deeplysupervised nets. In AISTATS, Cited by: §5.1.
 (2018) Progressive neural architecture search. In ECCV, Cited by: §2, §5.1, Table 2.
 (2019) Darts: differentiable architecture search. In ICLR, Cited by: §1, §2, §5.1, Table 2.
 (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In ECCV, Cited by: §5.1, Table 2.
 (2018) Efficient neural architecture search via parameter sharing. In ICML, Cited by: §2.
 (2019) On network design spaces for visual recognition. In ICCV, Cited by: Figure 1, §1, §2, §3.1, §3.1, §3, Appendix C: Optimization Settings, footnote 1, footnote 5.
 (2017) Searching for activation functions. arXiv:1710.05941. Cited by: Swish \vsReLU.
 (2019) Regularized evolution for image classifier architecture search. In AAAI, Cited by: §2, §5.1, Table 2.
 (2019) Do imagenet classifiers generalize to imagenet?. arXiv:1902.10811. Cited by: §1, Evaluation setup., Appendix A: Test Set Evaluation.
 (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §2, §4, §5.1, Table 2.
 (2015) Very deep convolutional networks for largescale image recognition. In ICLR, Cited by: §1, §1, §2, §2.
 (2015) Going deeper with convolutions. In CVPR, Cited by: §2.
 (2016) Rethinking the inception architecture for computer vision. In CVPR, Cited by: §2.
 (2019) EfficientNet: rethinking model scaling for convolutional neural networks. ICML. Cited by: §1, §1, §2, §2, §4, §5.3, Swish \vsReLU, Table 7.
 (2012) Lecture 6.5rmsprop: divide the gradient by a running average of its recent magnitude. Coursera: Neural networks for machine learning. Cited by: Table 7.
 (2017) Aggregated residual transformations for deep neural networks. In CVPR, Cited by: §1, §1, §2, Figure 4, §3.2, §5.2.
 (2016) Wide residual networks. In BMVC, Cited by: §2, §2.
 (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: Table 2.
 (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1.
 (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §2, §5.1, Table 2.