# Multi-Mode Inference Engine for Convolutional Neural Networks

## Abstract

During the past few years, interest in convolutional neural networks (CNNs) has risen constantly, thanks to their excellent performance on a wide range of recognition and classification tasks. However, they suffer from the high level of complexity imposed by the high-dimensional convolutions in convolutional layers. Within scenarios with limited hardware resources and tight power and latency constraints, the high computational complexity of CNNs makes them difficult to be exploited. Hardware solutions have striven to reduce the power consumption using low-power techniques, and to limit the processing time by increasing the number of processing elements (PEs). While most of ASIC designs claim a peak performance of a few hundred giga operations per seconds, their average performance is substantially lower when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, leading to low resource utilization. Their performance efficiency is limited to less than on average, which leads to unnecessarily high processing latency and silicon area. In this paper, we propose a dataflow which enables to perform both the fully-connected and convolutional computations for any filter/layer size using the same PEs. We then introduce a multi-mode inference engine (MMIE) based on the proposed dataflow. Finally, we show that the proposed MMIE achieves a performance efficiency of more than 84% when performing the computations of the three renown CNNs (i.e., AlexNet, VGGNet and ResNet), outperforming the best architecture in the state-of-the-art in terms of energy consumption, processing latency and silicon area.

## 1Introduction

Deep neural networks (DNNs), especially convolutional neural networks (CNNs) [1], have received tremendous attention due to their ability to surpass human-level accuracy on a wide range of complex tasks such as recognition, classification and detection [2]. Depending on their size and complexity, these networks achieve different degrees of classification/recognition accuracy. A CNN is a stack of multiple convolutional layers followed by fully-connected layers: they extract high level abstractions and features of raw data, whereas fully-connected networks are used to learn non-linear combinations of the extracted features. In 2012, a CNN called AlexNet [3] was introduced: it is constituted of 5 convolutional layers followed by 3 fully-connected layers and achieves misclassification rate (MCR) on the ImageNet dataset. AlexNet contains 2.3M weights and 58.6M weights in its convolutional and fully-connected layers, respectively, performing 1332M operations (i.e., 666M multiplications-accumulations) in its convolutional layers and 117.2M operations (i.e., 58.6M multiplications-accumulations) in its fully-connected layers. VGGNet-16 [4] is another well-known CNN, containing 13 convolutional layers with 14.7M weights and 3 fully-connected layers with 124M weights. VGGNet-16 performs 30.6G operations in its convolutional layers and 248M operations in its fully-connected layers, achieving MCR on ImageNet. Recently, ResNet-50 [5], containing 49 convolutional layers with 23.5M weights and 1 fully-connected layer with 2M weights, achieved a better MCR (i.e., 22.85% on ImageNet) by going even deeper. ResNet-50 respectively performs 7G and 4M operations within the two types of layers. All these CNNs have won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [6].

Regardless of the fact that in almost all the aforementioned CNNs the majority of weights is found in fully-connected layers, the number of operations are dominated by convolutions. As a result, the processing time of CNNs is also dominated by the convolutional processes. This issue can easily be addressed by exploiting parallel processing elements (PEs) to increase throughput. However, a straightforward parallelization requires high data movement and bandwidth, leading to high energy consumption [7]. It is worth noting that memory accesses to off-chip memories are more expensive than on-chip storage, as shown in [8].

Pruning techniques were first introduced in [9] to reduce the number of parameters and memory accesses to off-chip memory. In [9] CPU/GPU implementations were considered, showing that 3 to 4 layer-wise speedup can be obtained for fully-connected layers without any practical speedup for convolutional layers. To accelerate convolutional processes on GPUs and CPUs, a new method was also introduced in [11], achieving up to 5.1 speedup. The work presented in [12] introduces a fully-connected accelerator, called efficient inference engine (EIE), for the pruning technique introduced in [9]. EIE can obtain 13 to 307 speedup, and save 2700 to 24000 energy compared to CPUs or GPUs for fully-connected computations. Recently, a new pruning technique and its custom hardware were introduced in [13], using low-cost linear-feedback shift registers (LFSRs) to prune the connectivity of fully-connected layers. This technique also saves up to 90% energy compared to conventional implementations of fully-connected layers. However, as discussed earlier, convolutional processes are the bottleneck of the processing time of CNNs.

During the past few years, many convolutional accelerators with different dataflows have been introduced in literature [14]. While these ASIC architectures can successfully reduce the energy consumption of convolutional processes and meet the latency constraints of small CNNs such as AlexNet, they fail to employ the full potential of their architectures, resulting in a low performance efficiency. In fact, there is a huge gap between their peak performance and average runtime performance. For instance, in [14] the architecture known as Eyeriss achieves a peak performance of 84 Gops, where each MAC is considered as two operations. However, its performance efficiency is limited to 55% and 26% when performing the convolutional computations of AlexNet and VGGNet-16, respectively.

To improve the performance efficiency and to accelerate the convolutional processes for VGG and VGG-like networks, a dataflow, called fully-connected inspired dataflow (FID), and the architecture implementing it were introduced in [20]. This architecture achieves a high performance efficiency of on the convolutional processes of VGGNet-16. Despite its high performance efficiency, throughput and low silicon area, it is only limited to architectures with filters.

In this paper, we propose a dataflow supporting all type of filter sizes used in state-of-the-art CNNs by generalizing FID. We provide a theoretical framework showing that the proposed generalized FID (GFID) can perform both the fully-connected and convolutional processes while using the same hardware resources, resulting in a high utilization factor. We then propose a CNN accelerator based on the proposed GFID, that performs both fully-connected and convolutional computations, which is hereafter referred to as multi-mode inference engine (MMIE). MMIE is optimized to achieve high performance efficiency and low memory accesses to the off-chip memory, while keeping the power consumption below the budget of mobile/embedded devices. Finally, we evaluate the performance of MMIE on the state-of-the-art CNN models (i.e., AlexNet, VGGNet-16 and ResNet-50) and show that MMIE performs the convolutional computations of these CNNs with an minimum performance efficiency.

## 2Preliminaries

A fully-connected network is a stack of layers where each neuron is connected to every neuron in the previous and next, and to each connection is associated a weight . A fully-connected layer performs the following computations with inputs and outputs:

where denotes the input pixels, the output pixels, the biases, and is the non-linear activation function . According to (Equation 1), the fully-connected computational kernel calculates numerous vector-matrix multiplications followed by the . Due to parallel memory access requirement for fully-parallel implementations of such networks, a semi-parallel implementation is a typical approach for their hardware implementations [20]. In semi-parallel implementations, only a limited number of PEs is instantiated, and computations for each neuron are performed serially [21]. In fact, different trade-offs between area occupation and latency can be obtained by changing the degree of parallelism.

Inspired by the organization of the animal visual cortex, it was shown that the connectivity of neurons in convolutional layers can be mathematically described by a convolution operation [22]. All neurons in a convolutional layer share a set of weights, also referred to as a filter.

The main computational kernel of a convolutional layer involves high-dimensional convolutions, as shown in Figure 1. The convolutional layers take input pixels, which are also called input activation maps, arranged in 3 dimensions (i.e., height , width and channel ), and generate output pixels, which are also called output activation maps, arranged in 3 dimensions (i.e., height , width and channel ). This transformation is a result of the convolution between the input activation maps and a set of 3D filters. More precisely, every single 2D plane of the output activation maps is a result of the convolution between the 3D input activation maps with a set of 3D filters. In fact, a summation of multiple plane-wise 2D convolutions forms a 3D convolution. At the end, the results of 3D convolutions are also added to 1D bias. In summary, the convolutional processes with the input activation maps, the output activation maps, the filters and the bias matrices denoted as , , and , respectively, can be expressed as

where , and . The stride represents the number of activation map pixels of which the filter is shifted after each convolution. Contrary to the fully-connected layers, convolutional computations are dominated by numerous MACs according to Eq. (Equation 2), leading to a high degree of computational complexity.

### 2.1Fully-Connected Inspired Dataflow for Convolutional Computations

In [20], FID was introduced. It can be used to efficiently perform the computations of convolutional layers with filter parameter fixed to . Let us note that 2D convolution is the weighted summation of each pixel of an input image with its neighboring pixels, and consider an input image as a matrix , a filter as a matrix and an output as a matrix , such that

Considering each output pixel assigned to a neuron, Table ? shows the convolutional process of this example in a way similar to the fully-connected layer computations, where input pixels are read sequentially at each clock cycle (CC) and the neurons share the same input pixels. This example considers , , , , , and . Similar to the fully-connected dataflow, each neuron loads a different weight at each time step, subsequently accumulating the weighted input pixels. The number of time steps required to perform the convolutional computations is also equal to the number of input pixels, . When passed to the next neuron belonging to the same row of the output activation map, the weights need to be shifted of one position. However, weight passing between neurons of different rows requires a shift of positions, as can be observed between output and in Table ?.

CC |
Inputs | ||||||||||||

#1 | #2 | #3 | #4 | #5 | #6 | #7 | #8 | #9 | #10 | #11 | #12 | ||

#1 | |||||||||||||

#2 | |||||||||||||

#3 | |||||||||||||

#4 | |||||||||||||

#5 | |||||||||||||

#6 | |||||||||||||

#7 | |||||||||||||

#8 | |||||||||||||

#9 | |||||||||||||

#10 | |||||||||||||

#11 | |||||||||||||

#12 | |||||||||||||

#13 | |||||||||||||

#14 | #15 |
||||||||||||

A direct implementation of the convolutional process in Table ? requires a large number of PEs, or neurons, each of them with a low utilization factor (UF). In [20] it was shown that 3 PEs, which are denoted by different colors in Table ?, are sufficient to perform the convolutions. In fact, there are only 3 active neurons at each time step. Each PE thus receives its input at clock cycle , and . Their outputs are also valid after 3 clock cycles in the given example. So far, we only considered a case with . In case of , the procedure in Table ? has to be repeated 2 times more: the first iteration with , and , the second with , and , and the final one with , and . Similarly, for higher values of , the process has to be to repeated times. Therefore, a memory is required to store the partial values generated by the 3 neurons for each output pixel. In general, output pixels can be computed using 3 neurons (i.e. PEs) and 3 separate -element SRAM memories working in parallel. The unit generating the output pixels of an output activation map is referred to as a 1D tile. Parallel 1D tiles can be also exploited to generate out of output activation maps in parallel. Using parallel 1D tiles reduces both the latency and memory access by a factor of . The input pixels are shared among all the 1D tiles.

## 3Generalized Fully-Connected Inspired Dataflow (GFID)

Let us define a generalized form of the FID as a matrix :

where each column of the matrix can contain only non-zero elements at most. The shift amount within each row of the output activation map is equal to , denoted with a dashed line in the matrix . The number of columns of the matrix indicates the output pixels that belong to the same row of the output activation map, while the number of rows of denotes the required number of clock cycles.

In this Section, we use the GFID matrix to represent different filter sizes used in the state-of-the-art CNNs (i.e., AlexNet, VGGNet and ResNet). AlexNet uses filter sizes of with , with , and with . The filter sizes used in VGGNets are fixed to with . Finally, ResNets use filter sizes of with , with , and with .

### 3.1Filters with and

In Section 2.1, we showed that 3 PEs are sufficient to perform the convolutions for filter size of with . Therefore, a 1D tile containing only 3 neurons can perform the convolutional computations. Considering a convolution of a row of a filter map with its corresponding input pixels, clock cycles are required to generate output pixels which belong to the same row of the output activation map. For instance, in the given example in Table ?, 8 clock cycles are required to generate the output pixels of the first row of the output activation map (i.e., the first 6 output pixels). This example can also be expressed using the GFID matrix as follows:

The matrix also confirms that there are only 3 active neurons at each time steps, highlighted in dark gray.

### 3.2Filters with and

The convolutional computations for filters with and are performed in a way similar to the convolutional computations of the filters with and , with the difference that neurons are active at each time step. Thus, a 1D tile with 5 PEs can perform the computations for this filter size. Moreover, clock cycles are required to generate output pixels which belong to the same row of the output activation map.

### 3.3Filters with and

The following matrix shows the convolutional computations for filters with and .

Contrary to other filter sizes, its GFID matrix is square: the number of clock cycles required to generate output pixels is equal to . As denoted in the matrix , there is only one active neuron at each clock cycle. Consequently, its 1D tile requires only one PE to perform the convolutional computations.

### 3.4Filters with and

So far, we only considered a stride value . However, both AlexNet and ResNet contain layers computing convolutions with . Considering filters with and , the shift amounts within each row of the output activation map is equal to 2 as shown in the following matrix :

While the higher stride value linearly decreases the number of pixels in the output activation maps, it also reduces the number neurons required to perform the convolutional computations. For instance, the above matrix shows that there are only active neurons at each time step, while the width of the filter . According to the matrix , clock cycles are required to generate output pixels in the given example.

### 3.5Filters sizes with and

The matrix for filters with and is as follows:

Despite of the large width of the filter, the number of active neurons at each time step is only 3, thanks to the large stride value. However, the number of clock cycles required to generate 4 output pixels is , which is rather high and can result in a long latency.

### 3.6Utilization Factor for Different Filter Sizes

As discussed in Section 2.1, the number of clock cycles required to perform the convolutions using FID is equal to the number of input pixels, and it is the same for GFID. Considering and , in order to generate pixels of an output activation map, clock cycles are required to perform the convolutions according to Eq. (Equation 2). Let us define the number of required PEs in the 1D tile as . The number of pixels computed by each neuron is equal to when is a multiple of . Each neuron also requires clock cycles to generate an output pixel. Therefore, the utilization factor of GFID can be expressed as

In Section 1, we discussed the importance of high performance efficiency. The utilization factor of PEs in a convolutional accelerator is also linearly proportional to its performance efficiency. Any increasing in the utilization factor of PEs exploited in the 1D tile results in an increase in performance efficiency. Considering the fact that and are usually small, a high UF is achieved for a large value of . In other word, the maximum achievable utilization factor can be obtained as

Eq. (Equation 4) suggests that the highest performance efficiency is obtained when . The maximum utilization factors for filters with [] equal to [1, 1], [3, 1], [5, 1], [7, 2] and [11, 4] are 100%, 100%, 100%, 88% and 92%, respectively, showing the high performance efficiency of the proposed GFID.

## 4Multi-Mode Inference Engine

# layers | ||||
---|---|---|---|---|

AlexNet [3] |
4 | 3 | 1 out of 5 | |

1 | 5 | 1 out of 5 | ||

1 | 3 | 3 out of 5 | ||

ResNet-50 [5] |
2 | 4 | 1 out of 49 | |

1 | 3 | 16 out of 49 | ||

1 | 1 | 32 out of 49 | ||

VGG-16 [4] |
1 | 3 | 13 out of 13 | |

In Section 3, we showed that different filter sizes require different number of PEs per tile. Table ? summarizes the number of required PEs per tile for each layer of AlexNet, VGGNet-16 and ResNet-50. AlexNet consists of 5 layer of convolutions with filter sizes of , and . Performing the GFID on the AlexNet layers show that 4 out of 5 layers (i.e., the layers with filter sizes of and ) only require 3 PEs to perform the computations, while the remaining layer requires 5 PEs per tile. Therefore, is the most frequent number in AlexNet for convolutional processes. The filter size and stride are fixed to and one pixel for the convolutional computations of VGGNets [4], respectively: as a result, the whole computations of VGGNets can be performed using 3 PEs per tile. There are different VGGNet models in literature: in this paper, we use VGGNet-16, which contains 13 convolutions and 3 fully-connected layers, for experimental purposes. Similar to VGGNets, ResNets also come in different flavors. The first layer of ResNets is fixed to the receptive field of and stride of . The filter sizes of the remaining layers are either fixed to (for ResNet-18 and ResNet-34) or a combination of and (for ResNet-50, ResNet-101 and ResNet-152) [5]. Therefore, the dominant filter sizes are and , which require one and 3 PEs per tile to perform the convolutional computations, respectively. In Table ?, we report the requirements for ResNet-50.

Figure 2 shows the high level architecture of the 1D tile. It consists of two main sub-blocks: the weight generator and PEs working in parallel. All the PEs share the same input activation pixel while their weights are different. Each PE takes an input activation map and its corresponding weight according to the proposed GFID and performs the accumulation-multiplication for the first row of the first input filter, i.e., , , , . This process takes clock cycles and the computed partial value is stored in a memory of elements. Afterwards, the PE starts the processing of another output activation pixel, using the same weights. The convolutional computations of the first row of the first input filter require clock cycles, as discussed in Section 3.6. Upon reaching this point, the partial value of the first output activation pixel is read from the memory and the computations of the second row of the first input filter are performed for clock cycles. In general, this procedure is repeated for times until the computations of the first filter are finished (i.e., upon completion of clock cycles). At this point, the computation of the second of the filters starts. Upon completion of clock cycles, the output value of each PE is passed through the ReLU and the result is stored in the off-chip memory.

So far, we introduced a high-level architecture for the 1D tile and explained the high level procedure of convolutional computations. In order to perform the computations while achieving a high performance efficiency, the number of PEs per tile has to be reconfigurable. In order words, instantiated PEs have to dynamically adapt to act as a multiple of PEs to achieve the maximum possible utilization factor. The closed form solution for this strategy is

where LCM denotes the least common multiple. Using this approach, 60 PEs are required to achieve the maximum possible utilization factor for all the network sizes listed in Table ?. Depending on the required , the 60 PEs can dynamically behave as a set of PEs. For instance, they can act as 60, 20, 15 and 12 parallel tiles for equal to 1, 3, 4 and 5, respectively, where each tile also contains 1, 3, 4 and 5 PEs. However, using 60 reconfigurable PEs is not trivial and results in a complex address generator.

Table ? shows that and are the dominant minimum numbers of PEs for the three well-known CNNs. More precisely, the two filters with and have the least impact on the overall performance efficiency of CNNs, since they are used in only one layer of CNNs. Therefore, we use PEs inside the reconfigurable tile: the reason is twofold. First of all, 6 PEs can be easily used as 2 and 6 tiles containing 3 and 1 PEs for and , which are the dominant minimum numbers of PEs for the three well-known CNNs. Secondly, they can perform the computations for and with a minimum level of complexity for the address generator unit. In this case, with larger than what strictly necessary, the number of clock cycles required to perform the convolutional computations remains the same. However, the utilization factors of PEs for these cases decreases.

### 4.1Reconfigurable Weight Generator Unit

The weight generator unit provides each neuron an appropriate weight according to the proposed GFID. The weight generator unit consists of 6 register sets where each set contains 11 registers. The appropriate weight for each neuron is provided by selecting among these shift registers.

#### Filters With and

As discussed in Section 4, in case of and , the 1D reconfigurable tile containing 6 neurons can function as two tiles of 3 neurons each. Fig. ? shows the weight generator unit and its working path highlighted in black when using and . It is worth noting that tiles are separated using a dashed line. Each tile loads the weights of the first row of the first filter (i.e., , and ) through the input ports denoted as In #1 and In #2 in Fig. ?. These weights then loop through the first register of each set to provide one clock cycle delay for each neuron according to (Section 3.1). Considering Eq. (Equation 3), the utilization factor of each neuron for this case can be computed as

which approaches 100% for large values of .

#### Filters With and

In case of and , we use 6 neurons to perform the convolutional processes while we showed that the minimum required number of neurons is 5 for this case (see Section 4). Therefore, the reconfigurable tile works as a single tile containing 6 PEs as shown in Fig. ?. The tile takes the first row of the first filter (i.e., , , , and ) through the input port denoted as In #1. It then provides the required one clock cycle delay for each PE by passing the weights through the first register of each register set as highlighted in black in Fig. ?. It is worth noting that 6 registers are used in this paradigm while only 5 of them required to store the weights. Therefore, the value of one register among the 6 registers is always zero to cancel out its effect on the computations. More precisely, we can assume the 5 weights (i.e., , , , and ) as a set of 6 weights in which one of them is zero (i.e., , , , and ). The utilization factor of each PE is also can be expressed as

In fact, using 6 neurons to perform the convolutions of reduces the maximum achievable utilization factor from 100% to 83%.

#### Filters With and

In Section 3.3, we showed that only one PE is sufficient to perform the computations for and . Therefore, the reconfigurable 1D tile can be used as 6 parallel tiles, as depicted in Fig. ?. The 6 tiles are separated using dashed lines and the involved hardware units and paths are highlighted in black. Each tile takes its weight (i.e., ) at the first clock cycle through the input ports In #1 to In #6. Afterwards, the imported weight loops through each tile and the first register of each register set. The utilization factor of each PE is equal to 100% regardless of , according to (Equation 3).

#### Filters With and

Similar to the case of and , 6 PEs are used to compute the convolutions for and , while only 4 neurons are sufficient. As a result, the reconfigurable tile functions as a single tile containing 6 PEs (see Fig. ?). The tile loads the weights of the first row of the first filter (i.e., , , , and ) through the input port In #1 and they loop through the black paths in Fig. ?. In this scheme, the first two registers of each register set are used to provide the required two delays for each PE, as shown in (Section 3.4). It is worth mentioning that while 12 registers are used in this case, only 7 of them contain the weights. The utilization factor for this configuration is computed as follows:

Since 4 PEs are sufficient to perform the computations of this case, using 6 neurons highly affects the utilization factor and results in 53% for large values of . However, the final impact of this case when considering the computations of whole system is negligible due to the fact that this configuration is only used for one layer out of 49 in ResNet-50.

#### Filters With and

Similar to the case with and , 3 PEs are sufficient to perform the convolutional processes when using and . Therefore, the reconfigurable tile functions as two tiles where each contains 3 PEs. The weights of the first row of the first filter (i.e., , , , and ) are passed through input ports In #1 and In #4 to each tile, as shown in Fig. ?. Since a stride value of 4 is used, the first four registers of each register set are used to provide the required four clock cycle delays (Section 3.5). A total of 12 registers are used in each tile while only 11 weights exist. Therefore, the remaining register is zero. The utilization factor of this case is also computed as

achieving up to 92% for large values of .

#### Fully-Connected Computations

As discussed in Section 2, semi-parallel architectures are a common approach to implement fully-connected layers, where the computations of each neuron are performed serially. For instance, considering a single neuron with 512 inputs (i.e., and ), 512 clock cycles are required to perform the computations of (Equation 1) using a single PE. We can perform the computations of multiple neurons by instantiating multiple PEs in parallel as discussed in [20]. In this way, each PE shares the same input pixels while loading different weights. This approach can be easily realized using the proposed reconfigurable tile as illustrated in Fig. ?. In fact, the reconfigurable tile passes the incoming 6 parallel weights directly to each PE through multiplexers highlighted in black. The utilization factor of PEs for fully-connected computations is 100%.

### 4.2Handling Weight Passing

So far, we discussed both the convolutional and fully-connected computations while not considering the weight passing cases for the sake of simplicity. However, weight passing occurrence is inevitable in convolutional computations and impacts both the processing time and utilization factor of PEs. Weight passing occurs when a tile performs the computations of more than one row of the output activation map. In this case, the weight passing from a neuron of a row to a neuron of another row takes clock cycles regardless of the stride value , resulting in a longer latency and consequently a lower utilization factor for PEs. The total number of weight passing occurrences for computations of a single convolutional layer is equal to . We considered 11 registers for each register set to support the weight passing delay up to 11 clock cycles. Therefore, in case of weight passing in any of PEs, its corresponding register set provides the required delay depending on .

### 4.3Exploiting Parallel Tiles

While the proposed reconfigurable tile can efficiently performs both fully-connected and convolutional computations, using a single tile results in a long latency and numerous memory accesses, as discussed in [20]. To address this issue, tiles are instantiated in parallel to generate multiple activation maps in parallel. Since the reconfigurable tile itself can function as up to 6 parallel tiles, the upper bound for the maximum number of tiles is in MMIE. Therefore, the computational latency of MMIE is effectively reduced by a factor when compared to a single reconfigurable tile. Moreover, the memory accesses are reduced as well, since the input pixels are shared among the parallel tiles (see Figure 1), while each tile is fed by a different set of weights.

Exploiting parallel tiles requires an input bandwidth of bits ( for weights and 16 for input pixels). However, most of the embedded and mobile devices cannot provide such a high bandwidth. To overcome this problem, MMIE leverages the pipelining technique first introduced in [20]. As discussed in Section 3, each input pixel is read at each clock cycle while weights are read only for the first clock cycles when performing the convolutional process of the first row of the first input filter. The parameter is also a small value compared to the processing time of convolutions for the first row of the first input filter (i.e., ). More precisely, the input bandwidth from clock cycle to clock cycle is only occupied with input pixels. Therefore, we can fill out this available bandwidth by pipelining the tiles with up to stages, while the additional latency overhead is negligible compared to the overall latency of the system.

### 4.4Processing Time and Memory Accesses of MMIE

#### Convolutional Processes

Earlier in Section 4 we showed that in convolutional processes, a single tile computes out of pixels of one of output activation maps within clock cycles. We also showed that the total number of weight passing occurrences for the computation of a single convolutional layer is equal to , which causes additional clock cycles for the computations of each row of the input filters. Considering parallel tiles, the number of required clock cycles is expressed as

4 | 192 | 64 | |

2 | 384 | 32 | |

1 | 384 | 32 | |

1 | 192 | 64 | |

1 | 64 | 192 | |

Eq. (Equation 5) suggests that the number of required clock cycles for convolutional computations is independent of for large values of (i.e., ) when not considering the weight passing overheads. In Section 4.3 we showed that input pixels are shared among all tiles and each pixel is read at each clock cycle. This means that the number of memory accesses by input activation maps (MA) is equal to the number of clock cycles required to complete the convolution. On the other hand, the weights are read in the first clock cycles out of a total . As a result, the number of required memory accesses by filters to compute out of pixels of one out of output activation map is equal to . In general, the number of memory accesses by filters (MA) can be computed as follows:

Finally, the total number of memory accesses (MA) is a summation of memory accesses by filters, input activation maps and output activation maps, where the number of memory accesses by output activation maps (MA) is equal to . It is worth noting that while the number of clock cycles and MA are independent of , MA depends on it. On the other hand, MA is independent of while the number of clock cycles and MA are not. It is worth mentioning that while higher values of and optimize MMIE towards lower memory accesses and processing latencies, they also increases its power consumption and silicon area.

#### Fully-Connected Computations

In Section ?, we showed that MMIE can perform the fully-connected computations in a similar way to convolutional computations, with each PE loading a different set of weights. The processing time of each PE is thus equal to the number of inputs . The number of clock cycles required to generate output pixels can be expressed as

Unlike weights, input pixels are shared among PEs. Therefore, the number of memory accesses by input pixels (MA) is equal to the number of clock cycles required for fully-connected computations. Since each output pixel relies on a distinct set of weights, the number of memory accesses by weights (MA) is computed as follows:

The number of memory accesses by output pixels (MA) is equal to . The total number of memory accesses (MA) is also a summation of memory accesses by weights, input and output pixels.

## 5Implementation Results

In this paper, we optimize MMIE for a low-latency, low-memory access implementation while keeping its power consumption below the power budget of mobile devices, limited to a few hundred mW [23]. Fig. ? shows the architecture of MMIE which is consisted of three main sub-blocks: tiles, pipelining stages and a distributor unit. MMIE contains 32 reconfigurable tiles, each of which with 6 PEs. Each PE is also associated with a 24-bit memory. The pipelining stages provide the required amount of shifts depending on the value of using shift registers and multiplexers, as discussed in Section 4.3. The distributor unit also provides the required bandwidth for fully-connected weights using shift registers working at lower frequency than the off-chip memory.

The and parameters do not only affect latency and number of memory accesses, but also impact power and area costs. Therefore, it is possible to obtain different trade-offs between processing time, throughput and implementation costs depending on and . Since the reconfigurable tile functions differently based on and , the effective values of and vary for each case. Table ? shows the effective values of and for AlexNet, VGGNet and ResNet filter sizes. The effective values of and , denoted as and respectively, have to be used in all the equations reported in this paper that rely on these two values.

MMIE was implemented in TSMC 65nm GP CMOS technology and its layout are shown in Fig. ?. MMIE works at a nominal frequency of 200 MHz and 40 MHz for convolutional and fully-connected processes respectively. MMIE performs the fully-connected computations at a lower frequency since they require a high input bandwidth, as each neuron loads its own set of weights. We also used the run-length compression technique introduced in [14] to reduced the required bandwidth. MMIE uses the distributor unit to decode the compressed values. Considering MMIE working at 10 lower frequency compared to the off-chip memory for fully-connected computations, the required bandwidth of 193 16-bit values are obtained using this technique.

### 5.1Hardware Implementation Results on State-of-the-Art Networks

Fig. ? shows the breakdown of performance efficiency for each layer of AlexNet, VGGNet-16 and ResNet-50 when using MMIE. In our simulations, the input pixels and weight values are quantized to 16 bits while using 2 and 15 fractional bits, respectively. It is worth noting that this quantization scheme only results in less than 0.5% accuracy degradation on the aforementioned CNNs using [24]. The implementation results show that the lowest performance efficiency of AlexNet and VGGNet-16 was obtained at the first layer of these networks. The number of output activation maps of the first layer of AlexNet is 96 while MMIE provide 64 parallel tiles when and . As a result, for the first 64 output activation maps, MMIE achieves a high performance efficiency while the remaining 32 output activation maps are computed using 32 parallel tiles out of 64, which explain the low performance efficiency of this layer. On the other hand, MMIE successfully performs the computations of the first layer of VGGNet-16 with a high performance efficiency. However, since the required time for writing the computed output activation pixels is longer than the computation time, the low performance efficiency is inevitable. In ResNet-50, layers with a receptive field of show lower performance compared to other filter sizes, while it was shown in Section ? that such receptive field yields a performance efficiency. Such performance efficiency degradation is expected, as of the layers with receptive field of are not multiple of 192 available parallel tiles. For instance, the number of output activation maps of the second layer of ResNet-50 is 64, while 192 parallel tiles are available. Therefore, 128 tiles are not being used for this layer.

Fig. ? shows the breakdown of power consumption for each layer of AlexNet, VGGNet-16 and ResNet-50. The power consumption of MMIE follows a descending trend as the number of zeros in output/input activations maps and filters increases for each layer of AlexNet, VGGNet-16 and ResNet-50. Moreover, it also increases as the performance efficiency of layers rises. The power numbers reported in this paper are obtained by measuring switching activities of all models.

Fig. ? shows the breakdown of the memory accesses for each layer of AlexNet, VGGNet-16 and ResNet-50. The memory accesses for each layer of the aforementioned networks are limited to a few MB. More precisely, AlexNet and ResNet-50 layers require a lower number of memory accesses compared to VGGNet-16. While the memory accesses for each layer of AlexNet and ResNet-50 are roughly in the same order, the total memory accesses of ResNet-50 are significantly more due to its numerous layers. The processing latency of each layer also follows a similar trend to the memory accesses as shown in Fig. ?. In fact, the latency of each layer in AlexNet and ResNet-50 is limited to a few milliseconds while each layer of VGGNet-16 requires roughly 10 more clock cycles.

### 5.2Comparison With State-of-the-Art Implementations

The implementation results of MMIE on AlexNet, VGGNet-16 and ResNet-50 are shown in Table ?. As discussed in Section 1, MCR of these networks varies depending on their sizes. Therefore, different implementation results are expected when running MMIE on these models. MMIE performs the convolutional and fully-connected computations of AlexNet within 20.8 ms and 7.6 ms while requiring 15.6 MB and 117.8 MB memory accesses to the off-chip memory, respectively. The convolutional and fully-connected processes of VGGNet-16 are performed within 421.8 ms and 16.4 ms and require 375.5 MB and 247.3 MB memory accesses, respectively. Finally, performing convolutional and fully-connected computations of ResNet-50 on MMIE requires 106.6 ms and 0.3 ms while memory accesses are 154.6 MB and 4.1 MB, respectively. Therefore, AlexNet computations require the lowest latency while its total memory accesses are roughly similar to those of ResNet-50. VGGNet-16 is the most complex network in terms of both processing latency and memory accesses. MMIE also yields 83%, 94% and 94% performance efficiency for convolutional computations of AlexNet, VGGNet-16 and ResNet-50, respectively. It is worth mentioning that the performance efficiency of fully-connected computations is roughly 100% for all the aforementioned networks.

During the past few years, numerous works have been conducted towards ASIC implementations of DNNs. However, most of them were only tested on either small datasets or outdated CNNs which require order of magnitudes lower parameters and computations [26]. Recently, Google released a custom DNN accelerator tensor processing unit (TPU) [30]. TPU is a programmable and reconfigurable accelerator that can perform both fully-connected and convolutional computations. However, its power consumption exceeds the power budgets of embedded devices [20]. In [14], a convolutional accelerator, called Eyeriss, was introduced. Eyeriss was fabricated in 65 nm CMOS technology and tested on AlexNet and VGGNet-16. Eyeriss uses high batch sizes to obtain a lower number of memory accesses, but using this method results in a higher computational latency. Eyeriss performs convolutional computations of AlexNet and VGGNet-16 in 115.3 ms and 4.3 s while requiring 15.4 MB and 321.1 MB memory accesses and using batch size of 4 and 3, respectively. Its performance efficiency is also limited to only 55% and 26% on AlexNet and VGGNet-16, resulting in large silicon area of 12.52 mm (1852kgates). Eyeriss also uses clock gating to reduce its power consumption.

Recently, a few works have focused on minimizing energy by modulating precision, frequency and supply voltage of their accelerator for each convolutional layer [32]. In [15], a precision-scalable convolutional accelerator, fabricated in 28 nm UTBB FD-SOI technology, was introduced. This architecture dynamically adapts itself depending on the required precision for each layer, instead of using a fixed precision. More precisely, it exploits a reconfigurable multiplier which is able to perform a 16-bit, two 8-bit and four 4-bit multiplications, depending on the required precision. As a result, using a dynamic fixed-point technique allows to change frequency and supply voltage over time which results in a lower power/energy consumption. This accelerator performs the convolutional computations of AlexNet to 21.3 ms, and those of ResNet to 598.8 ms, while its performance efficiency is respectively limited to 38% and 32% on average. Similar to Eyeriss, the low performance efficiency of this architecture results in a large gate count of 1950kgates.

In [19], a DNN accelerator, fabricated in 65 nm CMOS 1P8M, was introduced. This accelerator can perform both fully-connected and convolutional computations while using two separate cores and the dynamic fixed-point technique to minimize power/energy consumption. This architecture exploits a reconfigurable 16-bit multiplier for convolutional processes which allows it to work with lower frequency and supply voltage. This architecture performs convolutional and fully-connected computations within 5.7 ms and 833 s, respectively. The convolutional core of this architecture contains 768 16-bit reconfigurable PEs, which can be used as 3072 4-bit PEs. Despite its high convolutional throughput, its performance efficiency is limited to on average. The fully-connected core contains only 64 PEs, and uses a quantization table-based matrix multiplication to reduce off-chip memory accesses and remove redundancy. This technique reduces the memory accesses by and avoids of the 16-bit fixed-point multiplications in fully-connected computations [19]. While the fully-connected core is highly optimized, it requires separate PEs and hardware resources, which leads to a large silicon area of 16 mm.

In [20], a convolutional accelerator was proposed as a first attempt to improve the performance efficiency for filters fixed to . This architecture performs the convolutional computations of VGGNet-16 within 453.3 ms and requires 331.7 MB memory accesses.

In this paper, we proposed MMIE which supports all the filter sizes that require less than or equal to 6 parallel PEs in each tile. MMIE can perform both the convolutional and fully-connected computations while using the same PEs. Since both Eyeriss and MMIE were implemented in TSMC 65nm CMOS technology and use 16-bit fixed-point representations, a direct comparison of these two implementations constitutes a fair comparison. As shown in Table ?, MMIE outperforms Eyeriss [14] in terms of gate count (1.8 smaller), latency (5.5 and 10.2 lower), throughput (1.4 and 3.1 faster), performance efficiency (1.5 and 3.6 better) and energy scalability (1.5 and 2.7 more efficient) while having roughly the same number of memory accesses per batch. It is worth noting that a direct comparison of MMIE with the works published in [15] does not constitute a fair comparison, since they dynamically modulate precision, frequency and supply voltage and use advanced technology nodes, which allows them to instantiate more PEs while still having a low-power/energy consumption. However, the introduced performance efficiency metric can be used for a fair comparison as it reflects the performance of the accelerators independent of their technology nodes, precisions and optimization techniques. Therefore, MMIE has better the performance efficiency than the works published in [19] (2 better) and [15] (2.2 and 2.9 better) when performing convolutions of AlexNet and VGGNet-16.

## 6Conclusion

CNN accelerators in literature promise a high peak throughput, but their performance is limited to less than 55 % when running the state-of-the-art networks such as AlexNet, VGGNets and ResNets. We proposed a dataflow inspired to the fully-connected computations to perform both convolutional and fully-connected processes with a high utilization factor. We then introduced a multi-mode inference engine (MMIE) based on the proposed dataflow and theoretically formalized its implementation performance. Finally, we implemented MMIE in TSMC 65nm CMOS technology and tested it on three state-of-the-art networks, AlexNet, VGGNet-16 and ResNet-50. The implementation results show that MMIE performs both the fully-connected and convolutional computations with performance efficiency no less than 84%, outperforming the state of the art also in terms of area occupation.

### References

- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” in
*Proceedings of the IEEE*, pp. 2278–2324, 1998. - Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,”
*Nature*, vol. 521, pp. 436–444, 5 2015. - A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
*Advances in Neural Information Processing Systems 25*(F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, eds.), pp. 1097–1105, Curran Associates, Inc., 2012. - K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,”
*CoRR*, vol. abs/1409.1556, 2014. - K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
*CoRR*, vol. abs/1512.03385, 2015. - O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,”
*International Journal of Computer Vision (IJCV)*, vol. 115, no. 3, pp. 211–252, 2015. - Y.-H. Chen, J. Emer, and V. Sze, “Eyeriss: a Spatial Architecture for Energy-efficient Dataflow for Convolutional Neural Networks,” in
*Proceedings of the 43rd International Symposium on Computer Architecture*, ISCA ’16, (Piscataway, NJ, USA), pp. 367–379, IEEE Press, 2016. - M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” in
*2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC)*, pp. 10–14, Feb 2014. - S. Han, H. Mao, and W. J. Dally, “Deep Compression: compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding,”
*CoRR*, vol. abs/1510.00149, 2015. - S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in
*Advances in Neural Information Processing Systems 28*(C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, eds.), pp. 1135–1143, Curran Associates, Inc., 2015. - W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,”
*CoRR*, vol. abs/1608.03665, 2016. - S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally, “EIE: Efficient inference engine on compressed deep neural network,” in
*2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)*, pp. 243–254, June 2016. - A. Ardakani, C. Condo, and W. J. Gross, “Sparsely-Connected Neural Networks: Towards Efficient VLSI Implementation of Deep Neural Networks,”
*Proc. 5th Int. Conf. Learn. Represent. (ICLR)*, Nov. 2016. - Y. H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,”
*IEEE Journal of Solid-State Circuits*, vol. 52, pp. 127–138, Jan 2017. - B. Moons, R. Uytterhoeven, W. Dehaene, and M. Verhelst, “14.5 Envision: A 0.26-to-10TOPS/W subword-parallel dynamic-voltage-accuracy-frequency-scalable Convolutional Neural Network processor in 28nm FDSOI,” in
*2017 IEEE International Solid-State Circuits Conference (ISSCC)*, pp. 246–247, Feb 2017. - Y. Chen, T. Luo, S. Liu, S. Zhang, L. He, J. Wang, L. Li, T. Chen, Z. Xu, N. Sun, and O. Temam, “DaDianNao: A Machine-Learning Supercomputer,” in
*2014 47th Annual IEEE/ACM International Symposium on Microarchitecture*, pp. 609–622, Dec 2014. - L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: a convolutional network accelerator,”
*CoRR*, vol. abs/1512.04295, 2015. - S. Wang, D. Zhou, X. Han, and T. Yoshimura, “Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks,” in
*Design, Automation Test in Europe Conference Exhibition (DATE), 2017*, pp. 1032–1037, March 2017. - D. Shin, J. Lee, J. Lee, and H. J. Yoo, “14.2 DNPU: An 8.1TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks,” in
*2017 IEEE International Solid-State Circuits Conference (ISSCC)*, pp. 240–241, Feb 2017. - A. Ardakani, C. Condo, M. Ahmadi, and W. J. Gross, “An Architecture to Accelerate Convolution in Deep Neural Networks,”
*IEEE Transactions on Circuits and Systems I: Regular Papers*, Early Access, doi: 10.1109/TCSI.2017.2757036, 2017. - F. Moreno, J. Alarcon, R. Salvador, and T. Riesgo, “Fpga implementation of an image recognition system based on tiny neural networks and on-line reconfiguration,” in
*Industrial Electronics, 2008. IECON 2008. 34th Annual Conference of IEEE*, pp. 2445–2452, Nov 2008. - Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”
*Neural Comput.*, vol. 1, pp. 541–551, Dec. 1989. - R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “YodaNN: An Ultra-Low Power Convolutional Neural Network Accelerator Based on Binary Weights,” in
*2016 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)*, pp. 236–241, July 2016. - A. Vedaldi and K. Lenc, “MatConvNet: Convolutional Neural Networks for MATLAB,” in
*Proceedings of the 23rd ACM International Conference on Multimedia*, MM ’15, (New York, NY, USA), pp. 689–692, ACM, 2015. - Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding,”
*arXiv preprint arXiv:1408.5093*, 2014. - L. Cavigelli, M. Magno, and L. Benini, “Accelerating real-time embedded scene labeling with convolutional networks,” in
*2015 52nd ACM/EDAC/IEEE Design Automation Conference (DAC)*, pp. 1–6, June 2015. - L. Cavigelli and L. Benini, “A 803 GOp/s/W convolutional network accelerator,”
*IEEE Transactions on Circuits and Systems for Video Technology*, vol. PP, no. 99, pp. 1–1, 2016. - A. Pullini, F. Conti, D. Rossi, I. Loi, M. Gautschi, and L. Benini, “A heterogeneous multi-core system-on-chip for energy efficient brain inspired computing,”
*IEEE Transactions on Circuits and Systems II: Express Briefs*, vol. PP, no. 99, pp. 1–1, 2017. - A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. Srikumar, “ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars,” in
*2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA)*, pp. 14–26, June 2016. - N. P. J. et al., “In-datacenter performance analysis of a tensor processing unit,”
*CoRR*, vol. abs/1704.04760, 2017. - Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,” in
*IEEE International Solid-State Circuits Conference, ISSCC 2016, Digest of Technical Papers*, pp. 262–263, 2016. - B. Moons and M. Verhelst, “An energy-efficient precision-scalable convnet processor in a 40-nm CMOS,”
*IEEE Journal of Solid-State Circuits*, vol. PP, no. 99, pp. 1–12, 2016. - B. Moons and M. Verhelst, “A 0.3–2.6 TOPS/W precision-scalable processor for real-time large-scale ConvNets,” in
*2016 IEEE Symposium on VLSI Circuits (VLSI-Circuits)*, pp. 1–2, June 2016.