Ternary Neural Networks for ResourceEfficient AI Applications
Abstract
The computation and storage requirements for Deep Neural Networks (DNNs) are usually high. This issue limits their deployability on ubiquitous computing devices such as smart phones, wearables and autonomous drones. In this paper, we propose ternary neural networks (TNNs) in order to make deep learning more resourceefficient. We train these TNNs using a teacherstudent approach based on a novel, layerwise greedy methodology. Thanks to our twostage training procedure, the teacher network is still able to use stateoftheart methods such as dropout and batch normalization to increase accuracy and reduce training time. Using only ternary weights and activations, the student ternary network learns to mimic the behavior of its teacher network without using any multiplication. Unlike its {1,1} binary counterparts, a ternary neural network inherently prunes the smaller weights by setting them to zero during training. This makes them sparser and thus more energyefficient. We design a purposebuilt hardware architecture for TNNs and implement it on FPGA and ASIC. We evaluate TNNs on several benchmark datasets and demonstrate up to 3.1 better energy efficiency with respect to the state of the art while also improving accuracy.
images \DeclareSIUnit\fpsfps
1 Introduction
Deep neural networks (DNNs) have achieved stateoftheart results on a wide range of AI tasks including computer vision [1], speech recognition [2] and natural language processing [3]. As DNNs become more complex, their number of layers, number of weights, and computational cost increase. While DNNs are generally trained on powerful servers with the support of GPUs, they can be used for classification tasks on a variety of hardware. However, as the networks get bigger, their deployability on autonomous mobile devices such as drones and selfdriving cars and mobile phones diminishes due to the extreme hardware resource requirements imposed by high number of synaptic weights and floating point multiplications. Our goal in this paper is to obtain DNNs that are able to classify at a high throughput on lowpower devices without compromising too much accuracy.
In recent years, two main directions of research have been explored to reduce the cost of DNNs classifications. The first one preserves the floating point precision of DNNs, but drastically increases sparsity and weights sharing for compression [4, 5]. This has the advantage of significantly diminishing memory and power consumption while preserving accuracy. However, the power savings are limited by the need for floatingpoint operation. The second direction reduces the need for floatingpoint operations using weight discretization [6, 7, 8, 9], with extreme cases such as binary neural networks completely eliminating the need for multiplications [10, 11, 12]. The main drawback of these approaches is a significant degradation in the classification accuracy in return for a limited gain in resource efficiency.
This paper introduces ternary neural networks (TNNs) to address these issues and makes the following contributions:

We propose a teacherstudent approach for obtaining Ternary NNs with weights and activations constrained to . The teacher network is trained with stochastic firing using backpropagation, and can benefit from all techniques that exist in the literature such as dropout [13], batch normalization [14], and convolutions, The student network has the same architecture and, for each neuron, mimics the behavior of the equivalent neuron in the teacher network without using any multiplications,

We design a specialized hardware that is able to process TNNs at up to 2.7 better throughput, 3.1 better energy efficiency and 635 better area efficiency than stateoftheart and with competitive accuracy,

We make the training code publicly available
^{1} and provide a demonstration hardware design for TNNs using FPGA.^{2}
The rest of this paper is organized as follows. In the following section, we introduce our procedure for training the ternary NNs detailing our use of teacherstudent paradigm to eliminate the need for multiplications altogether during test time, while still benefiting all stateoftheart techniques such as batch normalization and dropout during training. In Section 3, we provide a survey of related works that we compare our performance with. We present our experimental evaluation on ternarization and the classification performance on five different benchmark datasets in Section 4. In Section 5, we describe our purposebuilt hardware that is able to handle both fully connected multilayer perceptrons (MLPs) and convolutional NNs (CNNs) with a high throughput and a lowenergy budget. Finally, we conclude with a discussion and future studies in Section 6.
2 Training Ternary Neural Networks
We use a twostage teacherstudent approach for obtaining TNNs. First, we train the teacher network with stochastically firing ternary neurons. Then, we let the student network learn how to imitate the teacher’s behavior using a layerwise greedy algorithm. Both the teacher and the student networks have the same architecture. The student network’s weights are the ternarized version of the teacher network’s weights. The student network uses a step function with two thresholds as the activation function. In Table 1, we provide our notations and descriptions. In order to emphasize the difference, we denote the discrete valued parameters and inputs with a bold font. Realvalued parameters are denoted by normal font. We use to denote a matrix or a vector. We use subscripts for enumeration purposes and superscripts for differentiation. is defined as the output of neuron in teacher network and is the output of neuron in the student network. We detail the two stages in the following subsections.
Teacher network  Student Network  
Weights  
Bias  
Transfer  
Function  
Act. Fun  
where 
2.1 The Teacher Network
The teacher network can have any architecture with any number of neurons, and can be trained using any of the standard training algorithms. We train the teacher network with a single constraint only: it has stochastically firing ternary neurons with output values of , , or . The benefit of this approach is that we can use any technique that already exists for efficient NN training, such as batch normalization [14], dropout [13], etc. In order to have a ternary output for teacher neuron denoted as , we add a stochastic firing step after the activation step. For achieving this stochastically, we use tanh (hyperbolic tangent), hard tanh, or softsign as the activation function of the teacher network so that the neuron output has range before ternarization. We use this range to determine the ternary output of the neuron as described in Table 1. Although we do not require any restrictions for the weights of the teacher network, several studies showed that it has a regularization effect and reduces overfitting [8, 9]. Our approach is compatible with such a regularization technique as well.
2.2 The Student Network
After the teacher network is trained, we begin the training of the student network. The goal of the student network is to predict the output of the teacher realvalued network. Since we use the same architecture for both networks, there is a onetoone correspondence between the neurons of both. Each student neuron denoted as learns to mimic the behavior of the corresponding teacher neuron individually and independently from the other neurons. In order to achieve this, a student neuron uses the corresponding teacher neuron’s weights as a guide to determine its own ternary weights using two thresholds (for the lower threshold) and (for the higher one) on the teacher neuron’s weights. This step is called the weight ternarization. In order to have a ternary neuron output, we have a step activation function of two thresholds and . The output ternarization step determines these.
Figure 1 depicts the ternarization procedure for a sample neuron. In the top row, we plot the distributions of the weights, activations and ternary output of a sample neuron in the teacher network respectively. The student neuron’s weight distribution that is determined by and is plotted below the teacher’s weight distribution. We use the transfer function output of the student neuron, grouped according to the teacher neuron’s output on the same input, to determine the thresholds for the step activation function. In this way, the resulting output distribution for both the teacher and the student neurons are similar. In the following subsections we detail each step.
Output Ternarization
The student network uses a twothresholded step activation function to have ternary output as described in Table 1. Output ternarization finds the step activation function’s thresholds and , for a ternary neuron , for a given set of ternary weights . In order to achieve this, we compute three different transfer function output distributions for the student neuron, using the teacher neuron’s ternary output value on the same input. We use to denote the set of transfer function outputs of the student neuron for which the teacher neuron’s output value is . and are defined in the same way for teacher neuron output values and , respectively.
We use a simple classifier to find the boundaries between these three clusters of student neuron transfer function outputs, and use the boundaries as the two thresholds and of the step activation function. The classification is done using a linear discriminant on the kernel density estimates of the three distributions. The discriminant between and is selected as the , and the discriminant between and gives the .
Weight Ternarization
During weight ternarization, the order and the sign of the teacher network’s weights are preserved. We ternarize the weights of the neuron of the teacher network using two thresholds and such that and . The weights for the student neuron are obtained by weight ternarization as follows
(1) 
where
(2) 
We find the optimal threshold values for the weights by evaluating the ternarization quality with a score function. For a given neuron with positive weights and negative weights, the total number of possible ternarization schemes is since we respect the original sign and order of weights. For a given configuration of the positive and negative threshold values and for a given neuron, we calculate the following score for assessing the performance of the ternary network, mimicking the original network.
(3) 
where and denote the output of the teacher neuron and student neuron, respectively.
is the input sample for the teacher neuron, and is the input input sample for the student neuron. Note that after the first layer. Since we ternarize the network in a feedforward manner, in order to prevent ternarization errors from propagating to upper layers, we always use the teacher’s original input to determine its output probability distribution. The output probability distribution for the teacher neuron for input , , is calculated using stochastic firing as described in Table 1. The output probability distribution for the student neuron for input , is calculated using the ternary weights with the current configuration of , , and the step activation function thresholds. These thresholds, and are selected according to the current ternary weight configuration .
The output probability values are accumulated as scores over all input samples only when the output of the student neuron matches the output of the teacher neuron. The optimal ternarization of weights is determined by selecting the configuration with the maximum score.
(4) 
The worstcase time complexity of the algorithm is . We propose using a greedy dichotomic search strategy instead of a fully exhaustive one. We make a search grid over candidate values for by values for . We select two equallyspaced pivot points along one of the dimensions, or . Using these pivot points, we calculate the maximum score along the other axis. We reduce the search space by selecting the region in which the maximum point lies. Since we have two points, we reduce the search space to twothirds at each step. Then, we repeat the search procedure in the reduced search space. This faster strategy runs in , and when there are no local maxima it is guaranteed to find the optimal solution. When there are multiple local extremum, it may get stuck. Fortunately, we can detect the possible suboptimal solutions, using the score values obtained for the student neuron. By using a threshold on the output score for a student neuron, we can selectively use exhaustive search on a subset of neurons. Empirically, we find these cases to be rare. We provide a detailed analysis in Section 4.1.
The ternarization of the output layer is slightly different since it is a softmax classifier. In the ternarization process, instead of using the teacher network’s output, we use the actual labels in the training set. Again, we treat neurons independently but we make several iterations over each output neuron in a roundrobin fashion. After each iteration we check against convergence. In our experiments, we observed that the method converges after a few passes over all neurons.
Our layerwise approach allows us to update the weights of the teacher network before ternarization of any layer. For this optional weight update, we use a staggered retraining approach in which only the nonternarized layers are modified. After the teacher network’s weights are updated, input to a layer for both teacher and student networks become equal, . We use early stopping during this optional retraining and we find that a few dozen of iterations suffice.
3 Related Work
In this section, we give a brief survey on several related works in energyefficient NNs. In Table 2, we provide a comparison between our approach and the related works that use binary or ternary weights in the deployment phase by summarizing the constraints put on inputs, weights and activations during training and testing.
Training  Deployment  
Method  Inputs  Weights  Activations  Inputs  Weights  Activations 
BC [15], TC [9], TWN [8]  
Binarized NN [6]  
XNORNet[16]  with  with  
EBP[17]  
Bitwise NN [10]  
TrueNorth [11]  
TNN (This Work) 
Courbariaux et al. [15] propose the BinaryConnect (BC) method for binarizing only the weights, leaving the inputs and the activations as realvalues. The same idea is also used as TernaryConnect (TC) in [9] and Ternary Weight Networks (TWN) in [8] with ternary weights instead of binary . They use the backpropagation algorithm with an additional weight binarization step. In the forward pass, weights are binarized either deterministically using their sign, or stochastically. Stochastic binarization converts the realvalued weights to probabilities with the use of the hardsigmoid function, and then decides the final value of the weights with this. In the backpropagation phase, a quantization mechanism is used so that the multiplication operations are converted to bitshift operations. While this binarization scheme helps reducing the number of multiplications during training and testing, it is not fully hardwarefriendly since it only reduces the number of floating point multiplication operations. Recently, the same idea is extended to the activations of the neurons also [6]. In Binarized NN, the sign activation function is used for obtaining binary neuron activations. Also, shiftbased operations are used during both training and test time in order to gain energyefficiency. Although this method helps improving the efficiency in multiplications it does not eliminate them completely.
XNORNets [16] provide a solution to convert convolution operations in CNNs to bitwise operations. The method first learns a discrete convolution together with a set of realvalued scaling factors (). After the convolution calculations are handled using bitwise operations, the scaling factors are applied to obtain actual result. This approach is very similar to Binarized NN and helps reducing the number of floating point operations to some extent.
Following the same goal, DoReFaNet [7] and Quantized Neural Networks (QNN) [18] propose using bit quantization for weights, activations as well as gradients. In this way, it is possible to gain speed and energy efficiency to some extent not only during training but also during inference time. Han et al. [4] combine several techniques to achieve both quantization and compression of the weights by setting the relatively unimportant ones to 0. They also develop a dedicated hardware called Efficient Inference Engine (EIE) that exploits the quantization and sparsity to gain large speedups and energy savings, only on fully connected layers currently [5].
Soudry et al. [17] propose Expectation Backpropagation (EBP), an algorithm for learning the weights of a binary network using a variational Bayes technique. The algorithm can be used to train the network such that, each weight can be restricted to be binary or ternary values. The strength of this approach is that the training algorithm does not require any tuning of hyperparameters, such as learning rate as in the standard backpropagation algorithm. Also, the neurons in the middle layers are binary, making it hardwarefriendly. However, this approach assumes the bias is real and it is not currently applicable to CNNs.
All of the methods described above are only partially discretized, leading only to a reduction in the number of floating point multiplication operations. In order to completely eliminate the need for multiplications which will result in maximum resource efficiency, one has to discretize the network completely rather than partially. Under these extreme limitations, only a few studies exist in the literature.
Kim and Smaragdis propose Bitwise NN [10] which is a completely binary approach, where all the inputs, weights, and the outputs are binary. They use a straightforward extension of backpropagation to learn bitwise network’s weights. First, a realvalued network is trained by constraining the weights of the network using . Also nonlinearity is used for the activations to constrain the neuron output to . Then, in a second training step, the binary network is trained using the realvalued network together with a global sparsity parameter. In each epoch during forward propagation, the weights and the activations are binarized using the sign function on the original constrained realvalued parameters and activations. Currently, CNNs are not supported in BitwiseNNs.
IBM announced an energy efficient TrueNorth chip, designed for spiking neural network architectures [19]. Esser et al. [11] propose an algorithm for training networks that are compatible with IBM TrueNorth chip. The algorithm is based on backpropagation with two modifications. First, Gaussian approximation is used for the summation of several Bernoulli neurons, and second, values are clipped to satisfy the boundary requirements of TrueNorth chip. The ternary weights are obtained by introducing a synaptic connection parameter that determines whether a connection exits. If the connection exists, the sign of the weight is used. Recently, the work has been extended to CNN architectures as well [12].
4 Experimental Assessment of Ternarization and Classification
The main goals of our experiments are to demonstrate, (i) the performance of the ternarization procedure with respect to the realvalued teacher network, and (ii) the classification performance of fully discretized ternary networks.
We perform our experiments on several benchmarking datasets using both multilayer perceptrons (MLP) in a permutationinvariant manner and convolutional neural networks (CNN) with varying sizes. For the MLPs, we experiment with different architectures in terms of depth and neuron count. We use 250, 500, 750, and 1000 neurons per layer for 2, 3, and 4 layer networks. For the CNNs, we use the following VGGlike architecture proposed by [15]:
where is a convolutional layer, is a maxpooling layer, is a fully connected layer. We use as our output layer. We use two different network sizes with this architecture with and . We call these networks CNNSmall and CNNBig, respectively.
We perform our experiments on the following datasets:
MNIST database of handwritten digits [20] is a wellstudied database for benchmarking methods on realworld data. MNIST has a training set of 60K examples, and a test set of 10K examples of grayscale images. We use the last 10K samples of the training set as a validation set for early stopping and model selection.
CIFAR10 and CIFAR100 [21] are two colorimage classification datasets that contain RGB images. Each dataset consists of 50K images in training and 10K images in test sets. In CIFAR10, the images come from a set of 10 classes that contain airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships and trucks. In CIFAR100, the number of image classes is 100.
SVHN (Street View House Numbers) [22] consists of RGB color images of digits cropped from Street View images. The total training set size is 604K examples (with 531K less difficult samples to be used as extra) and the test set contains 26K images.
GTSRB (German Traffic Sign Recognition Benchmark Dataset) [23] is composed of 51839 images of German road signs in 43 classes. The images have great variability in terms of size and illumination conditions. Also, the dataset has unbalanced class frequencies. The images in the dataset are extracted from 1second video sequences recorded at 30 fps. In order to have a representative validation set, we extract 1 track at random per traffic sign for validation. The number of images in train, validation and test set are 37919, 1290 and 12630 respectively.
In order to allow a fair comparison against related works, we perform our experiments in similar configurations. On MNIST, we only use MLPs. We minimize cross entropy loss using stochastic gradient descent with a minibatch size of 100. During training we use random rotations up to degrees. We report the test error rate associated with the best validation error rate after 1000 epochs. We do not perform any preprocessing on MNIST and we use a thresholdbased binarization on the input.
For other datasets, we use two CNN architectures: CNNSmall and CNNBig. As before, we train a teacher network before obtaining the student TNN. For the teacher network, we use a modified version of Binarized NN’s algorithm [6] and ternarize the weights during training. In this way, we obtain a better accuracy on the teacher network and we gain considerable speedup while obtaining the student network. Since we already have the discretized weights during the teacher network training, we only mimic the output of the neurons using the step activation function with two thresholds for the student network. During teacher network training, we minimize squared Hinge loss with adam with minibatches of size 200. We train at most 1000 epochs and report the test error rate associated with the best validation epoch. For input binarization, we use the approach described in [12] with either 12 or 24 (on CIFAR100) transduction filters. We do not use any augmentation on the datasets.
4.1 Ternarization Performance
The ternarization performance is the ability of the student network to imitate the behavior of its teacher. We measure this by using the accuracy difference between the teacher network and the student network. Table 3 shows this difference between the teacher and student networks on training and test sets for three different exhaustive search threshold values. corresponds to the fully exhaustive search case whereas represents fully dichotomic search.
Neurons  Layers  Train  Test  Train  Test  Train  Test 

250  1  0.63  0.75  0.69  0.76  0.72  0.98 
2  0.30  0.44  0.25  0.56  0.15  0.54  
3  0.08  0.47  0.10  0.62  0.03  0.52  
500  1  0.50  0.94  0.49  0.83  0.48  0.70 
2  0.23  0.36  0.29  0.39  0.26  0.49  
3  0.12  0.25  0.07  0.34  0.10  0.20  
750  1  0.27  0.90  0.27  0.90  0.29  1.05 
2  0.00  0.56  0.01  0.65  0.00  0.67  
3  0.02  0.43  0.03  0.44  0.03  0.26  
1000  1  0.44  0.66  0.60  0.87  0.79  0.95 
2  0.02  0.59  0.07  0.33  0.05  0.37  
3  0.16  0.29  0.14  0.34  0.14  0.34 
The results show that the ternarization performance is better for deeper networks. Since we always use the teacher network’s original output as a reference, errors are not amplified in the network. On the contrary, deeper networks allow the student network to correct some of the mistakes in the upper layers, dampening the errors. Also, we perform a retraining step with early stopping before ternarizing a layer, since it slightly improves the performance. The ternarization performance generally decreases with lower threshold values, but the decrease is marginal. On occasion, performance even increases. This is due to teacher network’s weight update, that allows the network to escape from local minima.
MNIST  CIFAR10  SVHN  GTRSB  CIFAR100  

Fully Discretized  
TNN (This Work)  1.67  12.11  2.73  0.98  48.40  
TrueNorth [11, 12]  7.30  16.59  3.34  3.50  44.36  
Bitwise NN [10]  1.33  
Partially Discretized  
Binarized NN [6]  0.96  10.15  2.53  
BC [15]  1.29  9.90  2.30  
TC [9]  1.15  12.01  2.42  
TWN [8]  0.65  7.44  
EBP [24]  2.20  
XNORNet [16]  9.88  
DoReFaNet [7]  2.40 
In order to demonstrate the effect of in terms of runtime and classification performance, we conduct a detailed analysis without the optional staggered retraining. Figure 2 shows the distribution of the ratio of neurons that are ternarized exhaustively with different , together with the performance gaps on training and test datasets. The optimal tradeoff is achieved with . Exhaustive search is used for only 20% of the neurons, and the expected value of accuracy gaps is practically 0. For the largest layer with 1000 neurons, the ternarization operations take \SI2\minute and \SI63\minute for dichotomic and exhaustive search, respectively, on a 40core Intel(R) Xeon(R) CPU E52650 v3 @ 2.30GHz server with 128 GB RAM. For the output layer, the ternarization time is reduced to \SI21\minute with exhaustive search.
4.2 Classification Performance
The classification performance in terms of error rate (%) on several benchmark datasets is provided in Table 4. We compare our results to several related methods that we described in the previous section. We make a distinction between the fully discretized methods and the partially discretized ones because only in the latter, the resulting network is completely discrete and requires no floating points and no multiplications, providing maximum energy efficiency.
Since the benchmark datasets we use are the most studied ones, there are several known techniques and tricks that give performance boosts. In order to eliminate unfair comparison among methods, we follow the majority’s lead and we do not use any data augmentation in our experiments. Moreover, using an ensemble of classifiers is a common technique for performance boosting in almost all classifiers and is not unique to neural networks [25]. For that reason, we do not use an ensemble of networks and we cite the compatible results in other related works.
MNIST is by far the most studied dataset in deep learning literature. The stateoftheart is already down to 21 erroneous classifications () which is extremely difficult to obtain without extensive data augmentation. TNN’s error rate on MNIST is with a single 3layer MLP with 750 neurons in each layer. Bitwise NNs [10] with 1024 neurons in 3 layers achieves a slightly better performance. TNN with an architecture that has similar size to Bitwise NN is worse due to overfitting. Since TNN selects a different sparsity level for each neuron, it can perform better on smaller networks, and larger networks cause overfitting on MNIST. Bitwise NN’s global sparsity parameter has a better regularization effect on MNIST for relatively bigger networks. Its performance with smaller networks or on other datasets is unknown. TrueNorth [11] with a single network achieves only error rate. To alleviate the limitations of single network performance, a committee of networks can be used, reducing the error rate to with 64 networks.
The error rate of TNN on CIFAR10 is . When compared to partially discretized alternatives, a fully discretized TNN is obtained at the cost a few points in the accuracy and exceeds the performance of TrueNorth by more than . On SVHN, it has a similar achievement with lower margins. For CIFAR100, on the other hand, it does not perform better than TrueNorth. Given the relatively lower number of related works that report results on CIFAR100 as opposed to CIFAR10, we can conclude that this is a more challenging dataset for resourceefficient deep learning with a lot of room for improvement. TNN has the most remarkable performance on GTSRB dataset. With error rate, CNNBig model exceeds the human performance which is at .
Partially discretized approaches use realvalued input which contains more information. Therefore, it is expected that they are able to get higher classification accuracy. When compared to partially discretized studies, TNNs only lose a small percentage of accuracy and in return they provide better energy efficiency. Next, we describe the unique hardware design for TNNS and investigate to which extent TNNs are area and energy efficient.
5 Purposebuilt Hardware for TNN
We designed a hardware architecture for TNNs that is optimized for ternary neuron weights and activation values . In this section we first describe the purposebuilt hardware we designed and evaluate its performance in terms of latency, throughput and energy and area efficiency.
5.1 Hardware Architecture
Figure 3 outlines the hardware architecture of a fullyconnected layer in a multilayer NN. The design forms a pipeline that corresponds to the sequence of NN processing steps. For efficiency reasons, the number of layers and the maximum layer dimensions (input size and number of neurons) are decided at synthesis time. For a given NN architecture, the design is still userprogrammable: each NN layer contains a memory that can be programmed at runtime with neuron weights or output ternarization thresholds and . As seen in the previous experiments of Section 4, a given NN architecture can be reused for different datasets with success.
Ternary values are represented with 2 bits using usual two’s complement encoding. That way, the compute part of each neuron is reduced to just one integer adder/subtractor and one register, both of width bits, where is the input size for the neuron. So each neuron is only a few tens of ASIC gates, which is very small. Inside each layer, all neurons work in parallel so that one new input item is processed per clock cycle. Layers are pipelined in order to simultaneously work on different sets of inputs, i.e. layer processes image while layer processes image . The ternarization block processes the neuron outputs sequentially, so it consists of the memory of threshold values, two signed comparators and a multiplexer.
We did a generic register transfer level (RTL) implementation that can be synthesized on both FieldProgrammable Gate Array (FPGA) and Applicationspecific Integrated Circuit (ASIC) technologies. FPGAs are reprogrammable offtheshelf circuits and are ideal for generalpurpose hardware acceleration. Typically, highperformance cloud solutions use highend FPGA tightly coupled with generalpurpose multicore processors [26], while ASIC is used for more throughput or in batterypowered embedded devices.
5.2 Hardware Performance
For the preliminary measurements, we used the dataset MNIST and the FPGA board SakuraX [27] because it features precise power measurements capabilities. It can accommodate a 3layer fully connected NN with 1024 neurons per layer (for a total of 3082 neurons), using 81% of the Kintex7 160T FPGA.
Neurons  Throughput  Layers  Energy \si\micro\joule  Latency \si\micro\second  Accuracy 

(fps)  (per image)  (per image)  (%)  
250  255 102  1  1.24  5.37  97.76 
2  2.44  6.73  98.13  
3  3.63  8.09  98.14  
500  255 102  1  2.44  6.63  97.75 
2  4.83  9.24  98.14  
3  7.22  11.9  98.29  
750  255 102  1  3.63  7.88  97.73 
2  7.22  11.8  98.10  
3  10.8  15.6  98.33  
1000  198 019  1  6.22  10.2  97.63 
2  12.4  15.3  98.09  
3  18.5  20.5  97.89 
The performance of our FPGA design in terms of latency, throughput and energy efficiency is given in Table 5. With a 200 MHz clock frequency, the throughput (here limited by the number of neurons) is \SI195K \si[permode=symbol]\images\per\second with a power consumption of \SI3.8\watt and a classification latency of \SI20.5\micro\second.
We know that TrueNorth [11] can operate at the two extremes of power consumption and accuracy. It consumes \SI0.268\micro\joule with a network of low accuracy (), and consumes as high as \SI108\micro\joule with a committee of 64 networks that achieves . Our hardware cannot operate at these two extremes, yet in the middle operating zone, we outperform TrueNorth both in terms of energyefficiency  accuracy tradeoff and speed. TrueNorth consumes \SI4\micro\joule per image with accuracy with a throughput of \SI[permode=symbol]1000\images\per\second, and with \SI1\milli\second latency. Our TNN hardware, consuming \SI3.63\micro\joule per image achieves accuracy at a rate of \SI[permode=symbol]255102\images\per\second, and a latency of \SI8.09\micro\second.
For the rest of the FPGA experiments, the larger board VC709 equipped with the Virtex7 690T FPGA is used because it can support much larger designs. We also synthesized the design as ASIC using STMicroelectronics \SI28\nano\meter FDSOI manufacturing technology. The results are given in Table 6. We compare our FPGA and ASIC solutions with the state of the art: TrueNorth [12] and EIE [5].
Platform  TrueNorth[12]  EIE 64PE[5]  EIE 256PE[5]  TNN  TNN 

Year  2014  2016  2016  2016  2016 
Type  ASIC  ASIC  ASIC  FPGA  ASIC 
Technology  28 nm  45 nm  28 nm  Virtex7  ST 28 nm 
Clock (MHz)  Async.  800  1200  250  500 
Quantization  1bit  4bit  4bit  Ternary  Ternary 
Throughput (\si\fps)  1 989  81 967  426 230  61 035  122 070 
Power (\si\watt)  0.18  0.59  2.36  6.25  0.588 
Energy Eff. (\si[permode=symbol]\fps\per\watt)  10 839  138 927  180 606  9 771  207 567 
Area (\si\milli\squared\metre)  430  40.8  63.8  6.36  
Area Eff. (\si[permode=symbol]\fps\per\milli\squared\metre)  5  2 009  6 681  19 194 
TNN FPGA 200 MHz  TNN ASIC ST \SI28\nano\meter 500 MHz  TrueNorth [12]  

CNNBig  CNNSmall  CIFAR10  CIFAR100  GTRSB  SVNH  CIFAR10  CIFAR100  GTRSB  SVNH  
Throughput (\si\fps)  1 695  3 390  3 390  3 390  3 390  6 781  1 249  1 526  1 615  2 526 
Power (\si\watt)  9.58  4.8  0.377  0.224  0.310  0.224  0.204  0.208  0.200  0.256 
Energy per image (\si\micro\joule)  5 650  1 410  111  66.0  91.3  33.0  163  131  124  101 
Energy Efficiency (\si[permode=symbol]\fps\per\watt)  178  709  8 985  15 148  10 947  30 344  6 108  7 344  8 052  9 850 
Area (\si\milli\squared\metre)  6.06  6.06  6.06  1.79  424  424  424  424  
Area Efficiency (\si[permode=symbol]\fps\per\milli\squared\metre)  559  559  559  3 787  2.95  3.60  3.81  5.96  
Accuracy (%)  87.89  51.60  99.02  97.27  83.41  55.64  96.50  96.66 
The ASIC version compares very well with TrueNorth on throughput, area efficiency (\si[permode=symbol]\fps\per\milli\squared\metre) and energy efficiency (\si[permode=symbol]\fps\per\watt). Even though EIE uses 16bit precision, it achieves high throughput because it takes advantage of weight sparsity and skips many useless computations. However, we achieve better energy and area efficiencies since all our hardware elements (memories, functional units etc.) are significantly reduced thanks to ternarization. Our energy results would be even better if taking into account weight sparsity and zeroactivations (e.g. when input values are zero) like done in EIE works.
Finally, we implemented the CNNBig and CNNSmall described in Section 4, on both FPGA and ASIC. Results are given in Table 7. We give worstcase FPGA results because this is important for users of generalpurpose hardware accelerators. For ASIC technology, we took into account perdataset zeroactivations to reduce power consumption, similar to what was done in EIE works. We compare with TrueNorth because only paper [12] gives figures of merit related to CNNs on ASIC. The TrueNorth area is calculated according to the number of cores used. Using different CNN models than TrueNorth’s, we achieve better accuracy on three datasets out of four, while having higher throughput, better energy efficiency and much better area efficiency.
6 Discussions and Future Work
In this study, we propose TNNs for resourceefficient applications of deep learning. Energy efficiency and area efficiency are brought by not using any multiplication nor any floatingpoint operation. We develop a studentteacher approach to train the TNNs and devise a purposebuilt hardware for making them available for embedded applications with resource constraints. Through experimental evaluation, we demonstrate the performance of TNNs both in terms of accuracy and resourceefficiency, with CNNs as well as MLPs. The only other related work that has these two features is TrueNorth [12], since Bitwise NNs do not support CNNs [10]. In terms of accuracy, TNNs perform better than TrueNorth with relatively smaller networks in all of the benchmark datasets except one. Unlike TrueNorth and Bitwise NNs, TNNs use ternary neuron activations using a step function with two thresholds. This allows each neuron to choose a sparsity parameter for itself and gives an opportunity to remove the weights that have very little contribution. In that respect, TNNs inherently prune the unnecessary connections.
We also develop a purposebuilt hardware for TNNs that offers significant throughput and area efficiency and highly competitive energy efficiency. As compared to TrueNorth, our TNN ASIC hardware offers improvements of 147 to 635 on area efficiency, 1.4 to 3.1 on energy efficiency and 2.1 to 2.7 on throughput. It also often has higher accuracy with our new training approach.
Acknowledgment
This project is being funded in part by Grenoble Alpes Métropole through the Nano2017 Esprit project. The authors would like to thank Olivier Menut from ST Microelectronics for his valuable inputs and continuous support.
Footnotes
 https://github.com/slidelig/tnntrain
 http://tima.imag.fr/sls/researchprojects/tnnfpgaimplementation/
References
 L. Hertel, E. Barth, T. Käster, and T. Martinetz. Deep convolutional neural networks as generic feature extractors. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–4, July 2015.
 A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. In ICLR, 2015.
 S. Han, H. Mao, and W. J Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In ICLR, 2016.
 S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and W. J. Dally. EIE: Efficient Inference Engine on Compressed Deep Neural Network. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016.
 M. Courbariaux, I. Hubara, D. Soudry, R. ElYaniv, and Y. Bengio. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1. In NIPS, 2016.
 S. Zhou, Z. Ni, X. Zhou, H. Wen, Y. Wu, and Y. Zou. DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. CoRR, abs/1606.06160, 2016.
 F. Li and B. Liu. Ternary weight networks. CoRR, abs/1605.04711, 2016.
 Z. Lin, M. Courbariaux, R. Memisevic, and Y. Bengio. Neural networks with few multiplications. In ICLR, 2016.
 M. Kim and P. Smaragdis. Bitwise neural networks. In International Conference on Machine Learning (ICML) Workshop on ResourceEfficient Machine Learning, 2015.
 S. K. Esser, R. Appuswamy, P. Merolla, J. V. Arthur, and D. S. Modha. Backpropagation for energyefficient neuromorphic computing. In Advances in Neural Information Processing Systems (NIPS), pages 1117–1125, 2015.
 S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry, T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha. Convolutional networks for fast, energyefficient neuromorphic computing. Proceedings of the National Academy of Sciences, 113(41):11441–11446, 2016.
 N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.
 S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning (ICML15), pages 448–456, 2015.
 M. Courbariaux, Y. Bengio, and J. David. Binaryconnect: Training deep neural networks with binary weights during propagations. In Advances in Neural Information Processing Systems (NIPS), pages 3105–3113, 2015.
 M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016.
 D. Soudry, I. Hubara, and R. Meir. Expectation backpropagation: Parameterfree training of multilayer neural networks with continuous or discrete weights. In Advances in Neural Information Processing Systems (NIPS), pages 963–971, 2014.
 I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. CoRR, 2016.
 P. A Merolla, J. V. Arthur, R. AlvarezIcaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo, S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk, R. Manohar, and D. S. Modha. A million spikingneuron integrated circuit with a scalable communication network and interface. Science, 345(6197):668–673, 2014.
 Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, Toronto University, 2009.
 Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
 J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. In International Joint Conference on Neural Networks (IJCNN), 2011.
 Z. Cheng, D. Soudry, Z. Mao, and Z. Lan. Training binary multilayer neural networks for image classification using expectation backpropagation. arXiv preprint arXiv:1503.03562, 2015.
 D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. A committee of neural networks for traffic sign classification. In International Joint Conference on Neural Networks (IJCNN), pages 1918–1921, 2011.
 Y. Choi, J. Cong, Z. Fang, Y. Hao, G. Reinman, and P. Wei. A quantitative analysis on microarchitectures of modern CPUFPGA platforms. In Proceedings of the 53rd Design Automation Conference, pages 109–115, 2016.
 SakuraX Board, http://satoh.cs.uec.ac.jp/SAKURA/hardware/SAKURAX.html. [online], accessed 1 Nov 2016.