1 Introduction

\mlsystitlerunning

MicroNets

\printAffiliationsAndNotice\mlsysEqualContribution

1 Introduction

Machine learning (ML) methods play an increasingly central role in a myriad of internet-of-things (IoT) applications. Using ML, we can interpret the wealth of sensor data that IoT devices generate. Prototypical uses in IoT include tasks such as monitoring environmental conditions such as temperature and atmosphere (e.g., carbon monoxide levels), monitoring mechanical vibrations from machinery to predict failure, or visual tasks such as detecting people or animals. User interfaces based on speech recognition and synthesis are also very common, as many IoT devices have limited user input features and small displays. In mobile applications, ML inference is often off-loaded to the cloud, where compute resources are more abundant. However, offloading introduces overheads in terms of latency, energy and privacy. It also requires access to communications, such as WiFi or cellular access. For the proliferating class of IoT devices, offloading can be prohibitively expensive, in terms of both the radio chips which increase the bill of materials, as well as the network access costs.

Platform Architecture Memory Storage Power Price
CloudML GPU HBM SSD/Disk
Nvidia V100 Nvidia Volta 16GB TBPB 250W $9K
MobileML CPU DRAM Flash
iPhone 11 Arm A-Class 4GB 64GB 8W $750
TinyML MCU SRAM eFlash
ST F446RE Arm M4 128KB 0.5MB 0.1W $3
ST F746ZG Arm M7 320KB 1MB 0.3W $5
ST F767ZI Arm M7 512KB 2MB 0.3W $8
Table 1: Illustrative comparison of hardware for CloudML, MobileML and TinyML, including the MCUs targeted in this work.

TinyML is an alternative paradigm, where we execute ML tasks locally on IoT devices. This allows for real time analysis and interpretation of data at the point of collection, which translates to huge advantages in terms of cost and privacy. Microcontroller units (MCUs) are the ideal hardware platform for TinyML, as they are typically small (1cm), cheap ($1) and low-power (1mW) compared to mobile and cloud platforms (Table 1). MCUs typically integrate a CPU, digital and analog peripherals, on-chip embedded flash (eFlash) memory for program storage and Static Random-Access Memory (SRAM) for intermediate data. However, deploying deep neural networks on MCUs is extremely challenging; the most severe limitation being the small and flat memory system (Figure 1) within which the model weights and activations must be stored. Therefore, to achieve the promise of TinyML, we must aggressively optimize models to best exploit the limited resources provided by an MCU hardware and software stack.

(a) Mobile SoC                           (b) MCU

Figure 1: Illustration of memory hierarchies for (a) a mobile SoC which has a deep memory hierarchy with many levels of on-chip cache and a large off-chip DRAM main memory, and (b) an MCU with a flat on-chip memory system with no off-chip main memory.

Mounting interest in TinyML has led to some maturity in both software stacks and benchmarks. The open-source TensorFlow Lite for Microcontrollers (TFLM) inference runtime  David et al. (2020) allows for straightforward and portable deployment of NN workloads. TFLM uses an interpreter to execute an NN graph, which means the same model graph can be deployed across different hardware platforms. When compared to code generation based methods uTensor (), TFLM provides portability across MCU vendors, at the cost of a fairly minimal memory overhead. Recently, the ML performance (MLPerf) benchmarking organization has outlined a suite of benchmarks for TinyML called TinyMLPerf Banbury et al. (2020), which consists of three TinyML tasks of visual wake words (VWW), audio keyword spotting (KWS), and anomaly detection (AD). Standardizing TinyML research results around a common open-source runtime and benchmark suite makes comparing reasearch results easier and fairer, hopefully driving research progress.

Previous work on TinyML has largely considered model design without consideration for the real deployment scenario (e.g. SpArSe Fedorov et al. (2019)), or has used closed-source software stacks which make deployment and comparison impossible (e.g. MCUNet Lin et al. (2020)). In this paper, we describe MicroNets, a family of models which can be deployed with publicly available TFLM, for the three TinyMLperf tasks of VWW, KWS and AD. In contrast to previous TinyML work that uses black-box optimizations, such as Bayesian optimization Fedorov et al. (2019), and evolutionary search Lin et al. (2020), MicroNets are optimized for MCU inference performance using differentiable neural architecture search (DNAS).

The contributions of this work are summarized below.

  • Using an extensive characterization of NN inference performance on three representative MCUs, we demonstrate that the number of operations is a viable proxy for inference latency and energy.

  • We show that differentiable neural architecture search (DNAS) with appropriate constraints can be used to automatically construct models that fit the MCU resources, while maximizing performance and accuracy.

  • We provide state of the art models for all three TinyML tasks, deployable on standard MCUs using TFLM.

2 Related Work

Since its inception, deep learning has been synonymous with expensive, power-hungry GPUs Krizhevsky et al. (2012). However, the current interest in deploying deep learning models on MCUs is reflected in a small number of papers that have begun to explore this promising space. In this section, we briefly survey the literature related to TinyML, divided between hardware, software and machine learning.

Hardware The current interest in ML has led to a growing demand for arithmetic compute performance in MCU platforms, which was previously driven by digital signal processing workloads. Single-instruction multiple-data (SIMD) extensions Hennessy and Patterson (2011) are one of the most effective approaches to achieving this in the CPU context, but increase silicon area and power consumption. The Arm Helium extensions Helium () address this using a lightweight SIMD implementation targeted to MCUs. Beyond CPUs, various accelerators Whatmough et al. (2019) and co-processors such as digital signal processors (DSPs) Efland et al. (2016) and micro neural processing units (microNPUs) Ethos-U55 () typically offer greater performance and energy efficiency, at the cost of a more complex and less portable programming model. Finally, subthreshold circuit operation is another notable technology to increase energy efficiency, that has been demonstrated in commercial MCUs Ambiq (). In this work we specifically target commodity MCUs (Table 1), and expect results to improve with new hardware generations.

Figure 2: Breakdown of SRAM and eFlash memory occupancy for a KWS model on TFLM runtime on the STMF746ZG.

Software A critical element for TinyML is the inference software stack. MCU vendors typically provide low-level libraries with specific primitives for basic NN operators like convolution, such as the CMSIS-NN Lai et al. (2018a) kernels which are optimized for Arm Cortex-M devices. Alternatively, MicroTVM automatically generates low level kernels. These kernels need stitching together to implement a neural network inference graph. Popular ML frameworks like TensorFlow and PyTorch are unsuitable for inference on MCUs, as the memory requirements are too large. A number of ML inference runtimes have emerged to fill this need on MCUs. There are two fundamental approaches: code generation and interpreter. The code generation approach takes a model definition and automatically generates C code directly. In general, this approach typically gives the best results, but the generated code is not portable between different platforms. Examples include uTensor , tinyEngine Lin et al. (2020), and embedded learning library ELL (). In contrast, TensorFlow Lite for Microcontrollers TFLM is an interpreter based runtime for executing TensorFlow Lite graphs on MCUs. TFLM supports most common NN layers, with the notable exception of recurrent networks. It is widely supported by hardware vendors and supports many optimized kernels on the back end for specific platforms. Compared to code generation based methods, TFLM is more portable but has some overheads. We use TFLM due to its portability, ease of deployment and open-source nature.

Machine Learning The challenges with implementing CNNs on MCUs were discussed in Bonsai Kumar et al. (2017), namely that the feature maps of typical NNs require prohibitively large SRAM buffers. As a more storage efficient alternative to CNNs, pruned decision trees were proposed to suit the smallest MCUs with as little as 2KB of SRAM. Gupta et al. (2017) propose a variant of k-nearest neighbors tailored for MCUs. Gural and Murmann (2019) propose a novel convolution kernel, reducing activation memory and enabling inference on low-end MCUs. SpArSe Fedorov et al. (2019) demonstrated that by optimizing the model architecture, CNNs in fact can be deployed on MCUs with SRAM down to 2KB. This was achieved using NAS, which has emerged as a vibrant area of research, whereby ML algorithms construct application specific NNs to meet very specific constraints Elsken et al. (2019). SpArSe employs a Bayesian optimization framework that jointly selects model architecture and optimizations such as pruning. Similarly, MCUNet Lin et al. (2020) uses evolutionary search to design NNs for larger MCUs (2MB eFlash / 512KB SRAM) and larger datasets including visual wakewords (VWW) Chowdhery et al. (2019) and keyword spotting (KWS) Warden (2018)). Reinforcement learning (RL) has also been used to choose quantization options in order to help fit an ImageNet model onto a larger MCU (2MB eFlash) Rusci et al. (2020a). As well as images, audio tasks are an important driver for TinyML. TinyLSTMs Fedorov et al. (2020) shows that LSTMs for speech enhancement in smart hearing aids are similarly amenable to deployment on MCUs, after targeted optimization.

In this paper, we use differentiable NAS (DNAS) Liu et al. (2019) to design specialized MCU models to target the three TinyMLperf tasks. Unlike black-box optimization methods that have previously been applied to TinyML problems, like Bayesian optimization Fedorov et al. (2019) and evolutionary search Lin et al. (2020), DNAS uses gradient descent and lends itself to straightforward implementation in modern auto-differentiation software like Tensorflow with acceleration on GPUs. Our work provides experimental evidence that DNAS is capable of satisfying MCU-specific model constraints, including eFlash, SRAM, and latency. In contrast to Lin et al. (2020), our work uses a standard deployment framework (TFLM).

3 Hardware Characterization

3.1 System Overview

In this section we characterize the performance of NN inference workloads on MCUs. The MCUs we use (Table 1) are fairly self-contained, consisting of an Arm Cortex-M processor, SRAM for working memory, embedded flash for non-volatile program storage, and a variety of digital and analog peripherals. Unlike their mobile, desktop and datacenter counterparts, MCUs have a rather flat memory system, as illustrated in Figure 1. Mobile and cloud computer systems universally employ a large off-chip main memory (usually DRAM). However, MCUs are typically equipped with only on-chip memory, which is relatively small to keep the die size reasonable. Figure 2 gives an example memory map showing how a KWS model is mapped onto the STM32F746ZG devices by TFLM. Activation buffers are allocated in the SRAM, while the model weights and biases and graph definition are allocated in the eFlash memory. Alternatively, weights can be stored in SRAM, but we found experimentally that this results in only about a 1% speedup in end-to-end latency, while significantly reducing the space available for activations, which cannot be stored in eFlash.

In terms of throughput, this flat memory system coupled with the lower clock frequencies and simple (cheap) microarchitectures used in MCUs results in a predominately compute-bound system. The Cortex-M7 can dual issue load and ALU instructions, which the Cortex-M4 cannot. This gives higher IPC, which, combined with a 20% higher clock rate, makes the STM32F646ZG and the STM32F767ZI approximately twice as fast as the STM32F446RE.

Note that the runtime overhead for the TFLM interpreter is fairly minimal, requiring just 4KB of SRAM and 37 KB of eFlash. The 34KB SRAM block labeled as persistent buffers in Figure 2 scales with the size of the model and contains buffered quantization parameters and the c structs that hold pointers to the intermediate tensors and to the operators.

3.2 Layer Latency

Figure 3: Measured latency of a range of different individual layer types and sizes on the STM32F767ZI using TFLM. Different layers can exhibit a spread in latencies for the same ops count, due to variations in, for example, data reuse and IM2COL overheads.

In this section, we examine the hardware performance of typical NN layers. To do this, we generate a large number of layer types1 and sizes and characterize them on the hardware. Figure 3 shows the measured latency of each layer in TFLM and CMSIS-NN kernels, as a function of the number of operations2 (ops). We observe that different layer types and sizes result in some spread in throughput, which was previously observed by Lai et al. (2018b). 2D convolutions and fully connected layers exhibit lower latency per op than depth-wise convolutions. This is likely due to depth-wise convolutions having less operations relative to their IM2COL overhead. We also note some variability in ops/s between 2D convolution layers. This is primarily caused by the sensitivity of the CMSIS-NN kernel to input and output channel sizes. The CMSIS-NN CONV 2D kernel is substantially faster when the number of input and output channels are divisible by four. As an example, we observe that increasing the input/output channels of a convolution layer from 138/138 to 140/140 decreases the latency from 37.5ms to 21.5ms (57% speedup).

3.3 Model Latency

Figure 4: Measured latency of whole models randonly sampled from two backbones, on STM32F446RE and STM32F746ZG. Models sampled from a given search space exhibit latency linear with ops, despite the variation seen with individual layers.

Next, we characterize whole end-to-end models. To do this, we setup a parameterized supernet backbone that we randomly sample. This allows us to automatically generate a large number of random models with different layer types and dimensions, which we then characterize on the hardware in terms of latency and, in the next subsection, energy. Figure 4 shows measured model latency on the STMF446RE and the STMF746ZG. Measurements are shown for random models sampled from backbones tailored to two different tasks viz. image classification and audio classification.

Interestingly, the measured latency for the end-to-end models is linear with op count (). This is perhaps surprising given the variation seen with the layer-wise latency measurements (Figure 3). Also, we observe that models sampled from the two different backbones results in a different slope. The explanation for this is that although single layers exhibit variation in latency as a function of ops, in a whole model this is averaged across many layers. Since a given search space will typically be dominated by a particular layer type, in terms of ops, the result is that we see a linear latency for models sampled from the same backbone. The KWS backbone has 40% higher throughput (Mops/S) than the CIFAR10 backbone, which is due to the mix in layer types and sizes. Finally, STM32F746ZG is around twice as fast as STM32F446RE (Section 3.1).

3.4 Model Energy

Energy consumption is obviously a critical metric for TinyML. Following the same random model sampling methodology used to characterize latency, we measured the current consumption of 400 models from the CIFAR10 backbone. We use the Qoitech Otii Arc Otii () to power the MCU boards and measure the current draw with the inference workload looping. Figure 5 shows the average power consumption versus the op count of each model on two MCUs. Clearly, there is little variance in power consumption between models (), i.e. power is essentially independent of model size or architecture. Additionally, Figure 5 shows the energy consumption versus the op count of each model. We observe that executing the same model on a smaller MCU reduces the total energy consumption despite an increase in latency. This decreased energy consumption motivates the design of models that can fit within the tighter constraints of smaller devices.

Figure 5: Measured energy and power of models randomly sampled from an image classification CNN backbone. MCUs have simple microarchitectures and memory systems and hency power is fairly constant. Therefore, energy is largely determined by latency, which is in turn a linear function of model ops.

3.5 Summary

Below is a summary of the findings of this section:

  • Although ops is not a good predictor for the latency of a single layer, it is a viable proxy for the latency of an entire model sampled from a given backbone.

  • For a given MCU, power is largely independent of model size and design. Therefore, energy per inference is a function of the size of the MCU, which determines power, and the number of ops, which dictates latency.

Therefore, when designing a model from within a backbone for a given task, ops is a viable proxy for both latency and energy, as measured on the target hardware and software.

4 TinyMLperf Benchmark Tasks

This section describes the TinyMLPerf benchmark tasks: Visual Wake Words (VWW), Keyword Spotting (KWS), and Anomaly Detection (AD). These were selected by a committee from industry and academia, to represent common TinyML application domains Banbury et al. (2020).

4.1 Visual Wake Words

The VWW dataset used in TinyMLperf is a visual classification task, where each image is labels as when a person occupies at least 0.5% of the frame and when no person is present Chowdhery et al. (2019). The dataset contains 82,783 train and 40,504 test images, which we resize to a common resolution of . We use the standard ImageNet data prepropecessing pipeline.

4.2 Audio Keyword Spotting

Audio KWS (a.k.a. wake words), finds application in a plethora of use cases in commercial IoT products (e.g. Google Assistant, Amazon Alexa, etc.). Recent research has explored various model architectures suitable for resource constrained devices Chen et al. (2019); Wong et al. (2020); Kusupati et al. (2018); Thakker et al. (2019). Among these, CNNs achieve good accuracy Choi et al. (2019); Anderson et al. (2020); Zhang et al. (2017c); Gope et al. (2019) and have the advantage of being deployable on commodity hardware using existing software stacks. The KWS dataset in TinyMLperf is Google Speech Commands (V2) Warden (2018). A model trained on this dataset is required to classify an -second long incoming audio clip from a vocabulary of words into one of the classes– keyword classes along with “silence” (i.e. no word spoken), and an “unknown” class, which is the remaining keywords from the dataset. The raw time-domain speech signal is converted to 2-D MFCC (Mel-frequency cepstral coefficients). MFCC features are then obtained from a speech frame of length ms with a stride of ms, yielding an input dimension of features for second of audio. Training samples are augmented by applying background noise and random timing jitter to provide robustness against noise and alignment errors. We follow the input data processing procedure described in Zhang et al. (2017b); Mo et al. (2020) for training the baselines and other DNAS variants.

4.3 Anomaly Detection

Anomaly detection is a task that classifies temporal signals as “normal” or “abnormal”. Anomaly detection finds numerous application, including industrial factories, where it is deployed on smart sensors to monitor equipment and detect problems. The dataset used for anomaly detection in TinyMLperf is MIMII(Slide Rail) Purohit et al. (2019). It is a dataset of industrial machine sounds operating under normal or anomalous conditions recorded in real factory environments. The original dataset contains different machine types, but we focus on the Slide Rail task as selected in TinyMLPerf benchmarks.

Anomaly detection is an unsupervised learning problem. The model only sees “normal” samples at training time and is expected to make predictions on a mix of normal and abnormal cases at test time. Many unsupervised learning methods can be applied, however, inspired by state-of-the-art solutions Giri et al. (2020), we reformulate the problem as a self-supervised Hendrycks et al. (2019) learning problem, so that it can be handled in a similar way as the other two tasks. The essential idea is to leverage machine ID metadata provided in this dataset. The training dataset contains 4 different machine IDs, each corresponding to a different slide machine for which the audio is recorded. We train a classifier in a supervised way to identify the machine ID given the audio as input. The classifier needs to learn useful information about the normal operating sound of these machines to tell them apart, which can then be used to detect anomaly. At testing time, we use the softmax score for the test sample machine ID as an index of how confident the classifier is about the test sample falling into the normal operating regime data on which it has been trained. Therefore its negative can be used as an anomaly score (higher meaning more likely to be abnormal). The area under the curve (AUC) metric from the receiver operating characteristic (ROC) is calculated using this anomaly score.

Data preprocessing is done in a similar way as for KWS: the audio signal is transformed into log-Mel spectrograms, which are then input to a CNN classifier. An audio clip of length 10s is split into overlapping frames of length 64ms with a stride (hop length) of 32ms between frames. 64 MFCC features are extracted for each frame. The preprocessed dataset is available on Kaggle Kaggle AD (). We then stack 64 frames together to get 64 by 64 images and the next image has an overlap of 44 frames. We found that CNNs can tolerate even lower resolution spectrograms so the image is further down-sampled to 3232 using bilinear interpolation. This is the input to our CNN classifier.

5 MicroNet Models

5.1 Optimizations

We use DNAS to discover models which are highly accurate, while also satisfying SRAM, eFlash, and latency constraints. In the following, we briefly review DNAS and how it can be applied to ML model design for MCUs. For further information, we refer the reader to Liu et al. (2019); Cai et al. (2019); Dong and Yang (2019); Wan et al. (2020). The search begins with the definition of a supernet consisting of decision nodes. The output of a decision node expresses a choice between options

(1)

where is the input tensor, is the operation executed by choice and parameterized by , is the total number of options for the decision node, and represents the selection of one of options. The goal of the search is to select for all of the decision nodes in the supernet. In the present work, we restrict our search to the width for each layer in the supernet. In this case, each option represents on operation with a different number of channels Wan et al. (2020).

Optimizing for MCU Memory

Without any model constraints, DNAS may produce models which violate one or more MCU hardware limits. Given the model size and the size of the intermediate activations produced by modern NNs, eFlash and SRAM play an important role in model design. We incorporate appropriate regularization terms in our DNAS experiments such that the selected models both fit in eFlash memory and produce activations which can fit in available SRAM. For model size considerations, we express the size of a particular selection from the supernet using

(2)

where denotes the cardinality of . Summing the size of each node, we obtain the size of the supernet as a function of decision parameters for each decision node, which we use to regularize the DNAS such that the selected architecture meets the MCU eFlash constraint.

In order to ensure that the selected architecture satisfies SRAM constraints, we adopt the working memory model from Fedorov et al. (2019), which states that the working memory required for a particular node with inputs and outputs is given by

(3)

For tensors which are outputs of decision nodes, we replace by (2). The total model working memory is then defined as the maximum over the working memory of every network node, which we include in the DNAS objective function such that the discovered architecture meets the MCU SRAM constraint. We define the constraint as the available SRAM minus the expected TFLM overhead.

Optimizing for Latency

In addition to making sure that discovered models are deployable, we also incorporated a latency constraint into our DNAS experiments. Due to the (almost) linear relationship between latency and number of operations for ML inference on MCUs, we treat the operation count as a strong proxy for latency during optimization. As with memory, we begin by defining the operation count of each decision node as a function of the decision vector :

(4)

where is the number of ops required to execute option . Note that the number of operations for a particular option typically depends on the input and output tensor sizes, which are a function of decision parameters Wan et al. (2020).

Sub-Byte Quantization

The predominate datatype for NN inference on microcontrollers is 8-bit integer. The use of smaller 4-bit datatypes Esser et al. (2020); Park and Yoo (2020); Banner et al. (2019) for weights (and/or activations) allows for more parameters (and/or feature maps), potentially realizing higher accuracy in the same memory footprint. However, current MCUs do not natively support sub-byte datatypes, so they must be emulated using -bit instructions. We investigated the benefit of 4-bit quantization on the KWS task.

Currently, the CMSIS-NN Lai et al. (2018a), does not provide convolution operators for bit values. Therefore, we developed optimized kernels for bit datatypes, and incorporated them into CMSIS-NN for use in our experiments. This allows our DNAS to expand the search space to fit models with more weights and/or activations, potentially achieving higher accuracy in the same memory footprint. The unpacking and packing routines required to emulate hardware support for bit using native or bit operations add negligible latency overhead. These optimized kernels can efficiently support sub-byte quantization on either weights or activations or both. Prior work on mixed-precision inference (CMix-NN Capotondi et al. (2020)) does not support operations on signed sub-byte weight and activation values, nor non-modulo- feature-map channel numbers, and therefore is not compatible with current CMSIS-NN software and TFLM runtime stack. We anticipate that future MCUs may provide native hardware support for bit datatypes, further increasing the value of this research direction.

5.2 DNAS Backbones and Training Recipes

DNAS requires a backbone supernet to be defined as the starting point for the search. The design of the backbone is an important step which requires human experience of network operators and connectivity patterns that work well for a given task. If the backbone is too large, the supernet will not fit in GPU memory. On the other hand, if the backbone is too small, it may not provide a rich enough search space within which to find models that satisfy the constraints. In this section, we describe the backbones used for the TinMLperf tasks and the training methodology.

Visual Wake Words (VWW)

We use a MobilenetV2 Sandler et al. (2018) backbone, consisting of a series of inverted bottleneck (IBN) blocks. Each IBN block includes the sequence: conv, depthwise conv, conv. We restrict our search space to the width of the first and last convolutions in each IBN, as well the convolutions preceding and following the sequence of IBN blocks. For each convolution, we choose between and of the width of the corresponding layer in MobilenetV2, in increments of . In order to ensure that the input itself does not violate the SRAM constraint, we resize the input images to and for the small (STMF446RE) and medium (STM32F746ZG) sized MCUs, respectively. Note that we convert the RGB images to grayscale, such that the input only has channel, in order to trade off color resolution for spatial resolution Fedorov et al. (2019); Chowdhery et al. (2019).

We run DNAS for 200 epochs, batch size 768, decaying the learning rate from to with a cosine schedule. We use quantization aware training Krishnamoorthi (2018) to emulate bit quantization of both weights and activations during training. Discovered architectures are finetuned for 200 epochs with the same learning rate schedule, weight decay of , and knowledge distillation using MobilenetV2 as the teacher, knowledge distillation coefficient , and temperature Hinton et al. (2015).

(a)
(b)
Figure 6: VWW architectures discovered by DNAS targeting (a) small (STM32F446RE), and (b) medium (STM32F446RE) MCUs. The two numbers following IBN denote the number of expansion and compression filters. Tensor dimensions are provided in black text.

Keyword Spotting (KWS)

After experimenting with different architectures on the KWS task, we settled on an enlarged DS-CNN(L) Zhang et al. (2017b) model as the backbone for DNAS. The backbone is built by adding four more depthwise-separable blocks of output channels 276 to the largest variant of DS-CNN. A skip connection branch (average pooling if the parallel convolutional block downsamples the input) is also added in parallel to each depthwise-separable block in the backbone to create shortcuts for choosing the number of layers. We use DNAS to choose the number of channels and the number of layers in this backbone network, while trying to satisfy the hardware constraints. The number of channels are restricted to multiples of 4 for good performance on hardware. For the small and medium models, the constrains were set to achieve 10FPS and 5FPS on the medium (STM32F746ZG) board while also suiting the smallest (STMF446RE) board. It is thus a combination of latency and working memory constraints. For the large model, we target latency of less than one second, in order to achieve real-time throughput.

DNAS is run for 100 epochs, with a batch size of 512, decaying the learning rate from to with a cosine schedule. A weight decay coefficient of is used. Additionally, we quantize weights, activations and input to 8-bit using fake quantization nodes to simulate deployment. The ranges of quantizers are learnt with gradient descent. We observe that model accuracy at the end of DNAS is often very good and no further fine-tuning is needed.

Anomaly Detection (AD)

For AD, the model operates on spectrograms of audio signals in a similar way as for KWS, so it makes sense for the two tasks to share similar backbone networks. Hence, the backbone network we used for AD was DSCNN-L, with parallel skip connections (or average pooling if downsampling) to skip layers. The strides of the last two depthwise-separable blocks are increased to 2 to downsample the input patch down to 44 before applying the final pooling. DNAS searches for channel numbers and the total number of layers to meet the hardware deployment constraints. An anomaly detection system is expected to run in real-time for continuous monitoring, and should therefore take less time than the increment between two successive spectrogram images (considering overlapping). In our setting, this latency cutoff can be calculated as . This latency constraint together with the SRAM limits for each board are used as constraints in our DNAS runs.

We use the same set of DNAS hyperparameters as for KWS, except we only train for 50 epochs, as convergence is faster on this task. We also apply a mixup Zhang et al. (2017a) augmentation coefficient of to avoid overfitting. We have experimented with the spectral warping augmentations as suggested in Giri et al. (2020) but did not observe benefits in our setting. It is likely that since our models are relatively parameter-efficient and we use quantization aware training, less data augmentations are required.

6 Results

6.1 Methodology

To deploy our models, we convert them to TFlite format and then execute them on each MCU using the TFLM runtime. The eFlash occupancy is determined using the Mbed compiler Mbed OS () and the SRAM consumption is obtained using the TFLM recording memory APIs. We measure latency on the MCU using the Mbed Timer API.

6.2 Visual Wake Words (VWW)

Figure 8 compares our DNAS results (MicroNets) to three state of the art results, including ProxylessNAS Cai et al. (2019), MSNet Cheng et al. (2019), and the TFLM example model Chowdhery et al. (2019). The largest network in our search space is MobileNetV2, which achieves accuracy. The MicroNets models are visualized in Figure 6. We found that the model produced by targeting the medium MCU () nearly matched the accuracy of MobileNetV2, obviating the need to search for a large-MCU specific model. MicroNets are pareto-optimal for the small and medium sized MCUs. For the small MCU, our MicroNet is more accurate than the TFLM reference, the only network considered which can be deployed on the small MCU with TFLM, while being ms faster. For the medium MCU, our MicroNet model was the only model considered that could be deployed on that MCU.

Figure 7: KWS results. The DNAS search targets the smallest MCU for both small and medium models. MicroNet medium model is more accurate than DS-CNN(L) and runs faster. The largest MobileNetV2 variant model does not fit and is omitted.
Figure 8: VWW results. We target DNAS to find models which can be deployed on the small and medium MCUs. Our search yields MicroNets which are pareto optimal.

The limitations of previous TinyML models is clear in Figure 8: they do not fully exploit the precious memory resources. For instance, the ProxylessNAS model easily fits the flash memory on the smallest MCU (STM32F446RE), but requires the largest MCU (STM32F767ZI) to fit the activations in SRAM. Therefore, ProxylessNAS will only run on the large MCU. MSNet shows similar characteristics. These limitations underline the motivation for DNAS optimized models to target a specific MCU size.

MicroNet models are not tight against the SRAM and flash constraints for a number of reasons. TFLM has to schedule every graph and perform memory management, which leads to some variability in the resulting model size. Therefore, we estimate the maximum model size and activation footprint possible for a given hardware platform by performing some experiments with TFLM. However, the final binary size is still somewhat dependent on the graph itself, so this prevents us from tightly meeting the constraints. In a real application context there will also be application logic and potentially even a real-time operating system (RTOS) Mbed OS (), which will take additional eFlash memory resources that must be budgeted into the model constraints.

6.3 Keyword Spotting (KWS)

The results on KWS are shown in Figure 7 where we compare our results (MicroNet) to DS-CNNs Zhang et al. (2017b) and models built by stacking MobileNetV2 Sandler et al. (2018) bottleneck blocks. Few results are currently available for version 2 of the Google speech commands dataset, therefore we train baseline models for comparison. We trained all the models with exactly the same training recipe and quantized them to bit weights and activations (including input) before measuring their accuracy. MicroNet models are the Pareto optimal for latency, SRAM usage and model size. MicroNet small and medium models also come very close to the latency constraints set for them, achieving 9.2FPS and 5.4FPS on the medium sized MCU while having accuracy of and and being deployable on the smallest MCU. Detailed description of MicroNet KWS model architectures can be found in the Appendix.

We can further leverage sub-byte quantization to make bigger but more accurate models deployable on smaller MCUs. Table 2 demonstrates the accuracy, latency, and SRAM memory trade-off of a bit MicroNet discovered by DNAS, targeting the small MCU (STM32F446RE) on the KWS task. The bit KWS MicroNet outperforms the bit medium-sized model by , because it is able to have more weights and activations on the same small MCU board. Table 2 reports the latency measured on the different models on the medium MCU (STM32F746ZG). The increase in latency of the -bit model is primarily attributed to increase in ops due to larger feature maps. When compared to the sub-byte kernels of CMix-NN Capotondi et al. (2020), our bit kernels can substantially hide the latency overhead due to software-emulation of bit operations, by fully exploiting the available instruction-level-parallelism (ILP) bandwidth on Cortex-M microcontrollers. Furthermore, we believe the accuracy of the bit KWS MicroNet can be further improved by selectively quantizing lightweight depthwise layers to bits, while quantizing remaining memory- and latency-heavy pointwise and standard convolutional layers to -bits Rusci et al. (2020b); Gope et al. (2020).

6.4 Anomaly Detection (AD)

Results for AD are given in Table 3. To aid comparison, we use the “Uptime” metric defined as model latency divided by stride time between two successive inputs. For real-time AD, this ratio is the duty cycle of the MCU workload. For example, our models expect a stride of while some other models expect and need to be run more often. This metric is also relates directly to power consumption.

MN-KWS (L) MN-KWS (M) MN-KWS (S)
(8-b W/8-b A) (8-b W/8-b A) (4-b W/4-b A)
Accuracy (%) 95.3 94.2 94.5
Latency (s) (M) 0.59 0.18 0.66
Model Size (KB) 612 163 290
SRAM (KB) 208 103 112
Table 2: KWS results for 4-bit quantized MicroNet models.

The models obtained from DNAS are called MicroNet-AD. The three different sized models each targets a different sized MCU. Our solutions are compared with a baseline fully-connected auto-encoder (FC-AE) for anomaly detection Purohit et al. (2019), which has a 640 dimensional input, followed by 4 fully-connected hidden layers of 128 neurons each, a bottleneck layer of 8 neurons, 4 fully-connected hidden layers of 128 neurons again and the output. This baseline model achieves 84.76% AUC and runs fast. However, once we try to scale it up for better anomaly detection performance, the size of the model quickly exceeds the flash limit of all MCUs used in this work. The wide FC-AE model which scales up all the hidden neuron number from 128 to 512 in the baseline, achieves only 87.1% AUC but its size exceeds 2MB in 8-bit, making it undeployable on our MCUs. An alternative to fully-connected AE that is more parameter efficient is convolutional AE, which we have also included in the comparison. Convolutional AEs require the transposed convolution operator, not supported in TFLM.

We also compare with the MobileNetV2-0.5AD model trained in a similar self-supervised way as ours, which is a component of the winning solution at DCASE2020 challenge DCASE () presented in Giri et al. (2020). Since the authors submitted ensembles of multiple classifiers, we take the average of AUCs reported where the MobileNetV2-0.5AD is a component as an estimate of its accuracy. This model can only be deployed on the largest MCU because of it relatively large size (close to 1MB). The model is light in number of operations but its uptime requirement is worse than our solutions since it expects a time stride of . Our large MicroNet model is equally performing in terms of AUC, requires less than half the Flash size and consumes less compute resources in terms of uptime requirement. The smallest MicroNet-AD model can be deployed on the small MCU performing real-time AD with AUC performance. As we have shown previously, the small MCU only draws about the power of the medium, which is attractive since most of these tiny IoT devices run on batteries or need to be energy self-sufficient. MicroNet-AD model architectures can be found in the Appendix.

Models AUC(%) Ops(M) Size Mem Uptime()
MicroNet-AD(L) 97.28 129 442KB 383KB 95.9 (L)
MicroNet-AD(M) 96.22 124.7 453KB 274KB 94.8 (M)
MicroNet-AD(S) 95.35 37.5 247KB 114KB 71.4 (S)
FC-AE(Baseline) 84.76 0.52 270KB 4.7KB 10.3 (M)
FC-AE(Wide) 87.1 4.47 2.2MB   4.7KB* ND
Conv-AE 91.77 578   4.1MB*   160KB* ND
MBNETV2-0.5AD   97.24* 31.1 965KB 206KB 98.8 (L)

* denotes estimated value. ND denotes not deployable on MCU.

Table 3: AD results. The AUC for our models are with 8-bit weights/activations, while Conv-AE Ribeiro et al. (2020) and MBNETV2-0.5AD Giri et al. (2020) are reported for unquantized FP32. S, M and L denote small, medium and large MCU targets.

6.5 Comparison with State-of-the-Art

A key previous TinyML work is SpArSe Fedorov et al. (2019). This work is focused on even smaller MCUs, with memory down to 2KBs. However, it also generally targets smaller datasets with smaller input dimensions than the industry-standard tinyMLperf tasks that we use in our work. In a parallel line of work, MCUNet Lin et al. (2020) demonstrated SOTA MCU models using a framework that jointly designs the model architecture and the lightweight code-generation inference engine. Their latency and SRAM measurement relies on a closed-source software stack that is not available to us, so it is difficult to make comparison with their results. However, our models are pareto optimal compared to MCUNet on the KWS task even though we use a readily available, open source software stack.

In support of this paper, we will also open source our models for all three TinyMLperf tasks. We hope these models will be especially useful for MCU vendors and researchers, as a set of standard models for benchmarking.

7 Conclusion

TinyML promises to enable a broad array of IoT applications, but is technically challenging. This is primarily due to the memory demands of deep neural network inference, which are at conflict with the limitations of MCUs. We start by analyzing measured MCU inference performance. Measurements demonstrate that for models sampled from a given network search space, the inference latency of the model is, in fact, linear with the total operation count. Since MCU power is largely independent of workload, operation count is also a strong proxy for energy per inference. Therefore, we use operation count as a proxy for both latency and energy, and setup a differentiable NAS search to design a family of models called MicroNets. MicroNet models optimized for multiple MCUs demonstrate state-of-the-art performance on all three tinyMLperf tasks: visual wake words, audio keyword spotting and anomaly detection.

Dataset Model Accuracy Flash(KB) SRAM(bytes) S latency (S) M latency (S) L latency (S) S Energy(mJ) M Energy(mJ)
GSC MicroNet-KWS-L 95.3 612 208840 - 0.610128 0.595811 - 274.32
GSC MicroNet-KWS-M 94.2 163 103360 0.425768 0.18671 0.181068 70.56 83.16
GSC MicroNet-KWS-S 93.2 102 53208 0.249885 0.108839 0.107573 40.68 48.6
VWW MicroNet-VWW-M 87.3 855 284692 - 1.165829 1.126236 - 507.6
VWW MicroNet-VWW-S 79.6 217 70060 0.187745 0.084759 0.083967 29.196 38.16
Anomaly MicroNet-AD-L AUC: 97.28 442 383776 - - 0.613999 - -
Anomaly MicroNet-AD-M AUC: 96.05 453 274528 - 0.607672 0.566909 - 269.64
Anomaly MicroNet-AD-S AUC: 95.35 247 114292 0.457176 0.19404 74.16 91.8
GSC DSCNN-L 93.9 490 201392 - 0.515165 0.497092 - 229.32
GSC DSCNN-M 93.5 181 123348 - 0.219424 0.211568 - 98.64
GSC DSCNN-S 92.1 49 47188 0.130769 0.058416 0.057685 21.132 25.956
GSC MBNETV2-L 91.2 988 530000* - - - - -
GSC MBNETV2-M 90.4 233 265976 - 0.330264 0.31697 - 147.6
GSC MBNETV2-S 89.2 87 134200 - 0.119629 0.115458 54 15.264
VWW ProxylessNas Cai et al. (2019) 94.6 309 349772 - 7.72* 7.543249 - -
VWW MSNetCheng et al. (2019) 95.13 264 413020 - 8.69* 8.498659 - -
VWW TFLM Person Detection TFLM () 76 294 82276 0.254136 0.107987 0.107831 39.96 49.32
Anomaly AD-baseline AUC:84.76 270 4740 0.007115 0.003326 0.002947 1.1736 1.26
Anomaly MBNetV2-0.5 Giri et al. (2020) AUC: 97.62? 965 206832 - - 0.252809 - -
Table 4: Results Table. (*) Estimated (-) Unable to measure do to SRAM or eFlash constraints

Appendix A Results Table

In Table 4 we provide a table of our results and baselines for easy comparison with future work. We report the flash consumption of the model flatbuffer and the SRAM consumption of the model whole model. We also report latency on the STM32F446RE (S), STM32F746ZG (M) and SRM32F767ZI (L) as well as energy consumption on the STM32F446RE (S) and STM32F746ZG (M). All of models are deployed using the TFLM inference framework. The eFlash consumption is determined by the MBED compiler and the SRAM consumption is obtained using the TFLM recording micro interpreter. We measure latency on the MCU in microseconds using the MBED Timer API. Finally we use the Qoitech Otii Arc Otii () to measure energy consumption.

Appendix B Power Trace

We plot the current vs. time for a small model and a medium model on the STM32f446RE and STMF746ZG in Figure 9. We also report the average power consumption over 1 second to illustrate the impact of the deep sleep power consumption on the overall energy consumption of a tinyML application with a duty cycle of one frame per second. We show that the current consumption varies little between models but the smaller model consume significantly less energy due to its reduced latency. Figure 9 also demonstrates that the smaller mcu consumes less power on average despite being active for longer.

Figure 9: Current consumption of a small and medium model on the STM32f446RE and STMF746ZG. We report the average power consumption over one second.

Footnotes

  1. Excluding RNN layers, not currently supported in TFLM.
  2. A single multiply-accumulate is defined as two operations.

References

  1. Ambiq Ambiq Micro Subthreshold Power Optimized Technology (SPOT) . External Links: Link Cited by: §2.
  2. Performance-oriented neural architecture search. External Links: 2001.02976 Cited by: §4.2.
  3. Benchmarking tinyml systems: challenges and direction. arXiv preprint arXiv:2003.04821. Cited by: §1, §4.
  4. Post-training 4-bit quantization of convolution networks for rapid-deployment. Cited by: §5.1.3.
  5. ProxylessNAS: direct neural architecture search on target task and hardware. In International Conference on Learning Representations, External Links: Link Cited by: Table 4, §5.1, §6.2.
  6. CMix-nn: mixed low-precision cnn library for memory-constrained edge devices. IEEE Transactions on Circuits and Systems II: Express Briefs 67 (5), pp. 871–875. Cited by: §5.1.3, §6.3.
  7. Small-footprint keyword spotting with graph convolutional network. External Links: 1912.05124 Cited by: §4.2.
  8. Msnet: structural wired neural architecture search for internet of things. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: Table 4, §6.2.
  9. Temporal convolution for real-time keyword spotting on mobile devices. CoRR abs/1904.03814. External Links: Link, 1904.03814 Cited by: §4.2.
  10. Visual wake words dataset. arXiv preprint arXiv:1906.05721. Cited by: §2, §4.1, §5.2.1, §6.2.
  11. TensorFlow lite micro: embedded machine learning on tinyml systems. arXiv preprint arXiv:2010.08678. Cited by: §1.
  12. DCASE DCASE2020 challenge task 2. External Links: Link Cited by: §6.4.
  13. Network pruning via transformable architecture search. In Advances in Neural Information Processing Systems, pp. 759–770. Cited by: §5.1.
  14. High performance dsp for vision, imaging and neural networks.. In Hot Chips Symposium, pp. 1–30. Cited by: §2.
  15. ELL Embedded learning library. External Links: Link Cited by: §2.
  16. Neural architecture search: a survey. Journal of Machine Learning Research 20 (55), pp. 1–21. External Links: Link Cited by: §2.
  17. Learned step size quantization. External Links: Link Cited by: §5.1.3.
  18. Ethos-U55 Arm ethos-u55. External Links: Link Cited by: §2.
  19. SpArSe: Sparse architecture search for cnns on resource-constrained microcontrollers. In Advances in Neural Information Processing Systems, pp. 4977–4989. Cited by: §1, §2, §2, §5.1.1, §5.2.1, §6.5.
  20. TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids. arXiv preprint arXiv:2005.11138. Cited by: §2.
  21. Unsupervised anomalous sound detection using self-supervised classification and group masked autoencoder for density estimation. Technical report DCASE2020 Challenge. Cited by: Table 4, §4.3, §5.2.3, §6.4, Table 3.
  22. Ternary mobilenets via per-layer hybrid filter banks. Cited by: §6.3.
  23. Ternary hybrid neural-tree networks for highly constrained iot applications. In Proceedings of Machine Learning and Systems 2019, pp. 190–200. Cited by: §4.2.
  24. ProtoNN: compressed and accurate knn for resource-scarce devices. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, PMLR 70, Cited by: §2.
  25. Memory-optimal direct convolutions for maximizing classification accuracy in embedded applications. K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 2515–2524. Cited by: §2.
  26. Helium Arm Helium Technology . External Links: Link Cited by: §2.
  27. Using self-supervised learning can improve model robustness and uncertainty. In Advances in Neural Information Processing Systems, pp. 15663–15674. Cited by: §4.3.
  28. Computer architecture, fifth edition: a quantitative approach. 5th edition, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. External Links: ISBN 012383872X Cited by: §2.
  29. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. Cited by: §5.2.1.
  30. Kaggle AD Preprocessed Anomaly Detection Dataset hosted on Kaggle.. External Links: Link Cited by: §4.3.
  31. Quantizing deep convolutional networks for efficient inference: a whitepaper. arXiv preprint arXiv:1806.08342. Cited by: §5.2.1.
  32. ImageNet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1Proceedings of the 34th International Conference on Machine Learning - Volume 708th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 202033rd Conference on Neural Information Processing Systems (NeurIPS 2019)Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) WorkshopsProceedings of the 2nd SysML Conference, Palo Alto, CA, USAProceedings of the Thirty-first Annual Conference on Neural Information Processing Systems (NeurIPS), NIPS’12ICML’17, Red Hook, NY, USA. Cited by: §2.
  33. Resource-efficient machine learning in 2 kb ram for the internet of things. Cited by: §2.
  34. FastGRNN: a fast, accurate, stable and tiny kilobyte sized gated recurrent neural network. pp. 9031–9042. Note: slides/fastgrnn.pdf External Links: Link Cited by: §4.2.
  35. CMSIS-NN: efficient neural network kernels for arm cortex-m cpus. CoRR abs/1801.06601. External Links: Link, 1801.06601 Cited by: §2, §5.1.3.
  36. Not all ops are created equal!. CoRR abs/1801.04326. External Links: Link, 1801.04326 Cited by: §3.2.
  37. Mcunet: tiny deep learning on iot devices. arXiv preprint arXiv:2007.10319. Cited by: §1, §2, §2, §2, §6.5.
  38. DARTS: differentiable architecture search. In International Conference on Learning Representations, External Links: Link Cited by: §2, §5.1.
  39. M4 Arm cortex-m4. External Links: Link Cited by: Table 1.
  40. M7 Arm cortex-m7. External Links: Link Cited by: Table 1.
  41. Mbed OS . External Links: Link Cited by: §6.1, §6.2.
  42. MicroTVM External Links: Link Cited by: §2.
  43. Neural architecture search for keyword spotting. arXiv preprint arXiv:2009.00165. Cited by: §4.2.
  44. Otii External Links: Link Cited by: Appendix A, §3.4.
  45. PROFIT: A novel training method for sub-4-bit mobilenet models. CoRR abs/2008.04693. External Links: Link, 2008.04693 Cited by: §5.1.3.
  46. MIMII dataset: sound dataset for malfunctioning industrial machine investigation and inspection. arXiv preprint arXiv:1909.09347. Cited by: §4.3, §6.4.
  47. PyTorch . External Links: Link Cited by: §2.
  48. Deep dense and convolutional autoencoders for unsupervised anomaly detection in machine condition sounds. arXiv preprint arXiv:2006.10417. Cited by: Table 3.
  49. Leveraging automated mixed-low-precision quantization for tiny edge microcontrollers. External Links: 2008.05124 Cited by: §2.
  50. Leveraging automated mixed-low-precision quantization for tiny edge microcontrollers. CoRR abs/2008.05124. External Links: Link, 2008.05124 Cited by: §6.3.
  51. Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §5.2.1, §6.3.
  52. ST F446RE STM32F446RE Microcontroller . External Links: Link Cited by: Table 1.
  53. ST F746ZG STM32F746ZG Microcontroller . External Links: Link Cited by: Table 1.
  54. ST F767ZI STM32F767ZI Microcontroller . External Links: Link Cited by: Table 1.
  55. TensorFlow . External Links: Link Cited by: §2.
  56. TFLM TensorFlow lite for microcontrollers. External Links: Link Cited by: Table 4, §2.
  57. Compressing rnns for iot devices by 15-38x using kronecker products. CoRR abs/1906.02876. External Links: Link, 1906.02876 Cited by: §4.2.
  58. uTensor External Links: Link Cited by: §1, §2.
  59. FBNetV2: differentiable neural architecture search for spatial and channel dimensions. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.1.2, §5.1.
  60. Speech commands: a dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209. Cited by: §2, §4.2.
  61. FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning. pp. . Cited by: §2.
  62. TinySpeech: attention condensers for deep speech recognition neural networks on edge devices. External Links: 2008.04245 Cited by: §4.2.
  63. Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §5.2.3.
  64. Hello edge: keyword spotting on microcontrollers. CoRR abs/1711.07128. External Links: Link, 1711.07128 Cited by: §4.2, §5.2.2, §6.3.
  65. Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128. Cited by: §4.2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
419990
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description