EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge
Despite showing state-of-the-art performance, deep learning for speech recognition remains challenging to deploy in on-device edge scenarios such as mobile and other consumer devices. Recently, there have been greater efforts in the design of small, low-footprint deep neural networks (DNNs) that are more appropriate for edge devices, with much of the focus on design principles for hand-crafting efficient network architectures. In this study, we explore a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration. The efficacy of this design strategy is demonstrated through the design of a family of highly-efficient DNNs (nicknamed EdgeSpeechNets) for limited-vocabulary speech recognition. Experimental results using the Google Speech Commands dataset for limited-vocabulary speech recognition showed that EdgeSpeechNets have higher accuracies than state-of-the-art DNNs (with the best EdgeSpeechNet achieving 97% accuracy), while achieving significantly smaller network sizes (as much as smaller) and lower computational cost (as much as fewer multiply-add operations, lower prediction latency, and smaller memory footprint on a Motorola Moto E phone), making them very well-suited for on-device edge voice interface applications.
Deep learning has seen widespread interest in recent years, and has been demonstrated to achieve state-of-the-art performance for a wide range of applications in speech recognition. In particular, limited-vocabulary speech recognition Warden2018 (), also known as keyword spotting, has recently seen significant interest as an important application of deep learning for mobile, IoT, and other edge devices. The ability to rapidly recognize specific keywords from a stream of verbal utterances can enable voice interfaces with which the user can interact in a natural, verbal manner without the need for cloud computing, which is particularly important in scenarios where privacy and internet connectivity are of concern.
Despite the promises, deep learning for speech recognition tasks such as limited-vocabulary speech recognition remains challenging to deploy in on-device edge scenarios such as mobile and other consumer devices due to computational and memory requirements. As such, there has been greater recent efforts to design small, low-footprint deep neural network (DNN) architectures that are more appropriate for edge devices, with much of the focus on design principles for hand-crafting efficient network architectures Warden2018 (); Sainath2015 (); Tang2017 (). In this study, we explore a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration via generative synthesis Wong2018 (). More specifically, a family of highly-efficient DNNs (nicknamed EdgeSpeechNets) are designed for limited-vocabulary speech recognition using this strategy.
The human-machine collaborative design strategy presented in this study for building EdgeSpeechNets (low-footprint DNN architectures for limited-vocabulary speech recognition) comprises of two main steps. First, design principles are leveraged to construct an initial design prototype catered towards the task of limited-vocabulary speech recognition. Second, machine-driven design exploration is performed based on the constructed initial design prototype and a set of design requirements to generate a set of alternative highly-efficient DNN designs appropriate to the problem space. As such, the goal of this strategy is to combine human ingenuity with the meticulousness of a machine. Details of these two steps are described below.
2.1 Human-driven Design Prototyping
The first step of the presented design strategy is to leverage design principles to construct an initial design prototype catered towards the task of limited-vocabulary speech recognition. Based on past literature, a very effective strategy for leveraging deep learning for limited-vocabulary speech recognition is to first transform the input audio signal into mel-frequency cepstrum coefficient (MFCC) representations. Inspired by Tang2017_Honk (), the input layer of the design prototype takes in a two-dimensional stack of MFCC representations using a 30ms window with a 10ms time shift across a one-second band-pass filtered (cutoff from 20Hz to 4kHz for reducing noise) audio sample.
For the intermediate representation layers of the initial design prototype, we leverage the concept of deep residual learning He2016 () and specify the use of residual blocks comprised of alternating convolution and batch normalization layers, with skip connections between residual blocks. Networks built around deep residual stacks have been shown to enable easier learning of DNNs with greater representation capabilities, and has been previously demonstrated to enable state-of-the-art speech recognition performance Tang2017 (). After the intermediate representation layers, average pooling is specified in the initial design prototype, followed by a dense layer. Finally, a softmax layer is defined as the output of the initial design prototype to indicate which of the keywords was detected from the verbal utterance. Driven by these design principles, the initial design prototype is shown in Figure 1.
2.2 Machine-driven Design Exploration
The second step in the presented design strategy is to perform machine-driven design exploration based on the initial design prototype, while considering a set of design requirements that ensure the generation of a set of alternative highly-efficient DNN designs appropriate for on-device limited-vocabulary speech recognition. Previous literature Tang2017 () has leveraged manual design exploration for designing low-footprint DNNs for speech recognition, where high-level parameters such as the width and depth of a network are varied to observe the trade-offs between resource usage and accuracy in a similar vein as Howard2017 (); Sandler2018 (). However, the coarse-grained nature of such a design exploration strategy can be quite limiting in terms of the diversity in the network architectures that can be found. Motivated to overcome these limitations, we instead take advantage of a highly flexible machine-driven design exploration strategy in the form of generative synthesis Wong2018 (). Briefly, the goal of generative synthesis is to learn a generator that, given a set of seeds , can generate DNNs that maximize a universal performance function (e.g., Wong2018_Netscore ()) while satisfying requirements defined by an indicator function , i.e.,
An approximate solution to this optimization problem is found in an progressive manner, with the generator initialized based on prototype , , and and a number of successive generators being constructed. We take full advantage of this interesting phenomenon by leveraging this set of generators to synthesize a family of EdgeSpeechNets that satisfies these requirements. More specifically, we configure the indicator function such that the validation accuracy 95% on the Google Speech Commands dataset Warden2017 ().
2.3 Final Architecture Designs
The architecture design of the EdgeSpeechNets produced using the presented human-machine collaborative strategy are summarized in Tables 2 and 2. It can be observed that the architecture designs produced have diverse architectural differences that can only be achieved via fine-grained machine-driven design exploration.
3 Results and Discussion
|trad-fpool13 Sainath2015 ()||1.37M||125M|
|tpool2 Sainath2015 ()||1.09M||103M|
|res15 Tang2017 ()||238K||894M|
|res15-narrow Tang2017 ()||42.6K||160M|
The efficacy of the produced EdgeSpeechNets were evaluated using the Google Speech Commands dataset Warden2017 ()
This work was supported by NSERC, Canada Research Chairs Program, and DarwinAI Corp.
- https://research.googleblog.com/2017/08/ launching-speech-commands-dataset.html
- P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
- T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
- R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” arXiv preprint arXiv:1710.10361, 2017.
- A. Wong, M. J. Shafiee, B. Chwyl, and F. Li, “Ferminets: Learning generative machines to generate efficient neural networks via generative synthesis,” arXiv preprint arXiv:1809.05989, 2018.
- R. Tang and J. Lin, “Honk: A pytorch reimplementation of convolutional neural networks for keyword spotting,” arXiv preprint arXiv:1710.06554, 2017.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
- A. Wong, “Netscore: Towards universal metrics for large-scale performance analysis of deep neural networks for practical usage,” arXiv preprint arXiv:1806.05512, 2018.
- P. Warden, “Launching the speech commands dataset,” in Google Research Blog, 2017.