EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge

EdgeSpeechNets: Highly Efficient Deep Neural Networks for Speech Recognition on the Edge

Abstract

Despite showing state-of-the-art performance, deep learning for speech recognition remains challenging to deploy in on-device edge scenarios such as mobile and other consumer devices. Recently, there have been greater efforts in the design of small, low-footprint deep neural networks (DNNs) that are more appropriate for edge devices, with much of the focus on design principles for hand-crafting efficient network architectures. In this study, we explore a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration. The efficacy of this design strategy is demonstrated through the design of a family of highly-efficient DNNs (nicknamed EdgeSpeechNets) for limited-vocabulary speech recognition. Experimental results using the Google Speech Commands dataset for limited-vocabulary speech recognition showed that EdgeSpeechNets have higher accuracies than state-of-the-art DNNs (with the best EdgeSpeechNet achieving 97% accuracy), while achieving significantly smaller network sizes (as much as smaller) and lower computational cost (as much as fewer multiply-add operations, lower prediction latency, and smaller memory footprint on a Motorola Moto E phone), making them very well-suited for on-device edge voice interface applications.

1 Introduction

Deep learning has seen widespread interest in recent years, and has been demonstrated to achieve state-of-the-art performance for a wide range of applications in speech recognition. In particular, limited-vocabulary speech recognition Warden2018 (), also known as keyword spotting, has recently seen significant interest as an important application of deep learning for mobile, IoT, and other edge devices. The ability to rapidly recognize specific keywords from a stream of verbal utterances can enable voice interfaces with which the user can interact in a natural, verbal manner without the need for cloud computing, which is particularly important in scenarios where privacy and internet connectivity are of concern.

Despite the promises, deep learning for speech recognition tasks such as limited-vocabulary speech recognition remains challenging to deploy in on-device edge scenarios such as mobile and other consumer devices due to computational and memory requirements. As such, there has been greater recent efforts to design small, low-footprint deep neural network (DNN) architectures that are more appropriate for edge devices, with much of the focus on design principles for hand-crafting efficient network architectures Warden2018 (); Sainath2015 (); Tang2017 (). In this study, we explore a human-machine collaborative design strategy for building low-footprint DNN architectures for speech recognition through a marriage of human-driven principled network design prototyping and machine-driven design exploration via generative synthesis Wong2018 (). More specifically, a family of highly-efficient DNNs (nicknamed EdgeSpeechNets) are designed for limited-vocabulary speech recognition using this strategy.

2 Methods

The human-machine collaborative design strategy presented in this study for building EdgeSpeechNets (low-footprint DNN architectures for limited-vocabulary speech recognition) comprises of two main steps. First, design principles are leveraged to construct an initial design prototype catered towards the task of limited-vocabulary speech recognition. Second, machine-driven design exploration is performed based on the constructed initial design prototype and a set of design requirements to generate a set of alternative highly-efficient DNN designs appropriate to the problem space. As such, the goal of this strategy is to combine human ingenuity with the meticulousness of a machine. Details of these two steps are described below.

2.1 Human-driven Design Prototyping

The first step of the presented design strategy is to leverage design principles to construct an initial design prototype catered towards the task of limited-vocabulary speech recognition. Based on past literature, a very effective strategy for leveraging deep learning for limited-vocabulary speech recognition is to first transform the input audio signal into mel-frequency cepstrum coefficient (MFCC) representations. Inspired by Tang2017_Honk (), the input layer of the design prototype takes in a two-dimensional stack of MFCC representations using a 30ms window with a 10ms time shift across a one-second band-pass filtered (cutoff from 20Hz to 4kHz for reducing noise) audio sample.

For the intermediate representation layers of the initial design prototype, we leverage the concept of deep residual learning He2016 () and specify the use of residual blocks comprised of alternating convolution and batch normalization layers, with skip connections between residual blocks. Networks built around deep residual stacks have been shown to enable easier learning of DNNs with greater representation capabilities, and has been previously demonstrated to enable state-of-the-art speech recognition performance Tang2017 (). After the intermediate representation layers, average pooling is specified in the initial design prototype, followed by a dense layer. Finally, a softmax layer is defined as the output of the initial design prototype to indicate which of the keywords was detected from the verbal utterance. Driven by these design principles, the initial design prototype is shown in Figure 1.

Figure 1: (a) Audio signal; (b) MFCC representations; (c) initial design prototype.

2.2 Machine-driven Design Exploration

The second step in the presented design strategy is to perform machine-driven design exploration based on the initial design prototype, while considering a set of design requirements that ensure the generation of a set of alternative highly-efficient DNN designs appropriate for on-device limited-vocabulary speech recognition. Previous literature Tang2017 () has leveraged manual design exploration for designing low-footprint DNNs for speech recognition, where high-level parameters such as the width and depth of a network are varied to observe the trade-offs between resource usage and accuracy in a similar vein as Howard2017 (); Sandler2018 (). However, the coarse-grained nature of such a design exploration strategy can be quite limiting in terms of the diversity in the network architectures that can be found. Motivated to overcome these limitations, we instead take advantage of a highly flexible machine-driven design exploration strategy in the form of generative synthesis Wong2018 (). Briefly, the goal of generative synthesis is to learn a generator that, given a set of seeds , can generate DNNs that maximize a universal performance function (e.g., Wong2018_Netscore ()) while satisfying requirements defined by an indicator function , i.e.,

(1)

An approximate solution to this optimization problem is found in an progressive manner, with the generator initialized based on prototype , , and and a number of successive generators being constructed. We take full advantage of this interesting phenomenon by leveraging this set of generators to synthesize a family of EdgeSpeechNets that satisfies these requirements. More specifically, we configure the indicator function such that the validation accuracy 95% on the Google Speech Commands dataset Warden2017 ().

2.3 Final Architecture Designs

The architecture design of the EdgeSpeechNets produced using the presented human-machine collaborative strategy are summarized in Tables 2 and 2. It can be observed that the architecture designs produced have diverse architectural differences that can only be achieved via fine-grained machine-driven design exploration.

Type Params
conv 3 3 39 351
conv 3 3 20 7020
conv 3 3 39 7020
conv 3 3 15 5265
conv 3 3 39 5265
conv 3 3 25 8775
conv 3 3 39 8775
conv 3 3 22 7722
conv 3 3 39 7722
conv 3 3 22 7722
conv 3 3 39 7722
conv 3 3 25 8775
conv 3 3 39 8775
conv 3 3 45 15795
avg-pool - - - -
dense - - 12 540
softmax - - - -
Total - - - 107K
Type Params
conv 3 3 30 270
conv 3 3 8 2160
conv 3 3 30 2160
conv 3 3 9 2430
conv 3 3 30 2430
conv 3 3 11 2970
conv 3 3 30 2970
conv 3 3 10 2700
conv 3 3 30 2700
conv 3 3 8 2160
conv 3 3 30 2160
conv 3 3 11 2970
conv 3 3 30 2970
conv 3 3 45 12150
avg-pool - - - -
dense - - 12 540
softmax - - - -
Total - - - 43.7K
Table 2: Network architectures of EdgeSpeechNet-C (left) and EdgeSpeechNet-D (right)
Type Params
conv 3 3 24 216
conv 3 3 6 1296
conv 3 3 24 1296
conv 3 3 9 1944
conv 3 3 24 1944
conv 3 3 12 2592
conv 3 3 24 2592
conv 3 3 6 1296
conv 3 3 24 1296
conv 3 3 5 1080
conv 3 3 24 1080
conv 3 3 6 1296
conv 3 3 24 1296
conv 3 3 2 432
conv 3 3 24 432
conv 3 3 45 9720
avg-pool - - - -
dense - - 12 540
softmax - - - -
Total - - - 30.3K
Type Params
conv 3 3 45 405
avg-pool - - - -
conv 3 3 30 12150
conv 3 3 45 12150
conv 3 3 33 13365
conv 3 3 45 13365
conv 3 3 35 14175
conv 3 3 45 14175
avg-pool - - - -
dense - - 12 540
softmax - - - -
Total - - - 80.3k
Table 1: Network architectures of EdgeSpeechNet-A (left) and EdgeSpeechNet-B (right)

3 Results and Discussion

Model Test Accuracy NetScore Params Mult-Adds
trad-fpool13 Sainath2015 () 1.37M 125M
tpool2 Sainath2015 () 1.09M 103M
res15 Tang2017 () 238K 894M
res15-narrow Tang2017 () 42.6K 160M
EdgeSpeechNet-A 96.8 107K 343M
EdgeSpeechNet-B 43.7K 126M
EdgeSpeechNet-C 30.3K 83.5M
EdgeSpeechNet-D 106.67 80.3K 24.5M
Table 3: Test accuracy of EdgeSpeechNets in comparison to trad-fpool13 Sainath2015 (), tpool2 Sainath2015 (), res15 Tang2017 (), and res15-narrow Tang2017 (), NetScores, and model sizes in terms of number of parameters and multiply-add operations. All results are the mean across 5 runs. Best results are in bold.

The efficacy of the produced EdgeSpeechNets were evaluated using the Google Speech Commands dataset Warden2017 () 1. The Speech Commands Dataset was designed for limited-vocabulary speech recognition and contains 65,000 one-second samples of 30 short words and background noise samples. For comparison purposes, the results for two state-of-the-art deep neural networks presented in Tang2017 () (res15 and res15-narrow) and the Google networks presented in Sainath2015 () (trad-fpool13, tpool2) were also presented. As shown in Table 3, the produced EdgeSpeechNets had higher accuracies at much smaller sizes and lower computational costs than state-of-the-art deep neural networks. In terms of best accuracy, EdgeSpeechNet-A achieved 1% higher accuracy compared to the state-of-the-art res15 Tang2017 () while having >2.2 fewer parameters and requiring >2.6 fewer multiply-add operations. In fact, the best of 5 runs for EdgeSpeechNet-A achieved a test accuracy reaching 97%, thus noticeably outperforming previously published results. More interesting, EdgeSpeechNet-B still achieved higher accuracy (0.5% higher) compared to res15 while having >5.4 fewer parameters and requiring 7.1 fewer multiply-add operations. In terms of smallest size, EdgeSpeechNet-C achieved higher accuracy (0.4% higher) compared to res15 but has >7.8 fewer parameters and requiring >10.7 fewer multiply-add operations. In terms of lowest computational cost, EdgeSpeechNet-D achieved the same accuracy compared to res15 but requires 36.5 fewer multiply-add operations. When compared to the Google network tpool2 Sainath2015 (), EdgeSpeechNet-D achieved 4.1% higher accuracy while having >13.5 fewer parameters and requiring 4.2 fewer multiply-add operations. In terms of the highest NetScore, EdgeSpeechNet-D achieved a NetScore that is >20 points than res15, which demonstrates a strong balance between accuracy, computational cost, and size. Finally, running on a 1.4 GHz Cortex-A53 mobile processor in a Motorola Moto E phone using TensorFlow Mobile, EdgeSpeechNet-D ran with an average prediction latency of 34ms and memory footprint of 1MB (>10 lower latency and >16.5 smaller memory footprint than res15). These results demonstrate that the EdgeSpeechNets were able to achieve state-of-the-art performance while still being noticeably smaller and requiring significantly fewer computations, making them very well-suited for on-device edge voice interface applications. Given the promising prospects of the presented human-machine collaborative design strategy, we aim to further explore this strategy for designing highly-efficient deep neural networks in other applications such as visual perception and natural language processing.

Acknowledgements

This work was supported by NSERC, Canada Research Chairs Program, and DarwinAI Corp.

Footnotes

  1. https://research.googleblog.com/2017/08/ launching-speech-commands-dataset.html

References

  1. P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv preprint arXiv:1804.03209, 2018.
  2. T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
  3. R. Tang and J. Lin, “Deep residual learning for small-footprint keyword spotting,” arXiv preprint arXiv:1710.10361, 2017.
  4. A. Wong, M. J. Shafiee, B. Chwyl, and F. Li, “Ferminets: Learning generative machines to generate efficient neural networks via generative synthesis,” arXiv preprint arXiv:1809.05989, 2018.
  5. R. Tang and J. Lin, “Honk: A pytorch reimplementation of convolutional neural networks for keyword spotting,” arXiv preprint arXiv:1710.06554, 2017.
  6. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  7. A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
  8. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
  9. A. Wong, “Netscore: Towards universal metrics for large-scale performance analysis of deep neural networks for practical usage,” arXiv preprint arXiv:1806.05512, 2018.
  10. P. Warden, “Launching the speech commands dataset,” in Google Research Blog, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
311656
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description