AttendNets: Tiny Deep Image Recognition Neural Networks for the Edge via Visual Attention Condensers
While significant advances in deep learning has resulted in state-of-the-art performance across a large number of complex visual perception tasks, the widespread deployment of deep neural networks for TinyML applications involving on-device, low-power image recognition remains a big challenge given the complexity of deep neural networks. In this study, we introduce AttendNets, low-precision, highly compact deep neural networks tailored for on-device image recognition. More specifically, AttendNets possess deep self-attention architectures based on visual attention condensers, which extends on the recently introduced stand-alone attention condensers to improve spatial-channel selective attention. Furthermore, AttendNets have unique machine-designed macroarchitecture and microarchitecture designs achieved via a machine-driven design exploration strategy. Experimental results on ImageNet benchmark dataset for the task of on-device image recognition showed that AttendNets have significantly lower architectural and computational complexity when compared to several deep neural networks in research literature designed for efficiency while achieving highest accuracies (with the smallest AttendNet achieving 7.2% higher accuracy, while requiring 3 fewer multiply-add operations, 4.17 fewer parameters, and 16.7 lower weight memory requirements than MobileNet-V1). Based on these promising results, AttendNets illustrate the effectiveness of visual attention condensers as building blocks for enabling various on-device visual perception tasks for TinyML applications.
Deep learning lecun2015deep () has resulted in significant breakthroughs in the area of computer vision, with state-of-the-art performance in a wide range of visual perception tasks such as image recognition krizhevsky2012imagenet (); ResNet (); hu2017squeezeandexcitation (), object detection fasterrcnn (); liu2016ssd (), and semantic and instance segmentation lin2017refinenet (); chen2018deeplab (); he2017mask (); lin2019edgesegnet (). Despite these breakthroughs, the widespread deployment of deep neural networks for tiny machine learning (TinyML) applications involving on-device visual perception on low-cost, low-power devices remains a major challenge given the increasing complexities of deep neural networks. Motivated by the tremendous potential of deep learning empowering TinyML applications and inspired to tackle the aforementioned complexity challenge, there has been significant effort in recent years on the creation of highly efficient deep neural networks for edge scenarios. These efforts in efficient deep learning have yielded a number of effective strategies, and can be typically grouped into two main categories: i) model compression, and ii) efficient architecture design. In the realm of model compression, a popular approach is precision reduction Jacob (); Meng2017 (); courbariaux2015binaryconnect (), where the weights are represented at reduced data precision (e.g., fixed-point or integer precision Jacob (), 2-bit precision Meng2017 (), 1-bit precision courbariaux2015binaryconnect ()). Another model compression strategy leverage traditional data compression methods such as hashing, coding, and thresholding han2015deep (). More recently, teacher-student strategies for model compression has also been explored hinton2015distilling (); projectionnet (), where a larger teacher network is leveraged to train a smaller student network.
In the realm of efficient architecture design, a number of effective architecture design patterns geared around network efficiency have been introduced MobileNetv1 (); MobileNetv2 (); SqueezeNet (); SquishedNets (); TinySSD (); ShuffleNetv1 (); ShuffleNetv2 (); ResNet (). One popular design pattern involves the introduction of bottlenecks ResNet (); MobileNetv2 (); SqueezeNet (), which serve the purpose of reducing dimensionality for more complex operation such as spatial convolutions. Another efficient network design pattern involves the introduction of factorized convolutions MobileNetv1 (); MobileNetv2 (), which reduce architectural and computational complexity by factorizing spatial convolutions into smaller, more efficient operations. A third efficient network design pattern involves the introduction of new operations such as pointwise group convolutions and channel shuffling ShuffleNetv1 (); ShuffleNetv2 (). Despite the range of strategies explored, one particular area that has not been well explored and is ripe for innovation is to leverage the concept of self-attention bahdanau2014neural (); vaswani2017attention (); hu2017squeezeandexcitation (); woo2018cbam (); devlin2018bert (), seen as one of the recent big breakthroughs in deep learning, for the purpose of building highly efficient deep neural network architectures.
In this study, we introduce AttendNets, low-precision, highly compact deep neural networks tailored for on-device image recognition. More specifically, AttendNets possess deep self-attention architectures based on visual attention condensers, which extends on the recently introduced stand-alone attention condensers to improve spatial-channel selective attention. Furthermore, AttendNets have unique machine-designed macroarchitecture and microarchitecture designs achieved via a machine-driven design exploration strategy.
The paper is organized as follows. In Section 2, we will describe in detail the concept of visual attention condensers and the machine-driven design exploration strategy leveraged to build the proposed AttendNets. In Section 3, we will describe the produced AttendNet deep self-attention network architectures and discuss some interesting observations about their architecture designs. In Section 4, we will present experimental results that quantitatively explore the efficacy of the proposed AttendNets when compared to previously proposed efficient deep neural networks for the task of on-device image recognition. Fourth and finally, we will draw conclusions and discuss potential future directions in Section 5.
In this study, we introduce AttendNets, a family of low-precision, highly compact deep neural networks tailored specifically for on-device image recognition. Two key concepts are leveraged to construct the proposed AttendNets. First, we introduce a stand-alone self-attention mechanism called visual attention condensers, which extends upon the recently introduced attention condensers alex2020tinyspeech () to further improve its efficacy and efficiency for spatial-channel selective attention. Second, we leverage a machine-driven design exploration strategy incorporating visual attention condensers to automate the network architecture design process to produce the proposed AttendNets in a way that strikes a strong balance between network complexity and image recognition accuracy. Details pertaining to these two key concepts, along with details of the AttendNet network architectures, are provided below.
2.1 Visual Attention Condensers
The first concept we leverage to construct the proposed AttendNets is the concept of visual attention condensers. The concept of self-attention in deep learning has led to significant advances in recent years bahdanau2014neural (); vaswani2017attention (); hu2017squeezeandexcitation (); woo2018cbam (); devlin2018bert (), particularly with the advent of Transformers vaswani2017attention () that has reshaped the landscape of machine learning for natural language processing. It can be said that much of research on self-attention in deep learning has focused on improving accuracy, and this has had a heavy influence over the design of self-attention mechanisms. For example, in the realm of computer vision, self-attention mechanisms have largely been explored as a mechanism for augmenting existing deep convolutional neural network architectures hu2017squeezeandexcitation (); woo2018cbam (); bello2019attention (). Many of the introduced self-attention mechanisms for augmenting deep convolutional neural network architectures have focused on the decoupling of attention into channel-wise attention hu2017squeezeandexcitation () and local attention woo2018cbam (), and are integrated to improve the selective attention capabilities of existing convolution modules to boost overall network accuracy at the expense of architectural and computational complexity.
Motivated to explore the design of self-attention mechanisms primarily in the direction of enabling more efficient deep neural network architectures instead of primarily for accuracy, Wong et al. alex2020tinyspeech () introduce the concept of attention condensers as a stand-alone building block for deep neural networks geared around condensed self-attention. More specifically, attention condensers are stand-alone self-attention mechanisms that learn and produce unified embeddings at reduced dimensionality for characterizing the joint cross-dimensional activation relationships. The result is that attention condensers can better model activation relationships than decoupled mechanisms, resulting in improved selective attention while maintaining low complexity. Furthermore, these improved modeling capabilities enable attention condensers to be leveraged as stand-alone modules, rather than as dependent modules geared for augmenting convolution modules. As such, the heavier use of attention condensers and sparser use of complex stand-alone convolution modules can lead to more efficient deep neural network architectures while maintaining high modeling accuracy. The efficacy of attention condensers as a means for building highly efficient deep neural networks was illustrated in alex2020tinyspeech () for the task of on-device speech recognition, where deep attention condenser networks achieved computational and architectural complexities exceeding an order of magnitude when compared to previous deep speech recognition networks in literature.
Motivated by the promise of attention condensers, we extend upon the attention condenser design to further improve their efficiency and effectiveness for tackling visual perception tasks such as image recognition. More specifically, we take inspiration from the observation that deep convolutional neural network architectures for tackling complex visual perception tasks often have very high channel dimensionality. As such, while the existing attention condenser design can still achieve significant reductions on network complexity under such scenarios, we hypothesize that further complexity reductions can be gained through better handling of the high channel dimensionality when learning the condensed embedding of joint spatial-channel activation relationships. As such, we introduce an extended visual attention condenser design where we introduce a pair of learned channel mixing layers that further reduces spatial-channel embedding dimensionality while preserving selective attention performance.
An overview of the proposed visual attention condenser (VAC) is shown in Figure 1. More specifically, a visual attention condenser is a self-attention mechanism consisting of a down-mixing layer, , condensation layer , an embedding structure , an expansion layer , a selective attention mechanism , and an up-mixing layer . The down-mixing layer learns and produces a projection of the input activations to a reduced channel dimensionality to obtain . The condensation layer (i.e., ) condenses for reduced dimensionality to with an emphasis on strong activation proximity to better promote relevant region of interest despite the condensed nature of the spatial-channel representation. An embedding structure (i.e., ) then learns and produces a condensed embedding from characterizing joint spatial-channel activation relationships. An expansion layer (i.e., ) then projects the condensed embedding to an increased dimensionality to produce self-attention values emphasizing regions of interest in the same domain as . The output is a product of , self-attention values , and scale via selective attention (i.e., ). Finally, the up-mixing layer learns and prodcues a projection of to a higher channel dimensionality for final output that has the same channel dimensionality as the input activation . Overall, through the introduction of the pair of learned mixing layers into the attention condenser design, a better balance between joint spatial-channel embedding dimensionality and selective attention performance can be achieved for building highly efficient deep neural networks for tackling complex visual perception problems on the edge.
2.2 Machine-driven Design Exploration
The second concept we leverage to construct the proposed AttendNets is the concept of machine-driven design exploration. Similar to self-attention, the concept of machine-driven design exploration has gain tremendous research interests in recent years. In the realm of machine-driven design exploration for efficient deep neural network architectures, several notable strategies have been proposed. One strategy, which was taken by MONAS MONAS (), ParetoNASH ParetoNASH (), and MNAS MNAS (), involves formulating the search of efficient deep neural network architectures as a multi-objective optimization problem, where the objectives may include model size, accuracy, FLOPs, device inference latency, etc. and a solution found via reinforcement learning and evolutionary algorithm. Another strategy, which was taken by generative synthesis Wong2018 (), involves formulating the search of a generator of efficient deep neural networks as a constrained optimization problem, where the constraints may include model size, accuracy, FLOPs, device inference latency, etc. and a solution is found through an iterative solving process. In this study, we leverage the latter strategy and leverage generative synthesis Wong2018 () to automate the process of generating the macroarchitecture and microarchitecture designs of the final AttendNet network architectures such that they are tailored specifically for the purpose of on-device image recognition in computational and memory constrained scenarios such as on low-cost, low-power edge devices, with an optimal balance between image recognition accuracy and network efficiency.
Briefly (the theoretical underpinnings of generative synthesis are presented in detail in Wong2018 ()), generative synthesis formulates the following constrained optimization problem, which involves the search of a generator whose generated deep neural network architectures maximize a universal performance function (e.g., Wong2018_Netscore ()), with constraints around operational requirements defined by an indicator function ,
where denoting a set of seeds. The approximate solution to this constrained optimization problem is found in generative synthesis through an iterative solving process, with initiation of this process based on a prototype , , and . As the goal for the proposed AttendNets is to achieve a strong balance between accuracy and network efficiency for the task of on-device image recognition, we explore an indicator function with two key constraints: i) the top-1 validation accuracy is greater than or equal to 71% on the ImageNet edge vision benchmark dataset introduced by Fang et al. ImageNet50 () for evaluating performance of deep neural networks for on-device vision applications, and ii) 8-bit weight precision. First, a top-1 validation accuracy constraint of 71% validation accuracy was chosen to make AttendNets comparable in accuracy to a state-of-the-art efficient deep neural network proposed in alex2019attonets () for on-device image recognition. Second, an 8-bit weight precision constraint was chosen to account for the memory constraints of low-cost edge devices. Taking advantage of the fact that the generative synthesis process is iterative and produces a number of successive generators, we leverage two of the constructed generators at different stages to automatically generate two compact deep image recognition networks (AttendNet-A and AttendNet-B) with different tradeoffs between image recognition accuracy and network efficiency.
In terms of , we define a residual design prototype whose input layer takes in an RGB image, and the last layers consisting of global average pooling layer, a fully-connected layer, and a softmax layer indicating the image category. How and where visual attention condensers should be leveraged is not defined in , and thus the macroarchitecture and microarchitecture designs of the final AttendNets is automatically by the machine-driven design exploration process to determine the best way to satisfy the specified constraints in .
Finally, to realize the concept of visual attention condensers in a way that enables the learning of condensed embeddings characterizing joint spatial-channel activation relationships in an efficient yet effective manner, we leveraged max pooling, a lightweight two-layer neural network (grouped then pointwise convolution), unpooling, and pointwise convolution for the condensation layer , the embedding structure , the expansion layer , and the mixing layers and , respectively, within a visual attention condenser.
3 AttendNet Architecture Designs
Figure 2 shows the architecture designs of the two AttendNets, produced via the aforementioned machine-driven design exploration strategy that incorporates visual attention condensers in its design considerations. A number of interesting observations can be made about the AttendNet architecture designs. First, it can be observed that the AttendNet architecture designs are comprised of a mix of consecutive stand-alone visual attention condensers performing consecutive visual selective attention and projection-expansion-projection-expansion (PEPE) modules for efficient feature representation. The PEPE module was discovered by the machine-driven exploration strategy and comprises of a projection layer that reduces dimensionality via pointwise convolution, an expansion layer that increases dimensionality efficiently via depthwise convolution, a projection layer that reduces dimensionality again via pointwise convolution, and finally an expansion layer that increases dimensionality again via pointwise convolution.
Second, it can be observed that there is a heavy use of visual attention condensers early on within the AttendNet architecture by the machine-driven design exploration strategy, while relying on PEPE modules later in the network architecture. This interesting design choice by the machine-driven design exploration strategy may be a result of selective attention being more important earlier on low-level to medium-level visual abstraction for image recognition to enable better focus on irrelevant regions of interest critical to decision-making within a complex scene.
Third and finally, it can be observed that the AttendNet network architectures exhibits high architectural diversity both at the macroarchitecture level and microarchitecture level. For example, at the macroarchitecture level, there is a heterogeneous mix of visual attention condensers, PEPE modules, spatial and pointwise convolutions, and fully-connected layers. At the microarchitecture level, the visual attention condensers and PEPE modules have a diversity of microarchitecture designs as seen by the differences in channel configurations. This level of architectural diversity is a result of the machine-driven design exploration process, which has the benefit of determining the best architecture design at a fine grained level to achieve a strong balance of network efficiency and accuracy for the specific task at hand. Based on these three interesting observations, it can be seen the proposed AttendNet network architectures is highly tailored for on-device image recognition for the edge, and also shows the merits of leveraging both visual attention condensers and machine-driven design exploration for achieving such highly efficient, high-performance deep neural networks.
4 Results and Discussion
|MobileNet-V1 MobileNetv1 ()||3260K||567.5M|
|MobileNet-V2 MobileNetv2 ()||2290K||299.7M|
|AttoNet-A alex2019attonets ()||73.0%||2970K||424.8M|
|AttoNet-B alex2019attonets ()||71.1%||1870K||277.5M|
In this study, we evaluate the efficacy of the proposed low-precision AttendNets on the task of image recognition to empirically study the balance between accuracy and network efficiency. More specifically, we leverage ImageNet, a benchmark dataset that was introduced by Fang et al. ImageNet50 () for evaluating performance of deep neural networks for on-device vision applications on the edge derived from the popular ImageNet ImageNet () dataset. To quantify accuracy and network efficiency, we computed the following performance metrics: i) top-1 accuracy, ii) the number of parameters (to quantify architectural complexity), and iii) the number of multiply-add operations (to quantify computational complexity). For comparative purposes, the same performance metrics were also evaluated on MobileNet-V1 MobileNetv1 (), MobileNet-V2 MobileNetv2 (), AttoNet-A alex2019attonets (), and AttoNet-B alex2019attonets (), four highly efficient deep image recognition networks that were all designed for on-device image recognition purposes.
Table 1 shows the top-1 accuracy, the number of parameters, and the number of multiply-add operations of the AttendNets alongside the four other tested efficient deep image recognition networks. From the quantitative results, it can be clearly observed that the proposed AttendNets achieved a significantly better balance between accuracies and architectural and computational complexity when compared to the other tested efficient deep neural networks. In terms of lowest architectural and computational complexity, AttendNet-B achieved significantly higher accuracy compared to MobileNet-V1 (7.2% higher) but requires 4.17 fewer parameters, 16.7 lower weight memory requirements, and 3 fewer multiply-add operations than MobileNet-V1. Compared to similarly-accurate state-of-the-art AttoNet-B, AttendNet-B achieved 0.6% higher accuracy but requires 2.4 fewer parameters, 9.6 lower weight memory requirements, and 1.45 fewer multiply-add operations. In terms of the highest top-1 accuracy, AttendNet-A achieved significantly higher accuracy compared to MobileNet-V1 and MobileNet-V2 (8.7% higher and 4.5% higher, respectively) but requires 2.35 fewer parameters, 9.4 lower weight memory requirements, and 2.1 fewer multiply-add operations than MobileNet-V1 and 1.65 fewer parameters, 6.6 lower weight memory requirements, and 1.1 fewer multiply-add operations than MobileNet-V2. Compared to the similarly accurate state-of-the-art AttoNet-A, AttendNet-A achieved slightly higher accuracy (0.2% higher) but requires 2.1 fewer parameters, 8.4 lower weight memory requirements, and 1.53 fewer multiply-add operations.
These quantitative performance results illustrate the efficacy of leverage both visual attention condensers and machine-driven design exploration to creating highly-efficient deep neural network architectures tailored for on-device image recognition that striking a strong balance between accuracy and network complexity.
In this study, we introduce AttendNets, low-precision, highly compact yet high-performance deep neural networks tailored for on-device image recognition. The strong balance between accuracy and network efficiency achieved by AttendNets was achieved through the introduction of visual attention condensers, stand-alone self-attention mechanisms that extend upon the concept of attention condensers to improve spatial-channel selective attention in an efficient manner, and the use of a machine-driven design exploration strategy to automatically determine the macroarchitecture and microarchitecture designs of the AttendNet deep self-attention architectures. We demonstrated the efficacy of the resulting AttendNets on the task of on-device image recognition, which were able to achieve significantly better balance between accuracy and network efficiency when compared to previously proposed efficient deep neural networks designed for on-device image recognition. As such, the visual attention condensers used to create AttendNets have the potential to be a useful building block for constructing highly efficient, high-performance deep neural networks for real-time visual perception tasks on low-power, low-cost edge devices for TinyML applications.
Given the promising results associated with AttendNets and the use of visual attention condensers to achieve highly efficient yet high-performance deep neural networks, future work involves exploring the effectiveness of AttendNets on downstream tasks such as object detection, semantic segmentation, and instance segmentation to empower a wider variety of vision-related TinyML applications ranging from autonomous vehicles to smart city monitoring to wearable assistive technologies to remote sensing. We also aim to explore different design choices for the individual components of the visual attention condenser (e.g., mixing layers, embedding structure, condensation layer, expansion layer) and their impact on accuracy and efficiency. Finally, as alluded to in alex2020tinyspeech (), the exploration of self-attention architectures based on visual attention condensers and their adversarial robustness is a worthwhile endeavor, particularly given recent focuses on robustness and dependability of deep learning.
6 Broader Impact
There has been tremendous recent interest in TinyML (tiny machine learning) as being one of the key disruptive technologies towards widespread adoption of machine learning in industry and society. By enabling improved automated decision-making and predictions via machine learning to operate directly and in real-time on low-cost, low-power embedded hardware such as microcontrollers and embedded microprocessors, TinyML facilitates a wide variety of applications ranging from smart homes to smart factories to smart cities to smartgrids where a large number of intelligent nodes work in unison to improve productivity, efficiency, consistency, and performance. Furthermore, in mission-critical scenarios where privacy and security is crucial, such as in automotive, security, and aerospace, TinyML facilitates tetherless intelligence without the requirement of continuous connectivity, thus enabling greater dependability. As such, TinyML can have a tremendous impact across all facets of society and industry and thus have important socioeconomic implications that needs to be considered. In our exploration of new strategies such as the proposed visual attention condensers in this study, the hope is that the insights gained from such explorations can be leveraged by the community to advance efforts in TinyML for greater adoption of machine learning as a ubuiquitous technology.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, p. 436, 2015.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
- K. He et al., “Deep residual learning for image recognition,” arXiv:1512.03385, 2015.
- J. Hu, L. Shen, S. Albanie, G. Sun, and E. Wu, “Squeeze-and-excitation networks,” 2017.
- R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
- G. Lin, A. Milan, C. Shen, and I. Reid, “Refinenet: Multi-path refinement networks for high-resolution semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1925–1934.
- L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2018.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
- Z. Q. Lin, B. Chwyl, and A. Wong, “Edgesegnet: A compact network for semantic segmentation,” 2019.
- B. Jacob et al., “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” arXiv:1712.05877, 2017.
- W. Meng et al., “Two-bit networks for deep learning on resource-constrained embedded devices,” arXiv:1701.00485, 2017.
- M. Courbariaux, Y. Bengio, and J.-P. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in neural information processing systems, 2015, pp. 3123–3131.
- S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding,” arXiv preprint arXiv:1510.00149, 2015.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
- S. Ravi, “ProjectionNet: Learning efficient on-device deep networks using neural projections,” arXiv:1708.00630, 2017.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
- M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4510–4520.
- F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
- M. J. Shafiee, F. Li, B. Chwyl, and A. Wong, “Squishednets: Squishing squeezenet further for edge device scenarios via deep evolutionary synthesis,” NIPS Workshop on Machine Learning on the Phone and other Consumer Devices, 2017.
- A. Wong, M. J. Shafiee, F. Li, and B. Chwyl, “Tiny ssd: A tiny single-shot detection deep convolutional neural network for real-time embedded object detection,” Proceedings of the Conference on Computer and Robot Vision, 2018.
- X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” 2017.
- N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 116–131.
- D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” 2014.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017.
- S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” 2018.
- A. Wong, M. Famouri, M. Pavlova, and S. Surana, “Tinyspeech: Attention condensers for deep speech recognition neural networks on edge devices,” arXiv preprint: 2008.04245, 2020.
- I. Bello, B. Zoph, A. Vaswani, J. Shlens, and Q. V. Le, “Attention augmented convolutional networks,” 2019.
- C.-H. Hsu, S.-H. Chang, D.-C. Juan, J.-Y. Pan, Y.-T. Chen, W. Wei, and S.-C. Chang, “Monas: Multi-objective neural architecture search using reinforcement learning,” arXiv preprint arXiv:1806.10332, 2018.
- T. Elsken, J. H. Metzen, and F. Hutter, “Multi-objective architecture search for cnns,” arXiv preprint arXiv:1804.09081, 2018.
- M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” arXiv preprint arXiv:1807.11626, 2018.
- A. Wong, M. J. Shafiee, B. Chwyl, and F. Li, “Ferminets: Learning generative machines to generate efficient neural networks via generative synthesis,” Advances in neural information processing systems Workshops, 2018.
- A. Wong, “Netscore: Towards universal metrics for large-scale performance analysis of deep neural networks for practical usage,” arXiv preprint arXiv:1806.05512, 2018.
- B. Fang, X. Zeng, and M. Zhang, “Nestdnn: Resource-aware multi-tenant on-device deep learning for continuous mobile vision,” Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, 2018.
- A. Wong, Z. Q. Lin, and B. Chwyl, “Attonets: Compact and efficient deep neural networks for the edge via human-machine collaborative design,” 2019.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” 2009.