EdgeCNN: Convolutional Neural Network Classification Model with small inputs for Edge Computing
With the development of Internet of Things (IoT), data is increasingly appearing on the edge of the network. Processing tasks on the edge of the network can effectively solve the problems of personal privacy leaks and server overload. As a result, it has attracted a great deal of attention and made substantial progress. This progress includes efficient convolutional neural network (CNN) models such as MobileNet and ShuffleNet. However, all of these networks appear as a common network model and they usually need to identify multiple targets when applied. So the size of the input is very large. In some specific cases, only the target needs to be classified. Therefore, a small input network can be designed to reduce computation. In addition, other efficient neural network models are primarily designed for mobile phones. Mobile phones have faster memory access, which allows them to use group convolution. In particular, this paper finds that the recently widely used group convolution is not suitable for devices with very slow memory access. Therefore, the EdgeCNN of this paper is designed for edge computing devices with low memory access speed and low computing resources. EdgeCNN has been run successfully on the Raspberry Pi 3B+ at a speed of 1.37 frames per second. The accuracy of facial expression classification for the FER-2013 and RAF-DB datasets outperforms other proposed networks that are compatible with the Raspberry Pi 3B+. The implementation of EdgeCNN is available at https://github.com/yangshunzhi1994/EdgeCNN
keywords:Internet of Things (IoT), edge computing, convolutional neural networks, targe classification, efficiency
It is estimated that there will be 18 billion IoT devices by 2022 52 . The trend in IoT is to distribute and move computing from centralized cloud devices to edge devices which are closer to data sources 53 . The computing made on the data source are also called edge computing. We define edge computing as a way to process data on resource-constrained terminal devices. Edge computing has the potential to address the concerns of response time requirements, bandwidth cost saving, connectivity, data safety and privacy, and server overload 47 ; 48 ; 49 ; 55 ; 57 . However, many edge computing devices are characterized by limited computational and energy resources 56 ; 58 . At the same time, many tasks have strict requirements on computational cost. This means that running tasks that require high computing power on edge computing devices is very challenging. Currently, there are four main ways to solve this problem.
The first method is to perform data encryption on the edge computing device before uploading it to a more powerful server 10 ; 16 ; 17 ; 46 ; 54 . This approach introduces a slight computational overhead on the edge computing device and moves most workload to the server, which has sufficient computing resources. Moreover, the use of data encryption protects the privacy of users. However, the data transmission also increases the pressure on network bandwidth and introduces the possibility of network latency. In particular, when the network is inaccessible, the edge computing device cannot operate normally.
The second method is to run the task directly on the edge computing device using traditional machine learning methods, such as those in the field of facial expression classification 1 ; 2 ; 3 ; 4 ; 5 ; 6 . This is feasible because traditional machine learning methods require relatively little computing power and memory 14 . This means that they are relatively easy to run in resource-constrained edge computing devices. However, the feature attributes extracted by these traditional machine learning methods rely on manual design. Manually designed features are not very robust to the diversity of targets 11 . This problem also occurs in the field of facial expression recognition. As described in 12 , the features extracted by these traditional methods are generally low-level features. They are often effective in specific small datasets, but cannot sufficiently adjust to identifying new facial expression images.
The third method is to use multiple edge computing devices to perform a task together 53 ; 60 . This method needs to focus on computation and communication resource allocation and data sharing issues. The third method is particularly useful for tasks that require data to be collected from multiple edge computing devices for further processing. But for tasks that only require a single data source, this resulting in increased network load and transmission latency 61 .
The fourth method is to use an edge computing device to run deep learning methods, such as CNNs. The success of CNNs comes from their rich representation capacity and powerful generalization ability for image recognition 13 . The features extracted by CNNs are advanced and are highly robust to multi-style changes of targets. Unlike traditional machine learning methods, deep learning-based methods, using CNNs, extract optimal features with the desired characteristics directly from data 14 . Moreover, they also reduce the dependence on physics-based models and other pre-processing techniques 15 . The deep learning method is usually considered the most promising. However, deep learning requires high computing power, which means that running deep learning on edge computing devices is very challenging. Compute and memory demands of deep learning methods must be addressed to make them useful on IoT endnodes 59 .
To be able to run CNN models on edge computing devices, some efficient CNN models have been proposed 29 ; 30 ; 31 ; 36 ; 37 ; 38 ; 39 ; 27 ; 43 ; 9 ; 8 . However, all of these networks appear as a common network model and they usually need to identify multiple targets when applied. Therefore, the size of the input to the network model is very large. This leads to a large number of parameters and requires more computational power (FLOPs) in a given situation. Target recognition requires finding and classifying targets at unknown locations. However, the target classification only needs to classify the targets at known locations. In some specific cases, only the target needs to be classified. The input size of those common network models is generally 224 224, which contains multiple targets and related background. If these networks are applied to target classifications, this can result in wasted resources and possible poor performance. Therefore, this paper proposes a new CNN model based on edge computing devices named EdgeCNN. EdgeCNN uses an input size of 44 44 because this size is already adequately categorized. If the input size of the network is reduced, the learning characteristics are rough relative to a network with large inputs. However, classification is based on learning the similarities of individual features.
What we need to know is that the last layer of the current deep learning classification network is a fully connected layer. Take the example of identifying whether the target in the picture is ”human”. The ”head”, ”body”, etc., represent the feature maps learned by the fully connected layer. EdgeCNN’s fully connected layer learns relatively large feature maps such as ”head” and ”body”, while the fully connected layer in the 224 224 network learns the ears, eyes, etc. of the ”head”. The fully connected layer is scored based on the similarity of these learned features. This means that as long as the network is well designed, EdgeCNN’s classification doesnot result in low accuracy due to small inputs. In addition, since the learning features are relatively large, the fully connected layer requires only a few dimensional features. This not only reduces complexity, but also allows the network to focus on learning relatively large features.
To the best of our knowledge, this is the first paper discussing a network with small inputs for target classification. Therefore, this paper can compare the performance of networks with small inputs on the facial expression task. The recognition of facial expressions on edge computing devices has high practical value, such as detecting security risks in real time through cameras. In addition, we found that these efficient network models with small inputs have low accuracy on facial expression datasets. More importantly, the group convolution of other efficient network models is not suitable for edge computing devices with low memory access speeds. Therefore, we propose EdgeCNN in this paper, which runs on edge computing devices and has small inputs. The feasibility of EdgeCNN is demonstrated on the classification of facial expressions. The contributions of this paper are as follows:
We propose EdgeCNN, based on edge computing devices with small inputs. The accuracy of EdgeCNN for facial expression classification exceeds that of other proposed networks with small inputs.
We showed in the experiment that group convolution is not optimal in all cases. Group convolution is slower on edge computing devices with slower memory access than without group convolution. The actual running speed of the EdgeCNN model without group convolution on the edge computing device basically meets the actual needs.
We run the EdgeCNN model on Raspberry Pi 3B+ to prove that facial expressions can be classified and recognized by using deep learning on edge computing devices.
The remainder of this paper is as follows: Section 2 briefly introduces the principles of ResNet 22 , DenseNet 23 , and other efficient networks and explains the superiority of the DenseNet principle on resource-constrained devices. Section 3 introduces the design of EdgeCNN in detail. Section 4 shows a performance comparison between EdgeCNN and other efficient networks with small inputs on the facial expression datasets. Section 5 is a summary of this paper.
2 Related Work
Since the advent of the first CNN model in 1989 18 , the main feature extraction networks that have been developed include LeNet-5 18 , AlexNet 19 , VGG 20 , GoogLeNet 21 , ResNet 22 , and DenseNet 23 . Among these, the best and most widely used are ResNet and DenseNet because these two networks are very deep compared with previous networks. Researchers have shown that a deeper feature extraction network can learn more advanced features and can classify better. Therefore, we reference the theoretical methods of these two feature extraction networks to design an efficient network suitable for facial expression classification.
In a general CNN, the output at the th layer is:
In ResNet, the output at the th layer is:
In DenseNet, the output at the th layer is:
Among these, is a composite function that represents a combined operation. It may include a series of operations such as batch normalization 25 , convolution, and so on.
The feature of the th layer of ResNet comes only from the features of the previous layer and the features obtained by the composite operation of the th layer. The network depth is continuously enhanced by the identity mapping of skip connections to obtain better network performance. From another perspective, ResNet continuously abstracts features through identity mapping to obtain more advanced features. Consequently, there will be many transition features in the middle of the network, which need to be saved in memory when the program is running. Therefore, ResNet is an unwise basis for the design of an efficient network on a resource-constrained device.
The feature of the th layer of DenseNet is derived from all previous layers, which improves the performance of the model through feature reuse. DenseNet can be divided into multiple dense blocks, and the feature maps of all dense blocks have the same size. Each layer’s feature map in a dense block is passed to all subsequent layers in the block. The last layer of feature mapping in each dense block is used as input for all subsequent blocks. Thus, each layer in DenseNet generates only a small number of features, but after connection, the last layer of the network can obtain many feature maps. In the extreme case, only one feature is learned in each layer of the network; naturally, this is very difficult to achieve. Consequently, DenseNet greatly reduces the number of intermediate transition features. This is equivalent to reducing the consumption of memory resources, which is very important for resource-constrained devices. Moreover, we can also exploit the shared memory allocation strategy in Ref. 28 to further reduce the use of memory.
However, substantial computing power is required to implement an efficient network with low memory consumption and high accuracy by using DenseNet’s principles solely. For example, DenseNet (growthRate=12, depth=40) of 44 44-pixel input image size executes 166.9 MFLOPs. The current main methods of improving network efficiency are group convolution 31 ; 38 ; 39 , depthwise separable convolution 36 ; 30 ; 31 ; 37 , and grouping 31 ; 37 ; 38 ; 39 ; 40 to reduce the computational power of the network model. In fact, depthwise separable convolution is a special case of group convolution. Therefore, the problem of network computing power is currently solved mainly by the two mainstream methods of group convolution and grouping. We need to pay attention to the fact that group convolution can reduce the amount of computation, but it will increase the cost of memory access 37 . Therefore, in this paper, a network using group convolution (EdgeCNN-G) and a network without group convolution (EdgeCNN) are designed. The grouping method is to solve the problem of information flow communication between group and group after grouping. Currently, there are three methods: the channel shuffle of ShuffleNet 31 ; 37 , the learned group convolution of CondenseNet 40 , and the interleaved group convolutions of IGCV 38 ; 39 . In the channel shuffle and interleaved group convolution methods, assigning these features to disjoint groups can hinder the effective reuse of features in the DenseNet network 40 . Therefore, we use the learned group convolution of CondenseNet, which allows the network to automatically discover good connectivity patterns in learning to reduce the computational power.
CondenseNet is an upgraded version of the DenseNet network. However, the computing power required by the CondenseNet network is still very large. For an input image size of 44 44 pixels, the CondenseNet network performs 74.11 MFLOPs. We directly reduced parameters of CondenseNet, such as the growth rate, but found that the accuracy on datasets for facial expression classification and recognition was not good. There is a no-free-lunch theorem in machine learning, which states that there is no algorithm that can be optimal on all sets of problems 32 . The no-free-lunch theorem tells us that we need to select and construct appropriate deep learning models based on specific sets of problems and specific datasets.
To further reduce the parameters and computational power required by the network model based on edge calculation, we adopt DenseNet’s feature reuse method and learned group convolution to design an EdgeCNN-G model with small inputs. In order to compare the effects of group convolution, this paper only uses DenseNet’s feature reuse method to design an EdgeCNN model with small inputs.
First, we use the dlib 50 library to capture faces. The implementation of this part can be viewed in our other project implementation111 https://github.com/tobysunx/face_recognition . It captures a face at random times of 0.6 to 3 seconds. Capturing faces from the video is very smooth and not stuck. In this paper, the image of the face captured by the dlib library is compressed to a uniform size of 44 44 pixels. This is because the dlib face detector recognizes faces very accurately 35 ; 7 . In addition, the dlib 50 library consumes much less computing power than the deep convolutional network model. A face image of 44 44 pixels is also classified sufficiently well.
Next, the CNN model is used for feature extraction and classification. This is equivalent to translating a multi-target recognition problem to a single-target classification problem. Therefore, it decreases the size of the network input, reducing the number of parameters and reducing the computational power.
Although this paper demonstrates the feasibility of EdgeCNN and EdgeCNN-G on facial expression datasets, it can also be applied to the classification and recognition of other targets. However, to use it, it is necessary to find the location of the target first. For example, the cascading 45 method can be used. First, a network input image of 224 224 pixels is used. Second, the RPN 44 method is used to find the target. Finally, EdgeCNN and EdgeCNN-G is employed to analyze the specific categories.
|Image size of input|
3 Architecture Design
One of the differences between the original DenseNet network and EdgeCNN and EdgeCNN-G is the extraction of the low-level layer features, as shown in Table 1. This part of the network contains many details. We need to extract as many features as possible. Because the features of the low-level layer learning are not enough, it is difficult to extract effective advanced features in the lower layers. However, a larger network is not necessarily better: this would waste memory, time, and computing power. Considering that the appearance of facial expressions occupies a certain area, it is necessary to appropriately increase the size of the convolution kernel in the low-level layer features to adapt to the size of the facial expressions. Increasing the size of the convolution kernel enlarges the receptive field. It can learn more detailed information, and there are more parameters followed. In the low-level layer feature of DenseNet, a convolutional layer is used, which requires very substantial computing power. This paper uses a convolutional layer to obtain the low-level layer features of the image, and this is sufficient for networks with small inputs.
|EdgeCNN’s EdgeBlock (1)||222264|
|Transition Layer (1)||22average pool,stride=2||111164|
|EdgeCNN’s EdgeBlock (2)||111196|
|Transition Layer (2)||22average pool,stride=2||5596|
|EdgeCNN’s EdgeBlock (3)||55152|
|Classification||55global average pool||11152|
|Layer||152D fully-connected, softmax|
|EdgeCNN-G’s EdgeBlock (1)||222264|
|Transition Layer (1)||22average pool,stride=2||111164|
|EdgeCNN-G’s EdgeBlock (2)||111196|
|Transition Layer (2)||22average pool,stride=2||5596|
|EdgeCNN-G’s EdgeBlock (3)||55152|
|Classification||55global average pool||11152|
|Layer||152D fully-connected, softmax|
In the composite function , DenseNet uses the pre-activation mode in Ref. 24 , as shown in convolution block 1 of Table 2 and Table 3. That is, the batch normalization layer (BN) 25 is followed by a rectified linear unit (ReLU) 26 and a convolutional layer. However, in an experiment, we found that it is less accurate in networks with small inputs. Like Pelee , we use the traditional post-activation mode 25 ; that is, the convolutional layer is followed by a batch normalization layer and a ReLU layer. Similarly, we also modified the pre-activation mode in the learned group convolution (L-Conv is used to represent the learned group convolution in Tables 3, where the parameter G represents the number of groups, and C is the condensation factor), replacing it by the post-activation mode, which obtained good performance in the experiment. Inspired by MobileNetV2 30 , we eliminate the activation layer in the second convolution block of the bottleneck layer. This prevents nonlinearities from destroying too much information; moreover, it reduces element-wise operations 37 . This paper refers to the structure modified from the bottleneck layer of DenseNet as EdgeBlock. The difference between the EdgeCNN and the EdgeCNN-G model is only whether the learned group convolution is used in the EdgeBlock, as shown in Table 2 and Table 3 respectively. The overall structure of the EdgeCNN and EdgeCNN-G models is shown in Table 4 and Table 5, respectively.
We use two consecutive convolutional layer in the EdgeBlock, as shown in Table 2 and Table 3. This is because a single convolutional layer is not sufficient to accommodate the size of the facial expressions. In theory, deep learning networks tend to favor larger convolution kernels without considering computational complexity. This is because a larger receptive field can learn richer features. In addition, we eliminated the convolutional layer because this layer is less accurate in networks with small inputs and adds extra memory consumption. A detailed introduction is as follows.
First, the convolutional layer has fewer parameters and lower computational complexity than other efficient networks. The trained model is smaller, which is important in some space-constrained terminal devices. However, in general, the storage space of the edge computing device is still sufficient. For example, the Raspberry Pi 3B+ uses an SD card for storage; SD cards are cheap compared with smart devices and have ample storage space. Therefore, the convolutional layer can be ignored.
Second, in other efficient networks, a large number of convolutional layers is used because they can serve to reduce the number of input data channels. The convolutional layer greatly reduces the computational power required to improve calculation efficiency while ensuring accuracy. Although the convolutional layer reduces the number of parameters, adding additional layers will consume more memory. When running a convolutional neural network, in addition to the parameters, the output of each network layer also needs to occupy memory. In EdgeCNN and EdgeCNN-G, a convolutional layer is added to reduce the dimension of the data, but our experiments show that the added extra layer makes the whole network consume the same amount of memory. Memory is important on resource-constrained edge computing devices. In addition, the original convolutional layer can learn more abundant features than the convolutional layer. Most importantly, our experiments found that the increased number of convolutional layers reduces accuracy in a network with small inputs.
Finally, the use of a convolutional layer has another purpose: to increase nonlinearity, which allows the neural network to better fit the data. However, adding an active layer or adding a bias to the convolutional layer can also increase nonlinearity. Therefore, we eliminate the convolutional layer and add bias to all convolutional layers, including learned group convolution, to increase the nonlinear features of the network. Moreover, in EdgeCNN and EdgeCNN-G, all layers, including DenseNet’s transition layers, do not use a convolutional layer. This is because we found that the presence of a 1 1 convolutional layer in a model with small inputs leads to performance degradation.
In summary, the performance of the 1 1 convolutional layer in networks with small inputs is poor, which may be the reason for the poor performance of other efficient networks. The EdgeCNN model designed in this paper is shown in Table 4, with a selected growth rate of 8. After the model is designed, it is the choice of loss function and the setting of optimizer. This paper uses the Softmax Loss function and SGD optimization technology. We use the SGD optimizer to minimize the Softmax Loss function with a learning rate of 1e-2 and a weight decay of 5e-4. At the same time, after 80 epochs, the learning rate dropped by 0.1 after every 5 epochs.
|Model||Parameters||FLOPs||Speed||Memory consumed||FER-2013 Acc||RAF-DB Acc|
|SqueezeNet 29||0.70 MB||18.12 M||3.00 FPS||18.6%||68.26%||79.98%|
|SqueezeNext 43||0.56 MB||9.84 M||2.16 FPS||19.0%||67.48%||78.87%|
|MobileNet 36||3.06 MB||27.69 M||0.21 FPS||24.2%||67.84%||82.59%|
|MobileNetV2 30||2.12 MB||16.57 M||0.25 FPS||25.0%||69.46%||81.16%|
|MobileNetV3 8||1.18 MB||4.76 M||0.92 FPS||20.8%||66.39%||79.23%|
|ShuffleNet 31||0.88 MB||7.42 M||0.94 FPS||20.5%||68.51%||78.68%|
|ShuffleNetV2 37||1.20 MB||8.14 M||0.62 FPS||20.6%||67.84%||79.88%|
|IGCV3 39||2.11 MB||17.98 M||0.24 FPS||30.6%||69.90%||81.90%|
|EfficientNet 9||6.82 MB||28.94 M||0.13 FPS||32.8%||69.43%||81.22%|
|EdgeCNN-G||0.40 MB||2.7 M||0.87 FPS||19.7%||71.80%||84.90%|
|EdgeCNN||0.40 MB||52.28 M||1.37 FPS||19.1%||71.80%||85.13%|
Our experiments were designed to run the model trained on the graphics processing unit (GPU) on the Raspberry Pi 3B+ to perform facial expression recognition. Therefore, we needed to compare the real running speed and resource consumption of each model in the final use environment. This is because the Raspberry Pi 3B+ uses a central processing unit (CPU) instead of a GPU. As described in ShuffleNetV2 37 , the latest cuDNN library is optimized for a convolutional layer, so it is unlikely that a convolutional layer is nine times slower than a convolutional layer. Therefore, this paper compares the actual running speed and memory consumption of each efficient network on the Raspberry Pi 3B+, as shown in Table 6 (in which FPS stands for frames per second).
The total memory size of the Raspberry Pi 3B+ is 875 MB, and the input image size for all efficient network models was 44 44 pixels in our experiments. Considering that other efficient networks have not tested the accuracy on facial expression datasets, we also needed to tune the parameters of the other efficient networks. EdgeCNN and other efficient network models with small inputs were evaluated on the FER-2013 33 and RAF-DB 34 datasets. The accuracy values shown are the highest accuracy achieved after parameter tuning.
The FER2013 33 dataset contains 28,709 training images, 3,589 validation images, and 3,589 test images, with seven expression labels (anger, disgust, fear, happiness, sadness, surprise, and neutral). The size of each picture is 48 48 pixels. First, we randomly cropped each image to 10 images of 44 44 pixels. Finally, the average score (over all 10 images) of each facial expression category was calculated by the efficient network to determine the category of the whole picture.
The basic facial expressions datasets of the Real-world Affective Face Data-base (RAF-DB) 34 has 12,271 training images and 3,068 test images and also contains seven expression labels. The size of each picture is 100 100 pixels. First, we resized each picture to 48 48 pixels. Second, we randomly cropped each image to 10 images of 44 44 pixels. Finally, the average score of each facial expression category was calculated by the efficient network to determine the category of the whole picture.
Because there are multiple versions of other efficient networks (for example, SqueezeNext 43 has 1.0-SqNxt-23, 1.0-SqNxt-44, and other models), we only used the models that were detailed in the respective paper or that have good performance. That is, SqueezeNext 43 uses the 1.0-SqNxt-23 model, SqueezeNet 29 uses the 1.0 version of the model, MobileNet 36 and MobileNetV2 30 use a 1 expansion factor, MobileNetV3 8 uses the MobileNetV3-Small model, ShuffleNet 31 uses the ShuffleNet 1(g=3) model, ShuffleNetV2 37 uses the ShuffleNet v2 1 model, IGCV1 38 uses the IGC-L24M2 model, IGCV3 39 uses the IGCV3-D 1.0 model, MixNet 51 uses the MixNet-L model, and EfficientNet 9 uses the EfficientNet-B0 model.
The performance of EdgeCNN, EdgeCNN-G and other efficient networks is shown in Table 6 and Table 7. In Table 7, the speed and memory of the IGCV1 and Pelee models are shown as Null, indicating that these two network models are too large to run on the Raspberry Pi 3B+. They can be run by reducing the size of the network parameters, but the performance is even worse than that shown. As shown in Table 6, although the SqueezeNet and SqueezeNext models run very fast, their accuracy is not high. Therefore, EdgeCNN and EdgeCNN-G is very advantageous in terms of accuracy. In addition, EdgeCNN-G outperforms other efficient network models in terms of computing power. Although the computational power is not necessarily related to speed, the computational power required by the model is particularly important if the edge computing device needs to perform multiple tasks.
We can discuss the drawbacks of group convolution from the FLOPs and speeds in Table 6. The group convolution was first proposed from MobileNet 36 and IGCV1 38 . The SqueezeNet 29 and SqueezeNext 43 models do not use group convolution. The remaining network models use group convolution. Obviously, on the actual running speed of the Raspberry Pi 3B+, the network model using the group convolution is worse than the network model without the group convolution. The reason for this is that group convolution increases the cost of memory access 37 . At the same time, the memory access speed of the Raspberry Pi 3B+ is very slow.
We need to pay attention to the fact that the current efficient network model is mainly designed for embedded devices such as mobile phones. This paper tested the memory access speed of the Raspberry Pi 3B+ to 22.61 MB/s, while the average mobile phone can reach 222.26 MB/s. This means that those embedded devices have 10 times faster memory access than edge computing devices. Therefore, they can use group convolution to design an efficient network model. However, it can be seen from the experiments in Table 6 that group convolution is not suitable for edge computing devices with limited computing resources. The use of group convolution requires a comprehensive consideration of the actual network operating environment.
Ref. 37 showed that the ShuffleNetV2 network model is more accurate than MobileNetV2 on the ImageNet 2012 dataset 41 ; 42 . However, as shown in Table 6, in networks with small inputs (as designed in this paper), ShuffleNetV2 has lower accuracy than MobileNetV2 on the FER-2013 and RAF-DB datasets. Therefore, for practical applications, we need to design the appropriate network model based on specific problems.
This paper proposes the EdgeCNN-G and EdgeCNN model, which can perform deep learning tasks directly on edge computing devices. EdgeCNN-G uses group convolution and EdgeCNN does not use group convolution. It is worth noting that the current efficient network model is mainly used for embedded devices, such as mobile phones. However, the computing resources of most edge computing devices are scarce, such as the Raspberry Pi. The memory access speed of the Raspberry Pi is only 1/10 of that of a average mobile phone. This paper found in the experiment that the group convolution is not applicable to edge computing devices with low memory access speed such as Raspberry Pi. Therefore, the use of group convolution requires a comprehensive consideration of the actual network operating environment.
EdgeCNN and EdgeCNN-G has been run successfully on the Raspberry Pi 3B+. The EdgeCNN can run on the Raspberry Pi 3B + at 1.37 FPS, which basically meets the actual needs. The parameters and accuracy of the EdgeCNN and EdgeCNN-G model improve on other proposed network models that are compatible with the Raspberry Pi 3B+. However, the EdgeCNN and EdgeCNN-G model has scope for improvement in terms of speed. We tried to use Intel Movidius Neural Compute Stick 2 (NCS 2) 222OpenVINOTM toolkit: https://docs.openvinotoolkit.org/latest/index.html on the Raspberry Pi 3B + to increase the speed of EdgeCNN and EdgeCNN-G, but NCS 2 does not support feature detection. However, feature detection is a necessary condition for EdgeCNN and EdgeCNN-G. In addition, the range of edge computing devices is very broad, including, for example, the Advanced RISC Machine (ARM) chip. Currently, the only edge computing devices supported by NCS 2 are Raspberry Pi boards. This means that NCS 2 is not yet widely applicable to edge computing devices. Moreover, we need a ready-made CNN model before using NCS 2. Therefore, we still need to focus on using deep learning methods to improve the speed of task processing on edge computing devices.
- (1) Suja P, Tripathi S. Real-time emotion recognition from facial images using Raspberry Pi II[C]//2016 3rd International Conference on Signal Processing and Integrated Networks (SPIN). IEEE, 2016: 666-670.
- (2) Priyanka, Priyanka and T. N. R. Kumar. “Real-time Facial Expression Recognition System using Raspberry Pi.” (2017).
- (3) Mano L Y, Faiçal B S, Nakamura L H V, et al. Exploiting IoT technologies for enhancing Health Smart Homes through patient identification and emotion recognition[J]. Computer Communications, 2016, 89: 178-190.
- (4) Chavan P M, Jadhav M C, Mashruwala J B, et al. Real time emotion recognition through facial expressions for desktop devices[J]. International Journal of Emerging Science and Engineering (IJESE), 2013, 1(7): 104-108.
- (5) Turabzadeh S, Meng H, Swash R, et al. Facial Expression Emotion Detection for Real-Time Embedded Systems[J]. Technologies, 2018, 6(1): 17.
- (6) Suk M, Prabhakaran B. Real-time mobile facial expression recognition system-a case study[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2014: 132-137.
- (7) Cheney J, Klein B, Jain A K, et al. Unconstrained face detection: State of the art baseline and challenges[C]//2015 International Conference on Biometrics (ICB). IEEE, 2015: 229-236.
- (8) Howard A, Sandler M, Chu G, et al. Searching for MobileNetV3[J]. arXiv preprint arXiv:1905.02244, 2019.
- (9) Tan M, Le Q V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks[J]. arXiv preprint arXiv:1905.11946, 2019.
- (10) Gilad-Bachrach R, Dowlin N, Laine K, et al. Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy[C]//International Conference on Machine Learning. 2016: 201-210.
- (11) Zhou X, Wang K, Li L.[Review of object detection based on deep learning][J].Electronic Measurement Technology,2017,40(11):89-93.(in Chinese)
- (12) Chang T, Wen G, Hu Y, et al. Facial Expression Recognition Based on Complexity Perception Classification Algorithm[J]. arXiv preprint arXiv:1803.00185, 2018.
- (13) Yang B, Yan J, Lei Z, et al. Craft objects from images[J]. arXiv preprint arXiv:1604.03239, 2016.
- (14) Ko B. A brief review of facial emotion recognition based on visual information[J]. sensors, 2018, 18(2): 401.
- (15) Walecki R, Pavlovic V, Schuller B, et al. Deep structured learning for facial action unit intensity estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 3405-3414.
- (16) Tian Y, Yuan J, Yu S, et al. LEP-CNN: A Lightweight Edge Device Assisted Privacy-preserving CNN Inference Solution for IoT[J]. arXiv preprint arXiv:1901.04100, 2019.
- (17) Jiang L, Tan R, Lou X, et al. On Lightweight Privacy-Preserving Collaborative Learning for Internet-of-Things Objects[J]. 2019.
- (18) LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 1998, 86(11): 2278-2324.
- (19) Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks[C]//Advances in neural information processing systems. 2012: 1097-1105.
- (20) Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
- (21) Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2015: 1-9.
- (22) He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
- (23) Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 4700-4708.
- (24) He K, Zhang X, Ren S, et al. Identity mappings in deep residual networks[C]//European conference on computer vision. Springer, Cham, 2016: 630-645.
- (25) Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift[J]. arXiv preprint arXiv:1502.03167, 2015.
- (26) Nair V, Hinton G E. Rectified linear units improve restricted boltzmann machines[C]//Proceedings of the 27th international conference on machine learning (ICML-10). 2010: 807-814.
- (27) Wang R J, Li X, Ling C X. Pelee: A real-time object detection system on mobile devices[C]//Advances in Neural Information Processing Systems. 2018: 1963-1972.
- (28) Pleiss G, Chen D, Huang G, et al. Memory-efficient implementation of densenets[J]. arXiv preprint arXiv:1707.06990, 2017.
- (29) Iandola F N, Han S, Moskewicz M W, et al. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and¡ 0.5 MB model size[J]. arXiv preprint arXiv:1602.07360, 2016.
- (30) Sandler M, Howard A, Zhu M, et al. Mobilenetv2: Inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 4510-4520.
- (31) Zhang X, Zhou X, Lin M, et al. Shufflenet: An extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6848-6856.
- (32) Ho Y C, Pepyne D L. Simple explanation of the no-free-lunch theorem and its implications[J]. Journal of optimization theory and applications, 2002, 115(3): 549-570.
- (33) Carrier P L, Courville A, Goodfellow I J, et al. FER-2013 face database[J]. Technical report, 2013.
- (34) Li S, Deng W. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition[J]. IEEE Transactions on Image Processing, 2019, 28(1): 356-370.
- (35) Wang D, Otto C, Jain A K. Face search at scale[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1122-1136.
- (36) Howard A G, Zhu M, Chen B, et al. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.
- (37) Ma N, Zhang X, Zheng H T, et al. Shufflenet v2: Practical guidelines for efficient cnn architecture design[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 116-131.
- (38) Zhang T, Qi G J, Xiao B, et al. Interleaved group convolutions[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017: 4373-4382.
- (39) Sun K, Li M, Liu D, et al. Igcv3: Interleaved low-rank group convolutions for efficient deep neural networks[J]. arXiv preprint arXiv:1806.00178, 2018.
- (40) Huang G, Liu S, Van der Maaten L, et al. Condensenet: An efficient densenet using learned group convolutions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 2752-2761.
- (41) Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255.
- (42) Russakovsky O, Deng J, Su H, et al. Imagenet large scale visual recognition challenge[J]. International journal of computer vision, 2015, 115(3): 211-252.
- (43) Gholami A, Kwon K, Wu B, et al. Squeezenext: Hardware-aware neural network design[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 2018: 1638-1647.
- (44) He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]//Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.
- (45) Cai Z, Vasconcelos N. Cascade r-cnn: Delving into high quality object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018: 6154-6162.
- (46) Osia S A, Shamsabadi A S, Taheri A, et al. Private and Scalable Personal Data Analytics Using Hybrid Edge-to-Cloud Deep Learning[J]. Computer, 2018, 51(5): 42-49.
- (47) Yu W, Liang F, He X, et al. A survey on the edge computing for the Internet of Things[J]. IEEE access, 2017, 6: 6900-6919.
- (48) Shi W, Cao J, Zhang Q, et al. Edge computing: Vision and challenges[J]. IEEE Internet of Things Journal, 2016, 3(5): 637-646.
- (49) Shi W, Dustdar S. The promise of edge computing[J]. Computer, 2016,https://www.overleaf.com/project/5cf5c0831d72e9071d1a52d3 49(5): 78-81.
- (50) King D E. Dlib-ml: A machine learning toolkit[J]. Journal of Machine Learning Research, 2009, 10(Jul): 1755-1758.
- (51) Tan M, Le Q V. MixConv: Mixed Depthwise Convolutional Kernels[J]. 2019.
- (52) Novo O. Blockchain meets IoT: An architecture for scalable access management in IoT[J]. IEEE Internet of Things Journal, 2018, 5(2): 1184-1195.
- (53) Sahni Y, Cao J, Yang L. Data-aware task allocation for achieving low latency in collaborative edge computing[J]. IEEE Internet of Things Journal, 2018, 6(2): 3512-3524.
- (54) Li X, Liu S, Wu F, et al. Privacy preserving data aggregation scheme for mobile edge computing assisted IoT applications[J]. IEEE Internet of Things Journal, 2018.
- (55) Min M, Wan X, Xiao L, et al. Learning-based privacy-aware offloading for healthcare IoT with energy harvesting[J]. IEEE Internet of Things Journal, 2018.
- (56) Chiang M, Zhang T. Fog and IoT: An overview of research opportunities[J]. IEEE Internet of Things Journal, 2016, 3(6): 854-864.
- (57) Omoniwa B, Hussain R, Javed M A, et al. Fog/Edge Computing-based IoT (FECIoT): Architecture, Applications, and Research Issues[J]. IEEE Internet of Things Journal, 2018.
- (58) Premsankar G, Di Francesco M, Taleb T. Edge computing for the Internet of Things: A case study[J]. IEEE Internet of Things Journal, 2018, 5(2): 1275-1284.
- (59) Blanco-Filgueira B, García-Lesta D, Fernández-Sanjurjo M, et al. Deep learning-based multiple object visual tracking on embedded system for iot and mobile edge computing applications[J]. IEEE Internet of Things Journal, 2019.
- (60) Chen J, Chen S, Wang Q, et al. iRAF: a Deep Reinforcement Learning Approach for Collaborative Mobile Edge Computing IoT Networks[J]. IEEE Internet of Things Journal, 2019.
- (61) Zhang J, Hu X, Ning Z, et al. Energy-latency tradeoff for energy-aware offloading in mobile edge computing networks[J]. IEEE Internet of Things Journal, 2017, 5(4): 2633-2645..