Efficient Convolutional Neural Network for Audio Event Detection


Wireless distributed systems as used in sensor networks, Internet-of-Things and cyber-physical systems, impose high requirements on resource efficiency. Advanced preprocessing and classification of data at the network edge can help to decrease the communication demand and to reduce the amount of data to be processed centrally. In the area of distributed acoustic sensing, the combination of algorithms with a high classification rate and resource-constraint embedded systems is essential. Unfortunately, algorithms for acoustic event detection have a high memory and computational demand and are not suited for execution at the network edge. This paper addresses these aspects by applying structural optimizations to a convolutional neural network for audio event detection to reduce the memory requirement by a factor of more than 500 and the computational effort by a factor of 2.1 while performing 9.2 % better.


Matthias Meyer, Lukas Cavigelli, Lothar Thiele
Computer Engineering and Networks Laboratory, ETH Zurich, Switzerland
{matthias.meyer, thiele}@tik.ee.ethz.ch, cavigelli@iis.ee.ethz.ch

Index Terms—  Acoustic Event Detection, Convolutional Neural Networks, Low-Power Embedded Systems, Acoustic Sensor Networks, Mobile Computing

1 Introduction

Many applications for sensor networks, cyber-physical system or Internet-of-Things require a low power consumption for long-term autonomous operation. Local preprocessing and classifying data at the edge nodes can be a solution to reduce data transmission and therefore, to reduce energy consumption. In addition, such an approach can avoid that enormous amounts of data need to be communicated to and processed by a centralized data analysis infrastructure.

The detection of acoustic events is a typical example: Instead of streaming audio through the network to perform server-side event detection, the acoustic events of interest can be pre-detected directly at the sensing device reducing the network’s data throughput significantly.

The accurate detection and classification of individual acoustic events from a sound-emitting environment is of interest for many application such as surveillance [1] or environmental monitoring [2]. The low-power embedded devices used for such systems, however, come with very stringent memory and throughput limitations. These resource constraints impose severe limits to the complexity of suitable event detection algorithms.

Recently, a convolutional neural network (CNN)-based approach has been proposed for acoustic event detection [3] using a network design adapted from image classification [4]. It has been shown that this approach outperforms previous state-of-the-art approaches by a large margin. Unfortunately, CNNs are computationally expensive, and the proposed algorithm also comes at the expense of having a huge memory requirement due to the huge number of parameters. For image classification a way to reduce the complexity of CNN-based classification systems has been presented [5]. These two approaches are joined in this paper to build a highly-accurate CNN capable of running on embedded platforms with limited resources.

In this context, the present paper contains the following contributions:

  • A novel algorithm for acoustic event detection is presented that focuses on improving the accuracy while reducing the memory requirements and number of operations.

  • The algorithm is experimentally verified and compared to a state-of-the-art convolutional neural networks for acoustic event detection. The experiment shows that the overall reduction of memory requirement by a factor of 515 and a reduction of operations by a factor of 2.1 does not affect performance and the accuracy even increases by 9.2 %.

2 Related Work

Hardware platforms for battery-operated devices have stringent power requirements. On such low-power devices which are limited to a few 100 mW, like the ARM Cortex M7 series, the overall on-chip storage is typically limited to less than 2 MB and the available digital signal processing performance is limited to a few 100 millions multiply-accumulate operations per second even on the most advanced components [6, 7].

To achieve maximum energy efficiency while using CNNs, System-on-Chips (SoCs) with hardware accelerators for CNN workload or more generally 2D convolutions can be considered. Such platforms can provide speed-ups by a factor of around and an improved energy efficiency of about [8, 9]. Such system perform optimal if the CNN comprises a simple and structured architecture. While this concept can provide a relief on the admissible computational effort, the strong limitations on available memory remain because any external memory would deteriorate the device’s energy efficiency substantially. By removing memory-demanding, non-convolutional layers a CNN architecture is ideal to maximize the efficiency of hardware convolution accelerators [10]

For acoustic event detection different algorithms have been presented based on Non-Negative Matrix Factorization [11], Hidden Markov Models [12] or Recurrent Neural Networks [13]. Like in many other machine learning applications, CNNs have been proven to be the key for high classification accuracy. The advantage of such an architecture for acoustic event detection is its inherent inclusion of a temporal neighborhood since acoustic events are strongly characterized by temporal changes. Besides the mentioned CNN for acoustic event detection, so far CNNs have been used mainly for speech recognition [14] or music [15] classification tasks. These algorithms are all computationally expensive, and the current state-of-the-art also comes at the expense of having a huge number of parameters [3]. Implementing such a network on mobile devices or sensor nodes is difficult due to memory and computational restrictions on these devices. A way to reduce the complexity of CNN-based classification systems has been presented for image classification [5] and a similar structure is used in [16].

These approaches are joined in this paper to build a highly-accurate CNN for acoustic event detection capable of running on embedded platforms with limited resources.

3 Model Architecture

CNNs, by default, are not designed to run on low-power embedded devices, thus a careful design in terms of structure and learning algorithms must be chosen. To explain the different steps which are necessary to detect an acoustic event the detection system is divided into three major components, which are illustrated in Figure  1. First, the raw time-domain audio waveform is transformed by a front-end into a time-frequency representation. Then the systems extracts features from this representation. In a final step the features are classified.

Fig. 1: Model architecture

In the following, for each step the best option in terms of the challenges mentioned in section 2 is highlighted and chosen.

3.1 Front-end

The front-end is used to transform the raw audio signal into a time-frequency representation from which features in both, frequency- and time-domain, can be extracted. In general these transforms are based on the Short-Time Fourier Transform (STFT), which can be efficiently implemented via the Fast Fourier Transformation (FFT). This fact makes it favorable compared to novel techniques [15], which show good results by learning filterbank coefficients from the audio data, but have the major drawback of employing large filterbanks.

Mel-scaling, which mirrors the human auditory system, is often used [17] as addition to the STFT in order to compensate for its linear frequency scale. These additional processing steps may also reduce the amount of data that needs to be processed in later stages of the processing pipeline, which is important since the number of MAC operations of a CNN is directly related to the size of the input field.

As a consequence, in this work a mel-spectrogram has been chosen as front-end to satisfy the real-time constraint. It is calculated with a window size of 32 ms and a hop size of 10 ms using a hamming window. The number of mel coefficients is 64. From the spectrogram multiple frames consisting of 400 vectors are extracted. These frames are input to the feature extractor, thus the network analyzes a time span of 4 s for each frame.

Fig. 2: Data flow of proposed CNN
Layer type # param. # MAC Layer type # param. # MAC Layer type # param. # MAC
input 0 0 input 0 0 input 0 0
frontend 25.6 k 12.7 M frontend 25.6 k 12.7 M frontend 25.6 k 12.7 M
conv 3, 1, 64 1.8 k 32.3 M conv 3, 1, 64 640 14.8 M conv 3, 1, 64 640 14.8 M
conv 3, 1, 64 36.9 k 656.9 M conv 3, 1, 64 36.9 k 943.7 M conv 3, 2, 64 36.9 k 236.0 M
max pool 1x2 0 0 max pool 2x2 0 0 - - -
conv 3, 1, 128 73.9 k 581.0 M conv 3, 1, 128 73.9 k 471.9 M conv 3, 1, 128 73.9 k 471.9 M
conv 3, 1, 128 147.6 k 1040.5 M conv 3, 1, 128 147.6 k 943.7 M conv 3, 2, 128 147.6 k 236.0 M
max pool 2x2 0 0 max pool 2x2 0 0 - - -
fc 1024 231.2 M 231.2 M conv 3, 1, 128 147.6 k 236.0 M conv 3, 1, 128 147.6 k 236.0 M
fc 1024 1.1 M 1.1 M conv 1, 1, 128 16.5 k 26.2 M conv 1, 1, 128 16.5 k 26.2 M
fc 28 28.7 k 28.7 k conv 1, 1, 28 3.6 k 5.7 M conv 1, 1, 28 3.6 k 5.7 M
- - - avg pool 0 0 avg pool 0 0
activation 0 0 activation 0 0 activation 0 0
Total: 233 M 2555 M Total: 452 k 2655 M Total: 452 k 1239 M
Table 1: Structure, number of parameters and number of required MAC operations for three CNNs. The first (CNN-FC) uses fully-connected layers as classifier, the second (CNN-C) uses convolutional layers as classifier, the third (CNN-CNP) uses convolutional layers as classifier but no max pooling layers. Convolutional layers are defined as conv filter_size, stride, number_of_filters. Fully-connected layers as fc output_dimensions. Max pooling layers as max pool pool_size

3.2 Feature Extraction

The feature extraction block uses a CNN to learn the features of the time-frequency representation. Most CNNs are built from very few basic building blocks: convolution, activation and pooling layers. The concatenation of these blocks introduces a higher depth to the network which has been shown to enhance accuracy [4]. A higher depth results in a higher number of required operations and a higher parameter count. In this work the feature extraction block is therefore limited to two sections consisting of two convolutional layers each, which provides enough parameters to learn significant features but is still moderate in parameter count. Moreover, one convolutional layer with a higher filter size (e.g. 5x5) can be reduced with layers using 3x3 filters, which reduces the number parameters and potentially even improves the classification performance [4]. In most CNNs max pooling is used to regularize the network, but is has been shown that for small scale datasets the removal of the max pooling layer does not affect the performance [5]. As a consequence max pooling is removed from the network and the stride of the preceding layer is increased by 1, which divides the required MAC operations for this layer by four while maintaining the same network structure in principle.

3.3 Classification

Table 1 illustrates the structure of a network with fully-connected dense layers as classifier (CNN-FC) and a network using only convolutional layers as classifiers (CNN-C). For both networks the number of parameters and the number of required MAC operations are listed. It becomes obvious that the fully-connected layers have the biggest impact on parameter count which makes the preceding convolutional layers almost negligible. It has been shown that a fully-connected layer can be replaced by a 1x1 convolutional layer [18], which reduces the number of parameters, and thus the memory footprint, from to . The reduction of parameters also further regularizes the network, which is an advantage for training the network and therefore, an improvement in accuracy can be expected. As a last layer average pooling reduces the output of the last convolutional layer to an array, whose size matches the number of labels. The replacement of the fully-connected layers slightly increases the number of MAC operations.

3.4 Final design

The design proposed in this paper is denoted as CNN-CNP and is illustrated in Table 1. After applying each optimization step as described above the parameter count is decreased considerably by a factor of 515 to 452 k parameters and the number of MAC operations by a factor of 2.1 to 1239 M MAC. Moreover, after optimization the network consists only of convolutional layers. This unified architecture beneficial for hardware implementation and especially for the use of convolutional hardware accelerators.

4 Experiment

After applying these fundamental changes to the network and substantially reducing the number of parameters and arithmetic operations, it needs to be validated that the classification performance is maintained and kept on an acceptable level. For this purpose the two networks (CNN-C and CNN-CNP) have been implemented using Keras [19]. These networks are compared against the best performing implementations from [3], which are referred to as A and B. The network A has the same structure as the CNN-FC network from Table 1, the B network is a more complex network with a higher depth and bigger fully-connected layers.

4.1 Dataset

The dataset [3] contains various sound files collected from freely available online sources. It consists of 28 different event types of variable length, e.g. airplanes, violins, birds or footsteps. The total length of all 5223 audio files is 768.4 minutes. The data is split into training and test set. The training set contains 75 % of the original data and is further subdivided into training and validation set with a ratio of 0.25. Although the dataset is strongly biased no data augmentation was performed in the following experiments, since the main focus is on the comparison of the algorithm in terms of structure, resources and classification performance, and not on the augmentation technique. Both networks were trained by minimizing the cross-entropy loss using the gradient-based optimizer ADAM [20] with mini-batches of size 128. The optimizers’ parameters were left at its default values.

Testing was done by predicting the probabilities for each class on a 4 seconds window randomly extracted from the test set. The class with the highest probability was chosen as the correct class and compared against the ground truth.

4.2 Results

The experimental results are listed in Table 2. The values for accuracy of network A and B are taken from the original publication, the values for the number of parameters and number of MACs are calculated based on the information taken from the original publication. The first line of Table 2 shows the accuracy results for networks A and B with data augmentation which are 91.7 % and 92.8 %, respectively. As expected, these are better than the corresponding accuracy results without complex data augmentation which are 77.9 % and 80.3 %, respectively. The networks CNN-C and CNN-CNP have an accuracy of 86 % and 85.1 %, respectively. Thus, without sophisticated data augmentation both proposed networks perform better than the reference network A and even better than the more complex network B.

The analysis of parameter count shows that when 16 bit parameters are assumed, the total memory consumption for the CNN-FC network’s weights is 466 MB, which is not feasible for most edge computing applications considering the fierce power and memory constraints for distributed sensing devices. In contrast, the weight storage for the CNN-C and CNN-CNP is approx. 904 kB. Even when considering additional overhead by the implementation on low-power devices such as the NXP KV5x or ST STM32F7 [6, 7], their flash memory of up to 2 MB is still sufficient to store the parameters of the presented acoustic event detection algorithm.

Considering that the devices mentioned above have a processing capability higher than 430 M MAC/s and processing the input buffer is only required every 4 seconds, they are able to handle the 1239 M MACs of CNN-CNP in less than 4 seconds. Thus the classification can be considered real-time. In practical applications, however, the claimed processing capability might not be reached or the necessary real-time performance is higher. Therefore the use of hardware accelerators is suggested, allowing to offload the computation to a dedicated processor while maintaining a good energy efficiency. As has been explained, the novel network structure matches the communication and memory access pattern that is expected by current accelerators.

The evaluation shows that the design specifications could be reduced considerably without impact on performance.

Model A, [3] B, [3] CNN-C CNN-CNP
Accuracy 91.7 % 92.8 % - -
w/ aug
Accuracy 77.9 % 80.3 % 86.0 % 85.1 %
w/o aug
# params 233 M 257 M 452 k 452 k
# MACS 2543 M 3533 M 2655 M 1239 M
Table 2: Accuracy (with and without data augmentation), parameter count and number of operations for the proposed networks CNN-C and CNN-CNP compared to the top scoring implementations A and B from [3].

5 Conclusion

In this paper, an acoustic event detection algorithm was presented that exploits the advantages of CNNs while being implementable on low-power microcontrollers. First, a convolutional neural network has been proposed that is able to supersede state-of-the-art acoustic event detection algorithms. Second, the network can be efficiently implemented on resource-limited low-power embedded devices. It was demonstrated that it is possible to reduce the memory requirement by a factor of 515 and the number of operations by a factor of 2.1, while outperforming a similar network with fully-connected layers by 9.2 %. The structured approach of the CNN consisting mainly of convolutional layers makes it easily portable to novel convolutional hardware accelerators which can further increase the energy efficiency.

6 Acknowledgement

The work presented in this paper was scientifically evaluated by the SNSF, and financed by the Swiss Confederation and by nano-tera.ch.


  • [1] M. Cristani, M. Bicego, and V. Murino, “Audio-Visual Event Recognition in Surveillance Video Sequences,” IEEE Transactions on Multimedia, vol. 9, pp. 257–267, Feb. 2007.
  • [2] L. Girard, J. Beutel, S. Gruber, J. Hunziker, R. Lim, and S. Weber, “A custom acoustic emission monitoring system for harsh environments: application to freezing-induced damage in alpine rock walls,” Geoscientific Instrumentation, Methods and Data Systems, vol. 1, no. 2, pp. 155–167, 2012.
  • [3] N. Takahashi, M. Gygli, B. Pfister, and L. V. Gool, “Deep Convolutional Neural Networks and Data Augmentation for Acoustic Event Recognition,” in Proc. Interspeech 2016, (San Fransisco), 2016.
  • [4] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
  • [5] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” in ICLR (workshop track), 2015.
  • [6] NXP Semiconductors, KV5x Data Sheet, 6 2016. Rev. 4.
  • [7] STMicroelectronics, STM32F765xx STM32F767xx STM32F768Ax STM32F769xx, 5 2016. Rev. 3.
  • [8] F. Conti and L. Benini, “A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters,” in Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition, pp. 683–688, EDA Consortium, 2015.
  • [9] L. Cavigelli, D. Gschwend, C. Mayer, S. Willi, B. Muheim, and L. Benini, “Origami: A convolutional network accelerator,” in Proceedings of the 25th Edition on Great Lakes Symposium on VLSI, GLSVLSI ’15, (New York, NY, USA), pp. 199–204, ACM, 2015.
  • [10] D. Rossi, F. Conti, A. Marongiu, A. Pullini, I. Loi, M. Gautschi, G. Tagliavini, A. Capotondi, P. Flatresse, and L. Benini, “PULP: A parallel ultra low power platform for next generation IoT applications,” in 2015 IEEE Hot Chips 27 Symposium (HCS), pp. 1–39, Aug. 2015.
  • [11] T. Komatsu, T. Toizumi, R. Kondo, and Y. Senda, “Acoustic Event Detection Method Using Semi-Supervised Non-Negative Matrix Factorization with a Mixture of Local Dictionaries,” tech. rep., DCASE2016 Challenge, Sept. 2016.
  • [12] X. Zhou, X. Zhuang, M. Liu, H. Tang, M. Hasegawa-Johnson, and T. Huang, “HMM-Based Acoustic Event Detection with AdaBoost Feature Selection,” in Multimodal Technologies for Perception of Humans (R. Stiefelhagen, R. Bowers, and J. Fiscus, eds.), no. 4625 in Lecture Notes in Computer Science, pp. 345–353, Springer Berlin Heidelberg, 2008.
  • [13] G. Parascandolo, H. Huttunen, and T. Virtanen, “Recurrent neural networks for polyphonic sound event detection in real life recordings,” in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6440–6444, March 2016.
  • [14] O. Abdel-Hamid, A. r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional Neural Networks for Speech Recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, pp. 1533–1545, Oct. 2014.
  • [15] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” in 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6964–6968, May 2014.
  • [16] K. Choi, G. Fazekas, and M. B. Sandler, “Automatic tagging using deep convolutional neural networks,” in Proceedings of the 17th International Society for Music Information Retrieval Conference, ISMIR 2016, New York City, United States, August 7-11, 2016, pp. 805–811, 2016.
  • [17] E. Cakir, T. Heittola, H. Huttunen, and T. Virtanen, “Polyphonic sound event detection using multi label deep neural networks,” in 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–7, July 2015.
  • [18] M. Lin, Q. Chen, and S. Yan, “Network In Network,” ICLR: Conference Track, 2014.
  • [19] F. Chollet, “Keras.” https://github.com/fchollet/keras, 2015.
  • [20] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv:1412.6980 [cs], Dec. 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description