AudioAR: Audio-Based Activity Recognition with Large-Scale Acoustic Embeddings from YouTube Videos
Activity sensing and recognition have been demonstrated to be critical in health care and smart home applications. Comparing to traditional methods such as using accelerometers or gyroscopes for activity recognition, acoustic-based methods can collect rich information of human activities together with the activity context, and therefore are more suitable for recognizing high-level compound activities. However, audio-based activity recognition in practice always suffers from the tedious and time-consuming process of collecting ground truth audio data from individual users. In this paper, we proposed a new mechanism of audio-based activity recognition that is entirely free from user training data by usage of millions of embedding features from general YouTube video sound clips. Based on combination of oversampling and deep learning approaches, our scheme does not require further feature extraction or outliers filtering for implementation. We developed our scheme for recognition of 15 common home-related activities and evaluated its performance under dedicated scenarios and in-the-wild scripted scenarios. In the dedicated recording test, our scheme yielded 81.1% overall accuracy and 80.0% overall F-score for all 15 activities. In the in-the-wild scripted tests, we obtained an averaged top-1 classification accuracy of 64.9% and an averaged top-3 classification accuracy of 80.6% for 4 subjects in actual home environment. Several design considerations including association between dataset labels and target activities, effects of segmentation size and privacy concerns were also discussed in the paper.
Manuscript submitted to ACM \xpatchcmd\ps@standardpagestyleManuscript submitted to ACM \@ACM@manuscriptfalse
Sensing and recognizing human activities of daily living (ADL) have been demonstrated to be useful in many areas such as smart home or health care. For example, recognizing inhabitants’ activities and changes of situated environment can be helpful for smart home systems to take appropriate assistance to the inhabitants (chen2012knowledge, ). Patients with specific mental or brain disease can show typical symptoms in terms of behaviors that can be tracked using ADL (grunerbl2015smartphone, ). Traditional methods of ADL focus on usage of embedded motion sensors such as accelerometers and gyroscopes. Such sensors have the advantages of being relatively simple and cheap for implementation, and they can be effective to capture our body information in real time. However, as we can see from most prior research (ravi2005activity, ; kwapisz2011activity, ; anjum2013activity, ), the fact that they can only collect local information of the body heavily limits their application for recognizing compound activities, as in order to fully model such activities one is required to carry a smart phone or wear a bunch of sensors for a long time which can be quite awkward in practice. In fact, many activities in our daily life can not be modeled at all using only wearable motion sensors. Because of such reasons, researchers turn to use cameras and microphones as ideal supplements to traditional approaches of activity sensing.
Many of our daily activities can generate specific types of sounds. Comparing to videos, sounds can also be collected more easily and flexibly based on mobile and wearable platforms or using smart home assistants such as Google Home. Hence, in recent years researchers have proposed several different types of audio sensing approaches, from development in wearable and mobile devices (thomaz2015inferring, ; rossi2013ambientsense, ) to home-based sensor systems (laput2017synthetic, ; chen2005bathroom, ). Their work largely broaden the range of activity types for activity recognition research and significantly improve the recognition performance when combining with traditional methods. However, like most prior work, it is still quite bothering to implement such systems in practice because of the tedious process of collecting ground truth data for modeling. The sound of human activities can vary a lot depending on the objects in use, placement or types of recorders and context of environment that the activity is performing. Sometimes even the difference of behavioral trajectory can affect the model decision. Hence, individual users have to repeat the same activities for several times so that the model can collect sufficient data and reliably work. Some attempts have been made such as by Nguyen et al. and Rossi et al. (nguyen2013combining, ; rossi2012recognizing, ), where researchers tried using crowd-sourced data to alleviate the problem. But due to the limited size of available datasets and immatureness of deep learning approaches in the past years, such attempts either yielded less satisfactory performance or still partially relied on user data.
In this work, we proposed a novel scheme for recognizing 15 home-related activities based on ambient sound. Instead of directly collecting ground truth data and labels from users as in most prior research, we explored the feasibility of using millions of audio embeddings from general YouTube videos as the only training set and our mechanism does not require any training data from users. The audio embeddings are released on the Google Audio Set(gemmeke2017audio, ) database. Due to the large size and highly unbalanced characteristics of the on-line data, our scheme combined both oversampling and deep learning approaches. In this paper, we first described in details the implementation of our idea including the label association process, proposed oversampling techniques and the neural networks /media/arxiv_projects/473221/architecture. We then developed both dedicated tests and in-the-wild scripted tests to verify the feasibility of the scheme. Results showed that our scheme was able to yield promising performance under different test scenarios. In addition, the comparisons among different implementation /media/arxiv_projects/473221/architectures, choice of segmentation size and the related privacy issue were also discussed within the paper.
2. Related Work
Activity recognition based on sensor data is not new. It has been widely used for several domains including self monitoring (thomaz2015practical, ), assistance in smart home (chen2012knowledge, ) or diagnosis of some activity-related disease (grunerbl2015smartphone, ). Traditional approaches of activity recognition rely on body sensors such as accelerometers and gyroscopes. They can be implemented flexibly on smart phones (kwapisz2011activity, ; anjum2013activity, ), smart watches (thomaz2015practical, ; shoaib2015towards, ) or wearable sensor boards (ravi2005activity, ). For example, Kwapisz et al. (kwapisz2011activity, ) managed to recognize walking, jogging, upstairs, downstairs, sitting and standing by just using a smart phone in subjects’ pockets. Thomaz et al.(thomaz2015practical, ) proposed the usage of 3-axis accelerometers embedded in an off-the-shelf smart watch for detection of eating moment. Similarly, Ravi et al. (ravi2005activity, ) showed the feasibility of attaching a sensor board on human body for simple movement classification. Most of such work focused on recognition of very simple activities involving limited types of sensors. The limitation of recognizable activity types can be improved by introducing more complex groups of sensors. However, implementation of complex sensor arrays always brings challenges in insufficiency of training sets and constraints of energy or computing resource. The subject and location sensitivity of traditional inertial sensing can make it even harder for generalization of activity models (su2014activity, ).
Out of such reasons, researchers have proposed to empower activity recognition by video and audio approaches. Comparing to cameras, microphones can be more flexibly to be carried on human body or implemented for indoor scenarios such as for smart home. Yatani and Truong (yatani2012bodyscope, ) proposed recognition of 12 activities related to throat movement such as eating, drinking, speaking and coughing by acoustic data collected from human throat area. This was completed by a simple wearable headset consisting of a tiny microphone, a piece of stethoscope and a Bluetooth module. Another study showed that human eating activity can also be effectively inferred by using wrist-mounted acoustic sensing (thomaz2015inferring, ). This implies the practicality of simple audio-based activity recognition by off-the-shelf products such as smart watches. With the development of smart phones in recent years, phone-based acoustic sensing also shows great capability on activity recognition tasks. The AmbientSense application (rossi2013ambientsense, ) is an example. It is an Android app that can process ambient sound data in real time either on user front end or on an on-line server. It was tested on mainstream smart phones (Samsung Galaxy SII and Google Nexus One) and yielded satisfactory results on classification of 23 context of daily life. Acoustic sensing can also be used for indoor scenarios, especially when video-based methods may bring privacy concerns. Laput et al.(laput2017synthetic, ) described the concept of general-purpose sensing, where multiple sensor units including a microphone were embedded on a single home-oriented sensor tag. Chen et al. (chen2005bathroom, ) provided an audio solution for detection of 6 common activities in bathroom based on MFCC features. The work typically aims at elder care since direct behavioral observations can be quite embarrassing to be shared with clinicians.
Most of the prior work requires a manual collection of ground truth audio data from individual users. This can be quite laborious especially if we are targeting at multiple classes of activities. Also, it is actually unrealistic to ask the users to train the model on their own before using it. Hence, Hwang and Lee (hwang2012environmental, ) introduced a crowd-sourcing framework to the problem. They developed a mobile platform for collection of audio data from multiple users. The platform could then generate a global K-nearest neighbors (KNN) classifier based on Gaussian histogram of MFCC features to recognize basic audio scenes. However, this still requires collection of user data and the performance of the system highly depends on the size and quality of the training set. General-purpose acoustic database, on the other hand, can serve as ideal data source to existing systems. Over the past years, the Freesound database (https://freesound.org/) has been one of the most commonly used database for audio research. Started in 2005 and currently maintained by the Freesound team, it is a crowd-sourced dataset consisting of over 120’000 annotated audio recordings. Besides, Salamon et al. (salamon2014dataset, ) released the UrbanSound database containing 18.5 hours of urban sound clips selected from the Freesound. Säger et al. (sager2018audiopairbank, ) improved the Freesound recordings by adding adjective-noun and verb-noun pairs to the audio tags and constructed a new AudioPairBank dataset. Rossi et al. (rossi2012recognizing, ) first attempted context recognition based on MFCC features extracted from the on-line Freesound database by using a Gaussian Mixture Model (GMM). However, due to limited size of the training set (4678 audio samples for 23 target context), the top-1 classification accuracy based on dedicated sound recordings was just 38%. The performance was improved to 57% of accuracy by manually filtering 38% of the samples as outliers. Based on that, Nguyen et al. (nguyen2013combining, ; nguyen2013towards, ) leveraged semi-supervised learning methods to combine the on-line Freesound data with users’ own recordings. After manually filtering outliers for quality, they trained a semi-supervised GMM on MFCC features extracted from 163 Freesound audio clips for 9 context classes. The model was then applied to unlabeled user-centric data recorded by smart phones with a headset microphone. The performance was evaluated based on the second half of the user data with an average accuracy of 54% for 7 users. To further improve the performance, Nguyen et al. (nguyen2013towards, ) also presented two active learning mechanisms, where a supervised GMM was first trained on the same Freesound data or well-labeled user data and then interactively queried users for labeling the unlabeled user-centric data. Clearly, from the prior work we can see that the existing crowd-sourced datasets do not generalize sufficiently the audio recorded across users, and previous research still needs to rely on user data and manual filtering of outliers for better performance.
In April 2017, Google’s Sound Understanding team released the Audio Set database (gemmeke2017audio, ) (https://research.google.com/audioset/) containing ontology and embedding features of over 2 million audio clips drawn from general YouTube videos. The clips are categorized by 527 audio labels and many of the labels can be potentially applied to activity recognition. Due to the ambiguous types of source videos from movie scenes, cartoons to real-world recordings, activity recognition based on such data is close to the concept of cross-domain transfer learning. Actually, it is not the first time that transfer learning on web data was introduced to activity recognition research. Hu et al. (hu2011cross, ) proposed to use web search text as a bridge for similarity measures between sensor readings. Fast et al. (fast2016augur, ) developed the Augur, a system leveraging context in on-line fictions to predict human activities in the real world. To the best of our knowledge, however, very few attempts have been made in terms of transfer learning for audio-based activity recognition, especially leveraging such tremendous amount of on-line audio data.
Generally speaking, our contributions can be summarized as:
We proposed a novel scheme for activity recognition that is entirely free from collection of user training data. It leverages the idea of transfer learning to empower traditional audio-based activity recognition by applying over 2 million audio embedding features from nearly 52,000 Youtube videos.
Our study verified the feasibility of using syntactic audio embeddings generated from on-line general dataset for activity recognition in the real world.
We aimed to provide a practical solution to activity recognition specifically in the home. Audio-based activity recognition shows significant potential to be applied to smart homes and our scheme was tested on 15 activities that appear very frequently at home. Our study showed that the proposed scheme could yield promising performance for both dedicated tests and in-the wild tests using only off-the-shelf smart phones.
3.1. Audio Set
In 2017, Google’s Sound Understanding team released a large-scale acoustic dataset, named Audio Set (gemmeke2017audio, ), endeavoring to bridge the gap in data availability between image and audio research. The Audio Set contains information of over 2 million audio soundtracks drawn from general YouTube videos. The dataset is structured as a hierarchical ontology consisting of 527 class labels and the size is still growing now. All audio clips are equally chunked as 10 seconds long and labeled by human experts.
The dataset does not provide original waveforms of the audio clips. Instead, the samples are presented in the form of both source indexes and bottleneck embedding features. The audio index contains information of the audio ID, URL, class labels, and start and end time of the sample within the corresponding source video. The embedding features are generated from the embedding layer of a VGG-like deep neural network (DNN) /media/arxiv_projects/473221/architecture trained on the YouTube-100M dataset (hershey2017cnn, ). The generation frequency is roughly 1Hz (96 10ms audio frames, i.e. 0.96 seconds of audio per embedding vector). In other words, one embedding vector can describe one second of audio clip, and therefore there are 10 embedding vectors for each audio clip within the dataset. Before released, the embedding vectors have also been post-processed by principle component analysis (PCA) and whitening as well as quantization to 8 bits per embedding element. Only the first 128 PCA coefficients are kept and released.
The original vectors are all stored within TensorFlow (tensorflow2015-whitepaper, ) Record files. Given the significant size of the embeddings and the lack of convenience for data processing, Kong et al. (kong2017audio, ) provided a converted Python Numpy version of the raw embeddings which are adopted in our research. Their source codes and released data can be accessed on https://github.com/qiuqiangkong/ICASSP2018_audioset.
3.2. Label Association
Before implementation, we need to consider the range of target activities and how we can associate the class labels in the Audio Set with them. Our research leverages existing audio samples and labels from on line as the training set, and we aim at target activities that frequently appear in the home. But a first fact is that not all typical home-related actives can be found on the Audio Set. For example, we did not include activities such as ’using dishwasher’ or ’using coffee machine’ for our study because of the absence of related labels on the Audio Set ontology. Besides, the range of our target classes also has to be limited to just activities that are suitable for audio recognition. Here ‘suitable’ means that the sound of the activity should be featured and easily captured in practice. Hence, we did not choose any pet categories for our study because such sound is normally sparse in natural home scenarios, despite the fact that they do sometimes appear in our daily life. Some audio classes such as ’silence’ was also not selected because the same types of sounds can be attributed to various activities including sleeping, standing, or just absence of the subject in the room. Body movement with very weak sound features is not suitable for audio-based recognition as well. Further more, even in terms of the chosen target activities, it is not always possible to find an exact matching from the Audio Set labels. In such cases, we adopted an indirect matching process. That is, we first determined the most relevant objects and environmental context regarding to the target activity. Then we chose labels with such objects and context as representation of the activity. For example, we used class ’water tap’ and ’sink’ as representation of ’washing hands and faces’ as all three classes involve usage of water and the features are quite similar. This is actually a very subjective process as there is no quantized measurement to determine the similarity between such relevant classes and the actual target class. Also, the overlaps of chosen audio features among different types of activities can affect the model performance in practice. They would be discussed in the scripted test section 5.2 as a typical type of label co-occurrence.
It is noted that the dataset provides a quality rating of audio labels based on manual assessment. Most of the labels have been assessed by experts based on a random check of 10 audio segments within the label. The samples of each label are actually divided into three subsets (evaluation, balanced training, and unbalanced training) for training and evaluation purposes. The evaluation and balanced training sets are of much smaller size than the rest unbalanced training set, and most of the evaluation and balanced training sets are re-rated by experts. However, due to the considerable size of samples and factors such as misinterpretation or confusibility, many class labels are still of poor rating results. In our label determination process we did not actually rely on the sample ratings and we consider the whole evaluation, balanced training and unbalanced training sets as our training set.
Based on the above considerations, we determined 15 common home-related activities for our study. They are associated with 18 Audio Set labels. Table 1 shows the association between our target activities and the Audio Set class labels. The target activities are categorized in the table based on their frequently relevant locations in the home. Embeddings of the listed labels on the Audio Set will then serve as the only training data in our proposed scheme.
Another characteristics of the Audio Set database is the unbalanced distribution of the class size. Table 2 lists the original number of embedding vectors (raw embeddings) per class applied in our study without any sampling process. The numbers here include embeddings from all three parts (evaluation set, balanced training set and unbalanced training set) for all classes. These are not the whole sets of embeddings for the classes, as in our implementation we removed samples with label co-occurrence among the target classes to ensure mutual exclusiveness. Also, since we used the converted Python Numpy version of features as mentioned above, the actual size for some classes is slightly smaller than they appeared on the original Audio Set website. As we can see, classes ’Chatting’ and ’Playing music’ have the most embeddings (174,220 and 115,200 respectively). Class ’Brushing teeth’ is of the least, only 1230, which accounts for 0.7% of the largest class. Fig.1. shows the distribution of the class size, and we can see that the two majority classes account for over half of the whole training set. The unbalanced distribution of the class size leads to highly unbalanced training in our study. As we will see in the dedicated test section, the distribution of training class can heavily affect the recognition performance, and we implemented two oversampling processes for the problem.
The unbalanced distribution of labels can mainly be affected by two facts. Firstly, the distribution actually reflects the diversity and frequency of the class labels within the source YouTube videos. For example, elements of chatting or musics can be captured in a large amount of video topics, from advertisement, news to cartoons. Brushing teeth, on the contrary, appears much less and typically just in some movie scenes or daily life recordings. Chatting can also involve several modalities according to the speaker’s gender, age and the context of the speech, while types of brushing teeth activities seem to be much more similar. Secondly, we are using only samples without label co-occurrence among the target classes. The distribution of such disjoint embeddings can therefore affect the actual distribution in our training set.
The effects of unbalanced training on classification have been discussed in several work (liu2007generative, ; chawla2002smote, ; han2005borderline, ). Without prior knowledge of the unbalanced priors, a classifier can always tend to predict the majority classes, and there should be higher cost for misclassifying the minority classes (liu2007generative, ). In our scheme, we implemented random oversampling and synthetic minority oversampling technique (SMOTE) (chawla2002smote, ) to handle the problem. The process of random oversampling can be divided into two steps. The first is to calculate the sampling size for each minority class, i.e. to calculate the difference of size between the target class and the majority class. Then each minority class will be re-sampled with replacement until the sampling size is filled. This is actually replication of existing data without introducing any extra information into the dataset. The SMOTE, on the contrary, works by adding new elements for the minority classes. It leverages the K-nearest-neighbors (KNN) approach to first generate new data points around the existing data points. Then one of the neighbors is randomly selected as the synthetic new elements and will be introduced to the minority class. In our implementation, the oversampling process was developed based on the Python imbalanced-learn package (JMLR:v18:16-365, ; chawla2002smote, ). All parameters were set as default in the imbalanced-learn package version 0.3.3 except that the random state was kept as 0. The oversampling process yielded 2,613,300 embedding vectors in total for the 15 classes, same for both random oversampling and SMOTE settings.
Deep learning has been proven to be powerful for large-scale classification. Due to the considerable size of audio samples involved in our study, and also to keep the same feature format as released in the Audio Set, we adopted neural networks for both embedding feature extraction and classification in our proposed scheme. Fig.2. shows the /media/arxiv_projects/473221/architecture of our scheme. Overall we implemented two networks, a pre-trained feature extraction network and a fine-tune classification network. We used the VGGish model (hershey2017cnn, ) as the extraction network and all parameters of the network were fixed during our training process. The classification network consists of 1-dimensional convolutional layers and dense layers. The parameters and weights of the classification network were trained and fine tuned on the Audio Set data. Besides, we added an embedding segmentation process between the two networks to improve recognition performance.
In the initial Audio Set, the frame-level features of the audio clips were generated by a VGG-like acoustic model pre-trained on the YouTube-100M dataset. To enable researchers to extract the same format of features, Hershey et al. (hershey2017cnn, ) provided a TensorFlow version of the model called VGGish. It has been trained on the same YouTube-100M dataset and can produce the same format of 128-dimensional embeddings for every second of audio sample. The VGGish model takes as input non-overlapping frames of log mel spectrogram of raw audio waveforms. The source codes and weights of the pre-trained VGGish model are released on the public Audio Set model GitHub repository: https://github.com/tensorflow/models/tree/master/research/audioset. The source codes also include pre-processing steps for extracting the log mel spectrogram features to feed the model and post-processing steps for PCA transform and element-wise quantization which have also been adopted on the released Audio Set data. In our implementation, the audio pre-processing step takes as input audio waveforms with 16 bit resolution. So we manually convert other formats of audio samples (such as raw recordings from smart phones) using a free on-line converter (https://audio.online-convert.com/convert-to-wav) before passing the raw audio for processing. The parameters of the VGGish network kept constant during the whole training and validation process. The network could then output a vector of 128 syntactic embeddings for every second of the input audio.
Our classification network consists of 3 plain convolutional layers and 2 dense (fully connected) layers. The structure is shown in Fig.3. The convolutional layers are all 1-dimensional tensor with linear activation and same paddings to ensure the same feature size. The number of channels are 19, 20 and 30 respectively for the 3 layers. The kernel size was all set as 5 with a stride of 1. We applied 500 neurons for the first dense layer. The second dense layer is the output layer, thus there are 15 neurons and the output activation was set as softmax. A flatten layer was used to connect the convolutional layers and the dense layers. We chose categorical cross entropy as the loss. In terms of the optimizer, we applied stochastic gradient descent with Nesterov momentum. The learning rate was set as 0.001 with 1e-6 decay and 0.9 momentum. The network takes as input 128 * 1 segmented and normalized embeddings from the segmentation step of our /media/arxiv_projects/473221/architecture and outputs predicted probability distribution of the labels. Under the top-1 classification scenario, the label with the highest probability will be selected as the final prediction. Our classification network was built and compiled on Python Keras API (chollet2015keras, ) with Tensorflow (tensorflow2015-whitepaper, ) backend. The weights were trained and fine tuned on the Audio Set embeddings.
In addition to the neural networks and audio processing steps, we also applied embedding segmentation to determine the unit length of an audio segment for recognition. This is natural because the length of a single embedding vector (1 second) can be too short to some activities and may not be able to capture enough information for recognition. Also, increasing the segment length can help to alleviate the effects of outliers and noise within the real world recordings. Hence, we introduced a segmentation process on embeddings between the two networks. For convenience, in the following sections we will describe the length of a unit segment by number of embedding vectors (1 second each). In our /media/arxiv_projects/473221/architecture, the segmentation is completed by grouping the embedding vectors using a fix sized window with no overlaps. The vectors will then be averaged within each group to yield a new embedding vector. In other words, each unit audio segment is described by an averaged embedding vector. Activity labels will then be assigned to the averaged vectors and those vectors actually serve as the instances for classification. The embeddings are standardized using min-max scaling before fitting to the classification network.
Source codes of our overall /media/arxiv_projects/473221/architecture have been made publicly available on https://github.com/dawei-liang/Audioset_research. Both the oversampling and training processes were developed on the Texas Advanced Computing Center (TACC) Maverick server. Specifically, we applied the NVIDIA K40 GPU on the server to accelerate the training process. The training embeddings were split as 90% for training and 10% for validation using the Pyhton Scikit-learn package (scikit-learn, ). The TensorFlow version provided was TensorFlow-GPU 1.0.0 (tensorflow2015-whitepaper, ). Before training, we set all random seeds as 0 to ensure the same training status. Besides, a batch of 100 embedding vectors were input each time. The classification network was trained until the validation performance no longer improved (in our study, 15 to 20 epochs depending on the re-sampling set in use).
4. Feasibility Study
4.1. Dedicated Test
We evaluated the feasibility of our scheme based on a single-subject study under dedicated scenario. There are two purposes to do so. Firstly, we would like to check if our proposed scheme can actually work based on real-world ambient recordings. Although the /media/arxiv_projects/473221/architecture had been well trained on the Audio Set data, the characteristics of YouTube video sounds and real-world ambient sounds could possibly be different. Secondly, we would need a real-world test to determine the best combination strategy for the sampling process and the classifier. In this dedicated study, we collected sounds of the target activities in the wild by placing an off-the-shelf smart phone (Huawei P9) nearby for recording. The reason why it is called ’dedicated’ is that the context of the activities was well-controlled with low variability. Specifically, we tried best to avoid irrelevant environmental noise such as sounds of toilet fan or air conditioner during the collection. Also, when a target activity was performed there were no extra on-going activities. When the study began, an expert (one of the authors of the paper) first placed the smart phone near where the activity was going to be performed. Then the expert started performing the activity and manually started collection when the activity sound could be clearly captured. Sound collection for each activity lasted for 60 seconds, and the expert would stop recording when the time ended. This same process had been repeated for each individual activity until the collection for all 15 activities was completed.
We chose a segmentation size of 10 embedding vectors (10 seconds) for the dedicated study. The recognition performance was evaluated based on 3 different sampling processes (raw embeddings input/no oversampling, random oversampling, and SMOTE). To make it more clear how the classification network performs, we also tuned and trained a random forest classifier on the same training sets as a baseline. The random forest was developed using the Python Scikit-learn package (scikit-learn, ). We used the overall accuracy and overall F-score as the performance metrics. In binary classification, the F-score is calculated as 2 * (precision * recall) / (precision + recall) and it incorporates information for both precision and recall performance. In our study, the overall F-score across multiple classes can be calculated by finding the weighted average of F-score of the individual labels. Table 3 shows the recognition performance based on different /media/arxiv_projects/473221/architectures. For convenience, the random forest is abbreviated as RF and the classification network is referred as CNN in the table. From the results we can see that random forest without any sampling process yields the worst accuracy and F-score (34.4% and 24.5% respectively). This is comparable to the dedicated study by Rossi et al. (rossi2012recognizing, ), where the authors trained a GMM on 4678 raw samples from the crowd-scoured Freesound dataset and obtained 38% overall accuracy for 23 context categories. Clearly, our classification network significantly improves the recognition performance, especially if combining with the oversampling processes. The combination of random oversampling and our classification network yields the best performance (81.1% overall accuracy and 80.0% overall F-score). Generally, classifiers with oversampling processes outperform those without oversampling. Fig.4 shows in details the performance of individual classes with and without oversampling. The entries have been normalized for each class. As we can see, classification network input by raw embeddings over-fits to some of the majority classes such as ’playing music’ and ’strolling’. Network input by the random oversampled embeddings, on the contrary, yields equally promising results to most classes. The worst class for the top-1 /media/arxiv_projects/473221/architecture was ’flushing toilet’ with only 17% class accuracy. This is probably because the segmentation length was too long to the flushing activity and too much irrelevant information was captured within the segments.
To determine how the segmentation process can affect the classification performance, we compared the overall F-score under different size of embedding segmentation. The comparison is shown in Fig.5. As reference, we also plotted the random guess levels (around 0.07). From the figure, we can see that the performance was worst when no segmentation process was introduced (i.e. 1 embedding vector each segment), with an F-score of only 0.65. By applying bigger segment size, the F-score value significantly increased to over 0.8. In addition, we can see that a length of 5 embedding vectors per segment has already enabled the instances to capture enough information for the classification. Further enlarging the size of segmentation can no longer improve the overall recognition performance.
4.2. More Discussions towards Transfer Learning
We have mentioned at the beginning of the paper the concept of transfer learning. Typically it refers to modeling across different domains of data such as from video to audio. But we believe that our scheme also leverages this idea because soundtracks from on-line YouTube videos can generally be very different from real-world audio recordings for activity recognition. In fact, our classification network only yielded 53% of validation and training accuracy on the random oversampled Audio Set embeddings. But the overall accuracy of the top-1 scheme reached to over 80% on the ambient recordings. Besides, we have noticed that the validation performance on the training data could have been further improved by adopting deeper /media/arxiv_projects/473221/architecture. However, increasing the depth of the model would no longer help to improve its performance on real-world data (it might even harm the performance). A possible reason is that ambient sounds from the real world (especially in home settings) can generally be of less complexity and be more ’linear separable’ than those on the YouTube videos. In other words, a model fit too much to the Audio Set data can probably becomes over-fitting to sound recordings from our home. Our research aims to provide a feasible mechanism based on the on-line Audio Set data to empower existing activity recognition systems, thus we did not explore in depth the difference in terms of acoustic features between these two domains. Further work can be done in the future to model such similarity from the perspective of feature engineering.
5. In-the-wild Tests
5.1. Test Design
By the dedicated recording tests, we verified the feasibility of the proposed scheme and determined the appropriate combination of the oversampling and segmentation strategies within the proposed /media/arxiv_projects/473221/architecture. To generalize the study in more natural settings, we then implemented in-the-wild scripted tests based on 4 human subjects under variant test environment. In the previous feasibility study, we made two assumptions towards the test environment. Firstly, there was little irrelevant environmental noise during the collection process. The audio samples were recorded by a smart phone nearby with almost no artificial or ambient disturbance during the processes. The start and end points of the collection were carefully selected to ensure high quality recordings. Secondly, there were almost no overlaps of similar sound features among the target activities. In other words, the activities were ensured to be strictly mutual exclusive while being performed. However, in real-world settings such assumptions can always be broken. For example, chatting or washing can be commonly concurrent with cleaning or kitchen work. In addition, human artifacts and ambient noise such as sounds of air conditioners or fans are almost inevitable to audio collection in practice. Also, in actual implementation people tend to perform activities in a more continuous way and it is not convincing if the scheme can only work based on individual discrete samples. Hence, we are interested to see how the proposed /media/arxiv_projects/473221/architecture performs under such natural circumstances.
The real-world tests were performed based on a scripted scenario. A key advantage of the scripted tests is that the procedure of following the script can simulate the continuous process of human activities just as in natural home settings. All target activities were listed in advance in the form of instructions such as ”First head to the bathroom, wash your hands and face” or ”After juice prepared, please warm some food using the microwave oven”. Each human subject then simply followed the instructions on a paper and freely perform the activities. To collect the sounds, we adopted the same off-the-shelf device (Huawei P9). The smart phone was attached on subjects’ arms by a wristband (except for class ‘Bathing/Showering’ where it could be placing nearby) so that the subjects could perform the activities without paying attention to the collection process. During the whole collection, an expert (one of the authors of the paper) followed the subjects while they were performing the activities but would keep a distance (e.g. waiting outside the room while the subject was performing room cleaning) to allow free actions by the tested subjects. The key roles of the expert were to answer questions by the subjects during the test and to label the time stamps of the target activities by using a timer started simultaneously with the smart phone. To avoid subjective bias, the tested volunteers should not be aware of the collected audio data, so they were not told the purpose of the study until the whole collection was completed. All participants of the study were required to signed an IRB protocol form before the tests.
As mentioned above, two important factors that need to be incorporated in the real-world tests are the co-occurrence of activity labels and the environmental noise. To simulate the concurrence of activities, the expert would occasionally introduce a small amount of free chatting for some of the activities such as watching TV, frying or strolling. Also, the subjects were free to perform some activities simultaneously such as washing hands or dishes while doing the kitchen work to simulate the natural kitchen scenarios. All 4 tests were performed in volunteers’ own home and they were allowed to leave some other irrelevant household appliances such as the air conditioners or refrigerator compressors on during the tests.
In our script, most activities were required to be just performed once and the length was determined freely by the participants. It was also recommended to use the participants’ own tools or devices such as their own shavers, vacuums and blenders to simulate their normal life styles, except that we would bring them some bacon, cucumbers or tomatoes for activities ’Frying food’, ’Chopping food’ and ’Squeezing juice’. Given the high variant of television programs, the participants were asked to watch for 5 different channels with around 30 seconds each for the class ’Watching TV’. Although we used class ’Piano’ for training, the subjects were allowed to play or listen to some other types of musics such as guitar musics or symphonies for the class ’Playing music’. Besides, class ’Shavering’ was waived for female subjects.
5.2. Results and Discussions
Based on the labeled time stamps, we manually segmented the relevant parts of the target activities from the raw recordings. In total we were able to obtain 9163 seconds (153 minutes) of audio collection for all 4 subjects. Within the collected audio data we identified that 3493 seconds (58 minutes) of the clips were target-related, accounting for 38.12% of the total. The resulting sparsity is comparable to audio-based activity recognition in practice as not all home-related activities can generate specific sound features and some of them may not be suitable for audio-based recognition. We then applied the best /media/arxiv_projects/473221/architecture of the proposed scheme (classification network with random oversampling) for results evaluation. The segmentation length was chosen as 10 embedding vectors (10 seconds).
The test results were first examined based on each individual participant. Table 4 shows the overall performance of the activity classes for single subjects. Because of the highly unbalanced length of the audio clips among the activities, we adopted the overall weighted average as the performance metric. That is, for a single subject, the contribution of each tested instance to the overall accuracy is inversely proportional to its corresponding activity length. By weighting the instances, each activity class within the subject can then contribute equally to the overall performance. As obtained from table 4, the average of the top-1 classification accuracy was 64.93% for all 4 tested subjects. Actually, comparing to the traditional top-1 classification, it may be more reasonable to evaluate the overall performance using a top-3 classification scenario given the co-occurrence of activities during the tests. In the top-3 classification, predicted labels of the 3 highest probability will all be seen as the final predictions, and a true positive can be counted if any of the 3 labels match the ground truth. It incorporates the variants of predictions due to possible similarity of sound features or concurrence of the actual activities. From table 4, we can see that the top-3 performance was much better, with an averaged accuracy of 80.55% for all 4 subjects.
To evaluate the performance of individual activity classes, we also summarized the class accuracies across all tested subjects. Table 5 shows the statistics of the class performance. We calculated the average values for both the top-1 classification accuracies and the top-3 classification accuracies for each activity class. We adopted a similar approach as above to calculate the cross-subject class performance. In other words, data from each subject was weighed inversely proportional to its length and contributed equally to the target class. In addition, the standard deviations (abbreviated as ’Std’ in the table) of class accuracies across the subjects were also presented in table 5. A lower standard deviation level indicates a more stable performance of the predictions and further implies a stronger robustness of the scheme towards variants of actual recordings. Among the activities, ’Shavering’, ’Chopping food’ and ’Squeezing juice’ showed the best performance with 100% averaged class accuracy and 0 standard deviation. Class ’Floor cleaning’ also yielded satisfactory results due to its clear and unique sound features. As we can also see from table 5, none of the flushing activities were successfully recognized by the proposed scheme. That is probably because the process of pumping was too short for recognition and the sounds of water flushing were quite similar to washing or frying activities. Activities ’Frying food’, ’Boiling water’ and ’Brushing teeth’ yielded a high standard deviation in the predictions. It is natural because the modalities of cooking and boiling can vary in practice depending on the choice between kettles and pans, variant cooking styles and different types of food. The performance of kitchen activities was also affected by usage of hoods by some of the participants. The brushing activity could mainly be affected by the noise of toilet fans and usage of specific electric toothbrush. Besides, we can see that the predictive performance of outdoor strolling and bathing/showering increased significantly from the top-1 scenarios to the top-3 scenarios. This indicates the effects of activity co-occurrence as subjects tended to chat with the expert while walking and the showering or bathing process could be usually mixed with washing activities. Particularly, we noticed two main reasons for label co-occurrence in the real world studies. The first was activity concurrence, as the subjects were allowed to perform more than one activities at the same time including cases of simultaneous showering and washing or dish washing with frying. The other was the existence of similar sound elements among activities. For example, the sounds of nearby crowds could affect the decision of the classifier on strolling and chatting. Such factors largely differed the in-the-wild tests from the dedicated recording scenarios.
5.3. Discussions on Privacy Concerns
As presented by Chen et al. and Thomaz at al. (chen2005bathroom, ; thomaz2015inferring, ), the collection of audio activities can sometimes arise privacy concerns among the public. This is particularly true if the collection involves private activities or if it is performed at inappropriate time. For example, the recordings of bathroom activities by Chen et al. (chen2005bathroom, ) can still be embarrassing for some people to share with clinicians. In our scripted tests, we also faced similar cases when recording activities such as bathing. Such privacy issues can sometimes block audio-based activity recognition in practice. A feasible solution can be the avoidance of using cloud or public servers while processing the user data. Our proposed scheme can also alleviate the concerns since the training data is only accessed from the public YouTube videos and no specific user data is required. In fact, audio-based activity recognition can sometimes reduce the needs of privacy disclosure such as the substitute of in-person recordings in medical treatments. Some other feasible techniques for privacy protection in audio streams have been discussed by Wyatt et al. (wyatt2007conversation, ).
The collection of ground truth user data can be extremely time-consuming in multi-class activity recognition. This paper presented a novel scheme of leveraging general YouTube video soundtracks as the only training set for audio-based activity recognition. Given the potential applications of audio-based activity recognition in smart homes, we designed the proposed scheme for recognition of 15 common home-related activities. Due to the tremendous size of audio clips and highly unbalanced distribution of the training classes, our scheme combined both oversampling and deep learning /media/arxiv_projects/473221/architectures. To verify the idea and evaluate its performance under different real-world scenarios, we designed both dedicated recording test and in-the-wild scripted tests. In the dedicated test, the proposed scheme yielded 81.1% overall accuracy and 80.0% F-score. In the real-world tests, the proposed scheme was able to yield 64.9% averaged top-1 classification accuracy and 80.6% top-3 classification accuracy based on 4 subjects. Several design considerations such as the association of activity labels, effects of segmentation and privacy concerns were also discussed in the paper.
- copyright: none
- ccs: Human-centered computing Empirical studies in ubiquitous and mobile computing
- Liming Chen, Chris D Nugent, and Hui Wang. A knowledge-driven approach to activity recognition in smart homes. IEEE Transactions on Knowledge and Data Engineering, 24(6):961–974, 2012.
- Agnes Grünerbl, Amir Muaremi, Venet Osmani, Gernot Bahle, Stefan Oehler, Gerhard Tröster, Oscar Mayora, Christian Haring, and Paul Lukowicz. Smartphone-based recognition of states and state changes in bipolar disorder patients. IEEE Journal of Biomedical and Health Informatics, 19(1):140–148, 2015.
- Nishkam Ravi, Nikhil Dandekar, Preetham Mysore, and Michael L Littman. Activity recognition from accelerometer data. In Aaai, volume 5, pages 1541–1546, 2005.
- Jennifer R Kwapisz, Gary M Weiss, and Samuel A Moore. Activity recognition using cell phone accelerometers. ACM SigKDD Explorations Newsletter, 12(2):74–82, 2011.
- Alvina Anjum and Muhammad Usman Ilyas. Activity recognition using smartphone sensors. In Consumer Communications and Networking Conference (CCNC), 2013 IEEE, pages 914–919. IEEE, 2013.
- Edison Thomaz, Cheng Zhang, Irfan Essa, and Gregory D Abowd. Inferring meal eating activities in real world settings from ambient sounds: A feasibility study. In Proceedings of the 20th International Conference on Intelligent User Interfaces, pages 427–431. ACM, 2015.
- Mirco Rossi, Sebastian Feese, Oliver Amft, Nils Braune, Sandro Martis, and Gerhard Tröster. Ambientsense: A real-time ambient sound recognition system for smartphones. In Pervasive Computing and Communications Workshops (PERCOM Workshops), 2013 IEEE International Conference on, pages 230–235. IEEE, 2013.
- Gierad Laput, Yang Zhang, and Chris Harrison. Synthetic sensors: Towards general-purpose sensing. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pages 3986–3999. ACM, 2017.
- Jianfeng Chen, Alvin Harvey Kam, Jianmin Zhang, Ning Liu, and Louis Shue. Bathroom activity monitoring based on sound. In International Conference on Pervasive Computing, pages 47–61. Springer, 2005.
- Long-Van Nguyen-Dinh, Mirco Rossi, Ulf Blanke, and Gerhard Tröster. Combining crowd-generated media and personal data: semi-supervised learning for context recognition. In Proceedings of the 1st ACM international workshop on Personal data meets distributed multimedia, pages 35–38. ACM, 2013.
- Mirco Rossi, Gerhard Troster, and Oliver Amft. Recognizing daily life context using web-collected audio data. In Wearable Computers (ISWC), 2012 16th International Symposium on, pages 25–28. IEEE, 2012.
- Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 776–780. IEEE, 2017.
- Edison Thomaz, Irfan Essa, and Gregory D Abowd. A practical approach for recognizing eating moments with wrist-mounted inertial sensing. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, pages 1029–1040. ACM, 2015.
- Muhammad Shoaib, Stephan Bosch, Hans Scholten, Paul JM Havinga, and Ozlem Durmaz Incel. Towards detection of bad habits by fusing smartphone and smartwatch sensors. In Pervasive Computing and Communication Workshops (PerCom Workshops), 2015 IEEE International Conference on, pages 591–596. IEEE, 2015.
- Hanghang Tong Xing Su and Ping Ji. Activity recognition with smartphone sensors. Tsinghua Science and Technology, 19(3):235–249, 2014.
- Koji Yatani and Khai N Truong. Bodyscope: a wearable acoustic sensor for activity recognition. In Proceedings of the 2012 ACM Conference on Ubiquitous Computing, pages 341–350. ACM, 2012.
- Kyuwoong Hwang and Soo-Young Lee. Environmental audio scene and activity recognition through mobile-based crowdsourcing. IEEE Transactions on Consumer Electronics, 58(2), 2012.
- Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. A dataset and taxonomy for urban sound research. In Proceedings of the 22nd ACM international conference on Multimedia, pages 1041–1044. ACM, 2014.
- Sebastian Säger, Benjamin Elizalde, Damian Borth, Christian Schulze, Bhiksha Raj, and Ian Lane. Audiopairbank: towards a large-scale tag-pair-based audio content analysis. EURASIP Journal on Audio, Speech, and Music Processing, 2018(1):12, 2018.
- Long-Van Nguyen-Dinh, Ulf Blanke, and Gerhard Tröster. Towards scalable activity recognition: Adapting zero-effort crowdsourced acoustic models. In Proceedings of the 12th International Conference on Mobile and Ubiquitous Multimedia, page 18. ACM, 2013.
- Derek Hao Hu, Vincent Wenchen Zheng, and Qiang Yang. Cross-domain activity recognition via transfer learning. Pervasive and Mobile Computing, 7(3):344–358, 2011.
- Ethan Fast, William McGrath, Pranav Rajpurkar, and Michael S Bernstein. Augur: Mining human behaviors from fiction to power interactive systems. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, pages 237–247. ACM, 2016.
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn /media/arxiv_projects/473221/architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pages 131–135. IEEE, 2017.
- Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
- Qiuqiang Kong, Yong Xu, Wenwu Wang, and Mark D Plumbley. Audio set classification with attention model: A probabilistic perspective. arXiv preprint arXiv:1711.00927, 2017.
- Alexander Liu, Joydeep Ghosh, and Cheryl E Martin. Generative oversampling for mining imbalanced datasets. In DMIN, pages 66–72, 2007.
- Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357, 2002.
- Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pages 878–887. Springer, 2005.
- Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5, 2017.
- François Chollet et al. Keras. https://keras.io, 2015.
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Danny Wyatt, Tanzeem Choudhury, and Jeff Bilmes. Conversation detection and speaker segmentation in privacy-sensitive situated speech data. In Eighth Annual Conference of the International Speech Communication Association, 2007.