ActiveHARNet: Towards On-Device Deep Bayesian Active Learning for Human Activity Recognition
Various health-care applications such as assisted living, fall detection etc., require modeling of user behavior through Human Activity Recognition (HAR). HAR using mobile- and wearable-based deep learning algorithms have been on the rise owing to the advancements in pervasive computing. However, there are two other challenges that need to be addressed: first, the deep learning model should support on-device incremental training (model updation) from real-time incoming data points to learn user behavior over time, while also being resource-friendly; second, a suitable ground truthing technique (like Active Learning) should help establish labels on-the-fly while also selecting only the most informative data points to query from an oracle. Hence, in this paper, we propose ActiveHARNet, a resource-efficient deep ensembled model which supports on-device Incremental Learning and inference, with capabilities to represent model uncertainties through approximations in Bayesian Neural Networks using dropout. This is combined with suitable acquisition functions for active learning. Empirical results on two publicly available wrist-worn HAR and fall detection datasets indicate that ActiveHARNet achieves considerable efficiency boost during inference across different users, with a substantially low number of acquired pool points (at least 60% reduction) during incremental learning on both datasets experimented with various acquisition functions, thus demonstrating deployment and Incremental Learning feasibility.
Human Activity Recognition (HAR) is an important technique to model user behavior for performing various health-care applications such as fall detection, fitness tracking, health monitoring, etc. The significant escalation in usage of mobile and wearable devices has opened up multiple venues for sensor-based HAR research with various machine learning algorithms. Until recently, machine learning and deep learning algorithms for HAR have been restricted to cloud/server and Graphics Processing Units (GPUs) for obtaining good performance. However, this paradigm has begun to shift with increasing compute capabilities vested in latest smartwatches and mobile phones. Especially, on-device machine learning for monitoring physical activities has been on the rise as an alternative to server-based computes owing to communication and latency overheads (survey, ).
A special interest in bringing deep learning to mobile and wearable devices (incorporating deep learning on the edge) has been an active area of research owing to its automatic feature extraction capabilities, in contrast to conventional machine learning models which mandate domain knowledge to craft shallow heuristic features. One of the unexplored areas involving deep learning for such HAR tasks is Active Learning - a technique which gives a model, the ability to learn from real-world unlabeled data by querying an oracle (ActiveActivity, ). The integration of Bayesian techniques with deep learning provide us a convenient way to represent model uncertainties by linking Bayesian Neural Networks (BNNs) with Gaussian processes using Dropout (dropout_yaringal, ) (Section 3.1). These are effectively combined with contemporary deep active learning acquisition functions for querying the most uncertain data points from the oracle (Section 3.3).
However, these works have not been considerably discussed in resource-constrained (on-device) HAR scenarios; particularly, during Incremental Learning - where a small portion of unseen user data is utilized to update the model, thereby adapting to the new user’s characteristics (HARNet, ). In this paper, we investigate uncertainty-based deep active learning strategies in HAR and fall detection scenarios for wearable-devices. The main scientific contributions of this paper include:
A study of a sensor-based light-weight Bayesian deep learning model across various users on wrist-worn heterogeneous HAR and Fall Detection datasets.
Leveraging the benefits of Bayesian Active Learning to model uncertainties, and exploiting several acquisition functions to instantaneously acquire ground truths on-the-fly, thereby substantially reducing the labeling load on oracle.
Enabling Incremental Learning to facilitate continuous model updation on-device from incoming real-world data independent of users (User Adaptability), in turn eliminating the need to retrain the model from scratch.
The rest of the paper is organized as follows. Section 2 presents the related work in the area of deep learning and active learning for HAR. Section 3 discusses about our approach to model uncertainties, the Bayesian HARNet model architecture, and the acquisition functions used for querying the oracle. The baseline evaluations for the model in an user-independent scenario on two different datasets are elucidated in Section 4. This is followed by systematic evaluation of the same with the proposed ActiveHARNet architecture in resource-constrained incremental active learning scenarios in Section 5.
2. Related Work
Technological advancements in pervasive and ubiquitous health-care have drastically improved the quality of human life, and hence has been an actively explored research area (Healthcare1, ), (Healthcare3, ). Sensor-based deep learning has been an evolving domain for computational behavior analysis and health-care research. Mobile/wearable deep learning for HAR and fall detection (wearable_fall, ) among elders have become the need of the hour for patient monitoring; they have been widely used with promising empirical results by effectively capturing the most discriminative features using convolutional (iot_wearable, ) and recurrent neural networks, Restricted Boltzmann Machine (deepwatch_wristsense, ) and other ensembled models (HARNet, ), (DeepSense, ). However, these works assume that the incoming streams of data points are labeled in real-time, thereby necessitating ground truthing techniques like active learning to handle unlabeled data.
Conventional Active Learning (AL) literature (Settles, ) mostly handle low-dimensional data for uncertainty estimations, but do not generalize to deep neural networks as the data is inherently high-dimensional (BayesianAL, ). Deep active learning using uncertainty representations - which presently are the state-of-the-art AL techniques for high-dimensional data, have had very sparse literature. With the advent of Bayesian approaches to deep learning, Gal et al. (BayesianAL, ) proposed Bayesian Active Learning for image classification tasks, and is proven to learn from small amounts of unseen data, while Shen et al. (NER_AL, ) incorporate similar techniques for NLP sequence tagging. However, these techniques are predominantly not discussed for sensor and time-series data.
Incorporating AL for obtaining ground truth in mobile sensing systems has been addressed in few previous works discussed as follows. Hossain et al. (ActiveActivity, ) incorporate a dynamic k-means clustering approach with AL in HAR tasks. However, this work was before the ubiquitousness of deep learning algorithms, thereby making the model dependent on heuristic hand-picked features. Lasecki et al. (Legion, ) discuss about real-time crowd sourcing on-demand to recognize activities using Hidden Markov Models (HMMs) on videos, but not on inertial data. Moreover, deep learning models have vastly outperformed HMMs in video classification setting. Bhattacharya et al. (SouravUnlabeled, ) propose a compact and sparse coding framework to reduce the amount of ground truth annotation using unsupervised learning. Although these works seem to achieve impressive results, the feasibility of on-device Incremental Learning (model updation) scenarios with unlabeled data still seems debatable.
In this paper, we propose ActiveHARNet: a unified and novel deep Incremental Active Learning framework for Human Activity Recognition and Fall Detection tasks for ground truthing on resource-efficient platforms, by modeling uncertainty estimates on deep neural networks using Bayesian approximations, thereby efficiently adapting to new user behavior.
3. Our Approach
In this section, we discuss in detail about our proposed ActiveHARNet pipeline/architecture (showcased in Figure 1), and our approach to perform Incremental Active Learning.
3.1. Background on Modeling Uncertainties
Bayesian Neural Networks (BNNs) offer a probabilistic interpretation to deep learning models by incorporating Gaussian prior (probability distributions) - over our model parameters (weights - ), thereby modeling output uncertainties. The likelihood model for a classification setting with c classes and x input points is given by,
where is the model output. However, the posterior distribution of BNNs are not easily tractable, hence it becomes computationally intensive for training and inference.
Gal et al. propose that, Dropout - a stochastic regularization technique (dropout, ), can also perform approximate inference over a deep Gaussian process (dropout_yaringal, ), thereby learn the model posterior uncertainties without high computational complexities. This is equivalent to performing Variational Inference (VI), where the posterior distribution is approximated by finding another distribution , parameterized by , within a family of simplified tractable distributions, while minimizing the Kullback-Leibler (KL) divergence between and the true model posterior .
During inference, we can estimate the mean and variance of the BNN’s output by applying dropout before every fully-connected layer during train and test time for multiple stochastic passes (T). This is equivalent to obtaining predictions and uncertainty estimates respectively from the approximate posterior output of the neural network, thereby making the Bayesian NN non-deterministic (dropout_yaringal, ). The predictive distribution for a new data point input can be obtained by,
where , and is the dropout distribution approximated using VI. Dropout, being a light-weight operation in most existing NN architectures, enables easier and faster approximation of posterior uncertainties.
3.2. Model Architecture
To extract discriminative features from inertial data, a combination of local features and spatial interactions between the three axes of the accelerometer can be exploited, using a combination of convolutional 1D and 2D layers.
Intra-Axial dependencies are captured using a two-layer stacked convolutional 1D network, with 8 and 16 filters each and kernel size 2. Batch normalization is performed followed by a max-pooling layer of size 2.
Inter-axial dependencies from the concatenated intra-axial features are captured using a two-layer stacked 2-D CNN network comprising of 8 and 16 filters each with receptive field size 3x3, followed by batch normalization and a max pooling layer of size 3x2.
This is followed by two Fully-Connected (FC) layers with 16 and 8 neurons each, with weight regularization (L2-regularizer with a weight decay constant), and ReLU activations. A dropout layer with a probability of 0.3 is applied, followed by a softmax layer to get the probability estimates (scores). The categorical-cross entropy loss of the model is minimized using Adam optimizer with a learning rate of and implemented using the TensorFlow framework. We choose this HARNet architecture (HARNet, ) with extensive parametric optimization, as it is a state-of-the-art architecture for heterogeneous HAR tasks by taking into account the efficiency, model size and inference times, with extremely less parameters (31,000 parameters) compared to contemporary deep learning architectures.
In order make HARNet (Figure 2) a Bayesian NN so as to obtain uncertainty estimates, we introduce a standard Gaussian prior on the set of our model parameters. Also, to perform approximation inference in our model, we perform dropout at train and test-time as discussed in Section 3.1 to sample from the approximate posterior using multiple stochastic forward passes (MC-dropout) (dropout_yaringal, ). After experimenting with multiple dropout iterations (forward passes - T), an optimal T=10 in utilized this paper to determine uncertainties. Effectively, HARNet is a Bayesian ensembled Convolutional Neural Network (B-CNN) which can model uncertainties, which can be used with existing acquisition functions for AL.
3.3. Acquisition functions for Active Learning
As stated in (BayesianAL, ), given a classification model , pool data obtained from real-world, and inputs , an acquisition function is a function of that the active learning system uses to infer the next query point:
Acquisition functions are used in active learning scenarios for approximations in Bayesian CNNs, thereby arriving at the most efficient set of data points to query from . We examine the following acquisition functions to determine the most suitable function for on-device computation:
3.3.1. Max Entropy
Pool points are chosen that maximize the predictive entropy (Max_entropy, ).
3.3.2. Bayesian Active Learning by Disagreement (BALD)
In BALD, pool points are chosen that maximize the mutual information between predictions and model posterior (BALD, ). The points that maximize the acquisition function are the points that the model finds uncertain on average, and information about model parameters are maximized under the posterior that disagree the most about the outcome.
where is the entropy of , given model weights .
3.3.3. Variation Ratios ()
The LC (Least Confident) method for uncertainty based pool sampling is performed in (Var_Ratios, ).
3.3.4. Random Sampling
This acquisition function is equivalent to selecting a point from a pool of data points uniformly at random.
4. Baseline Evaluation
To evaluate our Bayesian ActiveHARNet framework on an embedded platform, we experiment and analyze the results on two wrist-worn public datasets, which were performed across multiple users in real-world. Datasets with multiple users were rigorously selected to exhibit the capabilities of ActiveHARNet like incremental active learning and user adaptability. This is essentially the pre-training phase where the model is stocked in the embedded system. To approximate our posterior with predictive uncertainties, we test our BNN model over T=10 stochastic iterations, and average our predictions to calculate our final efficiencies on both datasets. Also, each model is trained for a maximum of 50 epochs to establish the baseline efficiencies, as we observe that loss saturates extensively and does not converge after the same.
4.1. Heterogeneous Human Activity Recognition (HHAR) Smartwatch Dataset
HHAR dataset, proposed by Allan et al. (HHAR, ), contains accelerometer data from different wearables - two LG G smartwatches and two Samsung Galaxy Gears across nine users performing six activities: Biking, Sitting, Standing, Walking, Stairs-Up, Stairs-Down in real-time heterogeneous conditions.
As performed in (HARNet, ), we first segment the raw inertial accelerometer data into two-second non-overlapping windows. To handle disparity in sampling frequencies across devices, we perform Decimation - a down-sampling technique on all windows, to the least sampling frequency (100 Hz - Samsung Galaxy Gear) to obtain uniform distribution in data. Hence, the size of each window () is 200. Further, to obtain temporal and frequency information, we perform Discrete Wavelet Transform (DWT) and take only the Approximate coefficients. Note that, performing such operations compress the size of the sensor data by more than 50%. Also, we utilize only accelerometer data and not gyroscope, since the former reduces the size of the dataset by half without compromising much on accuracy.
Initially, we benchmark our baseline accuracies on the server using Bayesian HARNet with the Leave-One-User-Out (LOOCV) strategy. The test user samples are split in random into and points, with slightly higher than (70-30 ratio) as an approximation of real-world incoming data, while the unseen is always used for evaluation purposes only in our experiment for both datasets. The exact number of and differs with various users, and is subjective in real-time. The average accuracy using LOOCV is observed to be 61%. Also, from Figure 3, we can infer that the model performs the best on user ‘d’ data with 84% classification accuracy, while the classification accuracies of user ‘i’ and user ‘g’ are the least with 25% and 36% respectively. These disparate changes in accuracies can be attributed to the unique execution style of activities by the users.
4.2. Notch Wrist-worn Fall Detection Dataset
This dataset uses an off-the-shelf Notch sensor as performed in (smartfall, ) (only the wrist-worn accelerometer data is used). The dataset is collected by seven volunteers across various age groups performing simulated falls and activities (activities are termed as not-falls).
We segment the raw inertial data of each activity into non-overlapping windows with the given standardized sampling frequency of 31.25 Hz as in (smartfall, ), hence there is no decimation performed here. Further, similar to the HHAR dataset preprocessing, we perform Discrete Wavelet Transform (DWT) to obtain temporal and frequency information of the sensor data and take only the Approximate coefficients. Note that performing these operations compress the size of the sensor data by more than 50%.
From Table 1, we can observe the f1-scores and accuracies with LOOCV strategy on the Bayesian HARNet. f1-score would be a better estimate for handling the imbalance in falls and activities, since fall is the rare case event in this binary classification setting. A similar randomized and split (70-30 ratio) is performed, and the average f1-score on using Bayesian HARNet is found to be 0.927, which is substantially higher than (smartfall, ), whose best-performing deep learning model’s f1-score is calculated to be 0.837 from its precision and recall scores.
5. Incremental Active Learning
In order to handle ground truth labeling and incorporate model weight updation on incoming test user data, we experiment incremental active training with LOOCV on both datasets by deploying the system on a Raspberry Pi 2. We choose this single-board computing platform, since it has similar hardware and software specifications with predominant contemporary wearable devices. The number of acquisition windows used for incremental active training from can be governed by the acquisition adaptation factor [0, 1]. The incremental model updation was simulated for a maximum of 10 epochs, owing to its non-convergence in loss thereafter.
5.1. ActiveHARNet on HHAR dataset
We analyze various AL acquisition functions mentioned in section 3.3 for all users, particularly for the worst performing user ‘i’ (least classification accuracy), and we can infer from Figure 4 that Variation Ratios () acquisition function performs the best, while Random Sampling has the least classification accuracy as expected. In , only 62 (=0.5 or 50%) acquisition windows from the total (123 windows) were required for user ‘i’ to achieve a test accuracy of 70% from a baseline accuracy of 25%, which is a substantial increase of 45% in test accuracy. With =1.0 (all windows), a maximum of 73% is achieved.
We test ActiveHARNet using across all users, and observe that the average baseline accuracy (=0) increases from 61% to a maximum of 86% (=1) (Table 2). Also, very few acquisition windows (=0.4, accuracy=83.05%) from incoming are found to be sufficient for achieving competitive efficiencies (85.87%) as =1.0. Note that, =0.0 gives the efficiency of the pre-trained model without any data points acquired during incremental learning.
5.2. ActiveHARNet on Notch dataset
We analyze AL acquisition functions for all users as performed in Section 5.1, and showcase the f1-scores of user 5 (least performing) in Figure 5. Variation Ratios () has the highest classification efficiency again, requiring only 150 acquisition windows (=0.4) for a competitive f1-score of 0.956 from the total of 265 windows (=1.0), whose f1-score is 0.969. Random acquisition performs with the least f1-score again.
Also, from Table 3, across all users yield average f1-scores of 0.943 and 0.948 for =0.4 and =0.6 respectively, from a baseline 0.928 (=0.0), thus scaling well to new users with substantially less data points.
5.3. On-Device Incremental Learning
Raspberry Pi 2 is used for evaluating incremental active learning, and an average time of 1.4 sec is utilized for each stochastic forward pass (T) with dropout for acquisition of top windows. Since we perform T=10 dropout iterations in our experiment, we observe that an average of 14 seconds are needed for querying most uncertain data points. Variation Ratios was found to be converging slightly faster with better test accuracies than BALD and MaxEntropy acquisition functions on both datasets, while Random Sampling converged relatively slower. Also, there is substantial increase in relative HHAR efficiencies during incremental active learning than that of Notch. This can be attributed to the nature of the dataset, and the different ways people perform activities. Notch, being a fall detection problem with two classes, has a higher probability of better recognition efficiencies, than a multi-class HHAR problem.
|Inference time||14 ms||11 ms|
|Discrete Wavelet Transform||0.5 ms||0.39 ms|
|Time taken per epoch||1.8 sec||1.2 sec|
The model size for HHAR was kB, while for Notch, it was found to be 180 kB, which is substantially small compared to conventional deep learning models. For real-time deployment feasibility, it is practical to have a threshold (upper limit) on the number of collected at a single point of time. This can be quantified by either number of windows () or time taken (in seconds). We propose time as a benchmark (for instance, 15 minutes since start of an activity cycle; the setting could be personalized based on user), so that the oracle efficiently remembers the recently performed activities when queried for ground truth. Many such acquisition iterations, in turn, model updations would ideally happen during real-time, and our experiments showcased the practical feasibility of one such acquisition iteration with 14 seconds. The end-user can also have a trade-off between efficiency and model updation time, and this is proportional to the number of data points to be queried by the oracle.
This paper presents three new empirical contributions with emphasis on Bayesian Active Learning for embedded/wearable deep learning on HAR and fall detection scenarios. First, we benchmark our efficiencies for both datasets using the Bayesian HARNet model, which can incorporate uncertainties. Second, we systematically analyze various acquisition functions for active learning and exploit Bayesian Neural Networks with stochastic dropout for extracting the most informative data points to be queried by the oracle, using uncertainty approximations. Third, we propose ActiveHARNet - a resource-friendly unified framework which facilitates on-device Incremental Learning (model updation) for seamless physical activity monitoring and fall detection over-time, and can further be extended to other behavior monitoring tasks in pervasive healthcare.
- (1) Bhattacharya, S., and Lane, N. D. From smart to deep: Robust activity recognition on smartwatches using deep learning. In 2016 IEEE International Conference on Pervasive Computing and Communication Workshops (PerCom Workshops) (2016), IEEE, pp. 1–6.
- (2) Bhattacharya, S., Nurmi, P., Hammerla, N., and Plötz, T. Using unlabeled data in a sparse-coding framework for human activity recognition. Pervasive and Mobile Computing (2014), 242–262.
- (3) Freeman, L. C. Elementary Applied Statistics: For Students in Behavioral Science. John Wiley & Sons, 1965.
- (4) Gal, Y., and Ghahramani, Z. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48 (2016), ICML’16, pp. 1050–1059.
- (5) Gal, Y., Islam, R., and Ghahramani, Z. Deep bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (2017), ICML’17, pp. 1183–1192.
- (6) Hossain, H. M. S., Roy, N., and Khan, M. A. A. H. Active learning enabled activity recognition. In 2016 IEEE International Conference on Pervasive Computing and Communications (PerCom) (2016), pp. 1–9.
- (7) Houlsby, N., Huszár, F., Ghahramani, Z., and Lengyel, M. Bayesian active learning for classification and preference learning. arXiv preprint arXiv:1112.5745 (2011).
- (8) Lane, N. D., Bhattacharya, S., Georgiev, P., Forlivesi, C., and Kawsar, F. An early resource characterization of deep learning on wearables, smartphones and internet-of-things devices. In Proceedings of the 2015 International Workshop on Internet of Things Towards Applications (2015), IoT-App ’15, ACM, pp. 7–12.
- (9) Lasecki, W. S., Song, Y. C., Kautz, H., and Bigham, J. P. Real-time crowd labeling for deployable activity recognition. In Proceedings of the 2013 Conference on Computer Supported Cooperative Work (2013), CSCW ’13, ACM, pp. 1203–1212.
- (10) Liu, X., Liu, L., Simske, S. J., and Liu, J. Human daily activity recognition for healthcare using wearable and visual sensing data. In IEEE International Conference on Healthcare Informatics (ICHI) (2016), IEEE, pp. 24–31.
- (11) Mauldin, T. R., Canby, M. E., Metsis, V., Ngu, A. H. H., and Rivera, C. C. Smartfall: A smartwatch-based fall detection system using deep learning. Sensors 18 (2018).
- (12) Ojetola, O., Gaura, E. I., and Brusey, J. Fall detection with wearable sensors–safe (smart fall detection). In Seventh International Conference on Intelligent Environments (2011), pp. 318–321.
- (13) Osmani, V., Balasubramaniam, S., and Botvich, D. Human activity recognition in pervasive health-care: Supporting efficient remote collaboration. Journal of Network and Computer Applications 31, 4 (2008), 628–655.
- (14) Settles, B. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning (2012).
- (15) Shannon, C. E. A mathematical theory of communication. Bell system technical journal 27, 3 (1948), 379–423.
- (16) Shen, Y., Yun, H., Lipton, Z., Kronrod, Y., and Anandkumar, A. Deep active learning for named entity recognition. Proceedings of the 2nd Workshop on Representation Learning for NLP (2017).
- (17) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (2014), 1929–1958.
- (18) Stisen, A., Blunck, H., Bhattacharya, S., Prentow, T. S., Kjærgaard, M. B., Dey, A., Sonne, T., and Jensen, M. M. Smart devices are different: Assessing and mitigating mobile sensing heterogeneities for activity recognition. In Proceedings of the 13th ACM Conference on Embedded Networked Sensor Systems (2015), SenSys ’15, ACM, pp. 127–140.
- (19) Sundaramoorthy, P., Gudur, G. K., Moorthy, M. R., Bhandari, R. N., and Vijayaraghavan, V. Harnet: Towards on-device incremental learning using deep ensembles on constrained devices. In Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning (2018), EMDL’18, ACM, pp. 31–36.
- (20) Wang, J., Chen, Y., Hao, S., Peng, X., and Hu, L. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters (2019), 3–11.
- (21) Yao, S., Hu, S., Zhao, Y., Zhang, A., and Abdelzaher, T. Deepsense: A unified deep learning framework for time-series mobile sensing data processing. In Proceedings of the 26th International Conference on World Wide Web (2017), WWW ’17, pp. 351–360.