DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition

DFTerNet: Towards 2-bit Dynamic Fusion Networks for Accurate Human Activity Recognition

Zhan Yang, Osolo Ian Raymond, ChengYuan Zhang, Ying Wan, Jun Long
School of Information Science and Engineering, Central South University, Changsha 410083, China
Network Resources Management and Trust Evaluation Key Laboratory of Hunan Province
junlong@csu.edu.cn
Corresponding author. This work was supported in part by the National Natural Science Foundation of China (61472450), the Natural Science Foundation of Hunan Province (2017JJ3417) and the Science and Technology Plan of Hunan (2016TP1003).©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
July 15, 2019

I Introduction

Artificial Intelligence (AI), as an auxiliary technology in modern games, has played an indispensable role in improving gaming experience in the last decade. The film Ready Player One vividly shows the charm of the future virtual games on the world. It demonstrates that one of the core technologies of virtual-realistic interaction is recognizing all kinds of complex activities.

Convolutional neural networks are very powerful and have been successfully used in many neural network models. They have been widely applied in lots of virtual-realistic interactive practical applications, e.g., object recognition [1, 2, 3], Internet of Things [4, 5], human activity recognition (HAR) [6, 7]. Its success has been driven by the recent data explosion as well as the increase in model size. However, their large computation power cost limits the practical applications support in portable devices without high-performance graphics processing units (GPUs) as shown in Figure 1.


Figure 1: The rough numbers for the computations’ energy consumption (45nm technology) [18] and some effective methods to deploy deep neural networks on portable devices.

With the development of VR/AR technology, sensor-based portable gaming devices are dazzling to operate with recognize human activity and detection technique, hence it is expected to deploy advance accurate CNNs, e.g., Inception-Nets [8], ResNets [9] and VGG-Nets [10] on smart portable devices. However, the following problems limit the applicability above-mentioned idea. Firstly, as the winner of the ILSVRC-2015 competition, ResNets-152 [9] trained with nearly 19.4 million real-valued parameters to classify images, making it resource-intensive in different aspects. It is unable to run on portable devices for real-time applications, due to its high CPU/GPU workload and memory usage requirements. A similar phenomenon occurs in other deeper networks, such as, VGG-Net and AlexNet [11]. Secondly, in the case of practical applications, multiple sensors located at different position on the body, are each required to process the signals they collects separately. Depending on the type of activity being performed, a sensor may contribute less or more to the overall result compared to other sensors depending on its location. Therefore, computational complexity can be decreased and the performance of model improved by reducing the representations from sensors that have “less contribution” during a particular activity.

Recently, in order to resolve the storage and computational problems [12, 13] can be categorized into three methods, i.e., network pruning, low-rank decomposition and quantization of network. Among them, quantization of network has received more and more research focus. DCNNs with binary weights and activations have been designed [14, 15, 16]. Binary Convolutional Neural Networks (BCNNs) with weights and activations constrained to only two values (e.g., -1,+1), can bring great benefits to specialized machine learning hardware because of the following major reasons: (1) the quantized weights and activations reduce memory usage and model size by 32x compared to the full-precision version; (2) if networks are binary, then most multiply-accumulate operations (require hundreds of logic gate at least) can be replaced by popcount-XNOR operations (only require a single logic gate), which are especially well suited for FPGAs and ASICs [17].

However, quantization usually causes severe prediction accuracy degradation. The reported accuracy of obtained models is unsatisfactory on complex tasks (e.g., ImageNet dataset). More concretely, Reference [16] shows that binary weight causes the accuracy of ResNet-18 to be reduced by about 9% (GoogLenet drops by about 6%) on ImageNet dataset. It is Obvious that there is a considerable gap in performance between the accuracy of a quantitative model and the full-precision model. In light of these considerations, this paper proposes a novel quantization method and dynamic fusion strategy, to achieve deployments on an advanced high-precision and low-cost computation neural network model on portable device. The main contribution of this paper are summarized as follows:

  1. We propose a quantization function with an elastic scale parameter (-) to quantize the entire full-precision convolutional neural network. The quantization of weights, activations and fusion weights are derived from the quantization function (-) with different scale parameters (). We quantize the weights and activations to 2-bit values , and use a masked Hamming distance instead of floating-point matrix multiplication. This setup is able to achieve an accuracy score close to the full-precision counterpart, using about 11 less memory and achieving a 9 speedup.

  2. We introduce a dynamic fusion strategy for multi-sensors activity recognition. For sensors whose “contribution” (sub-network) is less than the others, we randomly reduce their representations through fusion weights, which are sampled from a Bernoulli distribution given by the scale parameter from the quantization method. Experimental results show that by adopting dynamic fusion strategy, we were able to achieve higher accuracy and lower memory usage than the baseline model.

Ideally, using more quantization weights, activation modules and fusion strategy will result in better accuracy and eventually achieve a higher accuracy than the full-precision baselines. The training strategy can reduce the amount of computing power required and the energy consumption therefore realizing the main objective of designing a system that can be deployed on portable devices. More importantly, adopting dynamic fusion strategies for different types of activities are more in line with the actual situation. This was verified by using both the quantization method and fusion strategy on the OPPORTUNITY and PAMAP2 datasets. Only the quantization method was applied on the UniMiB-SHAR dataset. This is the first time that both quantization and dynamic fusion strategy were adopted for convolutional networks to achieve a high prediction accuracy on complex human activity recognition tasks.

The remainder of this paper is structured as follows. In Section II, we briefly introduction the related works on human activity recognition, quantization models methods for convolutional neural networks. In Section III, we highlight the motivation of our method and provide some theoretical analyzes for its implementation. In Section IV, we introduce our experiment. At last, Section V experimentally demonstrates the efficacy of our method and Section VI draws the conclusion and future work.

Ii Related Work

i Convolutional Neural Networks for Human Activity Recognition

Several advance approaches have been evaluated in the last few years on the Human Activity Recognition (HAR). Accuracies for HAR without deep learning methods are often relatively low, e.g., Hand-crafted features method [19] uses simple statistical value (e.g., std, avg, mean, max, min, median, etc.) or frequency domain correlation features based on the signal Fourier transform to analyze the time series of human activity recognition data. Due to its simplicity to setup and low computational, some areas are still using, but the accuracy cannot satisfy the experience of modern AI games. Reference [20] adopts SVM, Random Forest, Dynamic Time Warping or HMM for predicting action classes, these methods work well when data are scarce and highly unbalance. However, when faced with the activity recognition of complex high-level behaviors task, identifying the relevant features through these traditional approaches are time-consuming [21].

Recently, many researches adopt Convolutional Neural Networks (CNNs) to deploy HAR system, such as [6, 21, 22, 23, 24]. Convolutional Neural Networks were based on the discovery of visual cortical cells and retains the spatial information of the data through receptive field. It is known that the power of CNNs stems in large part from their ability to exploit symmetries through a combination of weight sharing and translation equivariance. Also, with their ability to act as feature extractors, a plurality of convolution operators are stacked to create a hierarchy of progressive abstract features. Apart from image recognition [11, 25, 26], NLP [27, 28] and video recognition [29], more and more literatures have used CNNs in recent years to learn sensor-based data representations for human activity recognition (HAR) and have achieved remarkable performances [6, 30]. The model [24] consist of two or three temporal-convolution layers with ReLU activation function followed by max-pooling layer and a softmax classifier, which can be applied over all sensors simultaneously. Reference [6] introduces a four temporal-convolutional layers on single sensor, followed by a fully-connected layer and softmax classifier. Research shows that deeper networks can find correlations between different sensors.

Just like these works discussed above, we adopt convolutional neural networks to learn representations from wearable multi-sensor signal sources. However, these advanced high-precision models are difficult to deploy on portable devices, due to their computational complexity and energy consumption. Fortunately, quantization of convolutional neural networks have become a hot research topic. They aim to reduce memory usage and computational complexity meanwhile maintain an acceptable accuracy.

ii Quantization Model of Convolutional Neural Networks

The convolutional binary neural network is not a new topic. Inspired by neuroscience, the unit step function is used as an activation function in artificial neural networks [31]. The binary activation mode can use spiking response for computing and communication, which is an energy-efficient method because it only consumes energy when necessary [12].

Recently, Binarized-neural-networks (BNNs) [15] quantize the weights and activations to binary values of each layer successfully. They proposed two binarization functions, the first is deterministic as shown in (1) and the second is stochastic as shown in (2). Where is the binarized variable and is the full-precision variable, is the “hard sigmoid” function.

(1)
(2)

TWN [32] constrains the weights to ternary values () by referencing symmetric thresholds. In each layer, the quantization of TWN is shown in (3), where is a positive threshold parameter. They claim a trade-off between model complexity and generalization.

(3)

DoReFa-Net [33] is derived from AlexNet that has 1-bit weights, 2-bit activations and 6-bit gradients and that can achieve 46.1% top-1 accuracy on ImageNet validation set. DoReFa-Net adopts a method as shown in (4), where and are the full-precision (original) and quantized weights, respectively, and is the mean of absolute value of weights.

(4)

iii Quantization Method for Convolutional Neural Networks

The idea of quantization of weights and activations was first proposed by [15]. The research showed the following two contributions: 1) The costly arithmetic operations between weights and activations in a full-precision networks can be replaced with cheap bitcount and XNOR operations, which can result in significant speed improvements. Compared with the full-precision counterpart, 1-bit quantization reduce the memory by a factor of 32 and 2) In some visual classification tasks, using 1-bit quantization could achieve fairly good performance.

Some researchers [16, 34] introduce easy, high-performance and accurate approximations to convolutional neural networks by quantizing the weights, and using a uniform quantization method, which first scales its value in the range . Then it adopts the following -bit quantization as shown in (5), where approximates continuous values to their nearest discrete states. The benefit of this quantization method is that when calculating the inner product of two quantized vectors, costly arithmetic calculations can be replaced by cheap operations. (e.g. bit shift, count operation) In addition, this quantization method is rule-based and thus, easy to implement.

(5)

Reference [35] proposes a network compression method called INQ. After obtaining a network through training, the parameters (full-precision parameters) of each layer are first divided into two groups. The parameters in the first group are directly quantized and fixed. The other group of parameters through retraining are compensated for the loss of accuracy caused by quantization. The above process iterates until all parameters are quantized. With incremental quantization, using weights with small-width values (e.g., 3-bit, 4-bit and 5-bit) results in almost no accuracy loss compared with the full-precision counterpart. The quantization method is shown in (6), where and are full-precision (original) and quantized weights, respectively, and are the lower and upper bounds of the quantized set, respectively.

(6)

Reference [36] proposes a method as shown in (7), where is a scaler parameter, is the Hadamard product, and respectively returns the and value of each element. The method quantizes gradients to ternary values can effectively improve clients-to-server communication in distributed learning.

(7)

Reference [37] proposes greedy approximation, which instead tries to learn the quantization as shown in (8), where is binary filter and is optimization parameters.

(8)

The greedy approximation expands to -bit () quantization by minimizing the residue in order. Although not able to achieve a high-precision solution, the formulation of minimizing quantization error is very promising, and quantitative neural networks designed in this manner can be effectively deployed on modern portable devices.


Figure 2: An overview of three fusion strategy methods and architecture of the hierarchical DFTerNet for activity recognition. (a) Early fusion, (b) Late fusion, (c) Dynamic fusion, are summarized in Section iv. From the left of each sub-fig, the multi-sensor signal sources from the different position are processed by a common convolutional network in (a) and three sub-convolutional networks in (b)(c). Input sensor signals of size , where denote the length of features maps and the number of sensor channels. The Conv. ((kernel size), (siding stride), numbers of kernel) for blocks 1, 2 and 3 are ((11,1),(1,1),50), ((10,1),(1,1),40), ((6,1),(1,1),30), respectively. The Pool size for blocks 1, 2 and 3 are (2,1), (3,1), (1,1), respectively. Neurons in dense layer is 1000. The tensors are the fusion weights.

Iii Method

In this section, we introduce our quantization method and dynamic fusion strategy, which is termed DFTerNet (Dynamic-Fusion-Ternary(2-bit)-Convolutional-Network) for convenience. we aim to recognize human activity extracted from IMU sensors. For this purpose, a fully-convolutional-based architecture is chosen and we focus on the recognition accuracy of finally model. During train-time (Training), we still use the full-precision network (the real-valued weights are retained and updated at each epoch). During run-time (Inference), we use ternary weights in convolution.

i Linear Mapping

In this paper, we propose a quantization function - that converts a floating-point to its -bitwidth signed integer. Formally, it can be defined as follows:

(9)

where is uniform distance, whose role is to perform a discretization of -bit linear mapping of continuous and unbounded values, is a scale parameter, is the approximation function that approximates continuous values to their nearest discrete states, function that clips unbounded values to [,].

For example, when the scale parameter , quantizes to . Consider the scale parameter , assume we set two different scale parameters: and corresponds to and . In that case is 0 and is 0.5. Clearly, it can be seen that each quantization function can use the scale parameter to adjust the quantization threshold, clip differently to represent the input value.

ii Approximate weights

Consider that we use a -layer CNN model. Suppose that learnable weights of each convolutional layer are represented as , in which indicate the input-channel, output-channel, filter width and filter height, respectively. It is known that, when using 32 bits (full-precision) floating-point arithmetic, storing all these weights would require bit memory.

As claimed above, at each layer, our goal is to estimate the real-weight filter using 2-bit filter . Generally, we define a reconstruction error as shown in (10):

(10)

where describes a nonnegative scaling parameter. To retain the quantization network accuracy, the reconstruction error should be minimized. However, direct reconstruction error is NP-hard, so forcibly solving it will be very time consuming [38]. In order to solve the above problem in a reasonable time, we need to find an optimal estimation algorithm. Scilicet, the goal is to solve the following optimization problem:

(11)

in which , the is defined as for any three-dimension tensor .

One way to solve the optimization problem shown in (11) is to expand the cost function and take the derivate w.r.t. and , respectively. However, in this case, it must be get correlation-dependence value of and . To overcome this problem, we use the quantization function to quantize by (9):

(12)

In this work, we aim to quantize the real-weight filter to ternary values {-0.5,0,0.5} , so the parameter and the threshold of weights are controlled by as shown in (13),

(13)

where is a shift threshold parameter which can be used to constrain thresholds.

With the fixed through (12), Equation (11) becomes a linear regression problem:

(14)

We can use the “straight-through (ST) estimator” [39] to back-propagate though . This is shown in detailed in Algorithm 1. Note that in Run-time, only () is required.

Algorithm 1 Training with “straight-through (ST) estimator” [39] on the forward and backward approach of an approximated convolutional.
Require -, shift parameter . Assume as the loss function, and as the input and output tensors of a convolutional layer respectively.
A. Forward propagation:
    1. ,    #Quantization
    2. Solve Eq. (14) for ,
    3. .                    ()
B. Back propagation:
    By the chain rule of gradients and ST we have:
    1. .

iii Activation quantization

In order to avoid substantial memory consumption and computational requirement, which is caused by cumbersome floating-point calculations, we should use bitwise operation. Therefore, the activations as well as the weights must be quantized.

If activations are 1-bit values, we can quantize activations after they pass through a function similar to the activation quantization procedure in [33]. Formally, it can be defined as:

(15)

If activations are presented in -, the quantized of real-value activations can be defined as:

(16)

In this paper, we constrain the weights to ternary values {-0.5, 0, 0.5}. In order to transform the real-valued activation into ternary activation, we set the parameter . The scale parameter controls the clip threshold and can be varied throughout the process of learning. Note that, quantization operations in networks will cause the variance of weights to be scaled compared to the original limit, which will cause exploding of network’s outputs. XNOR-Net [16] proposes a filter-wise scaling factor calculated continuously with full precision to alleviate the amplification effect. In our experiment implementation, we control the activation threshold to attenuate the amplification effect by setting the scale parameter as

where is a pre-defined constant for each layer, and will be updated by in each epoch:

where is the trained weights of each layer. The forward and backward of the activation is shown in detailed in Algorithm 2.

Algorithm 2 Training with “straight-through (ST) estimator” [39] on the forward and backward approach of the activation.
Require -, shift parameter , can be seen as propagating the gradient through , indicates Hadamard product. Assume as the loss function.
A. Forward propagation:
    1. ,    #Quantization
B. Back propagation:
    1. ,    #using STE

iv Scalability to Multiple Sensors (IMUs)

Each activity in the OPPORTUNITY and PAMAP2 datasets is collected by multi-sensors in different parts, each sensor is independent. For different types of activities, different sensors may not have the same “contribution”. In order to improve the accuracy of our model, we conducted a comprehensive evaluation using different feature fusion strategies as shown in Figure 2. Note that the UniMiB-SHAR dataset only has 3-channels data (3D accelerometer), so we apply Early fusion.

Early fusion. All joints from multi-sensors in different parts are stacked as input of the network [21, 40].

Late fusion. Independent sensors in different signal sources through their own Conv3 feature maps () are concatenated by fusion weights like  [41] and the feature maps after fusion can be expressed as:

Dynamic fusion. Different parts of the body (different sensors locations) have different levels of participation in different types of activities. For example, for ankle-hand-based activities (e.g., running and jumping), the “contribution” of back-based sensor is lower than of the sensors on the hands and ankles. In the case of hand-based activities (e.g., open drawer, close drawer), the “contribution” of the sensors in the ankles and back is lower than the hands, etc. Therefore, unlike in the Late fusion method, the fusion weight settings of Dynamic fusion is different. Formally, the full-precision Sub-network-Conv3 weights and feature maps are represented as and respectively, where correspond to the fusion weights , . More specifically, dynamic fusion weights aim to randomly reduce the representations of less “contribution” signal source, which can be considered a “dynamic dropout method”, i.e., dynamic clip parameter (non-fixed parameter). Given a quantized weights , each fusion weight independently follows the Bernoulli distribution as shown in (24):

(17)

where and are the - parameter of and respectively.

Training-time. The full-precision Conv3-sub-network weights are quantized by (9):

(18)

According to (17), the generated fusion weight as shown in (19) is given by:

(19)

Assumption. The -- are the less “contribution” sub-networks, the feature maps after dynamic fusion strategy can be expressed as:

(20)

where denotes the Hadamard product. An example of this process is shown in Figure 3.

Run-time. For match the Train-time, only (19) and (20) were used.

The stochastic rounding, instead of deterministic one, is chosen by TernGrad [36] and QSDG [42]. Some researchers (e.g. [15, 43]) have proved that stochastic rounding has an unbiased expectation and achieved success on low-precision.


Figure 3: An example of the dynamic fusion processing when in Train-time. (a) is the sub-network feature maps . (b) is the full-precision weights . (c) is the quantized weights . (d) is the fusion weights . (e) is the feature maps after fusion. denotes the function from (b) to (c) is quantization function, which quantize the full-precision weights to 2-bit weights by (18). The denotes from (c) to (d) is the Bernoulli distribution (17) that stochastically samples to either 0 or 1, where is Hadamard product.

v Error and Complexity Analysis

Reconstruction Error According to (10) and (11), we have defined the reconstruction error . In this section, we analyze the boundary that is satisfied by .

Theorem 1. (Reconstruction Error Bound). The reconstruction error is bounded as

(21)

where and denotes the number of elements in .

Proof. We define which indicates the approximation residue after combining all the previously tensors as

(22)

Through derivative calculations, (10) is equivalent to

(23)

Since , we can obtain,

(24)

in which is an entry of . According to (22) and (24), we have

(25)

in which varies from 0 to .                                    

We can see from Theorem 1 that, the reconstruction error is “exponential decay” with a rate . It means that, given a small size , i.e., is small, the reconstruction error algorithm can be quite good.

Efficient Operations Both modern CPUs and SoCs contain instructions to efficiently and massively compute 64-bit strings in short time cycles [44]. However, floating-point calculations require very complex logic. The calculation efficiency can be improved by several tens of times by adopting each bit-count operator instead of the 64-bit floating-point addition and multiplication calculation.

In the classic deep learning architecture, floating point multiplication is the most time-consuming part. However, when the weights and activations are ternary values, floating point calculations should be avoided. In order to efficiently reduce the computational complexity and time consumption, we have to design a new operation, which aims to replace the full-precision cumulant operation of input tensor and filter . Some previous works [15, 16] on 1-bit networks have been successfully implemented using Hamming space calculation111The Hamming space can be used to calculate matrix multiplication and its inner-products. (bit-counting) as replacement for matrix multiplication. For example, , the matrix multiplication can be replaced by (26):

(26)

where defines a bit-count over the bits in the rows of and , and an exclusive OR operator.

In this paper, we aim to extend the concept to -bit networks. The quantized input tensor and filter can be denoted as , , where the value of and are composed of , - and -. Given a fixed , the is fixed as well. Therefore, we define two tensors as and to store and , respectively. (Note that the superscript and mean and respectively). The values of and as

In this work, our goal is to replace matrix multiplication with notion of bit-counting in -bit convolutional networks. Imagine that =, . Therefore, the inner-product calculation can be using two bit-counts in Hamming space:

(27)

where defines the negated XOR, an AND operator. Note that, if , the element-wise operator whose behavior must be custom.

Batch Normalization In previous works, weights are quantized to binary values by using a function [15] and to ternary values by using a positive threshold parameter [32] during Train-time. However, neural networks with quantized weights all failed to converge without batch normalization, because the quantized values are rather discretization for full-precision values. Batch Normalization [45] efficiently avoids the exploding and vanishing gradients problem. In this part, we briefly discuss the batch normalization operation which might increase extra computational cost. Simply, batch normalization is an affine function:

(28)

where , and are the mean and standard deviations respectively, and are scale and shift parameters respectively. More specifically, a batch normalization can be quantized in 2-bit values by the following quantization method:

(29)

where =. Equation (29) can be converted to the following:

(30)

Therefore, batch normalization will be accomplished at no extra cost.

Iv Experiments

To demonstrate the usefulness of quantization methods and fusion strategies on convolutional neural networks for high-precision human activity recognition on portable devices. To demonstrate that the extension of our model and training strategies to complex-combine activity recognition is straightforward. Providing better game experience for virtual-realistic interactive games on VR/AR devices and portable devices. The memory requirements and the quantized weights of each layer are also analyzed in detail. Complex naturalistic activities involve some parts of body and few activities are weak contrast make this recognition very difficult. Therefore, networks with better generalization ability to robustly fuse the data features of different parts of sensor are necessary, at the same time, an automatic method should depict the sketch of the activity feature and accurately recognize the activity.

The primary parameter of any experimental setup is the choice of datasets. To choose the optimal datasets for this study, we considered the complexity and richness of the datasets. Based on the background of our research, we selected the OPPORTUNITY [46], PAMAP2 [47] and UniMiB-SHAR [48] benchmark datasets for our experiments.

i Data Description and Performance Measure

i.1 Opportunity

The OPPORTUNITY public dataset has been used in many open activity recognition challenges. It contains four subjects performing 17 different (morning) Activities of Daily Life (ADLs) in a sensor-rich environment, as listed in Table 1. They were acquired at a sampling frequency of 30Hz equipping 7 wireless body-worn inertial measurement units (IMUs). Each IMU consists of a 3D accelerometer, 3D gyroscope and a 3D magnetic sensor, as well as 12 additional 3D accelerometer placed on the back, arms, ankles and hips, and accounting for a total of 145 different sensor channels. During the data collection process, each subject performed a session 5 times with ADL and 1 drill session. During each ADL session, subjects were asked to perform the activities naturally-named “ADL1” to “ADL5”. During the drill sessions, subjects performed 20 repetitions of each of the 17 ADLs of the dataset. The dataset contains about 6 hours in total, and the data are labeled on a timestamp level. In our experiment, the training and testing sets have 63-Dimension (36-D on hand, 9-D on back and 18-D on ankle, respectively).

In this paper, the models were trained on the data of ADL1, ADL2, ADL3, drill session, and test the model on the data of ADL4, ADL5.

i.2 Pamap2

The PAMAP2 dataset contains recordings from 9 subjects who participated in carrying out 12 activities, including household activities and a variety of exercise activities as shown in Table 2. The IMU and a HR-monitor equip on the hand, chest and ankle respectively and they are sampled at a constant sampling rate of 100Hz222Note that, following [22], the PAMAP2 dataset is downsampled to Hz, in order to have a temporal resolution comparable to the OPPORTUNITY dataset.. The accelerometer, gyroscope, magnetometer, temperature and heart rate contain 40 sensors are recorded from IMU over 10 hours in total. In our experiment, the training and testing sets have 36-Dimension (12-D on hand, 12-D on back and 12-D on ankle, respectively).

In this paper, data from subjects 5 and 6 are used as testing sets, the remaining data are used for training.

i.3 UniMiB-SHAR

The UniMiB-SHAR dataset collect data from 30 healthy subjects (6 male and 24 female) acquired using 3D-accelerometer of Samsung Galaxy Nexus I9250 with Android OS version 5.1.1. The data are sampled at a constant sampling rate of 50 Hz, and split in 17 different activity classes, 9 safety activities and 8 dangerous activities (falling action) as shown in Table 3. Unlike OPPORTUNITY dataset, the dataset does not have any NULL class and remains relatively balanced. In our experiment, the training and testing sets have 3-Dimension.

i.4 Performance Measure

ADL datasets are often highly unbalanced, Like OPPORTUNITY dataset. For this dataset, the overall classification accuracy is not an appropriate measure of performance, because the activity recognition rate of the majority classes might skew the performance statistics to the detriment of the least represented classes. As a result, many previous works such as [21] show the use of an evaluation metric independent of the class repartition—-score. The -score combines two measures: the precision and the recall : is the number of correct positives divided by the number of all positive examples returned by the classifier, and is the number of correct positive results divided by the number of all positive samples. The -score is the harmonic average of the and , where the best value is at 1 and worst at 0. In this paper, we use an additional evaluation metric to make the comparison with them easier: the weighted -Score (Sum of class -scores, weighted by the class proportion):

(31)

where and is the number of samples in class , and is the total number of samples.

Class Proportion Class Proportion
Open Door 1/2 1.87%/1.26% Open Fridge 1.60%
Close Door 1/2 6.15%/1.54% Close Fridge 0.79%
Open Dishwasher 1.85% Close Dishwasher 1.32%
Open Drawer 1/2/3 1.09%/1.64%/0.94% Clean Table 1.23%
Close Drawer 1/2/3 0.87%1.69%/2.04% Drink from Cup 1.07%
Toggle Switch 0.78% NULL 72.28%
Table 1: Classes and proportions of the OPPORTUNITY dataset
Class Proportion Class Proportion
Lying 6.00% Sitting 5.78%
Standing 5.92% Walking 7.45%
Running 3.06% Cycling 5.13%
Nordic walking 5.87% Ascending stairs 3.66%
Descending stairs 3.27% Vacuum cleaning 5.47%
Ironing 7.44% House cleaning 5.84%
Null 35.12%
Table 2: Classes and proportions of the PAMAP2 dataset
Class Proportion Class Proportion
StandingUpfromSitting 1.30% Walking 14.77%
StandingUpfromLaying 1.83% Running 16.86%
LyingDownfromStanding 2.51% Going Up 7.82%
Jumping 6.34% Going Down 11.25%
F(alling) Forward 4.49% F and Hitting Obstacle 5.62%
F Backward 4.47% Syncope 4.36%
F Right 4.34% F with ProStrategies 4.11%
F Backward SittingChair 3.69% F Left 4.54%
Sitting Down 1.70%
Table 3: Classes and proportions of the UniMiB-SHAR dataset

ii Experimental Setup

Sliding Window Our selected data are recorded continuously. We can think of the continuous-HAR data feature as a video feature. We use a sliding time-window of fixed length to segment the data. Each segmented data can be viewed as a frame in the video (a picture). We define , and as the length of the time-window, the number of sensor channels and the sliding stride, respectively. Through the above segment approach, each “picture” consists of a matrix. We set the segment parameters like [40], use a time-window of 2s on OPPORTUNITY and PAMAP2 datasets, resulting in =64, and . On UniMiB-SHAR dataset, a time-window of 2s was used, resulting in =96. Due to the timestamp-level labeling, each segmented data can usually contain multiple labels. We choose a majority labeling that appears most frequently among those of timestamps.

Dynamic fusion Weights Our selected datasets (OPPORTUNITY and PAMAP2) include two families of human activity recognition, that of periodic activities (locomotion of the OPPORTUNITY dataset and all PAMAP2 dataset) and that of sporadic activities (gestures the of OPPORTUNITY dataset). For designing the dynamic fusion strategies of two families of activity, we design two groups of feature maps after dynamic fusion strategies. In periodic activities (), we take into account the fact that back-based sensors have less “contribution”. In sporadic activities (), we consider that both back-based and ankle-based sensors have less “contribution”. Formally, in Training-time and Run-time, according to (17), (18) and (19), the feature maps and after dynamic fusion strategies can be expressed as:

(32)

Pooling Layer The role of common pooling layers is to find the maximum (max-pooing) or the average (avg-pooling) of output of each filter. Our experiments do not use the avg-pooling because the average operation will generate other values except . However, we observe that using max-pooling on values input will increase the probability distribution of , results in a noticeable drop in recognition accuracy. Therefore, we put the max-pooling layer before the batch normalization (BN) and activation (A).

iii Baseline Model

The aim of this paper does not necessarily exceed current state-of-the-art accuracies, but rather demonstrates and analyzes the impact of network model quantization and fusion strategy. Therefore, the benchmark model we used should not be very complex, because increasing the network topology and computational complexity to improve model performance runs counter to the aim of deploying advanced networks in portable devices. In this paper, we considered improving the performance of the model through a training strategy that is more in line with the practical applications. We therefore chose a CNN architecture [40] as the baseline model. It contains three convolutional blocks, a Dense layer and a Softmax layer. Each convolutional kernel performs a 1D convolutional layer on each sensor channel independently over the time dimension. To fairly evaluate the calculation consumption and memory usage of quantization on the CNNs, we employ the same number of channels and convolution filters for all comparison models. Layer-wise details are shown in Table 4, in which “Conv2” is the most computationally expensive and “Fc” commits the most memory. For example, using floating-point precision on OPPORTUNITY dataset, the entire model requires approximately 82MFLOPs333Note that FLOPs consist of the same number of FMULs and FADDs. and approximately 2 million weights, thus 0.38MBytes of storage for the model weights. During Train-time the model requires more than 12GBytes of memory (batch size of 1024), for inference during Run-time this can reduced to approximately 1.8GBytes.

OPPORTUNITY PAMAP2
Layer Name Params (b) FLOPs Params (b) FLOPs
Conv1 0.6k 4.84M 0.6k 2.76M
Conv2 20k 68.18M 20k 38.96M
Conv3 7.2k 5.47M 7.2k 3.12M
Fc 1.89M 3.78M 1.89M 2.16M
UniMiB-SAHR
Layer Name Params (b) FLOPs
Conv1 0.6k 0.23M
Conv2 20k 3.25M
Conv3 7.2k 0.26M
Fc 1.89M 0.18M
Table 4: Details of the learnable layers in our experimental model.

iv Implementation Details

In this section, we provide the implementation details of the architecture of the convolution neural network. Our method is implemented with Pytorch. The model is trained with mini-batch size of 1024, 50 epochs and using the AdaDelta with default initial learning rate [49]. A softmax function is used to normalize the output of the model. The probability that a sequence belongs to the - class is given by (33):

(33)

where is the output of the model, is the number of activities.

Algorithm 3 Training a -layer DFTerNet, is the loss function for minibatch, can be seen as propagating the gradient through and is the learning rate decay factor. indicates Hadamard product. BatchNorm() specifies how to batch-normalize the output of convolution. BackBatchNorm() specifies how to backpropagate through the normalization [45]. Update() specifies how to update the parameters when their gradients are known, such as AdaDelta [49].
Require A minibatch of inputs and targets (), previous weights , -bit, -bit, shift threshold parameter () and learning rate .
Ensure Updated weights .
  1. Computing the parameters gradients:
  1.1. Forward propagation:
    for =1 to do,
       with (12)
      Compute with (14)
      
      Apply max-pooling
       BatchNorm()
      if then
         with (16)
  1.2. Backward propagation:
  {note that the gradients are full-precision.}
  Compute  knowing and
  for to 1 do
    if then
       by Algorithm 2
    end if
     BackBatchNorm()
    
    
  end for
  2. Accumulating the parameters gradients:
  for to do
    With known, compute by Algorithm 1
     Update()
    
  end for

Experiments were carried out on a platform with an Intel 2 Intel E5-2600 CPU, 128G RAM and a NVIDIA TITAN Xp 12G GPU. The hyper-parameters of the model are provided in Figure 2444The early fusion is common convolutional neural network architecture (can be regarded as a sub-network). Therefore, the hyper-parameters of early fusion is equal to any sub-network of Late fusion or Dynamic fusion.. The training procedure, i.e., DFTerNet, is summarized in Algorithm 3.

V Result and Discussion

In this section, the proposed quantization method and fusion strategies are evaluated on three famous benchmark datasets. We consider: 1) The proposed dynamic fusion models are compared with other baseline models, 2) the effect of weight shift threshold parameter is evaluated, 3) the trade-off between quantization and model accuracy. For the first method (we call it Baseline or (TerNet) method), the required sensor signal sources are stacked together. In the second method (referred to as FTerNet), different sensor signal sources are peocessed through their own sub-networks and fused together with the learned representations before the dense layer, i.e., each element of fusion weights is equal to 1. The model proposed in this paper (DFTerNet), differs from the second method discussed in the way it handle the fusion part. In the DFTerNet, each element of fusion weights is sampled from Bernoulli distribution given by the scale parameter of the quantization method that we proposed.

i Multi-sensor Signals Fusion

In order to evaluate the different fusion strategies which were described in Section iv, an ablation study was performed on the OPPORTUNITY and PAMAP2 datasets. The first set of experiments consisted of comparing the three fusion strategies on the each dataset. As shown by the bold scores in Table 6, the order of fusion performance is: matched dynamic fusion in first place, followed by late fusion and finally early fusion. The reason for this could that it is better for each sensor signal source to have its network, it is improper to apply a single network to unify all signal sources. Meanwhile, there is a correlation between different signal sources and activity types and therefore the result of the recognition should be more reliable when the signal sources are highly correlated with the activity type. According to the two points above, the recognition result should be weighted by the learned representations of multiple signal sources, and the weight of learned representations of each signal source should reflect the similarity between the signal source and the activity type.

2.7 2.8 2.9 3.0
DFTerNet () 0.884 0.897 0.910 0.909 0.893
DFTerNet () 0.879 0.894 0.905 0.905 0.891
Table 5: Comparison of ’s value for activity recognition on DFTerNet.

ii Analysis of Weight Shift Threshold Parameter

In our DFTerNet, the weight shift parameter of operation whose result will directly affect the following fusion weights . Therefore, the second set of experiments is considered the effect of ’s value. As mentioned in the previous section, the value of is related to the value of and the fusion weights are sampled from . We use matched Dynamic fusion- and matched Dynamic fusion- on the OPPORTUNITY dataset as a test case to compare the performance of ’s value. In this experiment, the parameter settings are the same as described in Section ii and Section iv. Table 5 summarized the results of ’s value on matched Dynamic fusion. It can be seen that the quantization method we proposed achieves its best performance when using or . Similar phenomenon can also be found in literature such as [12].

Method O P U
locomotion gestures Activities ADLs and falls
Early fusion [40] 0.876 0.09 0.881 0.11 0.867 0.09 0.7981 0.12
TerNet (Early fusion) 0.865 0.14 0.876 0.19 0.850 0.11 0.7727 0.20
Late fusion [41] 0.897 0.06 0.917 0.05 0.908 0.05
FTerNet (Late fusion) 0.883 0.10 0.9080.14 0.893 0.13
Dynamic fusion- 0.915 0.04 0.914 0.06
DFTerNet () 0.909 0.06 0.901 0.11
Dynamic fusion- 0.920 0.07
DFTerNet () 0.910 0.10
  • Dynamic fusion- match locomotion of OPPORTUNITY and PAMAP2.

  • Dynamic fusion- and gestures are matched.

Memory
TerNet O 39k
P 20k
U 1.8k
FTerNet O 40k
P 24k
U 1.9k
DFTerNet O 34k
P 17k
U 1.8k
FP O 0.38M
P 0.22M
U 17k
Table 6: (a) Weighted performance of different fusion strategies and our propose quantization method for activity recognition on the OPPORTUNITY, PAMAP2 and UniMiB-SHAR datasets. (b) - quantization method generates 2-bit weight and activation Convolutional networks for activity recognition with the ability to make faithful inference and roughly fewer parameters than its counterpart.

iii Visualization of The Quantization Weights

In addition to analyzing the quantized weights, we further looked inside the learned layers and checked the values. We plot the heatmap of the fraction of zero value by DFTerNet-() on locomotion of OPPORTUNITY dataset across epochs. As shown in Figure 5, we can see the fraction of zero values increases in later epochs, similar phenomena also appear in DFTerNet-() on PAMAP2 dataset and DFTerNet-() on gestures of OPPORTUNITY dataset. Section v proves the reconstruction error boundary, the model can achieve a very small value. Table 4 shows that the layers contain most free parameters with increased sparsity at the end of training, this indicates that our proposed quantization method can avoid overfitting and sparsity acts as a regularizer.


Figure 4: Validation Weighted -score Curves on These Datasets.

iv The Trade-off Between Quantization and Model Accuracy

The third set of experiments is performed to explore the model accuracy of the quantization method. Just like in the first and second sets of experiments, a four-layer convolutional network is used, the parameter settings for the sliding window, batch size are kept completely the same. The weight shift threshold parameter is set to . Finally, TerNet, FTerNet and DFTerNet with their own full-precision counterparts are generated for comparisons. Table 6 shows the weighted -score performance of different full-precision models are described in Figure 2 and their quantization counterparts, and memory usage for model parameters. Table 6 shows that using the proposed quantization method, results in a very small difference in performance between 2-bit network and its full-precision counterpart. Figure 4 shows the validation weighted -score curves on these datasets. As shown in the Figure 4, our quantized models (TerNet, FTerNet and DFTerNet) converge almost as fast and stably as their counterparts. This demonstrate the robustness of the quantization technique we proposed.

We test the efficiency of using Hamming distance calculation by (27). For example, Train a dynamic fusion model using the OPPORTUNITY dataset took about 12 minutes on an NVIDIA TITAN Xp 12G GPU test platform. Inference of the full-precision network on a CPU takes about 15 seconds. We estimate the DFTerNet inference time to be 1.8 seconds on a mobile CPU. This shows that the quantization technique we proposed can achieve a speedup.


Figure 5: Visualization of fraction of zero value at each epoch in DFTerNet () on locomotion of OPPORTUNITY dataset.

Vi Conclusion and Future Work

In this paper, we present DFTerNet, a new network quantization method and a novel dynamic fusion strategy, to address the problem of how to better recognize activities from multi-sensors signal sources and deploy them to low-computation capable portable devices. Firstly, the proposed quantization method - is called by two operations through adjusting the scale parameter , weight quantization and activation quantization . Secondly, the bit-counts scheme replaces matrix multiplication proposed in this work is hardware friendly and realizes a 9 speedup as well as requiring 11 less memory. Thirdly, a novel dynamic fusion strategy is proposed. Unlike existing methods which treat the representations from different sensor signal sources equally, it considers that different sensor signal sources need to be learned separately and less “contribution” signal sources reduce its representations by fusion weights which are sampled from Bernoulli distribution given by the . Experiments that were performed demonstrated the effectiveness of the proposed quantization method and dynamic fusion strategy. As for future works, we plan to extend the quantization method to quantization gradients and errors that can deploy the model directly on portable devices for training and inference. Because improvement of model performance requires continuous online learning, separation of training and inference will limit that.

References

  • [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: towards real-time object detection with region proposal networks,” in International Conference on Neural Information Processing Systems (NIPS), Dec. 2015, pp. 91-99.
  • [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, and S. Reed, et al., “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 1-9.
  • [3] Q. Lu, C. Liu, Z. Jiang, A. Men, and B. Yang, “G-CNN: Object Detection via Grid Convolutional Neural Network,” IEEE Access, vol. 5, pp. 24023-24031, 2017.
  • [4] W. Yin, X. Yang, L. Zhang, and E. Oki, “ECG Monitoring System Integrated With IR-UWB Radar Based on CNN,” IEEE Access, vol. 4, pp. 6344-6351, 2016.
  • [5] Y. Shen, T. Han, Q. Yang, Y. Wang, F. Li, and H. Wen, “CS-CNN: Enabling Robust and Efficient Convolutional Neural Networks Inference for Internet-of-Things Applications,” IEEE Access, vol. 6, pp. 13439-13448, 2018.
  • [6] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li and S. Krishnaswamy, “Deep convolutional neural networks on multichannel time series for human activity recognition,” in International Joint Conference on Artificial Intelligence (IJCAI), Jul. 2015, pp. 3995-4001.
  • [7] Y. Liu, Q. Wu, L. Tang, and H. Shi, “Gaze-Assisted Multi-Stream Deep Neural Network for Action Recognition,” IEEE Access, vol. 5, pp. 19432-19441, 2017.
  • [8] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. (Feb. 2016). “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning.” [Online]. Available: https://arxiv.org/abs/1602.07261
  • [9] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016. pp. 770-778.
  • [10] K. Simonyan and A. Zisserman. (Sep. 2014). “Very deep convolutional networks for large-scale image recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556
  • [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems (NIPS), Dec. 2012, pp. 1097-1105.
  • [12] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, and R. Appuswamy, et al. (Mar. 2016). “Convolutional networks for fast, energy-efficient neuromorphic computing.” [Online]. Available: https://arxiv.org/abs/1603.08270
  • [13] S. Han, H. Mao, and W. J. Dally. (Oct. 2015). “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” [Online]. Available: https://arxiv.org/abs/1510.00149
  • [14] M. Courbariaux, Y. Bengio, and J. P. David, “BinaryConnect: training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2015, pp. 3123-3131.
  • [15] M. Courbariaux, I. Hubara, D. Soudry, R. EI-Yaniv and Y. Bengio. (Feb. 2016). “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1.” [Online]. Available: https://arxiv.org/abs/1602.02830
  • [16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. (Mar. 2016). “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” [Online]. Available: https://arxiv.org/abs/1603.05279v2
  • [17] A. Ehliar, “Area efficient floating-point adder and multiplier with IEEE-754 compatible semantics,” in International Conference on Field-Programmable Technology, Dec. 2014, pp. 131-138.
  • [18] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in IEEE Internatinal Solid-State Circuits Conference Digest of Technical Papers, pp. 10-14, 2014.
  • [19] M. D. J. Cook and N. C. Krishnan, “Activity learning: Discovering, recognizing, and predicting human behavior from sensor data,” John Wiley&Sons: Hoboken, NJ, USA, 2015.
  • [20] S. Feldhorst, M. Masoudenijad, M. T. Hompel, and G. A. Fink, “Motion Classification for Analyzing the Order Picking Process using Mobile Sensors,” in In Proc. of the International Conference on Pattern Recognition Applications and Methods, SCITEPRESS-Science and Technology Publication, Feb. 2016, pp. 706-713.
  • [21] F. J. Ordóñez and D. Roggen, “Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition,” Sensors, vol. 16, no. 1, pp. 115-140, 2016.
  • [22] N. Y. Hammerla, S. Halloran, and T. Ploetz, “Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables,” Journal of Scientific Computing, vol. 61, no. 2, pp. 454-476, 2016.
  • [23] R. Grzeszick, J. M. Lenk, F. M. Rueda, G. A. Fink, S. Feldhorst, and M. Ten Hompel, “Deep Neural Network based Human Activity Recognition for the Order Picking Process,” in In Proc. of the International Workshop on Sensor-Based Activity Recognition and Interaction, Sep. 2017, pp. 1-6.
  • [24] C. A. Ronao and S. B. Cho, “Deep Convolutional Neural Networks for Human Activity Recognition with Smartphone Sensors,” in International Conference on Neural Information Processing (NIPS), Dec. 2015, pp. 46-53.
  • [25] M. Z. Uddin, W. Khaksar, and J. Torresen, “Facial Expression Recognition Using Salient Features and Convolutional Neural Network,” IEEE Access, vol. 5, pp. 26146-26161, 2017.
  • [26] J. Li, G. Li, and H. Fan, “Image Dehazing using Residual-based Deep CNN,” IEEE Access, vol. 6, pp. 26831-26842, 2018.
  • [27] Y. Kim. (Aug. 2014). “Convolutional neural networks for sentence classification.” [Online]. Available: https://arxiv.org/abs/1408.5882
  • [28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 3431-3440.
  • [29] K. Xu, J. Ba, R. Kiros, K. Cho and Y. Bengio, et al., “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (ICML), Jul. 2015, pp.2048-2057.
  • [30] L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi, “Human Action Recognition Using Factorized Spatio-Temporal Convolutional Networks,” in IEEE International Conference on Computer Vision (ICCV), Dec. 2015, pp. 4597-4605.
  • [31] D. J. Toms, “Training binary node feedforward neural networks by back propagation of error,” Electronics Letters, vol. 26, no. 21, pp. 1745-1746, 1990.
  • [32] F. Li, B. Zhang, and B. Liu. (May. 2016). “Ternary weight networks.” [Online]. Available: https://arxiv.org/abs/1605.04711
  • [33] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. (Jun. 2016). “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients,” [Online]. Available: https://arxiv.org/abs/1606.06160
  • [34] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. (Sep. 2016). “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations.” [Online]. Available: https://arxiv.org/abs/1609.07061
  • [35] A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen. (Feb. 2017). “Incremental Network Quantization: Towards Lossless CNNs with Low-Precision Weights.” [Online]. Available: https://arxiv.org/abs/1702.03044
  • [36] W. Wen, C. Xu, F. Yan, C. Wu, and Y. Wang, et al., “TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2017, pp. 1508-1518.
  • [37] Y. Guo, A. Yao, H. Zhao, and Y. Chen, “Network Sketching: Exploiting Binary Structure in Deep CNNs,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 4040-4048.
  • [38] G. Davis, S. Mallat, and M. Avellaneda, “Adaptive greedy approximations,” Constructive Approximation, vol. 13, no. 1, pp. 57-98, 1997.
  • [39] Y. Bengio. (Aug. 2013). “Estimating or Propagating Gradients Through Stochastic Neurons.” [Online]. Available: https://arxiv.org/abs/1308.3432
  • [40] F. Li, K. Shirahama, M. A. Nisar, L. Köping, and M. Grzegorzek, “Comparison of Feature Learning Methods for Human Activity Recognition Using Wearable Sensors,” Sensors, vol. 18, no. 2, pp. 679-701, 2018.
  • [41] F. M. Rueda and G. A. Fink. (Feb. 2018). “Learning Attribute Representation for Human Activity Recognition.” [Online]. Available: https://arxiv.org/abs/1802.00761
  • [42] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding,” in In Advances in Neural Information Processing Systems (NIPS), Dec. 2017, pp. 1707-1718.
  • [43] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in International Conference on Machine Learning (ICML), Jul. 2015, pp. 1737-1746.
  • [44] W. Mula, N. Kurz, and D. Lemire. (Nov. 2016). “Faster Population Counts Using AVX2 Instructions.” [Online]. Available: https://arxiv.org/abs/1611.07612
  • [45] S. Ioffe and C. Szegedy. (Feb. 2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” [Online]. Available: https://arxiv.org/abs/1502.03167
  • [46] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, and D. Roggen, “The Opportunity challenge: A benchmark database for on-body sensor-based activity recognition,” Pattern Recognition Letters, vol. 34, no. 15, pp. 2033-2042, 2013.
  • [47] A.Reiss and D. Stricker, “Introducing a New Benchmarked Dataset for Activity Monitoring,” in The 16th IEEE International Symposium on Wearable Computers (ISWC), Jun. 2012, pp. 108-109.
  • [48] D. Micucci, M. Mobilio, and P. Napoletano. (Nov. 2016). “UniMiB SHAR: a new dataset for human activity recognition using acceleration data from smartphones.” [Online]. Available: https://arxiv.org/abs/1611.07688
  • [49] M. D. Zeiler. (Dec. 2012). “ADADELTA: An Adaptive Learning Rate Method.” [Online]. Available: https://arxiv.org/abs/1212.5701
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
254395
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description