DFTerNet: Towards 2bit Dynamic Fusion Networks for Accurate Human Activity Recognition
I Introduction
Artificial Intelligence (AI), as an auxiliary technology in modern games, has played an indispensable role in improving gaming experience in the last decade. The film Ready Player One vividly shows the charm of the future virtual games on the world. It demonstrates that one of the core technologies of virtualrealistic interaction is recognizing all kinds of complex activities.
Convolutional neural networks are very powerful and have been successfully used in many neural network models. They have been widely applied in lots of virtualrealistic interactive practical applications, e.g., object recognition [1, 2, 3], Internet of Things [4, 5], human activity recognition (HAR) [6, 7]. Its success has been driven by the recent data explosion as well as the increase in model size. However, their large computation power cost limits the practical applications support in portable devices without highperformance graphics processing units (GPUs) as shown in Figure 1.
With the development of VR/AR technology, sensorbased portable gaming devices are dazzling to operate with recognize human activity and detection technique, hence it is expected to deploy advance accurate CNNs, e.g., InceptionNets [8], ResNets [9] and VGGNets [10] on smart portable devices. However, the following problems limit the applicability abovementioned idea. Firstly, as the winner of the ILSVRC2015 competition, ResNets152 [9] trained with nearly 19.4 million realvalued parameters to classify images, making it resourceintensive in different aspects. It is unable to run on portable devices for realtime applications, due to its high CPU/GPU workload and memory usage requirements. A similar phenomenon occurs in other deeper networks, such as, VGGNet and AlexNet [11]. Secondly, in the case of practical applications, multiple sensors located at different position on the body, are each required to process the signals they collects separately. Depending on the type of activity being performed, a sensor may contribute less or more to the overall result compared to other sensors depending on its location. Therefore, computational complexity can be decreased and the performance of model improved by reducing the representations from sensors that have “less contribution” during a particular activity.
Recently, in order to resolve the storage and computational problems [12, 13] can be categorized into three methods, i.e., network pruning, lowrank decomposition and quantization of network. Among them, quantization of network has received more and more research focus. DCNNs with binary weights and activations have been designed [14, 15, 16]. Binary Convolutional Neural Networks (BCNNs) with weights and activations constrained to only two values (e.g., 1,+1), can bring great benefits to specialized machine learning hardware because of the following major reasons: (1) the quantized weights and activations reduce memory usage and model size by 32x compared to the fullprecision version; (2) if networks are binary, then most multiplyaccumulate operations (require hundreds of logic gate at least) can be replaced by popcountXNOR operations (only require a single logic gate), which are especially well suited for FPGAs and ASICs [17].
However, quantization usually causes severe prediction accuracy degradation. The reported accuracy of obtained models is unsatisfactory on complex tasks (e.g., ImageNet dataset). More concretely, Reference [16] shows that binary weight causes the accuracy of ResNet18 to be reduced by about 9% (GoogLenet drops by about 6%) on ImageNet dataset. It is Obvious that there is a considerable gap in performance between the accuracy of a quantitative model and the fullprecision model. In light of these considerations, this paper proposes a novel quantization method and dynamic fusion strategy, to achieve deployments on an advanced highprecision and lowcost computation neural network model on portable device. The main contribution of this paper are summarized as follows:

We propose a quantization function with an elastic scale parameter () to quantize the entire fullprecision convolutional neural network. The quantization of weights, activations and fusion weights are derived from the quantization function () with different scale parameters (). We quantize the weights and activations to 2bit values , and use a masked Hamming distance instead of floatingpoint matrix multiplication. This setup is able to achieve an accuracy score close to the fullprecision counterpart, using about 11 less memory and achieving a 9 speedup.

We introduce a dynamic fusion strategy for multisensors activity recognition. For sensors whose “contribution” (subnetwork) is less than the others, we randomly reduce their representations through fusion weights, which are sampled from a Bernoulli distribution given by the scale parameter from the quantization method. Experimental results show that by adopting dynamic fusion strategy, we were able to achieve higher accuracy and lower memory usage than the baseline model.
Ideally, using more quantization weights, activation modules and fusion strategy will result in better accuracy and eventually achieve a higher accuracy than the fullprecision baselines. The training strategy can reduce the amount of computing power required and the energy consumption therefore realizing the main objective of designing a system that can be deployed on portable devices. More importantly, adopting dynamic fusion strategies for different types of activities are more in line with the actual situation. This was verified by using both the quantization method and fusion strategy on the OPPORTUNITY and PAMAP2 datasets. Only the quantization method was applied on the UniMiBSHAR dataset. This is the first time that both quantization and dynamic fusion strategy were adopted for convolutional networks to achieve a high prediction accuracy on complex human activity recognition tasks.
The remainder of this paper is structured as follows. In Section II, we briefly introduction the related works on human activity recognition, quantization models methods for convolutional neural networks. In Section III, we highlight the motivation of our method and provide some theoretical analyzes for its implementation. In Section IV, we introduce our experiment. At last, Section V experimentally demonstrates the efficacy of our method and Section VI draws the conclusion and future work.
Ii Related Work
i Convolutional Neural Networks for Human Activity Recognition
Several advance approaches have been evaluated in the last few years on the Human Activity Recognition (HAR). Accuracies for HAR without deep learning methods are often relatively low, e.g., Handcrafted features method [19] uses simple statistical value (e.g., std, avg, mean, max, min, median, etc.) or frequency domain correlation features based on the signal Fourier transform to analyze the time series of human activity recognition data. Due to its simplicity to setup and low computational, some areas are still using, but the accuracy cannot satisfy the experience of modern AI games. Reference [20] adopts SVM, Random Forest, Dynamic Time Warping or HMM for predicting action classes, these methods work well when data are scarce and highly unbalance. However, when faced with the activity recognition of complex highlevel behaviors task, identifying the relevant features through these traditional approaches are timeconsuming [21].
Recently, many researches adopt Convolutional Neural Networks (CNNs) to deploy HAR system, such as [6, 21, 22, 23, 24]. Convolutional Neural Networks were based on the discovery of visual cortical cells and retains the spatial information of the data through receptive field. It is known that the power of CNNs stems in large part from their ability to exploit symmetries through a combination of weight sharing and translation equivariance. Also, with their ability to act as feature extractors, a plurality of convolution operators are stacked to create a hierarchy of progressive abstract features. Apart from image recognition [11, 25, 26], NLP [27, 28] and video recognition [29], more and more literatures have used CNNs in recent years to learn sensorbased data representations for human activity recognition (HAR) and have achieved remarkable performances [6, 30]. The model [24] consist of two or three temporalconvolution layers with ReLU activation function followed by maxpooling layer and a softmax classifier, which can be applied over all sensors simultaneously. Reference [6] introduces a four temporalconvolutional layers on single sensor, followed by a fullyconnected layer and softmax classifier. Research shows that deeper networks can find correlations between different sensors.
Just like these works discussed above, we adopt convolutional neural networks to learn representations from wearable multisensor signal sources. However, these advanced highprecision models are difficult to deploy on portable devices, due to their computational complexity and energy consumption. Fortunately, quantization of convolutional neural networks have become a hot research topic. They aim to reduce memory usage and computational complexity meanwhile maintain an acceptable accuracy.
ii Quantization Model of Convolutional Neural Networks
The convolutional binary neural network is not a new topic. Inspired by neuroscience, the unit step function is used as an activation function in artificial neural networks [31]. The binary activation mode can use spiking response for computing and communication, which is an energyefficient method because it only consumes energy when necessary [12].
Recently, Binarizedneuralnetworks (BNNs) [15] quantize the weights and activations to binary values of each layer successfully. They proposed two binarization functions, the first is deterministic as shown in (1) and the second is stochastic as shown in (2). Where is the binarized variable and is the fullprecision variable, is the “hard sigmoid” function.
(1) 
(2) 
TWN [32] constrains the weights to ternary values () by referencing symmetric thresholds. In each layer, the quantization of TWN is shown in (3), where is a positive threshold parameter. They claim a tradeoff between model complexity and generalization.
(3) 
DoReFaNet [33] is derived from AlexNet that has 1bit weights, 2bit activations and 6bit gradients and that can achieve 46.1% top1 accuracy on ImageNet validation set. DoReFaNet adopts a method as shown in (4), where and are the fullprecision (original) and quantized weights, respectively, and is the mean of absolute value of weights.
(4) 
iii Quantization Method for Convolutional Neural Networks
The idea of quantization of weights and activations was first proposed by [15]. The research showed the following two contributions: 1) The costly arithmetic operations between weights and activations in a fullprecision networks can be replaced with cheap bitcount and XNOR operations, which can result in significant speed improvements. Compared with the fullprecision counterpart, 1bit quantization reduce the memory by a factor of 32 and 2) In some visual classification tasks, using 1bit quantization could achieve fairly good performance.
Some researchers [16, 34] introduce easy, highperformance and accurate approximations to convolutional neural networks by quantizing the weights, and using a uniform quantization method, which first scales its value in the range . Then it adopts the following bit quantization as shown in (5), where approximates continuous values to their nearest discrete states. The benefit of this quantization method is that when calculating the inner product of two quantized vectors, costly arithmetic calculations can be replaced by cheap operations. (e.g. bit shift, count operation) In addition, this quantization method is rulebased and thus, easy to implement.
(5) 
Reference [35] proposes a network compression method called INQ. After obtaining a network through training, the parameters (fullprecision parameters) of each layer are first divided into two groups. The parameters in the first group are directly quantized and fixed. The other group of parameters through retraining are compensated for the loss of accuracy caused by quantization. The above process iterates until all parameters are quantized. With incremental quantization, using weights with smallwidth values (e.g., 3bit, 4bit and 5bit) results in almost no accuracy loss compared with the fullprecision counterpart. The quantization method is shown in (6), where and are fullprecision (original) and quantized weights, respectively, and are the lower and upper bounds of the quantized set, respectively.
(6) 
Reference [36] proposes a method as shown in (7), where is a scaler parameter, is the Hadamard product, and respectively returns the and value of each element. The method quantizes gradients to ternary values can effectively improve clientstoserver communication in distributed learning.
(7) 
Reference [37] proposes greedy approximation, which instead tries to learn the quantization as shown in (8), where is binary filter and is optimization parameters.
(8) 
The greedy approximation expands to bit () quantization by minimizing the residue in order. Although not able to achieve a highprecision solution, the formulation of minimizing quantization error is very promising, and quantitative neural networks designed in this manner can be effectively deployed on modern portable devices.
Iii Method
In this section, we introduce our quantization method and dynamic fusion strategy, which is termed DFTerNet (DynamicFusionTernary(2bit)ConvolutionalNetwork) for convenience. we aim to recognize human activity extracted from IMU sensors. For this purpose, a fullyconvolutionalbased architecture is chosen and we focus on the recognition accuracy of finally model. During traintime (Training), we still use the fullprecision network (the realvalued weights are retained and updated at each epoch). During runtime (Inference), we use ternary weights in convolution.
i Linear Mapping
In this paper, we propose a quantization function  that converts a floatingpoint to its bitwidth signed integer. Formally, it can be defined as follows:
(9) 
where is uniform distance, whose role is to perform a discretization of bit linear mapping of continuous and unbounded values, is a scale parameter, is the approximation function that approximates continuous values to their nearest discrete states, function that clips unbounded values to [,].
For example, when the scale parameter , quantizes to . Consider the scale parameter , assume we set two different scale parameters: and corresponds to and . In that case is 0 and is 0.5. Clearly, it can be seen that each quantization function can use the scale parameter to adjust the quantization threshold, clip differently to represent the input value.
ii Approximate weights
Consider that we use a layer CNN model. Suppose that learnable weights of each convolutional layer are represented as , in which indicate the inputchannel, outputchannel, filter width and filter height, respectively. It is known that, when using 32 bits (fullprecision) floatingpoint arithmetic, storing all these weights would require bit memory.
As claimed above, at each layer, our goal is to estimate the realweight filter using 2bit filter . Generally, we define a reconstruction error as shown in (10):
(10) 
where describes a nonnegative scaling parameter. To retain the quantization network accuracy, the reconstruction error should be minimized. However, direct reconstruction error is NPhard, so forcibly solving it will be very time consuming [38]. In order to solve the above problem in a reasonable time, we need to find an optimal estimation algorithm. Scilicet, the goal is to solve the following optimization problem:
(11) 
in which , the is defined as for any threedimension tensor .
One way to solve the optimization problem shown in (11) is to expand the cost function and take the derivate w.r.t. and , respectively. However, in this case, it must be get correlationdependence value of and . To overcome this problem, we use the quantization function to quantize by (9):
(12) 
In this work, we aim to quantize the realweight filter to ternary values {0.5,0,0.5} , so the parameter and the threshold of weights are controlled by as shown in (13),
(13) 
where is a shift threshold parameter which can be used to constrain thresholds.
With the fixed through (12), Equation (11) becomes a linear regression problem:
(14) 
We can use the “straightthrough (ST) estimator” [39] to backpropagate though . This is shown in detailed in Algorithm 1. Note that in Runtime, only () is required.
Algorithm 1 Training with “straightthrough (ST) estimator” [39] on the forward and backward approach of an approximated convolutional. 

Require , shift parameter . Assume as the loss function, and as the input and output tensors of a convolutional layer respectively. 
A. Forward propagation: 
1. , #Quantization 
2. Solve Eq. (14) for , 
3. . () 
B. Back propagation: 
By the chain rule of gradients and ST we have: 
1. . 
iii Activation quantization
In order to avoid substantial memory consumption and computational requirement, which is caused by cumbersome floatingpoint calculations, we should use bitwise operation. Therefore, the activations as well as the weights must be quantized.
If activations are 1bit values, we can quantize activations after they pass through a function similar to the activation quantization procedure in [33]. Formally, it can be defined as:
(15) 
If activations are presented in , the quantized of realvalue activations can be defined as:
(16) 
In this paper, we constrain the weights to ternary values {0.5, 0, 0.5}. In order to transform the realvalued activation into ternary activation, we set the parameter . The scale parameter controls the clip threshold and can be varied throughout the process of learning. Note that, quantization operations in networks will cause the variance of weights to be scaled compared to the original limit, which will cause exploding of network’s outputs. XNORNet [16] proposes a filterwise scaling factor calculated continuously with full precision to alleviate the amplification effect. In our experiment implementation, we control the activation threshold to attenuate the amplification effect by setting the scale parameter as
where is a predefined constant for each layer, and will be updated by in each epoch:
where is the trained weights of each layer. The forward and backward of the activation is shown in detailed in Algorithm 2.
Algorithm 2 Training with “straightthrough (ST) estimator” [39] on the forward and backward approach of the activation. 

Require , shift parameter , can be seen as propagating the gradient through , indicates Hadamard product. Assume as the loss function. 
A. Forward propagation: 
1. , #Quantization 
B. Back propagation: 
1. , #using STE 
iv Scalability to Multiple Sensors (IMUs)
Each activity in the OPPORTUNITY and PAMAP2 datasets is collected by multisensors in different parts, each sensor is independent. For different types of activities, different sensors may not have the same “contribution”. In order to improve the accuracy of our model, we conducted a comprehensive evaluation using different feature fusion strategies as shown in Figure 2. Note that the UniMiBSHAR dataset only has 3channels data (3D accelerometer), so we apply Early fusion.
Early fusion. All joints from multisensors in different parts are stacked as input of the network [21, 40].
Late fusion. Independent sensors in different signal sources through their own Conv3 feature maps () are concatenated by fusion weights like [41] and the feature maps after fusion can be expressed as:
Dynamic fusion. Different parts of the body (different sensors locations) have different levels of participation in different types of activities. For example, for anklehandbased activities (e.g., running and jumping), the “contribution” of backbased sensor is lower than of the sensors on the hands and ankles. In the case of handbased activities (e.g., open drawer, close drawer), the “contribution” of the sensors in the ankles and back is lower than the hands, etc. Therefore, unlike in the Late fusion method, the fusion weight settings of Dynamic fusion is different. Formally, the fullprecision SubnetworkConv3 weights and feature maps are represented as and respectively, where correspond to the fusion weights , . More specifically, dynamic fusion weights aim to randomly reduce the representations of less “contribution” signal source, which can be considered a “dynamic dropout method”, i.e., dynamic clip parameter (nonfixed parameter). Given a quantized weights , each fusion weight independently follows the Bernoulli distribution as shown in (24):
(17) 
where and are the  parameter of and respectively.
Trainingtime. The fullprecision Conv3subnetwork weights are quantized by (9):
(18) 
According to (17), the generated fusion weight as shown in (19) is given by:
(19) 
Assumption. The  are the less “contribution” subnetworks, the feature maps after dynamic fusion strategy can be expressed as:
(20) 
where denotes the Hadamard product. An example of this process is shown in Figure 3.
v Error and Complexity Analysis
Reconstruction Error According to (10) and (11), we have defined the reconstruction error . In this section, we analyze the boundary that is satisfied by .
Theorem 1. (Reconstruction Error Bound). The reconstruction error is bounded as
(21) 
where and denotes the number of elements in .
Proof. We define which indicates the approximation residue after combining all the previously tensors as
(22) 
Through derivative calculations, (10) is equivalent to
(23) 
Since , we can obtain,
(24) 
in which is an entry of . According to (22) and (24), we have
(25)  
in which varies from 0 to .
We can see from Theorem 1 that, the reconstruction error is “exponential decay” with a rate . It means that, given a small size , i.e., is small, the reconstruction error algorithm can be quite good.
Efficient Operations Both modern CPUs and SoCs contain instructions to efficiently and massively compute 64bit strings in short time cycles [44]. However, floatingpoint calculations require very complex logic. The calculation efficiency can be improved by several tens of times by adopting each bitcount operator instead of the 64bit floatingpoint addition and multiplication calculation.
In the classic deep learning architecture, floating point multiplication is the most timeconsuming part. However, when the weights and activations are ternary values, floating point calculations should be avoided. In order to efficiently reduce the computational complexity and time consumption, we have to design a new operation, which aims to replace the fullprecision cumulant operation of input tensor and filter . Some previous works [15, 16] on 1bit networks have been successfully implemented using Hamming space calculation^{1}^{1}1The Hamming space can be used to calculate matrix multiplication and its innerproducts. (bitcounting) as replacement for matrix multiplication. For example, , the matrix multiplication can be replaced by (26):
(26) 
where defines a bitcount over the bits in the rows of and , and an exclusive OR operator.
In this paper, we aim to extend the concept to bit networks. The quantized input tensor and filter can be denoted as , , where the value of and are composed of ,  and . Given a fixed , the is fixed as well. Therefore, we define two tensors as and to store and , respectively. (Note that the superscript and mean and respectively). The values of and as
In this work, our goal is to replace matrix multiplication with notion of bitcounting in bit convolutional networks. Imagine that =, . Therefore, the innerproduct calculation can be using two bitcounts in Hamming space:
(27)  
where defines the negated XOR, an AND operator. Note that, if , the elementwise operator whose behavior must be custom.
Batch Normalization In previous works, weights are quantized to binary values by using a function [15] and to ternary values by using a positive threshold parameter [32] during Traintime. However, neural networks with quantized weights all failed to converge without batch normalization, because the quantized values are rather discretization for fullprecision values. Batch Normalization [45] efficiently avoids the exploding and vanishing gradients problem. In this part, we briefly discuss the batch normalization operation which might increase extra computational cost. Simply, batch normalization is an affine function:
(28) 
where , and are the mean and standard deviations respectively, and are scale and shift parameters respectively. More specifically, a batch normalization can be quantized in 2bit values by the following quantization method:
(29) 
where =. Equation (29) can be converted to the following:
(30) 
Therefore, batch normalization will be accomplished at no extra cost.
Iv Experiments
To demonstrate the usefulness of quantization methods and fusion strategies on convolutional neural networks for highprecision human activity recognition on portable devices. To demonstrate that the extension of our model and training strategies to complexcombine activity recognition is straightforward. Providing better game experience for virtualrealistic interactive games on VR/AR devices and portable devices. The memory requirements and the quantized weights of each layer are also analyzed in detail. Complex naturalistic activities involve some parts of body and few activities are weak contrast make this recognition very difficult. Therefore, networks with better generalization ability to robustly fuse the data features of different parts of sensor are necessary, at the same time, an automatic method should depict the sketch of the activity feature and accurately recognize the activity.
The primary parameter of any experimental setup is the choice of datasets. To choose the optimal datasets for this study, we considered the complexity and richness of the datasets. Based on the background of our research, we selected the OPPORTUNITY [46], PAMAP2 [47] and UniMiBSHAR [48] benchmark datasets for our experiments.
i Data Description and Performance Measure
i.1 Opportunity
The OPPORTUNITY public dataset has been used in many open activity recognition challenges. It contains four subjects performing 17 different (morning) Activities of Daily Life (ADLs) in a sensorrich environment, as listed in Table 1. They were acquired at a sampling frequency of 30Hz equipping 7 wireless bodyworn inertial measurement units (IMUs). Each IMU consists of a 3D accelerometer, 3D gyroscope and a 3D magnetic sensor, as well as 12 additional 3D accelerometer placed on the back, arms, ankles and hips, and accounting for a total of 145 different sensor channels. During the data collection process, each subject performed a session 5 times with ADL and 1 drill session. During each ADL session, subjects were asked to perform the activities naturallynamed “ADL1” to “ADL5”. During the drill sessions, subjects performed 20 repetitions of each of the 17 ADLs of the dataset. The dataset contains about 6 hours in total, and the data are labeled on a timestamp level. In our experiment, the training and testing sets have 63Dimension (36D on hand, 9D on back and 18D on ankle, respectively).
In this paper, the models were trained on the data of ADL1, ADL2, ADL3, drill session, and test the model on the data of ADL4, ADL5.
i.2 Pamap2
The PAMAP2 dataset contains recordings from 9 subjects who participated in carrying out 12 activities, including household activities and a variety of exercise activities as shown in Table 2. The IMU and a HRmonitor equip on the hand, chest and ankle respectively and they are sampled at a constant sampling rate of 100Hz^{2}^{2}2Note that, following [22], the PAMAP2 dataset is downsampled to Hz, in order to have a temporal resolution comparable to the OPPORTUNITY dataset.. The accelerometer, gyroscope, magnetometer, temperature and heart rate contain 40 sensors are recorded from IMU over 10 hours in total. In our experiment, the training and testing sets have 36Dimension (12D on hand, 12D on back and 12D on ankle, respectively).
In this paper, data from subjects 5 and 6 are used as testing sets, the remaining data are used for training.
i.3 UniMiBSHAR
The UniMiBSHAR dataset collect data from 30 healthy subjects (6 male and 24 female) acquired using 3Daccelerometer of Samsung Galaxy Nexus I9250 with Android OS version 5.1.1. The data are sampled at a constant sampling rate of 50 Hz, and split in 17 different activity classes, 9 safety activities and 8 dangerous activities (falling action) as shown in Table 3. Unlike OPPORTUNITY dataset, the dataset does not have any NULL class and remains relatively balanced. In our experiment, the training and testing sets have 3Dimension.
i.4 Performance Measure
ADL datasets are often highly unbalanced, Like OPPORTUNITY dataset. For this dataset, the overall classification accuracy is not an appropriate measure of performance, because the activity recognition rate of the majority classes might skew the performance statistics to the detriment of the least represented classes. As a result, many previous works such as [21] show the use of an evaluation metric independent of the class repartition—score. The score combines two measures: the precision and the recall : is the number of correct positives divided by the number of all positive examples returned by the classifier, and is the number of correct positive results divided by the number of all positive samples. The score is the harmonic average of the and , where the best value is at 1 and worst at 0. In this paper, we use an additional evaluation metric to make the comparison with them easier: the weighted Score (Sum of class scores, weighted by the class proportion):
(31) 
where and is the number of samples in class , and is the total number of samples.
Class  Proportion  Class  Proportion 

Open Door 1/2  1.87%/1.26%  Open Fridge  1.60% 
Close Door 1/2  6.15%/1.54%  Close Fridge  0.79% 
Open Dishwasher  1.85%  Close Dishwasher  1.32% 
Open Drawer 1/2/3  1.09%/1.64%/0.94%  Clean Table  1.23% 
Close Drawer 1/2/3  0.87%1.69%/2.04%  Drink from Cup  1.07% 
Toggle Switch  0.78%  NULL  72.28% 
Class  Proportion  Class  Proportion 

Lying  6.00%  Sitting  5.78% 
Standing  5.92%  Walking  7.45% 
Running  3.06%  Cycling  5.13% 
Nordic walking  5.87%  Ascending stairs  3.66% 
Descending stairs  3.27%  Vacuum cleaning  5.47% 
Ironing  7.44%  House cleaning  5.84% 
Null  35.12% 
Class  Proportion  Class  Proportion 

StandingUpfromSitting  1.30%  Walking  14.77% 
StandingUpfromLaying  1.83%  Running  16.86% 
LyingDownfromStanding  2.51%  Going Up  7.82% 
Jumping  6.34%  Going Down  11.25% 
F(alling) Forward  4.49%  F and Hitting Obstacle  5.62% 
F Backward  4.47%  Syncope  4.36% 
F Right  4.34%  F with ProStrategies  4.11% 
F Backward SittingChair  3.69%  F Left  4.54% 
Sitting Down  1.70% 
ii Experimental Setup
Sliding Window Our selected data are recorded continuously. We can think of the continuousHAR data feature as a video feature. We use a sliding timewindow of fixed length to segment the data. Each segmented data can be viewed as a frame in the video (a picture). We define , and as the length of the timewindow, the number of sensor channels and the sliding stride, respectively. Through the above segment approach, each “picture” consists of a matrix. We set the segment parameters like [40], use a timewindow of 2s on OPPORTUNITY and PAMAP2 datasets, resulting in =64, and . On UniMiBSHAR dataset, a timewindow of 2s was used, resulting in =96. Due to the timestamplevel labeling, each segmented data can usually contain multiple labels. We choose a majority labeling that appears most frequently among those of timestamps.
Dynamic fusion Weights Our selected datasets (OPPORTUNITY and PAMAP2) include two families of human activity recognition, that of periodic activities (locomotion of the OPPORTUNITY dataset and all PAMAP2 dataset) and that of sporadic activities (gestures the of OPPORTUNITY dataset). For designing the dynamic fusion strategies of two families of activity, we design two groups of feature maps after dynamic fusion strategies. In periodic activities (), we take into account the fact that backbased sensors have less “contribution”. In sporadic activities (), we consider that both backbased and anklebased sensors have less “contribution”. Formally, in Trainingtime and Runtime, according to (17), (18) and (19), the feature maps and after dynamic fusion strategies can be expressed as:
(32) 
Pooling Layer The role of common pooling layers is to find the maximum (maxpooing) or the average (avgpooling) of output of each filter. Our experiments do not use the avgpooling because the average operation will generate other values except . However, we observe that using maxpooling on values input will increase the probability distribution of , results in a noticeable drop in recognition accuracy. Therefore, we put the maxpooling layer before the batch normalization (BN) and activation (A).
iii Baseline Model
The aim of this paper does not necessarily exceed current stateoftheart accuracies, but rather demonstrates and analyzes the impact of network model quantization and fusion strategy. Therefore, the benchmark model we used should not be very complex, because increasing the network topology and computational complexity to improve model performance runs counter to the aim of deploying advanced networks in portable devices. In this paper, we considered improving the performance of the model through a training strategy that is more in line with the practical applications. We therefore chose a CNN architecture [40] as the baseline model. It contains three convolutional blocks, a Dense layer and a Softmax layer. Each convolutional kernel performs a 1D convolutional layer on each sensor channel independently over the time dimension. To fairly evaluate the calculation consumption and memory usage of quantization on the CNNs, we employ the same number of channels and convolution filters for all comparison models. Layerwise details are shown in Table 4, in which “Conv2” is the most computationally expensive and “Fc” commits the most memory. For example, using floatingpoint precision on OPPORTUNITY dataset, the entire model requires approximately 82MFLOPs^{3}^{3}3Note that FLOPs consist of the same number of FMULs and FADDs. and approximately 2 million weights, thus 0.38MBytes of storage for the model weights. During Traintime the model requires more than 12GBytes of memory (batch size of 1024), for inference during Runtime this can reduced to approximately 1.8GBytes.
OPPORTUNITY  PAMAP2  
Layer Name  Params (b)  FLOPs  Params (b)  FLOPs 
Conv1  0.6k  4.84M  0.6k  2.76M 
Conv2  20k  68.18M  20k  38.96M 
Conv3  7.2k  5.47M  7.2k  3.12M 
Fc  1.89M  3.78M  1.89M  2.16M 
UniMiBSAHR  
Layer Name  Params (b)  FLOPs  
Conv1  0.6k  0.23M  
Conv2  20k  3.25M  
Conv3  7.2k  0.26M  
Fc  1.89M  0.18M 
iv Implementation Details
In this section, we provide the implementation details of the architecture of the convolution neural network. Our method is implemented with Pytorch. The model is trained with minibatch size of 1024, 50 epochs and using the AdaDelta with default initial learning rate [49]. A softmax function is used to normalize the output of the model. The probability that a sequence belongs to the  class is given by (33):
(33) 
where is the output of the model, is the number of activities.
Algorithm 3 Training a layer DFTerNet, is the loss function for minibatch, can be seen as propagating the gradient through and is the learning rate decay factor. indicates Hadamard product. BatchNorm() specifies how to batchnormalize the output of convolution. BackBatchNorm() specifies how to backpropagate through the normalization [45]. Update() specifies how to update the parameters when their gradients are known, such as AdaDelta [49]. 

Require A minibatch of inputs and targets (), previous weights , bit, bit, shift threshold parameter () and learning rate . 
Ensure Updated weights . 
1. Computing the parameters gradients: 
1.1. Forward propagation: 
for =1 to do, 
with (12) 
Compute with (14) 
Apply maxpooling 
BatchNorm() 
if then 
with (16) 
1.2. Backward propagation: 
{note that the gradients are fullprecision.} 
Compute knowing and 
for to 1 do 
if then 
by Algorithm 2 
end if 
BackBatchNorm() 
end for 
2. Accumulating the parameters gradients: 
for to do 
With known, compute by Algorithm 1 
Update() 
end for 
Experiments were carried out on a platform with an Intel 2 Intel E52600 CPU, 128G RAM and a NVIDIA TITAN Xp 12G GPU. The hyperparameters of the model are provided in Figure 2^{4}^{4}4The early fusion is common convolutional neural network architecture (can be regarded as a subnetwork). Therefore, the hyperparameters of early fusion is equal to any subnetwork of Late fusion or Dynamic fusion.. The training procedure, i.e., DFTerNet, is summarized in Algorithm 3.
V Result and Discussion
In this section, the proposed quantization method and fusion strategies are evaluated on three famous benchmark datasets. We consider: 1) The proposed dynamic fusion models are compared with other baseline models, 2) the effect of weight shift threshold parameter is evaluated, 3) the tradeoff between quantization and model accuracy. For the first method (we call it Baseline or (TerNet) method), the required sensor signal sources are stacked together. In the second method (referred to as FTerNet), different sensor signal sources are peocessed through their own subnetworks and fused together with the learned representations before the dense layer, i.e., each element of fusion weights is equal to 1. The model proposed in this paper (DFTerNet), differs from the second method discussed in the way it handle the fusion part. In the DFTerNet, each element of fusion weights is sampled from Bernoulli distribution given by the scale parameter of the quantization method that we proposed.
i Multisensor Signals Fusion
In order to evaluate the different fusion strategies which were described in Section iv, an ablation study was performed on the OPPORTUNITY and PAMAP2 datasets. The first set of experiments consisted of comparing the three fusion strategies on the each dataset. As shown by the bold scores in Table 6, the order of fusion performance is: matched dynamic fusion in first place, followed by late fusion and finally early fusion. The reason for this could that it is better for each sensor signal source to have its network, it is improper to apply a single network to unify all signal sources. Meanwhile, there is a correlation between different signal sources and activity types and therefore the result of the recognition should be more reliable when the signal sources are highly correlated with the activity type. According to the two points above, the recognition result should be weighted by the learned representations of multiple signal sources, and the weight of learned representations of each signal source should reflect the similarity between the signal source and the activity type.
2.7  2.8  2.9  3.0  

DFTerNet ()  0.884  0.897  0.910  0.909  0.893 
DFTerNet ()  0.879  0.894  0.905  0.905  0.891 
ii Analysis of Weight Shift Threshold Parameter
In our DFTerNet, the weight shift parameter of operation whose result will directly affect the following fusion weights . Therefore, the second set of experiments is considered the effect of ’s value. As mentioned in the previous section, the value of is related to the value of and the fusion weights are sampled from . We use matched Dynamic fusion and matched Dynamic fusion on the OPPORTUNITY dataset as a test case to compare the performance of ’s value. In this experiment, the parameter settings are the same as described in Section ii and Section iv. Table 5 summarized the results of ’s value on matched Dynamic fusion. It can be seen that the quantization method we proposed achieves its best performance when using or . Similar phenomenon can also be found in literature such as [12].


iii Visualization of The Quantization Weights
In addition to analyzing the quantized weights, we further looked inside the learned layers and checked the values. We plot the heatmap of the fraction of zero value by DFTerNet() on locomotion of OPPORTUNITY dataset across epochs. As shown in Figure 5, we can see the fraction of zero values increases in later epochs, similar phenomena also appear in DFTerNet() on PAMAP2 dataset and DFTerNet() on gestures of OPPORTUNITY dataset. Section v proves the reconstruction error boundary, the model can achieve a very small value. Table 4 shows that the layers contain most free parameters with increased sparsity at the end of training, this indicates that our proposed quantization method can avoid overfitting and sparsity acts as a regularizer.
iv The Tradeoff Between Quantization and Model Accuracy
The third set of experiments is performed to explore the model accuracy of the quantization method. Just like in the first and second sets of experiments, a fourlayer convolutional network is used, the parameter settings for the sliding window, batch size are kept completely the same. The weight shift threshold parameter is set to . Finally, TerNet, FTerNet and DFTerNet with their own fullprecision counterparts are generated for comparisons. Table 6 shows the weighted score performance of different fullprecision models are described in Figure 2 and their quantization counterparts, and memory usage for model parameters. Table 6 shows that using the proposed quantization method, results in a very small difference in performance between 2bit network and its fullprecision counterpart. Figure 4 shows the validation weighted score curves on these datasets. As shown in the Figure 4, our quantized models (TerNet, FTerNet and DFTerNet) converge almost as fast and stably as their counterparts. This demonstrate the robustness of the quantization technique we proposed.
We test the efficiency of using Hamming distance calculation by (27). For example, Train a dynamic fusion model using the OPPORTUNITY dataset took about 12 minutes on an NVIDIA TITAN Xp 12G GPU test platform. Inference of the fullprecision network on a CPU takes about 15 seconds. We estimate the DFTerNet inference time to be 1.8 seconds on a mobile CPU. This shows that the quantization technique we proposed can achieve a speedup.
Vi Conclusion and Future Work
In this paper, we present DFTerNet, a new network quantization method and a novel dynamic fusion strategy, to address the problem of how to better recognize activities from multisensors signal sources and deploy them to lowcomputation capable portable devices. Firstly, the proposed quantization method  is called by two operations through adjusting the scale parameter , weight quantization and activation quantization . Secondly, the bitcounts scheme replaces matrix multiplication proposed in this work is hardware friendly and realizes a 9 speedup as well as requiring 11 less memory. Thirdly, a novel dynamic fusion strategy is proposed. Unlike existing methods which treat the representations from different sensor signal sources equally, it considers that different sensor signal sources need to be learned separately and less “contribution” signal sources reduce its representations by fusion weights which are sampled from Bernoulli distribution given by the . Experiments that were performed demonstrated the effectiveness of the proposed quantization method and dynamic fusion strategy. As for future works, we plan to extend the quantization method to quantization gradients and errors that can deploy the model directly on portable devices for training and inference. Because improvement of model performance requires continuous online learning, separation of training and inference will limit that.
References
 [1] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: towards realtime object detection with region proposal networks,” in International Conference on Neural Information Processing Systems (NIPS), Dec. 2015, pp. 9199.
 [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, and S. Reed, et al., “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 19.
 [3] Q. Lu, C. Liu, Z. Jiang, A. Men, and B. Yang, “GCNN: Object Detection via Grid Convolutional Neural Network,” IEEE Access, vol. 5, pp. 2402324031, 2017.
 [4] W. Yin, X. Yang, L. Zhang, and E. Oki, “ECG Monitoring System Integrated With IRUWB Radar Based on CNN,” IEEE Access, vol. 4, pp. 63446351, 2016.
 [5] Y. Shen, T. Han, Q. Yang, Y. Wang, F. Li, and H. Wen, “CSCNN: Enabling Robust and Efficient Convolutional Neural Networks Inference for InternetofThings Applications,” IEEE Access, vol. 6, pp. 1343913448, 2018.
 [6] J. B. Yang, M. N. Nguyen, P. P. San, X. L. Li and S. Krishnaswamy, “Deep convolutional neural networks on multichannel time series for human activity recognition,” in International Joint Conference on Artificial Intelligence (IJCAI), Jul. 2015, pp. 39954001.
 [7] Y. Liu, Q. Wu, L. Tang, and H. Shi, “GazeAssisted MultiStream Deep Neural Network for Action Recognition,” IEEE Access, vol. 5, pp. 1943219441, 2017.
 [8] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. (Feb. 2016). “Inceptionv4, InceptionResNet and the Impact of Residual Connections on Learning.” [Online]. Available: https://arxiv.org/abs/1602.07261
 [9] K. M. He, X. Y. Zhang, S. Q. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2016. pp. 770778.
 [10] K. Simonyan and A. Zisserman. (Sep. 2014). “Very deep convolutional networks for largescale image recognition.” [Online]. Available: https://arxiv.org/abs/1409.1556
 [11] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems (NIPS), Dec. 2012, pp. 10971105.
 [12] S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, and R. Appuswamy, et al. (Mar. 2016). “Convolutional networks for fast, energyefficient neuromorphic computing.” [Online]. Available: https://arxiv.org/abs/1603.08270
 [13] S. Han, H. Mao, and W. J. Dally. (Oct. 2015). “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.” [Online]. Available: https://arxiv.org/abs/1510.00149
 [14] M. Courbariaux, Y. Bengio, and J. P. David, “BinaryConnect: training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2015, pp. 31233131.
 [15] M. Courbariaux, I. Hubara, D. Soudry, R. EIYaniv and Y. Bengio. (Feb. 2016). “Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or 1.” [Online]. Available: https://arxiv.org/abs/1602.02830
 [16] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. (Mar. 2016). “XNORNet: ImageNet Classification Using Binary Convolutional Neural Networks.” [Online]. Available: https://arxiv.org/abs/1603.05279v2
 [17] A. Ehliar, “Area efficient floatingpoint adder and multiplier with IEEE754 compatible semantics,” in International Conference on FieldProgrammable Technology, Dec. 2014, pp. 131138.
 [18] M. Horowitz, “Computing’s energy problem (and what we can do about it),” in IEEE Internatinal SolidState Circuits Conference Digest of Technical Papers, pp. 1014, 2014.
 [19] M. D. J. Cook and N. C. Krishnan, “Activity learning: Discovering, recognizing, and predicting human behavior from sensor data,” John Wiley&Sons: Hoboken, NJ, USA, 2015.
 [20] S. Feldhorst, M. Masoudenijad, M. T. Hompel, and G. A. Fink, “Motion Classification for Analyzing the Order Picking Process using Mobile Sensors,” in In Proc. of the International Conference on Pattern Recognition Applications and Methods, SCITEPRESSScience and Technology Publication, Feb. 2016, pp. 706713.
 [21] F. J. Ordóñez and D. Roggen, “Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition,” Sensors, vol. 16, no. 1, pp. 115140, 2016.
 [22] N. Y. Hammerla, S. Halloran, and T. Ploetz, “Deep, Convolutional, and Recurrent Models for Human Activity Recognition using Wearables,” Journal of Scientific Computing, vol. 61, no. 2, pp. 454476, 2016.
 [23] R. Grzeszick, J. M. Lenk, F. M. Rueda, G. A. Fink, S. Feldhorst, and M. Ten Hompel, “Deep Neural Network based Human Activity Recognition for the Order Picking Process,” in In Proc. of the International Workshop on SensorBased Activity Recognition and Interaction, Sep. 2017, pp. 16.
 [24] C. A. Ronao and S. B. Cho, “Deep Convolutional Neural Networks for Human Activity Recognition with Smartphone Sensors,” in International Conference on Neural Information Processing (NIPS), Dec. 2015, pp. 4653.
 [25] M. Z. Uddin, W. Khaksar, and J. Torresen, “Facial Expression Recognition Using Salient Features and Convolutional Neural Network,” IEEE Access, vol. 5, pp. 2614626161, 2017.
 [26] J. Li, G. Li, and H. Fan, “Image Dehazing using Residualbased Deep CNN,” IEEE Access, vol. 6, pp. 2683126842, 2018.
 [27] Y. Kim. (Aug. 2014). “Convolutional neural networks for sentence classification.” [Online]. Available: https://arxiv.org/abs/1408.5882
 [28] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2015, pp. 34313440.
 [29] K. Xu, J. Ba, R. Kiros, K. Cho and Y. Bengio, et al., “Show, attend and tell: Neural image caption generation with visual attention,” in International Conference on Machine Learning (ICML), Jul. 2015, pp.20482057.
 [30] L. Sun, K. Jia, D. Y. Yeung, and B. E. Shi, “Human Action Recognition Using Factorized SpatioTemporal Convolutional Networks,” in IEEE International Conference on Computer Vision (ICCV), Dec. 2015, pp. 45974605.
 [31] D. J. Toms, “Training binary node feedforward neural networks by back propagation of error,” Electronics Letters, vol. 26, no. 21, pp. 17451746, 1990.
 [32] F. Li, B. Zhang, and B. Liu. (May. 2016). “Ternary weight networks.” [Online]. Available: https://arxiv.org/abs/1605.04711
 [33] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou. (Jun. 2016). “DoReFaNet: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients,” [Online]. Available: https://arxiv.org/abs/1606.06160
 [34] I. Hubara, M. Courbariaux, D. Soudry, R. ElYaniv, and Y. Bengio. (Sep. 2016). “Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations.” [Online]. Available: https://arxiv.org/abs/1609.07061
 [35] A. Zhou, A. Yao, Y. Guo, L. Xu and Y. Chen. (Feb. 2017). “Incremental Network Quantization: Towards Lossless CNNs with LowPrecision Weights.” [Online]. Available: https://arxiv.org/abs/1702.03044
 [36] W. Wen, C. Xu, F. Yan, C. Wu, and Y. Wang, et al., “TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning,” in Advances in Neural Information Processing Systems (NIPS), Dec. 2017, pp. 15081518.
 [37] Y. Guo, A. Yao, H. Zhao, and Y. Chen, “Network Sketching: Exploiting Binary Structure in Deep CNNs,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 40404048.
 [38] G. Davis, S. Mallat, and M. Avellaneda, “Adaptive greedy approximations,” Constructive Approximation, vol. 13, no. 1, pp. 5798, 1997.
 [39] Y. Bengio. (Aug. 2013). “Estimating or Propagating Gradients Through Stochastic Neurons.” [Online]. Available: https://arxiv.org/abs/1308.3432
 [40] F. Li, K. Shirahama, M. A. Nisar, L. Köping, and M. Grzegorzek, “Comparison of Feature Learning Methods for Human Activity Recognition Using Wearable Sensors,” Sensors, vol. 18, no. 2, pp. 679701, 2018.
 [41] F. M. Rueda and G. A. Fink. (Feb. 2018). “Learning Attribute Representation for Human Activity Recognition.” [Online]. Available: https://arxiv.org/abs/1802.00761
 [42] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: CommunicationEfficient SGD via Gradient Quantization and Encoding,” in In Advances in Neural Information Processing Systems (NIPS), Dec. 2017, pp. 17071718.
 [43] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep Learning with Limited Numerical Precision,” in International Conference on Machine Learning (ICML), Jul. 2015, pp. 17371746.
 [44] W. Mula, N. Kurz, and D. Lemire. (Nov. 2016). “Faster Population Counts Using AVX2 Instructions.” [Online]. Available: https://arxiv.org/abs/1611.07612
 [45] S. Ioffe and C. Szegedy. (Feb. 2015). “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” [Online]. Available: https://arxiv.org/abs/1502.03167
 [46] R. Chavarriaga, H. Sagha, A. Calatroni, S. T. Digumarti, and D. Roggen, “The Opportunity challenge: A benchmark database for onbody sensorbased activity recognition,” Pattern Recognition Letters, vol. 34, no. 15, pp. 20332042, 2013.
 [47] A.Reiss and D. Stricker, “Introducing a New Benchmarked Dataset for Activity Monitoring,” in The 16th IEEE International Symposium on Wearable Computers (ISWC), Jun. 2012, pp. 108109.
 [48] D. Micucci, M. Mobilio, and P. Napoletano. (Nov. 2016). “UniMiB SHAR: a new dataset for human activity recognition using acceleration data from smartphones.” [Online]. Available: https://arxiv.org/abs/1611.07688
 [49] M. D. Zeiler. (Dec. 2012). “ADADELTA: An Adaptive Learning Rate Method.” [Online]. Available: https://arxiv.org/abs/1212.5701