Adversarial Unsupervised Representation Learning for Activity TimeSeries
Abstract
Sufficient physical activity and restful sleep play a major role in the prevention and cure of many chronic conditions. Being able to proactively screen and monitor such chronic conditions would be a big step forward for overall health. The rapid increase in the popularity of wearable devices provides a significant new source, making it possible to track the user’s lifestyle realtime. In this paper, we propose a novel unsupervised representation learning technique called activity2vec that learns and “summarizes” the discretevalued activity timeseries. It learns the representations with three components: (i) the cooccurrence and magnitude of the activity levels in a timesegment, (ii) neighboring context of the timesegment, and (iii) promoting subjectinvariance with adversarial training. We evaluate our method on four disorder prediction tasks using linear classifiers. Empirical evaluation demonstrates that our proposed method scales and performs better than many strong baselines. The adversarial regime helps improve the generalizability of our representations by promoting subject invariant features. We also show that using the representations at the level of a day works the best since human activity is structured in terms of daily routines.
arrows
Introduction
Physical activity and sleep are crucial to health and wellbeing. Requisite activity and sufficient sleep prevent various illnesses such as diabetes and depression [\citeauthoryearWarburton, Nicol, and Bredin2006]. Rise in chronic conditions, mainly due to aging and unhealthy lifestyles, is putting our healthcare system under stress. In the current approach to sleepdisorder screening subjects have to go through different diagnosis steps, involving questionnaires and polysomnography (PSG). With increasing popularity of wearable devices like Fitbit, which collect detailed data about the body’s movements, there is an increased interest in using actigraphy for detecting sleeprelated disorders and tracking longitudinal changes in the subject’s condition. Although much lower fidelity than clinical devices, availability of wearables provides a novel opportunity, owing to its nonintrusive and realtime capabilities; specifically, if we can develop techniques to extract information from the vast amounts of bodymonitoring data. Such techniques could be useful to assist healthcare professionals as well as help monitor behavioral therapy, e.g., exercise. This application is an essential motivation of this work.
Only a minuscule proportion of the population has both their clinical data and wearables data available. Thus a purely supervised approach utilizing the wearablesclinical data corpus is suboptimal since it renders the activity data from a majority of subjects redundant. Hence, any approach towards using activity signals should utilize the unsupervised learning. Second, an important aspect is that information in actigraphy signals depends on the subjects and their environments, such as their routines and surroundings [\citeauthoryearStorm, Heller, and Mazzà2015] along with measurement errors owing to device design. We propose a twopronged approach. First, a mapping of the temporal relations of the activities from the timeseries to a feature space should be learned. Second, this feature space should take into account the subject’s environment, and make the representations invariant to the subject and their environment.
We propose a new method that addresses the challenges mentioned above. The core of our method is an unsupervised representation learning model, activity2vec that learns distributed representations for activity signals spanning over a time segment (e.g., at a day level) in a subject invariant manner. We use two public datasets to evaluate our approach against baselines on four disorder prediction tasks. Using a linear classifier (logistic regression), we show that our proposed representation learning method outperforms the baseline timeseries representation methods with a good margin, with daylevel timesegment representations performing the best. The linear classifier with our learned features performs at par with the nonlinear convolution neural network baseline trained endtoend on the classification tasks. We also demonstrate the effectiveness of the subjectinvariance loss in inducing subject invariant features.
Unlike traditional timeseries methods, our feature vectors can be used to boost the performance of the supervised learning models. It has been shown that using unsupervised pretrained vectors to initialize the supervised models produces better performance [\citeauthoryearHinton et al.2012]. Our method is general enough to be applied to other domains with similar timeseries like activity tracking through smartphones, traffic monitoring, or other types of sensor data. Specifically, we make the following contributions:

Unsupervised scalable embeddings for activity timeseries: To utilize enormous unlabeled human activity data, we propose a novel unsupervised representation learning method activity2vec that uncovers activity patterns through distributed representation in a scalable fashion, which can be leveraged towards prediction tasks.

A hierarchical model of representation: One of the persistent challenges in the timeseries domain is the selection of timesegments granularity that serve as the primary analysis units. We explore learning representations at various levels of time granularity. We devise a novel algorithm that optimizes two different measures to capture local and global patterns in the activity timeseries, along with an ordinal loss to account for the natural ordering in the activity signals.

Subjectinvariant representations: The noise from the subjects environments can hinder the generalization. In order to make the representations invariant to subject environments, we train our representations with a subject invariance loss in the adversarial training setting.
Related Work

Human activity for health informatics: Wearable sensors have mostly been used for human activity recognition (HAR) task in machine learning [\citeauthoryearBulling, Blanke, and Schiele2014, \citeauthoryearAlsheikh et al.2016], while medical practitioners perform a manual examination on the actigraphy data for diagnosing mostly sleepdisorders [\citeauthoryearSadeh2011]. Recent works have tried using actigraphy data for quantifying sleep quality using deep learning [\citeauthoryearSathyanarayana et al.2016] or for actively monitoring human behavioral patterns [\citeauthoryearAlthoff et al.2017]. The novelty of our method is that we propose taskagnostic models rather than plain supervised learning.

Representation learning: \citeauthorbengio2013representation provide an overview of representation learning, used to construct a space that is discriminative for downstream tasks. It is based on ideas of better network convergence by adding (unsupervised) pretrained vectors that encode the mutual information between the input features [\citeauthoryearGoodfellow, Bengio, and Courville2016]. Recently, the area has made enormous progress in NLP, vision, and speech [\citeauthoryearCollobert et al.2011, \citeauthoryearHinton et al.2012]. Of particular interest are the distributed bagofwords (DBOW) architectures [\citeauthoryearMikolov et al.2013, \citeauthoryearGrover and Leskovec2016, \citeauthoryearSaha et al.2017] optimized to predict the context of the structure, unlike continuousbagofwords (CBOW) that predicts the structure from its context. In a similar fashion, we use DBOW to capture local patterns in a time segment. Our novelty lies in integrating it in an adversarial setting [\citeauthoryearGanin et al.2016] with an unsupervised predictor, which is a combination of DBOW with global context and ordinal constraints for activity timeseries.

Time series analysis literature: These methods mainly use pairwise similarity concept to perform classification [\citeauthoryearBagnall et al.2017] and clustering tasks, with a distance metric like Euclidean. Dynamic Time Warping [\citeauthoryearBerndt and Clifford1994] is a widely used technique for finding similarity between two timeseries which is computationally expensive due to its pairwise similarity approach. This has led to the creation of symbolic representation techniques like SAX [\citeauthoryearLin et al.2007] and BOSS [\citeauthoryearSchäfer2015], that convert timeseries into a symbolic sequence based on amplitudes or frequency analysis, respectively. SAXVSM and BOSSVS [\citeauthoryearSchäfer2016] use tfidf (term frequencyinverse document frequency) transformation of these symbolic sequences to get vector representation of sequence windows. HCTSA [\citeauthoryearFulcher and Jones2014] is an unsupervised timeseries feature extraction engine with over 7800 feature space based on frequencydomain and timedomain analysis of the timeseries, unlike the above supervised vector space models. These timeseries models, however, cannot complement the supervised learning model, unlike our model’s embeddings that can be used to initialize the architectures with backpropagation like neural networks.
Our Approach
In this section, we describe our method activity2vec. We first describe challenges, followed by the model components that address these challenges.
Challenges for activity2vec
To create a representational schema for activity timeseries, the first natural challenge is determining the right granularity of the analysis unit. For example, consider the sample timeseries in Figure 1, where the xaxis represents the time in 30 seconds epochs and the yaxis represents the activity levels (or counts), which in our setting are discrete values, ranging from to . Learning vector representations for each activity level might result in sparse vectors that are too finegrained to be effective in the downstream tasks. Similarly, learning a representation with too big an analysis unit (e.g., spanning a week) could result in generic vectors lacking required discriminative power. As we demonstrate later in our experiments, the right level of granularity is somewhere in between (a day span). Within the analysis units, the relative magnitude of the signal values (e.g., ‘10’ ‘15’) should also be taken into account.
Considering granularity of analysis unit shorter than the sequence length posits another challenge – how to capture the global contextual dependencies between the units. Since the units are parts of a sequence that describes a person’s activity over a timespan, they are likely to be interdependent which should be captured in the representations. The same activity can look very different across the subjects owing to subjectspecific noise and environment, e.g., how they wear the device on their wrist. Our activity2vec model addresses these challenges as described in the next section.
Problem Formulation and Time Granularity
Notation  Description 

Set of subjects  
Timeseries dataset  
th timeseries sequence from set  
Timeseries segment  
Set of timesegments  
Timegranularity of the model  
Dimension of distributed representation space  
Embedding Matrix representing the mapping function  
Representation of timesegment  
A timeseries value/symbol in the segment  
A neighboring segment of timesegment under consideration  
Neighborhood of timesegment  
Segment specific loss based on the pair  
Ordinal Regression loss based on the pair  
Neighbor prediction loss based on the pair with  
Smoothing loss based on the pair with  
Adversarial loss based on the pair with 
Let denote a set of activity sequences for unique subjects, where each sequence is length activity (e.g., step counts) for a subject over a time period (a week here). Let 30 sec, 1 hour, 1 day, 1 week specify the granularity we want to encode. We first break each sequence into consecutive time segments of equal length based on (top of Figure 2). Let be such a segment of length that starts at time . Our aim is to learn a mapping function to represent each time segment by a dimensional representation. Equivalently, if we represent each time segment in the dataset with a unique identifier (ID), the mapping function is a lookup operation on an embedding matrix of a single hidden layer neural network with nonlinear activations. Our goal is to learn the embedding matrix by considering the segment’s content and context while promoting subjective invariance. can be obtained by concatenating (or averaging) the segmentlevel vectors; we use concatenation. In this work, we consider the following time spans for a comparative analysis:

Sample (sample2vec): representation for each 30second sample. Our 20,160 length timeseries yields a representation space of .

Hour (hour2vec): representation for onehour chunks, producing a vector space of .

Day (day2vec): embeds timeseries at the level of a day span, giving us a representational space of .

Week (week2vec): provides embeddings at the scale of a week, yielding a vector in space.
For a given granularity level, activity2vec learns the mapping function . While it is possible to use a preprocessing step with changepoints detection (CPD) to get the timesegments instead of manually setting the granularity, this step needs careful adjustments to set the thresholds for CPD models adding to further complexity of the model. In principle, we can have a space search over the possible granularities instead of using the predefined set above. We skip that step in this work for the sake of simplicity. Instead, we demonstrate the model’s behavior for the choice of granularities that are intuitive to humans. Figure 2 presents the graphical flow of activity2vec.
Our model relates to the sequential methods like LSTM by taking into account the global temporal dependencies. Considering the intersegment and intrasegment context is analogous to a shallow autoencoder with very dense localized connections and sparse neighboring connections. While we leverage the discrete valued nature of our timeseries, we can apply this method to continuous valued timeseries by discretization, as done in the traditional timeseries literature. The adversarial setting motivated by the environmental noise, which might not apply to other domains. We first present the loss components, and then the combined loss.
Modeling Segment Content in activity2vec
We use two loss functions in activity2vec to capture the content of a segment – the ordinal relation between timeseries values and their cooccurrence patterns.
SegmentSpecific Loss
We use the segmentspecific loss to learn a representation for each time segment by predicting its own values. Given an input sequence , we first map it to a unique vector by looking up the corresponding vector in the shared embedding matrix . We then use to predict each symbol sampled randomly from a window in . However, using a softmax layer for the prediction is very computationally expensive. To compute the prediction loss in an efficient manner, we use NoiseContranstive Estimation [\citeauthoryearGutmann and Hyvärinen2012] as an alternative to softmax:
(1)  
where is the sigmoid function defined as , and are the weight vectors associated with and symbols, respectively, is the noise distribution from which negative example is sampled, and is the number of negative examples sampled for each positive example. In our experiments, we use unigram distribution as the noise distribution with .
Since we ask the same segmentlevel vector to predict its symbols, the model captures the overall pattern of a segment. Note that except for sample2vec, the model learns embeddings for both segments and symbols. A segmentbased approach is commonly used in the timeseries analysis, though among the representational models only models like SAXVSM [\citeauthoryearSenin and Malinchik2013] look at the cooccurrence statistics, with a bagofwords assumption.
Ordinal Regression Loss
When predicting an activity symbol, the segmentspecific loss above treats each symbol independently. However, since the symbols represent activity levels, there is a natural ordering in their values (e.g., ‘5’ ‘1’), which should be preserved in representations. In order to embed this ordinal relation, we use ordinal regression loss while learning the representation for each activity value:
(2) 
where is the sigmoid function as defined before, is the indicator function, and is the number of distinct discrete values that timeseries can take (05000 in our case). Here, is the cumulative probability of being at most with being the ordered threshold for the regression, such that , .
Remark: In principle, we can integrate the ordinal relation in Equation 1 with the NCE loss. However, owing to the hierarchical nature of our algorithm, the segment ID is also predicted while learning the symbols and segments representations. Since these segmentlevel IDs do not have an ordinal relation with the symbol IDs, we resort to using an ordinal loss applicable only to timeseries symbols.
Modeling Segment Context in activity2vec
Loss functions presented above capture local patterns in a segment. However, since the segments are contiguous and describe activities of a person, they are likely to be related. For example, after a strenuous activity, there might be lighter activity periods. Likewise, one can expect a smooth transition from one segment to the next. The algorithm should capture relations between proximal segments. With this motivation, we use two loss functions to model this relationship.
Neighbor Context Loss
Similar to activity symbols, each segment in the dataset is assigned a unique identifier that we can use to look up its corresponding vector in the embedding matrix or to predict the segment ID. We first formulate the relation between neighboring segments by asking the current segment vector (i.e., lookedup vector for segment ) to predict its neighboring segments in the timeseries: and . If is a neighbor to , the neighbor context loss is the neighbor prediction task using NCE as before:
(3)  
where, and are the weight vectors associated with and segments, respectively. The noise distribution is as described before over segment IDs.
Smoothing Loss
While the previous objective attempts to capture neighborhood information, we also hypothesize that there is a “continuity” between neighboring segments. The learning algorithm should discourage any abrupt changes in the representation of proximal segments. We apply smoothing between the neighboring segments by minimizing the distance between the representations of the neighbors:
(4) 
where is the set of timesegments in proximity to and is the smoothing strength parameter. Note that the smoothing loss is not applicable to week2vec.
Modeling Subject Invariance in activity2vec
One challenge in dealing with human activity data is that it heavily depends on the subject’s environment. To promote subject invariance, we use an adversary loss [\citeauthoryearGanin et al.2016]. Let be the number of unique subjects. We use a multiclass classifier as a discriminator to predict the timesegment’s source (subject) from the encoded segment representation . In other words, the discriminator tries to identify the subject from whose activity timeseries the encoder (the embedding matrix) has derived the segment’s representation. Note that we want to emphasize the invariance from the source subject of the timesegment and not from the particular sequence from which the segment is derived. Hence, if we have multiple sequences for the same subject, the sequences will share the same subject in our model. Formally, the discriminator is defined by a softmax:
(5) 
where is the weight vector associated with subject , and defines all the parameters of the discriminator. We use a crossentropy loss to optimize the subject discriminator:
(6) 
where is a subject from whom the segment is encoded.
With a goal to promote subject invariance, we put the encoder (the embedding matrix) in adversary with the discriminator. The encoder attempts to encode features that are indistinguishable to the subject discriminator by minimizing (negative of discriminator loss):
(7) 
Combined Loss for activity2vec
We define our activity2vec model as the combination of the losses described in Equations 1, 2, 3, 4, and 7:
(8) 
where and are the relative weights for the ordinal regression loss and the subject invariance loss, respectively. Concurrently, we also minimize the discriminator loss in Equation 6. As shown at the right in Figure 2, the training of activity2vec involves an optimization game between three players: the encoder (E), the combined predictor (F), and the discriminator (D). E plays a cooperative game with F to allow it to induce the necessary information. E also plays an adversarial minmax game against D to prevent it from identifying the subject from the encoded vector to promote subject invariance. Algorithm 1 presents our training procedure based on stochastic gradient descent (SGD).
The main challenge in adversarial training is to balance the two components – the combined loss in Eq. 8 vs. the discriminator loss in Eq. 6, as shown (right) in Figure 2. If one player becomes smarter, its loss to the shared encoder (embedding matrix) overwhelms, and the training fails to converge [\citeauthoryearArjovsky, Chintala, and Bottou2017]. In our experiments, the discriminator converges much faster. To stabilize the training, we update the discriminator once every five gradient steps of the algorithm, chosen randomly. Also, we follow the weighting schedule proposed by [\citeauthoryearGanin et al.2016, p. 21], that initializes to , and then changes it gradually to as training progresses. Through our experiments we demonstrate that the intuitions captured by the components are synergistic since they improve the performance incrementally as the components are added.
Experimental Settings
In this section, we describe experimental settings — datasets, the prediction tasks on which we evaluate the embeddings, the baselines models, and parameter selection.
Datasets
We use Study of Latinos (SOL) [\citeauthoryearSorlie et al.2010] and MultiEthnic Study of Atherosclerosis (MESA) [\citeauthoryearBild et al.2002] datasets. SOL has physical activity and clinical data for 1887 subjects, while MESA only has activity data for 2237 subjects. This simulates the scenario where disorder condition labels are available only for a portion of users. These datasets contain activity data (actigraphy) from each subject for a minimum of 7 days measured with wristworn Philip’s Actiwatch Spectrum. The timeseries for each subject is sampled at a rate of 30 seconds. Actigraphy records the mean activity count per 30 seconds, providing us with a signal that can only take integer values. Mean activity count is reported in ZCM (Zero Crossing Mode) that counts the number of times the waveform crosses zero. This can be generalized to steps since they calculate the intensity. This makes embedding the input symbols straightforward, without a dedicated discretization step for our proposed activity2vec method.
The datasets are not skewed with respect to the class distribution of different prediction tasks described in the next section. A very few missing values () observed in the dataset were replaced by unknown (UNK) token. Any future unseen or outofvocabulary (OOV) signal value can be handled by a procedure that assigns representation from averaging out neighboring ordinal signal values rather than a generic unknown symbol assignment.
Prediction Tasks
We evaluate the effectiveness of the learned embeddings on the following health disorder prediction tasks:

Sleep Apnea: Sleep apnea syndrome is a sleep disorder characterized by reduced respiration during the sleep time. We use the ApneaHypopnea Index (AHI) at 3% desaturation level with AHI5 being characterized as nonapneaic, while AHI indicating a mildtosevereapnea.

Diabetes: Diabetes (type 2) is the body’s insensitivity to insulin, leading to elevated levels of blood sugar. Task is defined as a threeclass classification problem, to decide whether a subject is a nondiabetic, prediabetic, or diabetic.

Hypertension: Hypertension refers to abnormally high levels of blood pressure, an indicator of stress. Hypertension prediction characterizes a binary classification problem for increased blood pressure.

Insomnia: Insomnia is a sleep disorder characterized by an inability to fall asleep easily, leading to low energy levels during the day. We use a 3class prediction problem for classifying subjects into non, pre and insomniac groups.
Baseline Models
We compare our method with a number of naive baselines and existing systems that use timeseries representations:
(a) Majority Class: This baseline always predicts the class that is most frequent in a dataset.
(b) Random: This baseline randomly picks a class label.
(c) SAX VSM: SAXVSM [\citeauthoryearSenin and Malinchik2013] uses SAX, one of the most widely used timeseries representation technique.
(d) BOSS: BOSS [\citeauthoryearSchäfer2015] is a symbolic representational learning technique that uses Discrete Fourier Transform (DFT) of timeseries windows. BOSS creates equal sized bins from histograms of DFT coefficients, which are then assigned representational symbols. Labels are assigned based on the class that gets the highest similarity score using nearest neighbor approach.
(e) BOSSVS: BOSS in Vector Space [\citeauthoryearSchäfer2016] is similar to SAXVSM and creates vector space representation of the timeseries from BOSS. BOSS is known to be one of the most accurate methods on standard timeseries classification tasks, with BOSSVS performing marginally lower.
(g) CNN: We use a Convolutional Neural Network (CNN) with ConvReLUAvgPoolBatchNormDropout layers for supervised prediction. We add layers until we get no performance improvement on heldout set.
(h) HCTSA: Highly Comparative TimeSeries Analysis [\citeauthoryearFulcher and Jones2014] is a timeseries feature extraction engine with over 7800 features extracted with frequency and time domain analysis like Kurtosis. Unlike the timeseries baselines above, this is an unsupervised method — most relevant baseline to our method.
(i) LSTM: We train an LSTM based unsupervised model similar to [\citeauthoryearSundermeyer, Schlüter, and Ney2012] that learns to encode sequences by predicting next timeseries value (as a language model in NLP). Next, we use this pretrained network to initialize the LSTM network that is further trained on the supervised learning tasks.
Variants of activity2vec
(a) Unregularized models: This group of models contain only two NCE loss components from Equation 8: and . In the Results section, we refer to these models as sample2vec, hour2vec, day2vec, and week2vec.
(b) Regularized models: We add smoothing loss to models in (a). This group includes hour2vec+Reg and day2vec+Reg. We omit sample2vec+Reg since it performed extremely poorly on all the tasks. Recall that smoothing is not applicable to week2vec.
(c) Ordinal loss model: We use ordinal loss with these models. We add this loss to day2vec+Reg model in (b), our best performing model as discussed in the next section. We omit other timeunit models for brevity.
(d) Adversarial model: These models use the adversary loss for our day2vec+Reg with ordinal loss.
Method  Clf.  Sleep Apnea  Diabetes  Insomnia  Hypertension  Speed  
macro  micro  macro  micro  
Supervised 
Majority  0R  00.0  21.7  31.9  47.4  25.5  00.0   
Random  33.9  34.3  31.3  36.2  30.0  33.4    
SAXVSM  00.0  38.6  24.3  47.4  25.5  00.0    
BOSS  17.6  38.9  31.5  49.8  34.9  29.6    
BOSSVS  11.7  40.1  32.7  47.5  33.1  31.3    
Taskspecific  CNN  41.5  45.2  41.0  50.7  40.1  36.6  2.0x  
Unsupervised 
sample2vec  LR  36.7  40.0  36.7  42.4  35.3  39.4   
hour2vec  LR  30.0  41.4  33.3  44.6  28.5  24.4    
hour2vec+Reg  LR  20.5  42.1  32.0  43.5  28.7  23.1    
day2vec  LR  36.8  40.9  38.0  45.2  35.8  40.3  0.3x  
day2vec+Reg  LR  38.9  41.8  39.5  46.6  39.7  43.4  0.3x  
week2vec  LR  14.5  40.6  34.1  44.2  31.5  18.7    
HCTSA  LR  20.3  40.0  35.0  46.7  33.7  22.0  8.2x  
LSTM  LSTM  32.2  41.4  33.3  46.1  30.4  37.8  10.5x  
\pbox20cmday2vec+Reg+O  LR  40.5  45.3  40.2  50.9  40.3  43.6  0.4x  
\pbox20cmday2vec+Reg+O+A  LR  43.6  45.8  42.5  55.7  41.4  44.1  1.0x 
HyperParameter Selection
We use 80%,10%,10% split for train, validation, and test sets repeated 10 times, and we report the mean scores. As mentioned earlier, we only have the disorder task labels for the SOL dataset. For the supervised models we only use the SOL dataset, while for the unsupervised models we use the combined SOL and MESA data. The embedding size of = was fixed for all the models. The weighting parameters and were chosen to be 0.05 and 0.5, respectively. The remaining hyperparameters in activity2vec are: window size () for segmentspecific loss, number of neighboring segments () and regularization strength () for day2vec and hour2vec. We tuned for , , and on the development set. We chose of size 20, 20, 30, and 50 for sample2vec, hour2vec, day2vec, and week2vec, respectively. The of 0.25 and 0.5 were chosen for day2vec and hour2vec, respectively. The neighbor set size of 2 was chosen. For the CNN baseline, 3, 4, 3, and 3layered network were used for sleepapnea, diabetes, insomnia, and hypertension, with a dropout of 0.5 trained with Adam Optimizer. We tuned all the parameters for maximizing the scores.
Results and Discussion
In this section, we present our results for the four prediction tasks described in the previous section. The results are presented in Table 2 in four groups: (i) baselines, (ii) existing timeseries representation methods and CNN, (iii) our unregularized and regularized activity2vec variants, and (iv) our ordinal and adversarial activity2vec models. We show classification performance in terms of scores.

Performance on disorder classification tasks: Since our goal is to evaluate the effectiveness of the learned vectors, we use simple linear classifier Logistic Regression (LR) with our activity2vec models. For the multiclass classification problems like Diabetes and Insomnia, we use OnevsAll classifiers, tuning for micro score. We run each experiment 10 times and take the average of the evaluation measures to avoid any randomness in results. We can notice that day2vec consistently gives absolute 24% improvements over the other activity2vec variants, and 620% over the baseline timeseries representation models and LSTM on scores on all tasks. The adversarialordinalregularized variant of day2vec gives the best results among all the variants. Adversarial variant performs at par or better than taskspecific supervised CNN on all the tasks. Due to inherent noisy nature of activity timeseries and long sequence length, the LSTMbased model did not work well.

Selecting time granularity: Across the tasks, the adversarial ordinal day2vec+Reg outperforms all the models. Models with day level granularity perform better than all the other granularities as well as baseline methods. Among the activity2vec variants, the week2vec models perform the worst, while hour2vec models perform just a bit better on an average. Hour and weeklevel models perform similarly to the baseline timeseries methods. The highdimensional models based on samples (sample2vec) perform better than hourlevel, weeklevel, and baselines. day2vec produces marginally better results than the sample2vec despite much lower dimensional space (2880x). The level of granularity makes a lot of difference in the performance of our models. From the above results, we can conclude that while the low granularity level (sample2vec) suffered from coarse embeddings, the high granularity (week2vec) level embeddings lost the ability to discriminate.
Figure 3: Comparison of our unsupervised day2vec+Reg+O+A with LR vs. supervised CNN as a function of labeled data. 
Effect of smoothing: Intuition behind adding the temporal smoothing loss (Eq. 4) to our model was to test the hypothesis that human activities happen in continuity and follow a macroroutine. This should be reflected in neighboring timesegments making them similar in structure. As results suggest, the continuity hypothesis was misguided at the sample and hourlevel. Regularization at the sample level made them lose the discriminative power for classification— considering the noise in activity levels at such a fine granularity. hour2vec+Reg exhibits a significant drop (sleep apnea) or at par performance compared to hour2vec. Since humans tend to switch between different activity types at the order of hours or less, the hypothesis of continuity was inappropriate at hour level as well. However, adding smoothing helps produce gains for day2vec, our best model. We argue that smoothing regularization helps capture a higher level global context since humans structure routines at the level of the day, while we switch between activities on the order of hours or lower. This is supported by the periodogram analysis, where day level frequencies dominate.

Effect of ordinal loss: Addition of ordinal loss improves the accuracy of our models, albeit marginally. While the other loss functions work on cooccurrence of activity values at local and global contexts, ordinal loss explicitly models the relationship between the magnitude of the activity values, e.g., ‘25’ ‘5’. While it can be argued that similar ordinal values would cooccur, hence reflected in the , adding an explicit ordinal constraint helps, though marginally.

Effect of adversarial training: Adding the adversarial loss to the ordinalregularized models improves the scores across the tasks by 0.5%3.1% in absolute numbers and 2%7.5% in relative terms. The difference in performance is more than the 95% confidence intervals around the repeated experimental means reported here. It can thus be concluded that making the representations invariant to the source (subjects) helps in removing the noise introduced by the subjects and their environments, owing to how they wear their device and their routines, making the embeddings more generalized. With an adversarial setting, we can create a representation space more relevant to the health condition of the subjects by removing the subject source domain.
Figure 4 shows the tSNE [\citeauthoryearMaaten and Hinton2008] plot of regularized day2vec (left) vs adversarial day2vec embeddings (right) for SOL dataset subjects. For each subject, we concatenate the embeddings from constituent day level timesegments to get the representation. We plot the lifestyles of the subjects as determined by the study questionnaire, identifying each subject as highlyactive, moderatelyactive, and sedentary person. We can notice that we get clusters or subject phenotypes with nice separation within the cluster for each of the lifestyle type with nonadversarial day2vec. Unsurprisingly, the clusters get very compact in the adversarial setting. The lifestyle classes get clustered markedly separately rather than forming incluster separations, observed in the nonadversarial setting. Hence, the adversary helps activity2vec with removing the subjectwide variance in the learned embeddings, while still capturing the properties of subject phenotypes. Also, adversary loss improved results on the disorder prediction tasks. Hence, by reducing subject level variance, the adversary loss helps encode a better global representation.

Scalability: As reflected in Table 2, our model takes much less training time compared to the deep unsupervised model like LSTM. In fact, our model takes much less time compared to the supervised CNN that only runs on the labeled data. Since our model has only one hidden layer (i.e., embedding layer) and uses NCE for training, it is scalable in practical settings compared to the deep neural models. Additionally, our model offers more flexibility to incorporate the alternative intuitions including human knowledge (e.g., the day/hour level representational hierarchy) that would be difficult to do with other deep methods. Clearly, using more sophisticated deep neural models like deep autoencoders or variational autoencoders that are susceptible to noisy data like ours for a semisupervised setting would pose scalability challenges.
Note: We restrict our models to a transductive setting, where representations of test segments are learned along with training segments. However, it can be easily applied to a inductive setting, where representations for unseen segments can be derived by a single backprop step [\citeauthoryearLe and Mikolov2014].

Supervised vs Unsupervised: To show the utility of unsupervised schema, we demonstrate the performance of supervised CNN vs. our adversarial day2vec method as a function of the percentage of labeled data in Figure 3. Clearly, our model outperforms CNN across the board. The gap is drastic when the proportion of labeled data is low, which is usually the case in practice.
Conclusions
In this work, we present a novel unsupervised representational learning technique, activity2vec that encodes human activity timeseries by modeling local and global activity patterns. We train our model on two datasets and test on prediction tasks (four commonly occurring disorders). We find that daylevel granularity preserves the best representations, which is not surprising since a day is a natural timescale for a full cycle of human activities. Our taskagnostic representational learning model using simple linear classifiers beats baseline timeseries representation models on all the disorder prediction tasks. It even performs at par or better than supervised convolutional neural network baseline. Our model learns the representational features using a combination of nonlinear loss functions, giving better performance on multiple tasks using simple linear classifiers. We further demonstrate that using adversarial loss along with our embedding encoder model helps increase the performance and generalizability of the embeddings. Additionally, our method is capable of complementing the supervised learning by initialization, unlike existing representation approaches.
References
 [\citeauthoryearAlsheikh et al.2016] Alsheikh, M. A.; Selim, A.; Niyato, D.; Doyle, L.; Lin, S.; and Tan, H.P. 2016. Deep activity recognition models with triaxial accelerometers. In AAAI Workshops.
 [\citeauthoryearAlthoff et al.2017] Althoff, T.; Horvitz, E.; White, R. W.; and Zeitzer, J. 2017. Harnessing the web for populationscale physiological sensing: A case study of sleep and performance. In WWW, 113–122.
 [\citeauthoryearArjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein GAN. CoRR abs/1701.07875.
 [\citeauthoryearBagnall et al.2017] Bagnall, A.; Lines, J.; Bostrom, A.; Large, J.; and Keogh, E. 2017. The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances. DMKD 31(3):606–660.
 [\citeauthoryearBengio, Courville, and Vincent2013] Bengio, Y.; Courville, A.; and Vincent, P. 2013. Representation learning: A review and new perspectives. TPAMI 35(8):1798–1828.
 [\citeauthoryearBerndt and Clifford1994] Berndt, D. J., and Clifford, J. 1994. Using dynamic time warping to find patterns in time series. In KDD workshop, volume 10, 359–370. Seattle, WA.
 [\citeauthoryearBild et al.2002] Bild, D. E.; Bluemke, D. A.; Burke, G. L.; Detrano, R.; Diez Roux, A. V.; Folsom, A. R.; Greenland, P.; JacobsJr, D. R.; Kronmal, R.; Liu, K.; et al. 2002. Multiethnic study of atherosclerosis: objectives and design. American journal of epidemiology 156(9):871–881.
 [\citeauthoryearBulling, Blanke, and Schiele2014] Bulling, A.; Blanke, U.; and Schiele, B. 2014. A tutorial on human activity recognition using bodyworn inertial sensors. ACM CSUR 46(3):33.
 [\citeauthoryearCollobert et al.2011] Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; and Kuksa, P. 2011. Natural language processing (almost) from scratch. JMLR 12(Aug):2493–2537.
 [\citeauthoryearFulcher and Jones2014] Fulcher, B. D., and Jones, N. S. 2014. Highly comparative featurebased timeseries classification. TKDE 26(12):3026–3037.
 [\citeauthoryearGanin et al.2016] Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016. Domainadversarial training of neural networks. JMLR 17(1):2096–2030.
 [\citeauthoryearGoodfellow, Bengio, and Courville2016] Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Learning. MIT Press.
 [\citeauthoryearGrover and Leskovec2016] Grover, A., and Leskovec, J. 2016. node2vec: Scalable feature learning for networks. In ACM SIGKDD, 855–864.
 [\citeauthoryearGutmann and Hyvärinen2012] Gutmann, M. U., and Hyvärinen, A. 2012. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. JMLR 2012 13(Feb):307–361.
 [\citeauthoryearHinton et al.2012] Hinton, G.; Deng, L.; Yu, D.; Dahl, G. E.; Mohamed, A.r.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T. N.; et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Processing Magazine 29(6):82–97.
 [\citeauthoryearLe and Mikolov2014] Le, Q., and Mikolov, T. 2014. Distributed representations of sentences and documents. In ICML, 1188–1196.
 [\citeauthoryearLin et al.2007] Lin, J.; Keogh, E.; Wei, L.; and Lonardi, S. 2007. Experiencing sax: a novel symbolic representation of time series. DMKD 15(2):107–144.
 [\citeauthoryearMaaten and Hinton2008] Maaten, L. v. d., and Hinton, G. 2008. Visualizing data using tsne. JMLR 9(Nov):2579–2605.
 [\citeauthoryearMikolov et al.2013] Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and Dean, J. 2013. Distributed representations of words and phrases and their compositionality. In NIPS, 3111–3119.
 [\citeauthoryearSadeh2011] Sadeh, A. 2011. The role and validity of actigraphy in sleep medicine: an update. Sleep medicine reviews 15(4):259–267.
 [\citeauthoryearSaha et al.2017] Saha, T. K.; Joty, S.; Hassan, N.; and Hasan, M. A. 2017. Regularized and retrofitted models for learning sentence representation with context. In CIKM, 547–556. ACM.
 [\citeauthoryearSathyanarayana et al.2016] Sathyanarayana, A.; Joty, S.; FernandezLuque, L.; Ofli, F.; Srivastava, J.; Elmagarmid, A.; Taheri, S.; and Arora, T. 2016. Impact of physical activity on sleep: A deep learning based exploration. arXiv preprint arXiv:1607.07034.
 [\citeauthoryearSchäfer2015] Schäfer, P. 2015. The boss is concerned with time series classification in the presence of noise. DMKD 29(6):1505–1530.
 [\citeauthoryearSchäfer2016] Schäfer, P. 2016. Scalable time series classification. DMKD 30(5):1273–1298.
 [\citeauthoryearSenin and Malinchik2013] Senin, P., and Malinchik, S. 2013. Saxvsm: Interpretable time series classification using sax and vector space model. In ICDM, 1175–1180. IEEE.
 [\citeauthoryearSorlie et al.2010] Sorlie, P. D.; AvilésSanta, L. M.; WassertheilSmoller, S.; Kaplan, R. C.; Daviglus, M. L.; Giachello, A. L.; Schneiderman, N.; Raij, L.; Talavera, G.; Allison, M.; et al. 2010. Design and implementation of the hispanic community health study/study of latinos. Annals of epidemiology 20(8):629–641.
 [\citeauthoryearStorm, Heller, and Mazzà2015] Storm, F. A.; Heller, B. W.; and Mazzà, C. 2015. Step detection and activity recognition accuracy of seven physical activity monitors. PloS one 10(3):e0118723.
 [\citeauthoryearSundermeyer, Schlüter, and Ney2012] Sundermeyer, M.; Schlüter, R.; and Ney, H. 2012. Lstm neural networks for language modeling. In INTERSPEECH.
 [\citeauthoryearWarburton, Nicol, and Bredin2006] Warburton, D. E.; Nicol, C. W.; and Bredin, S. S. 2006. Health benefits of physical activity: the evidence. Canadian medical association journal 174(6):801–809.