Time Perception Machine: Temporal Point Processes for the When, Where and What of Activity Prediction
Abstract
Numerous powerful point process models have been developed to understand temporal patterns in sequential data from fields such as healthcare, electronic commerce, social networks, and natural disaster forecasting. In this paper, we develop novel models for learning the temporal distribution of human activities in streaming data (e.g., videos and person trajectories). We propose an integrated framework of neural networks and temporal point processes for predicting when the next activity will happen. Because point processes are limited to taking event frames as input, we propose a simple yet effective mechanism to extract features at frames of interest while also preserving the rich information in the remaining frames. We evaluate our model on two challenging datasets. The results show that our model outperforms traditional statistical point process approaches significantly, demonstrating its effectiveness in capturing the underlying temporal dynamics as well as the correlation within sequential activities. Furthermore, we also extend our model to a joint estimation framework for predicting the timing, spatial location, and category of the activity simultaneously, to answer the when, where, and what of activity prediction.
1 Introduction
During the past decades, researchers have made substantial progress in computer vision algorithms that can automatically detect [1, 2, 3] and recognize [4, 5, 6, 7] actions in video sequences. However, the ability to go beyond this and estimate how past actions will affect future activities opens exciting possibilities. A good estimation of future behaviour is an essential sensory component for an automated system to fully comprehend the real world. In this paper, we tackle the problem of estimating the prospective occurrence of future activity. Our goal is to predict the timing, spatial location, and category of the next activity given past information. We aim to answer the when, where, and what questions of activity prediction.
Consider the sports video example shown in Fig. 1. In our work, we directly model the occurrence of discrete activity events that occur in a data stream. Within a sports context, these activities could include key moments in a game, such as passes, shots, or goals. More generally, they could correspond to important human actions along a sequence: such as a person leaving a building, stopping to engage in conversation with a friend, or sitting down on a park bench. Predicting where and when these semantically meaningful events occur would enable many applications within robotics, autonomous vehicles, security and surveillance, and other video processing domains.
Problem Definition.
Let the input be a sequence of frames. Among these, () frames are each marked by an activity, whose timestamps are denoted as . Our goal is to estimate when and where the next activity () will happen and what type of activity it will be given the past sequence of activities and frames up to .
Importantly, we are interested in predictions regarding the semantically meaningful, sparsely occurring events within a sequence. This discrete time moment representation for actions is commonplace in numerous applications: e.g., where and when will the next shot take place in this hockey game, where do we need to be to intercept it; from where and when will the next person hail a rideshare, where should we drive to pick him/her up; when is the next nursing home patient going to request assistance, what will he/she request and where will that request be made? Generalizations of this paradigm are possible, where we consider multiple people, such as players in a sports game. We elaborate on this idea and demonstrate that we can model events corresponding to important, actionable inferences.
Following the standard terminology [8], we use the term arrival pattern to refer to the temporal distribution of activities throughout the paper. We wish to model this distribution and infer when and where the next activity will take place. However, in vision tasks the raw input has frames, whereas we are interested in the moments sparsely distributed in the sequence that are the points at which activities commence. Therefore, we need a mechanism to build features from the frames while also preserving information of other regular frames. To address this problem, we utilize a hierarchical recurrent neural network with skip connections for multiresolution temporal data processing.
Similar to variational autoencoders [9, 10], which model the distribution of latent variables with deep learning, our model leverages the same advantage of neural networks to fit the arrival pattern (temporal distribution of activities) in the data. A network is used to learn the conditional intensity of a temporal point process and the likelihood is maximized during training. In contrast to traditional statistical approaches that demand expert domain knowledge, our model does not require a handcrafted conditional intensity. Instead, it is automatically learned on top of raw data. We name our model the Time Perception Machine (TPM).
Our work has three main contributions:

Proposing a new task – predicting the occurrence of future activity – for human action analysis, which has not been explored before on streaming data such as videos and person trajectories;

Developing a novel hierarchical RNN with skip connections for feature extraction at finer resolution (frames of interest) while preserving information at coarser resolution;

Formulating a generic conditional intensity and extending the model to a joint prediction framework for the when, where and what of activity forecasting.
2 Related Work
2.1 Activity Forecasting
Seminal work on activity forecasting was done by Kitani et al. [11], who modeled the effect of physical surroundings using semantic scene labeling and inverse reinforcement learning to predict plausible future paths and destinations of pedestrians.
Subsequent work [12] reasons about the longterm behaviors and goals of an individual given his firstperson visual observations. Similarly, Xie et al. [13] attempted to infer human intents by leveraging the agentbased Lagaragian mechanics to model the latent needs that drive people toward functional objects. Park et al. [14] proposed an EgoRetinal map for motion planning from egocentric stereo videos. Vondrick et al. [15] presented a framework for predicting the visual representations of future frames, which is employed to anticipate actions and objects in the future. Unlike the previous work on activity forecasting, which focuses on planning paths and predicting intent, our work addresses a different problem in that we aim to predict the discrete attributes (the when, where, and what) of future activities.
Recent temporal activity detection / prediction methods build on recurrent neural network architectures. These include connectionist temporal classification (CTC) architectures [16, 17]. CTC models conduct classification by generalizing away from actual time stamps, while prediction methods regress actual temporal values. A variety of temporal neural network structures exist (convolutional [18], GRU, LSTM, Phased LSTM [19]), many of which have been applied to activity recognition. Our contribution is complementary in that it focuses on a novel point process model for distributions of discrete events for activity prediction.
2.2 Temporal Point Processes
A temporal point process is a stochastic model used to capture the arrival pattern of a series of events in time. Temporal point processes are studied in various areas including healthcare analysis [20], electronic commerce [21], modeling earthquakes and aftershocks [22], etc.
A temporal point process model can be fully characterized by the “conditional intensity” quantity, denoted by , which is conditioned on the past information . The conditional intensity encodes the expected rate of arrivals within an infinitesimal neighborhood at time . Once we determine the intensity, we determine a temporal point process. Mathematically, given the history up to the event and the conditional intensity , we can formulate the probability density function and the cumulative distribution function for the time of the next event , shown in Eq. 1 and Eq. 2. We defer the full derivation of both formulas to Appendix A.
(1)  
(2) 
For notational convenience, we use “” to indicate that a quantity is conditioned on the past throughout this paper. For example, , and . Below we show the conditional intensities of several temporal point process models.
Poisson Process [23]. , where is a positive constant.
Hawkes Process [24]. , where , and are positive constants. This process is an “aggregated” process, where one event is likely to trigger a series of other events in a short period of time, but the likelihood drops exponentially with regard to time.
SelfCorrecting Process [25]. , where and are positive constants. This process is more “averaged” in time. A previous event is likely to inhibit the occurrence of the next one (by decreasing the intensity). Then the intensity will increase again until the next event happens.
Furthermore, a recent work by Du et al. [26] explored temporal process models using neural networks, but only experimented with sparse timestamp data. We extend their approach to dense streaming data with the proposed hierarchical RNN to extract features at frames of interest. Additionally, we demonstrate the effectiveness of a more generic intensity function in modeling the arrival pattern. We also show how a more powerful joint estimation framework can be formulated for simultaneous prediction of the timing, spatial location and category of the next activity event.
3 Model
We will first introduce the hierarchical RNN structure upon which our model is built. Then we will present in detail the formulation and derivation of the proposed model for predicting the timing of future activities. Finally we show how our model can be extended to a joint estimation framework for the simultaneous prediction of the time, location, and category of the next activity.
3.1 Hierarchical RNN
The input to our model is an entire sequence of frames. In our experiments, these include visual data in the form of bounding boxes cropped around people in video sequences and/or representations of human motion trajectories as 2D coordinates of person location over time.
A typical temporal point process model only takes as input the frames annotated with activities. These are very sparse compared to the entire dense sequence of frames (). We expect these significant frames will contain important features. However, we do not want to lose any information inherent in the remaining () frames. To this end, we need a hierarchical RNN capable of feature extraction at different time resolutions. This is similar in vein to tasks from the natural language processing domain, such as recent work [27, 28, 29] in language modeling, with charactertoword and wordtophase networks for feature extraction at multiple scales. More generally, this is an instance of the classic multipletime scales problem in recurrent neural networks [30].
In our case, we use a hierarchical RNN model composed of two stacked RNNs. The lowerlevel RNN looks into the details by covering every frame in the input sequence. The higher level RNN fixes its attention only on frames of activities so as to capture the temporal dynamics among these significant times. We implement the RNN with LSTM cells. Fig. 2 shows the model structure.
3.2 Conditional Intensity Function
Instead of handcrafting the conditional intensity , we view it as the output of the hierarchical RNN and learn the conditional intensity directly from raw data. However, an arbitrary choice of the conditional intensity could be potentially problematic, because it needs to characterize a probability distribution. Thus, we need to validate the resultant probability density function in Eq. 1 and the cumulative distribution function in Eq. 2.
Proposition.
Proof.
Necessity (). Given and Eq. 2, we have , from which it follows that . Since is positive, under this condition it defines a valid probability distribution, hence a well established temporal point process.
Sufficiency (). First, must be positive for it to define a valid probability density by Eq. 1. If , which means the integral is a positive less than , then it is easy to notice that . This would be an invalid cumulative distribution function since . ∎
We formally define two forms of conditional intensity as follows.
Explicit time dependence : The first form is inspired by [26], which models the conditional intensity based on the hidden states and the time .
(3) 
Note that we make an important correction to [26]. The conditional intensity without the positive constraint in Eq. 3 does not conform to the necessary condition above. By imposing a constraint , we can prove that the revised intensity in Eq. 3 satisfies the condition in the above proposition.
Implicit time dependence : Note that the design of , to some extent, assumes how it is a function of time . As is part of the input, we believe it is possible to acquire the time information from the hidden states without any specification about . We use an exponential activation to ensure the positivity of the resultant conditional intensity. Formally, we have:
(4) 
3.3 Joint Likelihood
Now we show our model can be readily plugged into a joint estimation framework by formulating a joint likelihood for the timing, spatial location and category of activities. However, instead of directly modeling the next activity location, we use an incremental approach that models the space shift from the current position. Let be the joint likelihood for a sequence of activities; , and denote the timestamp, action category, and space shift respectively. To derive the joint likelihood, we make the following assumption.
For mathematical convenience, we assume the timing, action category, space shift of event are conditionally independent given the history up to event (). That is, , or if we use the “*” notation. Therefore, we have the joint likelihood parameterized by :
(7)  
We drop the subscript “” whenever possible for clean notations. Since we have already obtained the form of in Eq. 5 and Eq. 6, in the next section we derive the form of and .
Estimating the Action Category: The action category likelihood represents the distribution over the type of action. Since the history is encoded by the RNN hidden states , we have . Given the hidden states , our model outputs a discrete distribution over action classes:
(8) 
We then model this likelihood with a Gibbs distribution:
(9) 
where the energy function is the KullbackLeibler divergence between the predicted distribution and the groundtruth distribution (encoded as a onehot vector).
Estimating the Space Shift: The space shift likelihood gives the spatial distribution of the next move. Similar to , we have . We model the likelihood using a bivariate Gaussian distribution:
(10) 
where is the mean and is a 2x2 covariance matrix. We find that learning all the parameters in is unstable, so we assume the shifts along the and directions are independent, hence . We set to be constant and given the hidden states ; we use
(11) 
to parameterize Eq. 10, where and are learnable parameters.
3.4 Training
The model parameters can be learned in a supervised learning framework, by maximizing the likelihood of event sequences. In order to formulate the data (log)likelihood, we substitute 5, 6, 9 and 10 into Eq. 7. Converting this to loglikelihood yields Eq. 12 and Eq. 13 for the intensities and in Eq. 3 and Eq. 4, respectively.
(12)  
(13)  
Here absorbs all constants in the derivation above and can be dropped during optimization. The joint likelihood for all sample sequences is obtained by summing the loglikelihood for each sequence. Because the loglikelihood is fully differentiable, we can apply backpropagation algorithms for maximization.
3.5 Inference
To infer the timing of the next activity, we follow the same inference procedure as in the standard point process literature: given all groundtruth history up to activity , we predict when the next activity will happen. Then we proceed to predict the timing of activity given all groundtruth history up to activity . Therefore, the errors will not accumulate exponentially. This is a reasonable approach in many practical scenarios (knowing what has happened up to now, predict the next event). While we have a full model of the distribution, to obtain a point estimate, we take the expected time as our prediction. Eq. 14 is the result obtained using the conditional intensity in Eq. 3, where is an incomplete gamma function whose value can be evaluated using numerical integration algorithms. Eq. 15 is acquired using the conditional intensity in Eq. 4. The derivation makes use of Eq. 1, and we include the full details in the supplementary material.
(14)  
(15)  
To predict the category of the next activity, we take the most confident class in the output distribution as the prediction:
(16) 
To estimate the spatial location of the next activity, we take the expected space shift added to the current position as the result:
(17) 
4 Experiments
We evaluate the model on two challenging datasets collected from real world sports games. These datasets include activities in basketball and ice hockey with extremely fast movement.
All of our baselines consist of two components: a Markov chain and a conventional point process. The Markov chain models action category and space shift distribution; the point process models action timestamps. In our experiments, we compare TPM’s performance in time estimation with three other typical temporal point processes: Poisson process, Hawkes process and selfcorrecting process (Sec. 2). We compare TPM’s performance in space and category prediction with order Markov chains (). Also note that TPM has two variants, TPM and TPM, using the two conditional intensity functions and in Eq. 3 and Eq. 4, respectively.
4.1 Datasets
STATS SportVU NBA dataset. This dataset contains the trajectories of 10 players and the ball in court coordinates. During each basketball game possession, there are annotations about when and where a predefined activity is performed, such as pass, rebound, shot, etc.
The frame data are obtained by concatenating the court coordinates of the offensive players, defensive players and the ball. The order of concatenation within each team is determined by how far a player is away from the ball. The closest is the first entry while the farthest is appended as the last. The frame data are fed into the hierarchical RNN with a singlelayer perceptron as the feature extractor of each frame. The maximum number of frames is 150 for each sequence. A basketball possession is at most 24 seconds, so this results in an effective frame rate of 6.2fps. During training, we set both and to 2ft.
SPORTLOGiQ NHL dataset. This dataset includes the raw broadcast videos, player bounding boxes and trajectories with similar annotations to the NBA dataset. However, unlike the NBA dataset, the number of players in each frame may change due to the nature of broadcast videos. To solve this problem, we set a fixed number of players to use. If there are fewer than players, we zero out the extra entries. If there are more than players, we select the players that are most clustered. We essentially assume the players cluster around where the actions are. We use closeness centrality to implement this intuition. We build a complete graph over the players in a frame, each player being a node in the graph. Then we compute the closeness centrality for each node using Euclidean distance and choose the top highest closeness scores.
Given the pixels inside the bounding box and the coordinates of a single player, we feed them into a VGG16 network [31] and a singlelayer perceptron respectively. The outputs are then summed. This is repeated times (i.e. for every selected player), and finally we do elementwise maxpooling over the feature vectors to obtain a holistic feature representation for the players. Fig. 3 outlines this workflow.
In the experiments, we use . For each sequence, we use at most 80 frames for training and 200 frames for evaluation. After downsampling the videos, the frame rate is 7.5fps. Thus the longest sequence allowed is approximately 10.7s for training and 26.7s for evaluation. We again use ft.
4.2 Performance Measures
We use mean absolute error (mAE) to evaluate the estimation of time and space, and mean average precision (mAP) to measure the performance of action category prediction. However, given the nature of sports games, there are significant variations among the time intervals between neighboring activities (intervals range from milliseconds to seconds). Reporting mAE alone ignores these variations. For example, an error of 100ms is considered less significant if the groundtruth time interval is 1s as opposed to merely 100ms. Therefore we advocate mean deviation rate (mDR) as a better measure. Deviation rate (DR) is calculated as below; mDR is DR averaged over all time steps.
(18) 
4.3 Baselines
The baseline models predict the time of the next activity with conventional temporal point process, such as Poisson process, Hawkes process and selfcorrecting process. In order to predict the category and location of next activity, we utilize order Markov chains, where . We do not use higher orders since most sample possessions do not have sequence length larger than 10.
The inference stage of a order Markov chain works as follows. Given the most recent activities, we find the next activity with the highest transition probability. If the number of historical activities at current time step is less than or we are unable to find the exact historical activities in the transition matrix, we relax the dependency requirement by using the most recent activities. This is repeated until we find a valid transition to the next activity. The worst case is a degenerate Markov chain of 0order, which is basically doing majority voting. Given the selected transition to next activity, we compute the mean space shift of all such transitions collected during training, which will be added to the current location, eventually making the prediction of the next activity location.
4.4 Results
The results in Tab. I show that the proposed TPMs outperform traditional statistical approaches. On the other hand, by comparing the two TPM variants, we find that TPM performs better than TPM. Thus, the proposed conditional intensity can be more generic and effective than .
NBA  NHL  

mAE (ms)  mDR (%)  mAE (ms)  mDR (%)  
TPM  288.1  54.6  527.9  174.5 
TPM  282.1  52.0  530.7  172.0 
Poisson  365.5  547.0  645.4  297.6 
Hawkes  363.8  541.2  643.7  296.5 
SelfCorrecting  382.4  522.4  643.0  291.5 
To see what the model has learned, we visualize the TPM model predictions versus groundtruth annotations in Fig. 4. We find that our model generally is able to approximate and keep track of the true arrival pattern in the input sequence (e.g., the upper row in each of the four subfigures in Fig. 4). There are some large gaps between prediction and groundtruth when there comes a sudden high spike in the groundtruth. We believe this is because of the inherent randomness in sports games. In addition to the past series of activities, the action to be performed depends on many other factors such as tactics, which have not been explicitly observed and annotated during training and are challenging for the model to learn.
The lower row of each of the four subfigures in Fig. 4 visualizes how the predicted time distribution changes as a basketball possession proceeds. The ability to capture the temporal distribution is a key advantage of the TPM.
TPM  TPM  MC1  MC3  MC5  MC7  MC9  
NBA  space mAE (ft)  3.43  3.28  6.91  6.86  6.73  6.69  6.69  
category AP (%) 
shoot  57.9  58.0  10.1  32.9  35.7  37.0  37.4  
dribble  92.4  92.7  86.2  76.2  80.6  82.1  82.6  
pass  44.5  45.9  34.3  21.4  22.6  24.5  24.7  
reception  98.4  98.4  96.2  95.3  95.3  95.2  95.1  
assist  8.7  8.6  2.1  2.5  3.3  3.7  3.7  
end  99.9  99.9  99.9  99.9  99.9  99.9  99.9  
mAP  67.0  67.2  54.8  54.7  56.2  57.0  57.3  
NHL  space mAE (ft)  56.95  57.01  65.96  66.60  66.85  66.88  67.24  
category AP (%) 
pass  61.2  61.8  66.9  51.8  52.4  53.1  52.9  
reception  64.4  64.3  78.8  50.8  51.8  52.3  52.1  
carry  21.3  21.2  30.8  20.0  18.7  19.2  18.8  
shoot  11.1  9.6  11.4  10.9  9.9  10.4  10.3  
dumpin  11.3  12.2  30.0  8.6  9.5  9.2  9.3  
protection  32.8  32.8  28.3  24.4  23.6  24.8  24.4  
dumpout  4.7  5.5  22.8  4.6  4.6  4.6  4.6  
check  11.0  11.8  19.5  7.2  7.3  8.0  8.7  
block  25.9  23.0  21.7  15.5  16.2  15.8  15.8  
end  80.6  79.5  47.0  32.9  26.6  25.0  25.0  
mAP  32.4  32.2  35.7  22.7  22.1  22.2  22.2 
In terms of space prediction, Tab. II shows quantitative results. We see that TPMs have consistently better performance than Markov chains on both datasets. A sample qualitative result is presented in Fig. 5. Note that the court in NBA games is 94ft by 50ft and the rink in NHL games is 200ft by 85ft.
The space mAE (in Euclidean distance) on the NHL dataset is significantly greater than that on the NBA dataset. We believe this is because, in ice hockey games, players and the puck exhibit extremely quick motions. For example, the puck can be moved from one end of the rink to the other in less than a second, after which a puck reception could happen immediately, making the spatial location hard to predict. In contrast to hockey, our models are more accurate for basketball, where the relatively slower motions make space prediction more precise. Space prediction relies heavily on the speed of motion, but category prediction is not subject to such a constraint, so our models exhibit reasonable performance on inferring the type of the next activity.
An interesting finding is that a order Markov chain has surprisingly good mAP on the NHL dataset when compared to Markov chains of other orders. After we look into the precision of each category (provided in the supplementary material), we find that it performs exceptionally better on activities such as carry, dumpout and dumpin, which are very rare in the training data as opposed to other types of activities. We did not observe similar behaviour on the NBA dataset, so we believe this results from the highly unbalanced groundtruth annotations in the NHL dataset.
TPM  Regression NN  

NBA  51.6%  56.9% 
NHL  138.0%  188.2% 
5 Discussion
Regression v.s. distribution. An intuitive way to predict the next activity time is training a regression neural network with mean squared error loss. However, we believe that learning a distribution captures more than regressing a scalar does. We validate this by doing a simple experiment. We train TPM solely for time prediction. Everything else equal, we train a vanilla regression neural network to predict the time interval between current activity and next activity, which is then added to current timestamp to obtain the predicted time of next activity. Results are presented in Tab. III. We see clearly how TPM does a better job in predicting the next activity occurrence. Additionally, since TPM is trained explicitly by maximizing the raw likelihood function, it readily enables us to inspect the temporal distribution of predictions as in Fig. 4, whereas this feature is not available for a regression model.
Framework and generality. The proposed TPM is a general framework for prediction and modeling the arrival pattern of an activity sequence. It does not rely on a specific neural network structure. For example, in our experiment, we use a simple VGG16 as the backbone network, but one can use other more advanced networks such as [32, 33, 34]. Networks [35, 36, 37, 7] exclusively designed for action recognition can be used as well.
Applicable scenarios. TPM is a powerful model of the arrival pattern of sparsely distributed activities and can forecast the exact next activity time of occurrence. Here “sparsely distributed” does not imply any concepts regarding weak supervision/annotation. TPM conforms to a fully supervised learning paradigm. Existing work such as [16] uses sparsely annotated data as well, but it addresses a totally different task than TPM. Furthermore, TPM specializes in dealing with sequences where activity events can be approximated as mass points in time. Activities with long temporal span do not fit into the TPM framework. Therefore, TPM is positioned in contrast to existing benchmarks such as Breakfast [38] and MPIICooking [39], but useful for the sports analytics, surveillance, and autonomous vehicle scenarios outlined above.
6 Conclusion
We have presented a novel take on the problem of activity forecasting. Predicting when and where discrete, important activity events will occur is the task we explore. In contrast with previous activity forecasting methods, this emphasizes semantically meaningful action categories and is explicit about when and where they will next take place. We construct a novel hierarchical RNN based temporal point process model for this task. Empirical results on challenging sports action datasets demonstrate the efficacy of the proposed methods.
Appendix A Probability density and cumulative distribution of temporal point processes
The cumulative distribution is defined as the probability that there is (at least) an event to happen at time since the last event time . The “*” is a reminder that a quantity depends on the past. Let denote the probability density function and the number of events till time . Then we have
(19) 
This is equivalent to
(20) 
Because the temporal point process models we are dealing with belong to the general class of nonhomogeneous Poisson processes whose conditional intensity is a function of time , by definition the number of events in conforms to Poisson distribution parameterized by :
(21) 
where is expected number of events per interval.
Appendix B The validity of conditional intensities
This section provides the proof that the two conditional intensities (Eq.3 and Eq.4) used in our experiments characterize valid temporal point processes.
Proof.
When , the quantity is monotonically increasing in terms of . As approaches infinity, approaches infinity as well. Substituting into Eq. 2, we have , so is a valid conditional intensity when .
Appendix C Inference of time
In this section, we derive the predicted time for the two conditional intensities (Eq.3 and Eq.4) we used.
c.1 When takes the form in Eq. 3
(Obtained by letting )  
( is , so equal to 1)  
(Obtained by letting )  
(where )  
(Integrate by parts)  
(where is an incomplete  
gamma function) 
c.2 When takes the form in Eq. 4
(Obtained by letting since does not  
actually rely on )  
(Obtained by letting )  
( is , so equal to 1)  
(Integrate by parts)  
References
 [1] Z. Shou, D. Wang, and S. Chang, “Action temporal localization in untrimmed videos via multistage CNNs,” in CVPR, 2016.
 [2] Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.F. Chang, “CDC: Convolutionaldeconvolutional networks for precise temporal action localization in untrimmed videos,” in CVPR, 2017.
 [3] Z. Yuan, J. C. Stroud, T. Lu, and J. Deng, “Temporal action localization by structured maximal sums,” in CVPR, 2017.
 [4] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. FeiFei, “Largescale video classification with convolutional neural networks,” in CVPR, 2014.
 [5] J. YueHei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in CVPR, 2015.
 [6] C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional twostream network fusion for video action recognition,” in CVPR, 2016.
 [7] Z. Qiu, T. Yao, and T. Mei, “Learning spatiotemporal representation with pseudo3d residual networks,” in CVPR, 2017.
 [8] W. Lian, R. Henao, V. Rao, J. Lucas, and L. Carin, “A multitask point process predictive model,” in ICML, 2015.
 [9] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” in ICLR, 2014.
 [10] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra, “DRAW: A recurrent neural network for image generation,” in ICML, 2015.
 [11] K. Kitani, B. Ziebart, J. Bagnell, and M. Hebert, “Activity forecasting,” in ECCV, 2012.
 [12] N. Rhinehart and K. M. Kitani, “Firstperson activity forecasting with online inverse reinforcement learning,” in ICCV, 2017.
 [13] D. Xie, T. Shu, S. Todorovic, and S.C. Zhu, “Modeling and inferring human intents and latent functional objects for trajectory prediction,” arXiv preprint arXiv: 1606.07827, 2016.
 [14] H. Soo Park, J.J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in CVPR, 2016.
 [15] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016.
 [16] D.A. Huang, L. FeiFei, and J. C. Niebles, “Connectionist temporal modeling for weakly supervised action labeling,” in ECCV, 2016.
 [17] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006.
 [18] S. Bai, J. Zico Kolter, and V. Koltun, “An empirical evaluation of generic convolutional and recurrent networks for sequence modeling,” arXiv preprint arXiv: 1803.01271, 2018.
 [19] D. Neil, M. Pfeiffer, and S.C. Liu, “Phased LSTM: Accelerating recurrent network training for long or eventbased sequences,” in NIPS, 2016.
 [20] T. A. Lasko, “Efficient inference of gaussianprocessmodulated renewal processes with application to medical event data,” in UAI, 2014.
 [21] L. Xu, J. A. Duan, and A. Whinston, “Path to purchase: A mutually exciting point process model for online advertising and conversion,” Management Science, vol. 60, no. 6, pp. 1392–1412, 2014.
 [22] Y. Ogata, “Spacetime pointprocess models for earthquake occurrences,” Annals of the Institute of Statistical Mathematics, vol. 50, no. 2, pp. 379–402, 1998.
 [23] J. F. C. Kingman, Poisson processes. Wiley Online Library, 1993.
 [24] A. G. Hawkes, “Spectra of some selfexciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
 [25] V. Isham and M. Westcott, “A selfcorrecting point process,” Stochastic Processes and Their Applications, vol. 8, no. 3, pp. 335–347, 1979.
 [26] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. GomezRodriguez, and L. Song, “Recurrent marked temporal point processes: Embedding event history to vector,” in SIGKDD, 2016.
 [27] J. Chung, S. Ahn, and Y. Bengio, “Hierarchical multiscale recurrent neural networks,” in ICLR, 2017.
 [28] A. Sordoni, Y. Bengio, H. Vahabi, C. Lioma, J. Grue Simonsen, and J.Y. Nie, “A hierarchical recurrent encoderdecoder for generative contextaware query suggestion,” in CIKM, 2015.
 [29] L. Kong, C. Dyer, and N. A. Smith, “Segmental recurrent neural networks,” in ICLR, 2016.
 [30] S. El Hihi and Y. Bengio, “Hierarchical recurrent neural networks for longterm dependencies,” in NIPS, 1996.
 [31] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [32] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with convolutions,” in CVPR, 2015.
 [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
 [34] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” in CVPR, 2017.
 [35] M. S. Ibrahim, S. Muralidharan, Z. Deng, A. Vahdat, and G. Mori, “A hierarchical deep temporal model for group activity recognition.” in CVPR, 2016.
 [36] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015.
 [37] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition in videos,” in NIPS, 2014.
 [38] H. Kuehne, A. B. Arslan, and T. Serre, “The language of actions: Recovering the syntax and semantics of goaldirected human activities,” in CVPR, 2014.
 [39] M. Rohrbach, A. Rohrbach, M. Regneri, S. Amin, M. Andriluka, M. Pinkal, and B. Schiele, “Recognizing finegrained and composite activities using handcentric features and script data,” International Journal of Computer Vision, vol. 119, no. 3, pp. 346–373, 2016.