Decoupled Learning for Factorial Marked Temporal Point Processes

Decoupled Learning for Factorial Marked Temporal Point Processes

Weichang Wu, Junchi Yan*,  Xiaokang Yang,  Hongyuan Zha W. Wu and X. Yang are with the Department of Electrical Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China. E-mail:, Yan (correspondence author) is with the Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China. E-mail: Zha is with School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia, 30332, USA and East China Normal University, Shanghai, 200062, China. E-mail:

This paper introduces the factorial marked temporal point process model and presents efficient learning methods. In conventional (multi-dimensional) marked temporal point process models, event is often encoded by a single discrete variable i.e. a marker. In this paper, we describe the factorial marked point processes whereby time-stamped event is factored into multiple markers. Accordingly the size of the infectivity matrix modeling the effect between pairwise markers is in power order w.r.t. the number of the discrete marker space. We propose a decoupled learning method with two learning procedures: i) directly solving the model based on two techniques: Alternating Direction Method of Multipliers and Fast Iterative Shrinkage-Thresholding Algorithm; ii) involving a reformulation that transforms the original problem into a Logistic Regression model for more efficient learning. Moreover, a sparse group regularizer is added to identify the key profile features and event labels. Empirical results on real world datasets demonstrate the efficiency of our decoupled and reformulated method. The source code is available online.

Factorial Temporal Point Process, Decoupled Learning, Alternating Direction Method of Multipliers, Fast Iterative Shrinkage-Thresholding Algorithm

I Introduction and Background

Events are ubiquitous across different domains and applications. In e-commerce, events refer to the transactions associated with users, items. In health informatics, event sequence can be a series of treatment over time of a patient. In predictive maintenance, events can carry important log data for when the failure occurs and what is the type. In all these examples, effectively modeling and predicting the dynamic behavior is of vital importance for practical usefulness.

Marked temporal point process Point process [daley2007introduction] is a useful tool for modeling the event sequence with arbitrary timestamp associated with each event. An event in point process can carry extra information called marker. The marker typically refers to event type and lies in the discrete label space i.e. a finite category set 111A general concept can be found in [Gelfand2010SS]: a marked point pattern is one in which each point of the process carries extra information called a mark, which may be a random variable, several random variables, a geometrical shape, or some other information. In this paper, we focus on discrete labels for marks. The marked point process is also termed by multi-dimensional point process [LinigerPhD2009], where each dimension refers to a discrete mark value..

Factorial marked temporal point process For the above mentioned marked point process, the event is represented by a single mark as a single discrete variable, but in many application scenarios, the event can carry multiple markers. For instance, a movement to a new job carries both the label for position and label for company, which can be treated by two orthogonal markers with different values. Though such cases are ubiquitous in real world, the factorial marked point processes have drawn little attention in literature as existing literatures mostly work with a single marker [LinigerPhD2009, ZhouAISTATS13, XiaoAAAI17]. Inspired by Factorial Hidden Markov Models [ghahramani1996factorial], we introduce the factorial marked temporal point process, in which the event is represented by multiple markers, and propose a decoupling method to learn the process.

Intensity function and problem statement One core concept for point process is the intensity function , which represents the expected instantaneous rate of events at time conditioned on the history. One basic intensity function is the constant over time, as used in the homogeneous Poisson process. Another popular form is the one used by the Hawkes process [HawkesBiometrika71]: , where denotes the event history and is a marker-vs.-marker infectivity kernel capturing the temporal dependency between event at and at .

In this paper, we are interested in describing a factorial marked point process for event marker prediction task by using the history event information and the individual level profile of an event taker. We focus on the next-event label estimation, distributed over more than one markers. In particular, our empirical study focuses on individual level next job prediction involving both position and company for LinkedIn users, and duration prediction in current ICU department and transition prediction to next ICU department for patients in MIMIC-II database [goldberger2000physiobank].

Ii Related Work and Contribution

Learning for temporal point process Point process is a powerful tool for modeling event sequence with timestamp in continuous time space. Early work dates back to the Hawkes processes [HawkesBiometrika71] which shows appropriateness for self-exciting and mutual-exciting process like earthquake and its aftershock [OgataJASA88, OgataJASA98]. The learning is fulfilled by maximum likelihood estimation by directly computing the gradient and Hessian matrix w.r.t. the log-likelihood function [ozaki1979maximum]. Recently more modernized machine learning approaches devise new efficient algorithms for learning the parameters of the specified point process. Nonparametric Expectation-Maximization (EM) algorithm is proposed in [LewisJNS2011] for multiscale Hawkes Processes using the majorization-minimization framework, which shows superior efficiency and robustness compared with sampling based estimation methods. [ZhouAISTATS13] extends the technique to handle the multi-dimensional Hawkes process by adding a low-rank and sparsity regularization term in the maximum likelihood estimation (MLE) based loss function.

Factorial model Though almost all of these works mentioned above involve the infectivity matrix for model parameters learning, none of them considers the factorial temporal point process case, i.e. an event type is factored into multiple markers, which leads to the explosion of the infectivity matrix size. The idea of factorizing events or states into multiple variables is employed in [ghahramani1996factorial] for Hidden Markov Models (HMM) using variational methods to solve data mining task like capturing statistical structure, but little literature is found about its utility in point process. To our best knowledge, this is the first work of factorial marked point process learning for event marker prediction. Note timestamp prediction can be approximated by predicting a predefined time interval as time duration marker as done in this paper.

Sparse regularization for point process Sparse regularization is a well-established technique in traditional classification and regression models, such as the regularizer [ng2004feature], group Lasso [meier2008group], sparse group regularizer [simon2013sparse], etc. Recent point process models have also found their applications like the regularization used in [li2014learning] to ensure the sparsity of social infectivity matrix, the nuclear norm in [ZhouAISTATS13] to get a low-rank network structure and the group Lasso in [xu2016icu] for feature selection. We propose to use the sparse group regularizer, which encourages the nonzero elements focusing on a few columns obeying with the intuition that only a few features and labels play the major role in event dynamics. We find little work in literature on group sparse regularizer for point process learning.

Contributions The main contributions of this paper are:

1) We introduce the concept of factorial marked point process for event marker prediction, and propose a decoupled learning algorithm to simplify the factorial model by decoupling the marker mutual relation in modeling. The method outperforms general marked point process on real-world datasets.

2) We present a multi-label Logistic Regression (LR) perspective and devise reformulation towards a class of point process discriminative learning problems. It eases the learning of these processes by using on-the-shelf LR solver.

Besides these major contributions, we also make additional improvements in proposing a regularized learning objective, which we will include for completeness.

Iii Proposed Model and Objective Function

Iii-a Factorial point process

Factorial point process refers to the processes in which event can be factorized into multiple markers. Except the job movement prediction and ICU department prediction mentioned in Introduction, many application cases can be described by the factorial point process, while haven’t been explored yet. For instance, a weather forecast containing temperature, humidity, precipitation, and wind can be seen as a factorial point process with markers, with each marker having discrete or continuous values. Obviously these factors affect each other, e.g. the humidity today is influenced by the precipitation and temperature in recent few days. The conventional marked point process could only model one of these factors using a single marker without considering the infectivity between these factors. A factorial point process with multiple markers for the event is essential.

Learning factorial point process is challenging. Taking job movement prediction with two markers company and position as example: to predict the probability of user ’s -th job , we need to learn a -dimension tensor representing the impact of history companies on and , the impact of history positions on and , respectively. In point process, it means we need to learn a set of intensity functions including , , and . This simple case considers no infectivity between different sequences, i.e., if we also consider the impact of another user y’s job movement on user ’s choice of and , we would compute a -dimension tensor to measure the complete infectivity, with two extra intensities and .

There are ways to simplify factorial point process learning, e.g. we can treat the combination of multiple factors as one marker, and use the conventional marked point process model, but this will lead to explosion of the size of infectivity matrix. In this paper, we explore a simple decoupling solution that decouple the factorial point process into separate models of different markers respectively. As shown in Fig.1 for the instance of markers and , we decouple the original infectivity matrix into smaller one by introducing tensor variable , , and . We will present the decoupled model in details in the following section.

Iii-B Decoupled learning for factorial point process

More generally, we discuss the situation that event can be factorized into markers . Given event sequence with event marker for where , the intensity function of a conventional marked point process model for marker is defined by:


where is the time-invariant features of sequence taker extracted from its profile, like Self-introduction of LinkedIn users or patients’ diagnose in MIMIC-II database, and is the corresponding coefficients.

For the choice of the three functions , , in Eq. 1, there are many forms in the literature that can be abstracted by the above framework, and popular ones are depicted in Table I.

For marked point process model, when the marker contains multiple label dimensions, one major bottleneck is that this model involves the infectivity matrix with size to measure the directional effect between and . More generally, the size of the infectivity matrix is (assume all dimensions have same number of values by ), which incurs learning difficulty.

To mitigate the challenge for learning the above parameter matrix, especially when is large while the sequences are relatively short, we propose the decoupled factorial point process model to linearly decouple the above intensity function into interdependent point processes for the -th marker as written by:


where , is the binary indicators connecting through the influence of one’s former marker and markers . Note the row vector is the parameter for intra-influence within marker , and is for inter-influence between and . The above vectors are illustrated in Fig.1 when , using notation as and as .

In fact, function , , are predefined and some embodiments can be chosen from Table I. For the time being, we do not specify these functions while focus on solving the learning problem in a general setting.

A concrete example is presented in Fig.2. See the caption for more details.

Fig. 1: Decoupling infectivity matrix between two markers and from size to . Note the indicated rows for , , , .
Fig. 2: Example of our decoupling perspective on factorial event sequence learning. The raw event sequence is represented with three-marker of label space . Our decouple model treats the raw sequence as an overlay of three sequences whose marker space is , , respectively and then the whole marker space’s dimension is in linear: . The directed dash arrows between events sketch the effect from previous events to future events: not only within one of the three sequences but also across the three sequences. Two particular attentions shall be paid to our model: 1) the method is designated to predict next event’s marker but not for its continuous occurrence timestamp (see more details later in the paper). To enable next event time prediction, we discretize the time interval into several levels as illustrated by . 2) On the other hand, the accurate timestamp rather than the discretized version , is used for learning of the point process model which makes sure our model can capture the fine-grained raw time information.

Iii-C Next-event marker prediction with regularizer

Loss function for discriminative learning Based on the defined intensity function, we write out the probability for event happens at time , conditioned on ’s history :


where is the event intensity, and is the conditional probability that this event happens at time given history . is the probability that the happened event is given current time and history .

Based on the above equation, most existing point process learning methods e.g. [LewisJNS2011, ZhouICML13, DuNIPS15] fall into the generative learning framework aiming to maximize the joint probability of all observed events via a maximum likelihood estimator .

However, such an objective function is not tailored to the particular task at hand: instead of taking care of handling the posterior probability of the whole event sequence, we are more interested in predicting the next event and its mark information. To enable a more discriminative learning paradigm to boost the next event prediction accuracy, a recent work [xu2016icu] suggests to focus on instead of as the loss function for learning.

In the decoupled point process model, the dependency between different markers has been measured by inter-influence parameters, i.e., the dependency between process for marker and process for marker have been measured by parameter and (see Eq. 2) in an independent fashion. In the same spirit, here we simplify the probability by an independence assumption for marker and marker :


where is the normalized intensity function. This simplification leads to the following loss:


where is an indicator returning 1 if true, otherwise 0. is the parameters whereby

Modulated Poisson process (MPP) [lloyd2014variational]
Hawkes process (HP) [LewisJNS2011]
Self-correcting process (SCP) [isham1979self]
Mutually-correcting process (MCP) [xu2016icu]
TABLE I: Parametric forms of popular point processes.

Sparse group regularization Since the model involves many parameters for learning, a natural idea is introducing sparsity to reduce the complexity. Incorporating both group sparsity and overall sparsity, we use a sparse group regularizer [simon2013sparse] as the regularization, a combination of regularization and a group lasso . The group lasso encourages the nonzero elements concentrated on a few columns in the whole matrix , and the rest part is assumed to be zeros. The regularization encourages the whole matrix to be sparse. The behind rationale is that only a few profile features and event marker values will be the main contributor to the point process. This means only a few columns will be activated. As a result, the regularized objective is:


where , , and , is the regularization weight, controls the balance between overall and group sparsity.

Iv Learning Algorithm

In this section, we first present our tailored algorithm to the presented model. Then we give a new perspective and show how to reformulate it into a Logistic Regression task.

Following the scheme of ADMM, we propose a FISTA [beck2009fast] based method with Line Search and two soft-shrinkage operators to solve the subproblems of ADMM. The whole algorithm is summarized in Alg. 2.

Iv-a Soft Shrinkage Operator

First we review two soft-shrinkage operators [donoho1995noising] that solve the following two basic minimization problem, which will be used in our algorithm.

  • The minimization problem

    with , , , has a closed-form solution given by the soft-shrinkage operator defined:

    where is the sign function.

  • The minimization problem

    with , , , has a closed-form solution given by the soft-shrinkage operator defined:

Iv-B ADMM iteration scheme

To solve the minimization problem defined in Eq. 6 using ADMM solver, we add two auxiliary variables , . The augmented Lagrangian function for Eq. 6 is


where , , , and .

The iterative scheme is given by:


Therefore the optimization of Function 7 has been divided into two sub-problems defined as Eq. 8 and Eq. 9. While for Eq. 9, the update of , it has a closed-form solution given by operator as follows


where , denotes the ball in -dimension centered at 0 with radius [donoho1995noising].

Iv-C FISTA with line search

To solve Eq. 8 we define , then can be obtained by solving through a FISTA method [beck2009fast] with line search to compute the step size. The Algorithm is summarized in Alg.1

Input: from last iteration
1 Initialize , threshold , ,,, ;
2 while  do
3       , ;
4       ;
5       while  do
6             ;
7             ;
9      , ;
11, return ;
Algorithm 1 FISTA()
Input: two associated marked point process , ,, threshold
1 Initialize ;
2 while  do
3       Update via ;
4       Compute via Eq. 11; Update via ;
5       ;
Algorithm 2 Decoupled Learning of Factorial Point Process

Iv-D Reformulating to Logistic Regression task

Based on the intensity function Eq. 2, the loss function Eq. III-C, we show how to reformulate the learning of the decoupled point process as a multi-class Logistic Regression task. One obvious merit of this reformulation is the reuse of on-the-shelf LR solvers e.g. [Liu:2009:SLEP:manual] with little parameter tuning. In contrast, the algorithm presented in Alg.2 involves more parameters and is more computational costive as shown in Fig.4 and Table II.

For event taker at time , by separating the event taker ’s feature from the parameters , the conditional intensity function in Eq. 2 can be written as:


Therefore the probability can be written as:


This is exactly the same probability function of sample belonging to class for a Softmax classifier of classes, and is the parameter.

Hence the log-loss function in Eq. III-C becomes:

which is the sum of Softmax classifiers’ loss functions.

So far we have reformulated the decoupled learning of the factorial marked point process to the learning of Softmax classifiers. For the -th classifier, it takes from the sample and classify it to one of markers . In the following experiments, we will show that the reformulated learning method in fact optimizes the same loss function as Alg.2.

Iv-E Event marker prediction

After learning parameters , we can predict the next event markers at , given history by computing (see Eq. 13). The predictions and are given by . It is important to note that though our model technically only issues discrete output as it is inherently a classification model, while in practice the future events’ timestamp can be predicted by an approximated discrete duration as done in our experiments. In this regard, we treat the future timestamp as a marker.

V Empirical Study and Discussion

V-a Dataset and protocol

To verify the potential of the proposed model, we apply it to a LinkedIn-Career dataset crawled and de-identified from LinkedIn to predict user’s next company , next position and duration of current job; an ICU dataset extracted from public medical database MIMIC-II [goldberger2000physiobank] to predict patient’s transition to the next ICU department and duration of stay in current department. Experiments are conducted under Ubuntu 64bit 16.04LTS, with i7-5557U 3.10GHz4 CPU and 8G RAM. For the convenience of replicating the experiments, the crawled de-identified LinkedIn-Career dataset and the code is available on Github222

Dataset The LinkedIn-Career Dataset contains users crawled from information technology (IT) industry on LinkedIn (, including their Self-introduction, Technical skills and Working Experience after de-identification preprocess. We collect samples in IT industry because: i) The staff turnover rate is high, which makes it easier to collect suitable samples; ii) The IT industry is most familiar to the authors, and our domain knowledge can help better curate the raw data. We extract profile features from users’ Self-introduction and Technical skills, and get users’ history company and position from Working Experience. After we exclude samples with zero job movement, we have a so-called LinkedIn-Career benchmark, involving 2,403 users, 57 IT companies, 10 kinds of positions and 4 kinds of durations. The dataset is to some extent representative for IT industry. For companies, we have large corporations like Google, Facebook, Microsoft and medium-sized enterprise like Adobe, Hulu, VMWare. For positions we have technical positions like engineer, senior engineer, tech lead, and management positions like manager, director, CEO. For durations we discretize the duration of stay in a position or company as temporary( within 1 year), short-term( 1-2 years), medium-term( 2-3 years) and long-term( more than 3 years). The goal is to predict user’s next company from companies, next position from positions and duration of stay in current company and position from durations.

The ICU dataset contains patients from MIMIC-II database, including patients’ diagnose, treatment record, transition between different ICU departments and duration of stay in the departments. The goal is to predict patient’s next ICU department from departments including Coronary care unit (CC), Anesthesia care unit (ACU), Fetal ICU (FICU), Cardiac surgery recovery unit (CSRU), Medical ICU (MICU), Trauma Surgical ICU (TSICU), Neonatal ICU (NICU), and General Ward (GW), and predict patient’s duration of stay from kinds of duration including temporary( within 1 day), short-term( 1-5 days), and long-term (more than 5 days). The profile features are extracted from patients’ diagnose (ICD9 code of patients’ disease) and treatment record (nursing, medication, treatment).

Many peer methods are evaluated as follows:

Intensity function choices Our framework is tested by four point process embodiments namely: i) Mutually-corrected Processes (MCP), ii) Hawkes Process (HP), iii) Self-correcting Process (SCP) and iv) Modulated Poisson Process (MPP). Their characters are briefly compared in Table I. Note in our experiments, all these models are learned via the reformulated LR algorithm as described in Alg. 2.

Comparison to classic Logistic Regression We test a non-point process approach i.e. the plain LR. For the point process based LR solver, its input is , while the plain LR involves the raw feature as , including user profile feature , binary indicator and representing one’s current state without considering the history states.

Comparison to RNN and RMTPP We also experiment on RNN by treating the prediction task as a sequence classification problem. A dynamic RNN that can compute over sequences with variable length is implemented.

Moreover, to explore the effect of discretizing the time interval when making duration prediction, we also experiment on RMTPP (Recurrent Marked Temporal Point Process) proposed by [du2016recurrent]. Instead of predicting a discrete label for duration, it gives a continuous prediction result.

Prediction performance metrics We use prediction accuracy AC to evaluate the performance of the model with four variants , , , to denote the prediction accuracy for state (i.e. # correct ones out of total predictions), state , state and joint ,, respectively.

To evaluate the performance of our discrete duration prediction compared with RMTPP, both MSE (Mean Squared Error) and AC are computed. To compute prediction MSE, the predicted discrete duration is substituted by the intermediate time point of the discrete intervals, e.g., 0.5 years for temporary stay, 1.5 years for short-term stay and 4 years for long-term stay. To compute prediction AC, the predicted continuous duration of RMTPP is discretized using the same criterion by the proposed model.

For LinkedIn-Career data, we further compute precision curve for the top-K position, company and duration predictions as shown in Fig. 3. These metrics are widely used for recommender system. In fact, as our model is for predicting the next company , next position and duration given career history , it can be used for recommending companies and posts at the predicted time period .

All the experimental results are obtained by 10-fold cross validation as commonly adopted like [YanIJCAI13].

(a) Company
(b) Position
(c) Duration
(d) Company & Position & Duration
Fig. 3: Top- prediction accuracy on the collected LinkedIn-Career dataset out of 57 companies, 10 positions and 4 durations.
Fig. 4: Convergence curve of reformulated LR solver & ADMM solver (Alg.2) by similar random initialization. Left: ICU dataset from MIMIC-II; Right: LinkedIn-Career.
Dataset Method Time Iter. #
Career Alg.2 32.81 60.67 52.41 10.74 123.8m 147.7
LR 33.58 60.13 53.96 10.96 46.8s 11.2
ICU Alg.2 76.63 55.74 45.64 764.2m 121.5
LR 76.98 55.63 45.55 55.1s 9.7
TABLE II: Comparison of the raw ADMM solver (Alg.2) and the reformulated LR solver: prediction accuracy by percentage for , , , joint prediction accuracy , time cost and average iteration count by random initialization for 10 trials. Time and iteration number is the average result.
company pred. accuracy position pred. accuracy duration pred. accuracy joint pred. accuracy
sequence intensity decoupled coupled uni-com decoupled coupled uni-pos decoupled coupled uni-dur decoupled coupled uni-cpt
short HP 15.37 14.23 14.50 56.99 56.99 56.99 38.97 36.61 36.61 4.59 4.05 4.26
SCP 17.99 10.01 12.61 58.98 49.75 55.38 43.56 41.81 38.22 4.96 2.69 2.11
MPP 15.58 13.58 13.96 56.99 56.99 56.99 40.16 36.61 36.61 4.71 3.72 4.21
MCP 18.54 11.04 13.14 59.42 51.49 56.90 46.49 46.46 41.32 5.41 2.52 3.26
LR 15.35 56.99 38.31 4.46
RNN 13.18 56.98 36.94 3.77
RMTPP 12.02 56.03 44.95 4.15
long HP 24.45 20.57 25.61 50.80 49.25 49.51 34.52 33.61 34.33 4.36 3.53 4.32
SCP 35.41 20.49 26.46 51.00 40.54 46.96 33.27 28.99 28.64 7.47 5.46 5.84
MPP 28.71 21.44 29.57 51.64 49.02 52.90 36.04 33.20 33.76 6.37 4.17 6.49
MCP 50.23 29.98 44.12 60.45 50.16 55.65 47.00 40.33 37.78 14.61 7.21 10.05
LR 24.59 49.51 35.47 4.14
RNN 19.92 49.24 33.80 3.26
RMTPP 19.88 49.17 44.07 4.58
all HP 19.33 16.23 19.15 53.50 52.95 52.95 39.85 37.72 37.80 4.39 3.20 4.05
SCP 26.52 15.54 20.77 53.68 46.88 51.90 37.56 32.30 29.45 5.06 3.32 3.70
MPP 21.59 16.49 21.73 54.10 52.77 54.49 39.81 35.17 35.72 4.96 3.50 4.81
MCP 33.58 20.30 27.80 60.13 52.48 57.56 53.96 48.34 45.16 10.96 5.44 6.96
LR 18.74 52.98 40.01 4.21
RNN 16.24 52.93 35.21 3.02
RMTPP 16.10 51.07 50.16 4.70
TABLE III: Accuracy comparison for different intensity functions on LinkedIn-Career (HP, SCP, MPP, MCP, see Table I). Numbers in bold denote the best or second-best accuracy on the specified metric and dataset. Learning for all point process based models is via the reformulated LR solver as discussed in the main paper. Long sequence denotes those with more than 2 job transitions. For the non-point process based (classic) LR, we present its performance for each prediction target.
duration prediction accuracy transition prediction accuracy joint prediction accuracy
sequence intensity decoupled coupled uni-duration decoupled coupled uni-transition decoupled coupled uni-dt
all HP 52.48 51.64 52.91 74.61 73.31 73.77 42.85 41.62 42.40
SCP 50.14 49.77 49.05 74.22 74.01 70.75 41.04 40.77 39.48
MPP 53.27 51.88 52.02 74.74 73.42 72.05 43.64 42.37 43.28
MCP 55.63 54.62 50.14 76.98 76.58 74.02 45.55 45.32 44.89
LR 39.88 69.61 31.13
RNN 47.01 70.54 36.44
RMTPP 54.28 67.49 41.93
TABLE IV: Accuracy comparison for different intensity function models on ICU dataset from MIMIC-II.

V-B Results and discussion

We are particularly interested in analyzing the following main bullets via empirical studies and quantitative results.

i) LR solver vs. ADMM solver To make a fair comparison, the LR solver and ADMM solver i.e. Alg.2 share the same initial parameter that initialized by a uniform distribution sampling from , and the running time and iteration count are the average of 10-fold cross validation. Table II compares LR solver and ADMM solver regarding with accuracy and time cost on the Dataset LinkedIn-Career and ICU. One can find the prediction accuracy is similar while the ADMM solver is more costive as we find it converges more slowly as shown in Fig.4. Also, as shown in Alg.2, it involves more hyper-parameters to tune and they have been tuned to their best performance. For comparison between the reformulated LR via the point process framework, and the raw LR using only user profile i.e. LR, we find the former outperforms in most cases in Table III and IV.

Comparing running time in Table II, the LR solver has better scalability than ADMM solver. This is because the ADMM solver is a general algorithm for convex optimization with sparse group regularization, while the LR solver works by a special design for the objectives that can be reformulated to Logistic Regression loss. Many algorithmic optimizations for Logistic Regression can be used in LR solver, like Efficient Projection in SLEP [Liu:2009:SLEP:manual].

ii) Decoupled learning vs. RNN As shown in Table III and Table IV, the decoupled marked point process model has much better performance than RNN. This is because the next-event prediction task for relatively short sequences like dataset LinkedIn-Career and ICU is not a typical sequence classification task. We need to make prediction on every step of the sequence, rather than make prediction at the end of the whole sequence. That means for the end-to-end RNN sequence classification model, it needs to deal with sequences with considerably variable length, including a large number of sequences of length .

We also compare the accuracy of RMTPP that makes continuous duration prediction, with decoupled model and general RNN in Table III and Table IV. Though the duration prediction accuracy of RMTPP is improved compared to general RNN, the decoupled model still have better performance.

To further verify the effect of discretizing the time interval when making duration prediction, the MSE of decoupled-MCP and RMTPP is also compared in Table V. Though it is a little tricky that discrete wrong prediction leads to smaller MSE than continuous wrong prediction, e.g., if a ground truth medium-term duration of 2.5 years is misclassified as long-term duration of 4 years for de-MCP, the RMTPP may gives a continuous prediction value of 10 years, the MSE of de-MCP is not only relatively but also absolutely small. It shows that the discretization of time interval is to some extent rational.

iii) Infectivity matrix decoupling vs. coupling Table III and Table IV also compare the performance of our decoupled model (see Eq. 2) against the raw coupled model (see Eq. 1), and the simplified model (single-dimension) when only marker , or is considered. This boils down to the single-dimension case and the method is termed by uni-com, uni-pos and uni-dur for dataset LinkedIn-Career, and uni-duration and uni-transition for ICU respectively. While for uni-cpt, it involves no new model while uses the output of uni-c, uni-p and uni-t to combine them together as the joint prediction. It shows that the decoupled model consistently achieves the best performance, which perhaps is attributed to the reduction of model complexity given relatively limited training data.

Comparing the accuracies in Table III for dataset LinkedIn-Career and Table IV for dataset ICU, we can see that the improvement in accuracy of decoupled model compared to coupled model or single-dimension model, is more remarkable for LinkedIn-Career than that for ICU. The reason is that for LinkedIn-Career, the coupled state space is decoupled from to , and for ICU it is decoupled from to . The LinkedIn-Career dataset has a larger coupled state space than ICU. So when decoupled to smaller state spaces, the improvement for LinkedIn-Career is more notable than that for ICU.

iv) Choice of intensity function There are many popular intensity forms and some are listed in Table I. According to Table III and Table IV, the mutually-correcting process (MCP) consistently shows superior performance against other intensity function embodiments. This verifies two simple assumptions that i) the intensity tends to decrease for the moment the event happens, i.e., the desire for new job can be suppressed when a new job is fulfilled for job prediction, and patients’ demand for transition to next ICU department decrease after they move into a new department for ICU department transition prediction; ii) the probability of future events is influenced by the history events according to Table I, i.e., one’s transition possibility to new job is related to his/her history career experience, and patient’s future ICU department transition procedure is related to his/her history treatment.

v) Influence of sequence length To further explore the performance behavior, we experiment on short-sequences and long-sequences respectively on LinkedIn-Career333ICU data is not included in the length test because of the patients in ICU have no more than three transitions.. Results in Table III show that the decoupled MCP algorithm has more advantage in long-sequence prediction, suggesting that the decoupled MCP can make better use of history information.

dataset LinkedIn ICU
MSE 9.625 2.934 14.602 4.272
TABLE V: MSE comparison of future event duration prediction for RMTPP and decoupled MCP on LinkedIn (in year) and ICU (in day) dataset. Note RMTPP model predicts continuous timestamp value for future events.
dataset marker w/o sparse group lasso sparse group
Career company 29.16 31.47 33.58
position 56.99 58.16 60.13
duration 50.56 52.33 53.96
joint (3) 9.53 10.04 10.96
ICU duration 52.68 55.13 55.63
transition 73.45 76.09 76.98
joint (2) 42.20 45.49 45.55
TABLE VI: Accuracy by different regularizers. Sparse group (Eq. 6) combines regularizer and group lasso.

vi) Influence of sparsity To verify the effect of sparse group regularization, we compare the accuracy of the decoupled MCP model with different regularization settings, including without sparse regularization, with group lasso for group sparsity and with sparse group regularization (a combination of regularization and group lasso – see Eq. 6) for both overall sparsity and group sparsity. As shown in Table VI, the sparse group regularizer outperforms.

We also explore feature selection functionality by investigating the magnitudes of elements in matrix . The element measures the influence of profile feature or marker to label . Small (large) values indicate the corresponding features or markers have little (high) influence to label. For example, in LinkedIn-Career dataset, the numerical values in coefficient column vector corresponding to marker are all nonzero, showing that having a working experience as an is important in IT industry. For marker , most of the elements in the corresponding coefficient column vector is zero except for the rows of positions and , suggesting an ascending career path in general.

Vi Conclusion

We study the problem of factorial point process learning for which the event can carry multiple markers whereby the relevant concept can be found in Factorial Hidden Markov Models [ghahramani1996factorial]. Two learning algorithms are presented: the first is directly based on the raw regularized discriminative prediction objective function which employs ADMM and FISTA techniques for optimization; the second is a simple LR solver which is based on a key reformulation of the raw objective function. Experimental results on two real-world datasets collaborate the effectiveness of our approach.


The work is partially supported by National Natural Science Foundation of China (61602176, 61628203, 61672231), The National Key Research and Development Program of China (2016YFB1001003), NSFC-Zhejiang Joint Fund for the Integration of Industrialization and Informatization (U1609220).


Weichang Wu received the B.S. degree in electronic engineering from Huazhong University of Science and Technology, Wu Han, China, in 2013. He is currently pursuing the Ph.D. degree with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China. His current research interests include data mining, especially medical information mining and disease modeling based on event sequence learning.

Junchi Yan (M’10) is currently an Associate Professor with Shanghai Jiao Tong University. Before that, he was a Senior Research Staff Member and Principal Scientist for visual computing with IBM Research where he started his career since April 2011. He obtained the Ph.D. at the Department of Electronic Engineering of Shanghai Jiao Tong University, China. He received the ACM China Doctoral Dissertation Nomination Award and China Computer Federation Doctoral Dissertation Award. His research interests are machine learning and visual computing. He serves as an Associate Editor for IEEE ACCESS and on the executive board of ACM China Multimedia Chapter.

Xiaokang Yang (M’00-SM’04) received the B. S. degree from Xiamen University, Xiamen, China, in 1994, the M. S. degree from Chinese Academy of Sciences in 1997, and the Ph.D. degree from Shanghai Jiao Tong University in 2000. He is currently a Distinguished Professor of School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include visual signal processing and communication, media analysis and retrieval, and pattern recognition. He serves as an Associate Editor of IEEE Transactions on Multimedia and an Associate Editor of IEEE Signal Processing Letters.

Hongyuan Zha is a Professor at the School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology and East China Normal University. He earned his PhD degree in scientific computing from Stanford University in 1993. Since then he has been working on information retrieval, machine learning applications and numerical methods. He is the recipient of the Leslie Fox Prize (1991, second prize) of the Institute of Mathematics and its Applications, the Outstanding Paper Awards of the 26th International Conference on Advances in Neural Information Processing Systems (NIPS 2013) and the Best Student Paper Award (advisor) of the 34th ACM SIGIR International Conference on Information Retrieval (SIGIR 2011). He was an Associate Editor of IEEE Transactions on Knowledge and Data Engineering.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description