Decoupled Learning for Factorial Marked Temporal Point Processes
Abstract
This paper introduces the factorial marked temporal point process model and presents efficient learning methods. In conventional (multidimensional) marked temporal point process models, event is often encoded by a single discrete variable i.e. a marker. In this paper, we describe the factorial marked point processes whereby timestamped event is factored into multiple markers. Accordingly the size of the infectivity matrix modeling the effect between pairwise markers is in power order w.r.t. the number of the discrete marker space. We propose a decoupled learning method with two learning procedures: i) directly solving the model based on two techniques: Alternating Direction Method of Multipliers and Fast Iterative ShrinkageThresholding Algorithm; ii) involving a reformulation that transforms the original problem into a Logistic Regression model for more efficient learning. Moreover, a sparse group regularizer is added to identify the key profile features and event labels. Empirical results on real world datasets demonstrate the efficiency of our decoupled and reformulated method. The source code is available online.
I Introduction and Background
Events are ubiquitous across different domains and applications. In ecommerce, events refer to the transactions associated with users, items. In health informatics, event sequence can be a series of treatment over time of a patient. In predictive maintenance, events can carry important log data for when the failure occurs and what is the type. In all these examples, effectively modeling and predicting the dynamic behavior is of vital importance for practical usefulness.
Marked temporal point process Point process [daley2007introduction] is a useful tool for modeling the event sequence with arbitrary timestamp associated with each event. An event in point process can carry extra information called marker. The marker typically refers to event type and lies in the discrete label space i.e. a finite category set ^{1}^{1}1A general concept can be found in [Gelfand2010SS]: a marked point pattern is one in which each point of the process carries extra information called a mark, which may be a random variable, several random variables, a geometrical shape, or some other information. In this paper, we focus on discrete labels for marks. The marked point process is also termed by multidimensional point process [LinigerPhD2009], where each dimension refers to a discrete mark value..
Factorial marked temporal point process For the above mentioned marked point process, the event is represented by a single mark as a single discrete variable, but in many application scenarios, the event can carry multiple markers. For instance, a movement to a new job carries both the label for position and label for company, which can be treated by two orthogonal markers with different values. Though such cases are ubiquitous in real world, the factorial marked point processes have drawn little attention in literature as existing literatures mostly work with a single marker [LinigerPhD2009, ZhouAISTATS13, XiaoAAAI17]. Inspired by Factorial Hidden Markov Models [ghahramani1996factorial], we introduce the factorial marked temporal point process, in which the event is represented by multiple markers, and propose a decoupling method to learn the process.
Intensity function and problem statement One core concept for point process is the intensity function , which represents the expected instantaneous rate of events at time conditioned on the history. One basic intensity function is the constant over time, as used in the homogeneous Poisson process. Another popular form is the one used by the Hawkes process [HawkesBiometrika71]: , where denotes the event history and is a markervs.marker infectivity kernel capturing the temporal dependency between event at and at .
In this paper, we are interested in describing a factorial marked point process for event marker prediction task by using the history event information and the individual level profile of an event taker. We focus on the nextevent label estimation, distributed over more than one markers. In particular, our empirical study focuses on individual level next job prediction involving both position and company for LinkedIn users, and duration prediction in current ICU department and transition prediction to next ICU department for patients in MIMICII database [goldberger2000physiobank].
Ii Related Work and Contribution
Learning for temporal point process Point process is a powerful tool for modeling event sequence with timestamp in continuous time space. Early work dates back to the Hawkes processes [HawkesBiometrika71] which shows appropriateness for selfexciting and mutualexciting process like earthquake and its aftershock [OgataJASA88, OgataJASA98]. The learning is fulfilled by maximum likelihood estimation by directly computing the gradient and Hessian matrix w.r.t. the loglikelihood function [ozaki1979maximum]. Recently more modernized machine learning approaches devise new efficient algorithms for learning the parameters of the specified point process. Nonparametric ExpectationMaximization (EM) algorithm is proposed in [LewisJNS2011] for multiscale Hawkes Processes using the majorizationminimization framework, which shows superior efficiency and robustness compared with sampling based estimation methods. [ZhouAISTATS13] extends the technique to handle the multidimensional Hawkes process by adding a lowrank and sparsity regularization term in the maximum likelihood estimation (MLE) based loss function.
Factorial model Though almost all of these works mentioned above involve the infectivity matrix for model parameters learning, none of them considers the factorial temporal point process case, i.e. an event type is factored into multiple markers, which leads to the explosion of the infectivity matrix size. The idea of factorizing events or states into multiple variables is employed in [ghahramani1996factorial] for Hidden Markov Models (HMM) using variational methods to solve data mining task like capturing statistical structure, but little literature is found about its utility in point process. To our best knowledge, this is the first work of factorial marked point process learning for event marker prediction. Note timestamp prediction can be approximated by predicting a predefined time interval as time duration marker as done in this paper.
Sparse regularization for point process Sparse regularization is a wellestablished technique in traditional classification and regression models, such as the regularizer [ng2004feature], group Lasso [meier2008group], sparse group regularizer [simon2013sparse], etc. Recent point process models have also found their applications like the regularization used in [li2014learning] to ensure the sparsity of social infectivity matrix, the nuclear norm in [ZhouAISTATS13] to get a lowrank network structure and the group Lasso in [xu2016icu] for feature selection. We propose to use the sparse group regularizer, which encourages the nonzero elements focusing on a few columns obeying with the intuition that only a few features and labels play the major role in event dynamics. We find little work in literature on group sparse regularizer for point process learning.
Contributions The main contributions of this paper are:
1) We introduce the concept of factorial marked point process for event marker prediction, and propose a decoupled learning algorithm to simplify the factorial model by decoupling the marker mutual relation in modeling. The method outperforms general marked point process on realworld datasets.
2) We present a multilabel Logistic Regression (LR) perspective and devise reformulation towards a class of point process discriminative learning problems. It eases the learning of these processes by using ontheshelf LR solver.
Besides these major contributions, we also make additional improvements in proposing a regularized learning objective, which we will include for completeness.
Iii Proposed Model and Objective Function
Iiia Factorial point process
Factorial point process refers to the processes in which event can be factorized into multiple markers. Except the job movement prediction and ICU department prediction mentioned in Introduction, many application cases can be described by the factorial point process, while haven’t been explored yet. For instance, a weather forecast containing temperature, humidity, precipitation, and wind can be seen as a factorial point process with markers, with each marker having discrete or continuous values. Obviously these factors affect each other, e.g. the humidity today is influenced by the precipitation and temperature in recent few days. The conventional marked point process could only model one of these factors using a single marker without considering the infectivity between these factors. A factorial point process with multiple markers for the event is essential.
Learning factorial point process is challenging. Taking job movement prediction with two markers company and position as example: to predict the probability of user ’s th job , we need to learn a dimension tensor representing the impact of history companies on and , the impact of history positions on and , respectively. In point process, it means we need to learn a set of intensity functions including , , and . This simple case considers no infectivity between different sequences, i.e., if we also consider the impact of another user y’s job movement on user ’s choice of and , we would compute a dimension tensor to measure the complete infectivity, with two extra intensities and .
There are ways to simplify factorial point process learning, e.g. we can treat the combination of multiple factors as one marker, and use the conventional marked point process model, but this will lead to explosion of the size of infectivity matrix. In this paper, we explore a simple decoupling solution that decouple the factorial point process into separate models of different markers respectively. As shown in Fig.1 for the instance of markers and , we decouple the original infectivity matrix into smaller one by introducing tensor variable , , and . We will present the decoupled model in details in the following section.
IiiB Decoupled learning for factorial point process
More generally, we discuss the situation that event can be factorized into markers . Given event sequence with event marker for where , the intensity function of a conventional marked point process model for marker is defined by:
(1) 
where is the timeinvariant features of sequence taker extracted from its profile, like Selfintroduction of LinkedIn users or patients’ diagnose in MIMICII database, and is the corresponding coefficients.
For the choice of the three functions , , in Eq. 1, there are many forms in the literature that can be abstracted by the above framework, and popular ones are depicted in Table I.
For marked point process model, when the marker contains multiple label dimensions, one major bottleneck is that this model involves the infectivity matrix with size to measure the directional effect between and . More generally, the size of the infectivity matrix is (assume all dimensions have same number of values by ), which incurs learning difficulty.
To mitigate the challenge for learning the above parameter matrix, especially when is large while the sequences are relatively short, we propose the decoupled factorial point process model to linearly decouple the above intensity function into interdependent point processes for the th marker as written by:
(2) 
where , is the binary indicators connecting through the influence of one’s former marker and markers . Note the row vector is the parameter for intrainfluence within marker , and is for interinfluence between and . The above vectors are illustrated in Fig.1 when , using notation as and as .
In fact, function , , are predefined and some embodiments can be chosen from Table I. For the time being, we do not specify these functions while focus on solving the learning problem in a general setting.
A concrete example is presented in Fig.2. See the caption for more details.
IiiC Nextevent marker prediction with regularizer
Loss function for discriminative learning Based on the defined intensity function, we write out the probability for event happens at time , conditioned on ’s history :
(3) 
where is the event intensity, and is the conditional probability that this event happens at time given history . is the probability that the happened event is given current time and history .
Based on the above equation, most existing point process learning methods e.g. [LewisJNS2011, ZhouICML13, DuNIPS15] fall into the generative learning framework aiming to maximize the joint probability of all observed events via a maximum likelihood estimator .
However, such an objective function is not tailored to the particular task at hand: instead of taking care of handling the posterior probability of the whole event sequence, we are more interested in predicting the next event and its mark information. To enable a more discriminative learning paradigm to boost the next event prediction accuracy, a recent work [xu2016icu] suggests to focus on instead of as the loss function for learning.
In the decoupled point process model, the dependency between different markers has been measured by interinfluence parameters, i.e., the dependency between process for marker and process for marker have been measured by parameter and (see Eq. 2) in an independent fashion. In the same spirit, here we simplify the probability by an independence assumption for marker and marker :
(4)  
where is the normalized intensity function. This simplification leads to the following loss:
(5) 
where is an indicator returning 1 if true, otherwise 0. is the parameters whereby
Model  

Modulated Poisson process (MPP) [lloyd2014variational]  
Hawkes process (HP) [LewisJNS2011]  
Selfcorrecting process (SCP) [isham1979self]  
Mutuallycorrecting process (MCP) [xu2016icu] 
Sparse group regularization Since the model involves many parameters for learning, a natural idea is introducing sparsity to reduce the complexity. Incorporating both group sparsity and overall sparsity, we use a sparse group regularizer [simon2013sparse] as the regularization, a combination of regularization and a group lasso . The group lasso encourages the nonzero elements concentrated on a few columns in the whole matrix , and the rest part is assumed to be zeros. The regularization encourages the whole matrix to be sparse. The behind rationale is that only a few profile features and event marker values will be the main contributor to the point process. This means only a few columns will be activated. As a result, the regularized objective is:
(6) 
where , , and , is the regularization weight, controls the balance between overall and group sparsity.
Iv Learning Algorithm
In this section, we first present our tailored algorithm to the presented model. Then we give a new perspective and show how to reformulate it into a Logistic Regression task.
Following the scheme of ADMM, we propose a FISTA [beck2009fast] based method with Line Search and two softshrinkage operators to solve the subproblems of ADMM. The whole algorithm is summarized in Alg. 2.
Iva Soft Shrinkage Operator
First we review two softshrinkage operators [donoho1995noising] that solve the following two basic minimization problem, which will be used in our algorithm.

The minimization problem
with , , , has a closedform solution given by the softshrinkage operator defined:
where is the sign function.

The minimization problem
with , , , has a closedform solution given by the softshrinkage operator defined:
IvB ADMM iteration scheme
To solve the minimization problem defined in Eq. 6 using ADMM solver, we add two auxiliary variables , . The augmented Lagrangian function for Eq. 6 is
(7) 
where , , , and .
The iterative scheme is given by:
(8)  
(9)  
(10) 
IvC FISTA with line search
IvD Reformulating to Logistic Regression task
Based on the intensity function Eq. 2, the loss function Eq. IIIC, we show how to reformulate the learning of the decoupled point process as a multiclass Logistic Regression task. One obvious merit of this reformulation is the reuse of ontheshelf LR solvers e.g. http://www.yelab.net/software/SLEP/ [Liu:2009:SLEP:manual] with little parameter tuning. In contrast, the algorithm presented in Alg.2 involves more parameters and is more computational costive as shown in Fig.4 and Table II.
For event taker at time , by separating the event taker ’s feature from the parameters , the conditional intensity function in Eq. 2 can be written as:
(12) 
Therefore the probability can be written as:
(13) 
This is exactly the same probability function of sample belonging to class for a Softmax classifier of classes, and is the parameter.
Hence the logloss function in Eq. IIIC becomes:
which is the sum of Softmax classifiers’ loss functions.
So far we have reformulated the decoupled learning of the factorial marked point process to the learning of Softmax classifiers. For the th classifier, it takes from the sample and classify it to one of markers . In the following experiments, we will show that the reformulated learning method in fact optimizes the same loss function as Alg.2.
IvE Event marker prediction
After learning parameters , we can predict the next event markers at , given history by computing (see Eq. 13). The predictions and are given by . It is important to note that though our model technically only issues discrete output as it is inherently a classification model, while in practice the future events’ timestamp can be predicted by an approximated discrete duration as done in our experiments. In this regard, we treat the future timestamp as a marker.
V Empirical Study and Discussion
Va Dataset and protocol
To verify the potential of the proposed model, we apply it to a LinkedInCareer dataset crawled and deidentified from LinkedIn to predict user’s next company , next position and duration of current job; an ICU dataset extracted from public medical database MIMICII [goldberger2000physiobank] to predict patient’s transition to the next ICU department and duration of stay in current department. Experiments are conducted under Ubuntu 64bit 16.04LTS, with i75557U 3.10GHz4 CPU and 8G RAM. For the convenience of replicating the experiments, the crawled deidentified LinkedInCareer dataset and the code is available on Github^{2}^{2}2https://github.com/blade091shenwei/factorialmarkedpointprocess.
Dataset The LinkedInCareer Dataset contains users crawled from information technology (IT) industry on LinkedIn (https://www.linkedin.com/), including their Selfintroduction, Technical skills and Working Experience after deidentification preprocess. We collect samples in IT industry because: i) The staff turnover rate is high, which makes it easier to collect suitable samples; ii) The IT industry is most familiar to the authors, and our domain knowledge can help better curate the raw data. We extract profile features from users’ Selfintroduction and Technical skills, and get users’ history company and position from Working Experience. After we exclude samples with zero job movement, we have a socalled LinkedInCareer benchmark, involving 2,403 users, 57 IT companies, 10 kinds of positions and 4 kinds of durations. The dataset is to some extent representative for IT industry. For companies, we have large corporations like Google, Facebook, Microsoft and mediumsized enterprise like Adobe, Hulu, VMWare. For positions we have technical positions like engineer, senior engineer, tech lead, and management positions like manager, director, CEO. For durations we discretize the duration of stay in a position or company as temporary( within 1 year), shortterm( 12 years), mediumterm( 23 years) and longterm( more than 3 years). The goal is to predict user’s next company from companies, next position from positions and duration of stay in current company and position from durations.
The ICU dataset contains patients from MIMICII database, including patients’ diagnose, treatment record, transition between different ICU departments and duration of stay in the departments. The goal is to predict patient’s next ICU department from departments including Coronary care unit (CC), Anesthesia care unit (ACU), Fetal ICU (FICU), Cardiac surgery recovery unit (CSRU), Medical ICU (MICU), Trauma Surgical ICU (TSICU), Neonatal ICU (NICU), and General Ward (GW), and predict patient’s duration of stay from kinds of duration including temporary( within 1 day), shortterm( 15 days), and longterm (more than 5 days). The profile features are extracted from patients’ diagnose (ICD9 code of patients’ disease) and treatment record (nursing, medication, treatment).
Many peer methods are evaluated as follows:
Intensity function choices Our framework is tested by four point process embodiments namely: i) Mutuallycorrected Processes (MCP), ii) Hawkes Process (HP), iii) Selfcorrecting Process (SCP) and iv) Modulated Poisson Process (MPP). Their characters are briefly compared in Table I. Note in our experiments, all these models are learned via the reformulated LR algorithm as described in Alg. 2.
Comparison to classic Logistic Regression We test a nonpoint process approach i.e. the plain LR. For the point process based LR solver, its input is , while the plain LR involves the raw feature as , including user profile feature , binary indicator and representing one’s current state without considering the history states.
Comparison to RNN and RMTPP We also experiment on RNN by treating the prediction task as a sequence classification problem. A dynamic RNN that can compute over sequences with variable length is implemented.
Moreover, to explore the effect of discretizing the time interval when making duration prediction, we also experiment on RMTPP (Recurrent Marked Temporal Point Process) proposed by [du2016recurrent]. Instead of predicting a discrete label for duration, it gives a continuous prediction result.
Prediction performance metrics We use prediction accuracy AC to evaluate the performance of the model with four variants , , , to denote the prediction accuracy for state (i.e. # correct ones out of total predictions), state , state and joint ,, respectively.
To evaluate the performance of our discrete duration prediction compared with RMTPP, both MSE (Mean Squared Error) and AC are computed. To compute prediction MSE, the predicted discrete duration is substituted by the intermediate time point of the discrete intervals, e.g., 0.5 years for temporary stay, 1.5 years for shortterm stay and 4 years for longterm stay. To compute prediction AC, the predicted continuous duration of RMTPP is discretized using the same criterion by the proposed model.
For LinkedInCareer data, we further compute precision curve for the topK position, company and duration predictions as shown in Fig. 3. These metrics are widely used for recommender system. In fact, as our model is for predicting the next company , next position and duration given career history , it can be used for recommending companies and posts at the predicted time period .
All the experimental results are obtained by 10fold cross validation as commonly adopted like [YanIJCAI13].
Dataset  Method  Time  Iter. #  

Career  Alg.2  32.81  60.67  52.41  10.74  123.8m  147.7 
LR  33.58  60.13  53.96  10.96  46.8s  11.2  
ICU  Alg.2  76.63  —  55.74  45.64  764.2m  121.5 
LR  76.98  —  55.63  45.55  55.1s  9.7 
company pred. accuracy  position pred. accuracy  duration pred. accuracy  joint pred. accuracy  
sequence  intensity  decoupled  coupled  unicom  decoupled  coupled  unipos  decoupled  coupled  unidur  decoupled  coupled  unicpt 
short  HP  15.37  14.23  14.50  56.99  56.99  56.99  38.97  36.61  36.61  4.59  4.05  4.26 
SCP  17.99  10.01  12.61  58.98  49.75  55.38  43.56  41.81  38.22  4.96  2.69  2.11  
MPP  15.58  13.58  13.96  56.99  56.99  56.99  40.16  36.61  36.61  4.71  3.72  4.21  
MCP  18.54  11.04  13.14  59.42  51.49  56.90  46.49  46.46  41.32  5.41  2.52  3.26  
LR  —  15.35  —  —  56.99  —  —  38.31  —  —  4.46  —  
RNN  —  13.18  —  —  56.98  —  —  36.94  —  —  3.77  —  
RMTPP  —  12.02  —  —  56.03  —  —  44.95  —  —  4.15  —  
long  HP  24.45  20.57  25.61  50.80  49.25  49.51  34.52  33.61  34.33  4.36  3.53  4.32 
SCP  35.41  20.49  26.46  51.00  40.54  46.96  33.27  28.99  28.64  7.47  5.46  5.84  
MPP  28.71  21.44  29.57  51.64  49.02  52.90  36.04  33.20  33.76  6.37  4.17  6.49  
MCP  50.23  29.98  44.12  60.45  50.16  55.65  47.00  40.33  37.78  14.61  7.21  10.05  
LR  —  24.59  —  —  49.51  —  —  35.47  —  —  4.14  —  
RNN  —  19.92  —  —  49.24  —  —  33.80  —  —  3.26  —  
RMTPP  —  19.88  —  —  49.17  —  —  44.07  —  —  4.58  —  
all  HP  19.33  16.23  19.15  53.50  52.95  52.95  39.85  37.72  37.80  4.39  3.20  4.05 
SCP  26.52  15.54  20.77  53.68  46.88  51.90  37.56  32.30  29.45  5.06  3.32  3.70  
MPP  21.59  16.49  21.73  54.10  52.77  54.49  39.81  35.17  35.72  4.96  3.50  4.81  
MCP  33.58  20.30  27.80  60.13  52.48  57.56  53.96  48.34  45.16  10.96  5.44  6.96  
LR  —  18.74  —  —  52.98  —  —  40.01  —  —  4.21  —  
RNN  —  16.24  —  —  52.93  —  —  35.21  —  —  3.02  —  
RMTPP  —  16.10  —  —  51.07  —  —  50.16  —  —  4.70  — 
duration prediction accuracy  transition prediction accuracy  joint prediction accuracy  
sequence  intensity  decoupled  coupled  uniduration  decoupled  coupled  unitransition  decoupled  coupled  unidt 
all  HP  52.48  51.64  52.91  74.61  73.31  73.77  42.85  41.62  42.40 
SCP  50.14  49.77  49.05  74.22  74.01  70.75  41.04  40.77  39.48  
MPP  53.27  51.88  52.02  74.74  73.42  72.05  43.64  42.37  43.28  
MCP  55.63  54.62  50.14  76.98  76.58  74.02  45.55  45.32  44.89  
LR  —  39.88  —  —  69.61  —  —  31.13  —  
RNN  —  47.01  —  —  70.54  —  —  36.44  —  
RMTPP  —  54.28  —  —  67.49  —  —  41.93  — 
VB Results and discussion
We are particularly interested in analyzing the following main bullets via empirical studies and quantitative results.
i) LR solver vs. ADMM solver To make a fair comparison, the LR solver and ADMM solver i.e. Alg.2 share the same initial parameter that initialized by a uniform distribution sampling from , and the running time and iteration count are the average of 10fold cross validation. Table II compares LR solver and ADMM solver regarding with accuracy and time cost on the Dataset LinkedInCareer and ICU. One can find the prediction accuracy is similar while the ADMM solver is more costive as we find it converges more slowly as shown in Fig.4. Also, as shown in Alg.2, it involves more hyperparameters to tune and they have been tuned to their best performance. For comparison between the reformulated LR via the point process framework, and the raw LR using only user profile i.e. LR, we find the former outperforms in most cases in Table III and IV.
Comparing running time in Table II, the LR solver has better scalability than ADMM solver. This is because the ADMM solver is a general algorithm for convex optimization with sparse group regularization, while the LR solver works by a special design for the objectives that can be reformulated to Logistic Regression loss. Many algorithmic optimizations for Logistic Regression can be used in LR solver, like Efficient Projection in SLEP [Liu:2009:SLEP:manual].
ii) Decoupled learning vs. RNN As shown in Table III and Table IV, the decoupled marked point process model has much better performance than RNN. This is because the nextevent prediction task for relatively short sequences like dataset LinkedInCareer and ICU is not a typical sequence classification task. We need to make prediction on every step of the sequence, rather than make prediction at the end of the whole sequence. That means for the endtoend RNN sequence classification model, it needs to deal with sequences with considerably variable length, including a large number of sequences of length .
We also compare the accuracy of RMTPP that makes continuous duration prediction, with decoupled model and general RNN in Table III and Table IV. Though the duration prediction accuracy of RMTPP is improved compared to general RNN, the decoupled model still have better performance.
To further verify the effect of discretizing the time interval when making duration prediction, the MSE of decoupledMCP and RMTPP is also compared in Table V. Though it is a little tricky that discrete wrong prediction leads to smaller MSE than continuous wrong prediction, e.g., if a ground truth mediumterm duration of 2.5 years is misclassified as longterm duration of 4 years for deMCP, the RMTPP may gives a continuous prediction value of 10 years, the MSE of deMCP is not only relatively but also absolutely small. It shows that the discretization of time interval is to some extent rational.
iii) Infectivity matrix decoupling vs. coupling Table III and Table IV also compare the performance of our decoupled model (see Eq. 2) against the raw coupled model (see Eq. 1), and the simplified model (singledimension) when only marker , or is considered. This boils down to the singledimension case and the method is termed by unicom, unipos and unidur for dataset LinkedInCareer, and uniduration and unitransition for ICU respectively. While for unicpt, it involves no new model while uses the output of unic, unip and unit to combine them together as the joint prediction. It shows that the decoupled model consistently achieves the best performance, which perhaps is attributed to the reduction of model complexity given relatively limited training data.
Comparing the accuracies in Table III for dataset LinkedInCareer and Table IV for dataset ICU, we can see that the improvement in accuracy of decoupled model compared to coupled model or singledimension model, is more remarkable for LinkedInCareer than that for ICU. The reason is that for LinkedInCareer, the coupled state space is decoupled from to , and for ICU it is decoupled from to . The LinkedInCareer dataset has a larger coupled state space than ICU. So when decoupled to smaller state spaces, the improvement for LinkedInCareer is more notable than that for ICU.
iv) Choice of intensity function There are many popular intensity forms and some are listed in Table I. According to Table III and Table IV, the mutuallycorrecting process (MCP) consistently shows superior performance against other intensity function embodiments. This verifies two simple assumptions that i) the intensity tends to decrease for the moment the event happens, i.e., the desire for new job can be suppressed when a new job is fulfilled for job prediction, and patients’ demand for transition to next ICU department decrease after they move into a new department for ICU department transition prediction; ii) the probability of future events is influenced by the history events according to Table I, i.e., one’s transition possibility to new job is related to his/her history career experience, and patient’s future ICU department transition procedure is related to his/her history treatment.
v) Influence of sequence length To further explore the performance behavior, we experiment on shortsequences and longsequences respectively on LinkedInCareer^{3}^{3}3ICU data is not included in the length test because of the patients in ICU have no more than three transitions.. Results in Table III show that the decoupled MCP algorithm has more advantage in longsequence prediction, suggesting that the decoupled MCP can make better use of history information.
dataset  ICU  

model  RMTPP  deMCP  RMTPP  deMCP 
MSE  9.625  2.934  14.602  4.272 
dataset  marker  w/o sparse  group lasso  sparse group 

Career  company  29.16  31.47  33.58 
position  56.99  58.16  60.13  
duration  50.56  52.33  53.96  
joint (3)  9.53  10.04  10.96  
ICU  duration  52.68  55.13  55.63 
transition  73.45  76.09  76.98  
joint (2)  42.20  45.49  45.55 
vi) Influence of sparsity To verify the effect of sparse group regularization, we compare the accuracy of the decoupled MCP model with different regularization settings, including without sparse regularization, with group lasso for group sparsity and with sparse group regularization (a combination of regularization and group lasso – see Eq. 6) for both overall sparsity and group sparsity. As shown in Table VI, the sparse group regularizer outperforms.
We also explore feature selection functionality by investigating the magnitudes of elements in matrix . The element measures the influence of profile feature or marker to label . Small (large) values indicate the corresponding features or markers have little (high) influence to label. For example, in LinkedInCareer dataset, the numerical values in coefficient column vector corresponding to marker are all nonzero, showing that having a working experience as an is important in IT industry. For marker , most of the elements in the corresponding coefficient column vector is zero except for the rows of positions and , suggesting an ascending career path in general.
Vi Conclusion
We study the problem of factorial point process learning for which the event can carry multiple markers whereby the relevant concept can be found in Factorial Hidden Markov Models [ghahramani1996factorial]. Two learning algorithms are presented: the first is directly based on the raw regularized discriminative prediction objective function which employs ADMM and FISTA techniques for optimization; the second is a simple LR solver which is based on a key reformulation of the raw objective function. Experimental results on two realworld datasets collaborate the effectiveness of our approach.
Acknowledgments
The work is partially supported by National Natural Science Foundation of China (61602176, 61628203, 61672231), The National Key Research and Development Program of China (2016YFB1001003), NSFCZhejiang Joint Fund for the Integration of Industrialization and Informatization (U1609220).
References
Weichang Wu received the B.S. degree in electronic engineering from Huazhong University of Science and Technology, Wu Han, China, in 2013. He is currently pursuing the Ph.D. degree with the Department of Electronic Engineering, Shanghai Jiao Tong University, Shanghai, China. His current research interests include data mining, especially medical information mining and disease modeling based on event sequence learning. 
Junchi Yan (M’10) is currently an Associate Professor with Shanghai Jiao Tong University. Before that, he was a Senior Research Staff Member and Principal Scientist for visual computing with IBM Research where he started his career since April 2011. He obtained the Ph.D. at the Department of Electronic Engineering of Shanghai Jiao Tong University, China. He received the ACM China Doctoral Dissertation Nomination Award and China Computer Federation Doctoral Dissertation Award. His research interests are machine learning and visual computing. He serves as an Associate Editor for IEEE ACCESS and on the executive board of ACM China Multimedia Chapter. 
Xiaokang Yang (M’00SM’04) received the B. S. degree from Xiamen University, Xiamen, China, in 1994, the M. S. degree from Chinese Academy of Sciences in 1997, and the Ph.D. degree from Shanghai Jiao Tong University in 2000. He is currently a Distinguished Professor of School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, Shanghai, China. His research interests include visual signal processing and communication, media analysis and retrieval, and pattern recognition. He serves as an Associate Editor of IEEE Transactions on Multimedia and an Associate Editor of IEEE Signal Processing Letters. 
Hongyuan Zha is a Professor at the School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology and East China Normal University. He earned his PhD degree in scientific computing from Stanford University in 1993. Since then he has been working on information retrieval, machine learning applications and numerical methods. He is the recipient of the Leslie Fox Prize (1991, second prize) of the Institute of Mathematics and its Applications, the Outstanding Paper Awards of the 26th International Conference on Advances in Neural Information Processing Systems (NIPS 2013) and the Best Student Paper Award (advisor) of the 34th ACM SIGIR International Conference on Information Retrieval (SIGIR 2011). He was an Associate Editor of IEEE Transactions on Knowledge and Data Engineering. 