Topic Modeling on Health Journals with Regularized Variational Inference

Topic Modeling on Health Journals with Regularized Variational Inference

Robert Giaquinto University of MinnesotaKeller Hall, 200 Union Street SEMinneapolisMN55455 giaquinto.ra@gmail.com  and  Arindam Banerjee University of MinnesotaKeller Hall, 200 Union Street SEMinneapolisMN55455 banerjee42@gmail.com
Abstract.

Topic modeling enables exploration and compact representation of a corpus. The CaringBridge (CB) dataset is a massive collection of journals written by patients and caregivers during a health crisis. Topic modeling on the CB dataset, however, is challenging due to the asynchronous nature of multiple authors writing about their health journeys. To overcome this challenge we introduce the Dynamic Author-Persona topic model (DAP), a probabilistic graphical model designed for temporal corpora with multiple authors. The novelty of the DAP model lies in its representation of authors by a persona — where personas capture the propensity to write about certain topics over time. Further, we present a regularized variational inference algorithm, which we use to encourage the DAP model’s personas to be distinct. Our results show significant improvements over competing topic models — particularly after regularization, and highlight the DAP model’s unique ability to capture common journeys shared by different authors.

machine learning, topic modeling, graphical model, regularized variational inference, healthcare
copyright: noneconference: Thirty-Second AAAI Conference on Artificial Intelligence; February 2018; New Orleans, Louisiana, USAjournalyear: 2018article: price: doi: isbn:

1. Introduction

Topic models can compactly represent large collections of documents by the themes running through them. We introduce a topic model designed for the unique challenges presented by the CaringBridge (CB) dataset. The CB dataset includes journals written by patients and caregivers during a health crisis. CB journals function like a blog, and are shared to a private community of friends and family. The full dataset includes 13.1 million journals written by approximately half a million authors between 2006 and 2016. From the CB dataset we’re interested in capturing health journeys, that is, authors writing about the same topics over time.

The challenges in topic modeling on the CB dataset stem from the asynchronous nature of author’s posts. Specifically, authors start and stop journaling at different times — both in terms of calendar dates and how far along they are in their health journey. Additionally, authors post at irregular frequencies. While about 15% of CB authors post nearly everyday, the majority of authors typically post less frequently, often corresponding to a major update, event, or anniversary of an event. What’s more, the length of these posts can range from just a few words to thousands of words.

State-of-the-art topic models can identify topics (Blei et al., 2003), track how topics change over time (Blei and Lafferty, 2006; Wang and McCallum, 2006; Wei et al., 2007; Wang et al., 2008), or associate authors with certain topics (Rosen-Zvi et al., 2004; Steyvers et al., 2004; McCallum et al., 2005; Mimno and McCallum, 2007). These models cannot, however, describe common narratives and the author’s sharing them. We present the Dynamic Author-Persona topic model (DAP), a novel approach that represents authors by latent personas. Personas act as a soft-clustering on authors based their propensity to write about similar topics over time. Our approach is unique in multiple respects. First, unlike other temporal topic models, the words making up a topic don’t evolve over time — rather, DAP’s personas reflect the flow of conversation from one topic to next. Second, we introduce a regularized variational inference (RVI) algorithm, an approach we use to encourage personas to be distinct from one another.

Our results show that the DAP model outperforms competing topic models, producing better likelihoods on heldout data. Finally, we demonstrate that using RVI further improves the DAP model’s performance, and results in personas that are rich and compelling descriptions of the health journeys experienced by CB authors.

The rest of the paper is as follows: in Section 2, a brief background on temporal topic models is given. Section 3 presents the DAP model. Section 4 details the model’s RVI algorithm. Section 5 introduces the evaluation dataset and procedure. Section 6 shares the results of the experiments. Finally, in Section 7 we summarize the contributions of this paper.

2. Background

Much of the research on topic modeling builds on the latent Dirichlet allocation (LDA) model (Blei et al., 2003). The LDA model doesn’t account for meta-information like authorship or time. Nevertheless, interest in LDA has endured, in part, due its ability to richly describe topics as distributions over words and documents as mixtures of topics. In the years since LDA’s introduction, others have extended the idea to compliment corpora with a variety of structures and metadata.

Author information is common in many corpora. A few topic models are designed to identify authors’ preferences for certain topics, and the relationships between authors (Rosen-Zvi et al., 2004; Steyvers et al., 2004; McCallum et al., 2005; Mimno and McCallum, 2007; Pathak et al., 2008). Corpora with a temporal structure, such as scientific journals or newspaper articles, are the focus of many of temporal topic models (Blei and Lafferty, 2006; Wang and McCallum, 2006; Wei et al., 2007; Wang et al., 2008).

Temporal Topic Models. Two topic models set the standard of comparison for topic modeling on corpora with a temporal element: the dynamic topic model (DTM) (Blei and Lafferty, 2006) and the topics over time model TOT (Wang and McCallum, 2006). These two models represent very different approaches to modeling time in a topic model.

The TOT model defines time as an observed variable, which leads to a continuous treatment of time and the ability to predict timestamps of documents. Alternatively, the DTM evolves topics over time using a Markov process. In many corpora the evolution of topics provides interesting insights. For example, Blei’s model of the Science corpus shows words associated with a topic on physics changing over a century.

Building directly on the DTM, in 2008 Wang et al. developed the continuous time dynamic topic model (CDTM) which uses continuous Brownian motion to model the evolution of topics over time (Wang et al., 2008). This is a major development in temporal topic models because, unlike the DTM, it doesn’t require partitioning the data into discrete time periods. Instead, the model assumes that at each time step the variance in the topic proportions increases proportional to the duration since the previous document. Similar to Wang et al., the Dynamic Mixture Model (DMM) is built for continuous streams of text (Wei et al., 2007). In the DMM, however, topics are fixed in time and the model captures the evolution of document-level topic proportions over time.

Topic Modeling of Health Journeys. In many topic modeling applications to temporal corpora, the time component is ignored. For example, Wen et al. model cancer event trajectories from users of an online forum for breast cancer support (Wen and Rose, 2012). Wen’s approach uses LDA to extract cancer event keywords, which are then linked together in time by temporal descriptions mined from the text. This work demonstrates a quantitative approach to studying the dynamics of social support network, and offer a powerful look at the experiences of users in these support networks.

Numerous studies have shown that support networks, both in person and online, are valuable tools for those suffering from chronic conditions or life-threating illness and caregivers (Wen et al., 2011; Rodgers and Chen, 2005; Beaudoin and Tao, 2007). Additionally, online social networks can serve as a way to efficiently disseminate information regarding someone’s status to their community. Understanding the health journeys of users in these social support communities is valuable information for improving user experience. Topic models are uniquely suited to succinctly describing and analyzing these health journeys.

3. The DAP Model

The design of the DAP model was made with journaling behavior in mind. Consider a CB author journaling about their surgery: initially they may write about topics related to the surgical procedure, but as time progresses the author is more likely to discuss recovery, physical therapy, or returning to normal life. In other words, the likelihood of a topic for some document depends where the document’s author is in their health journey. As such, DAP assumes that (1) a state space model controls the likelihood of a topic at each time step, (2) each persona represents a different flow of topics over time, and (3) each author has a distribution over personas.

The DAP’s approach for modeling topics in a document, and words in a topic follows the correlated topic model (CTM) and LDA, respectively (Lafferty and Blei, 2006; Blei et al., 2003). The idea of modeling latent personas was originally proposed by (Mimno and McCallum, 2007), however in their Author-Persona Topic model (APT) personas differ significantly from those proposed in the DAP model. First, in the APT model each author may have a different number of personas, whereas DAP models each author as a distribution over a fixed number of personas — which acts as a soft clustering on authors. Second, APT assigns a persona to documents, which indirectly defines a topic distribution for each cluster of documents, whereas we model documents as each having their own distribution over topics. Lastly, while DAP’s personas also correspond to a distribution over topics, DAP evolves these topic distributions over time — thereby capturing the inherent temporal structure resulting from an author writing multiple documents.

The DAP model directly addresses the challenges presented by the CB dataset. First, the asynchronous nature of health journals is handled by: (1) transforming each journal’s timestamp to the time elapsed since the author’s first post, and (2) learning multiple personas to account for a wide variety in topic trajectories. Second, irregular posting behavior is managed by employing the Brownian motion model, originally used in topic modeling by (Wang et al., 2008), to model topic variance as proportional to the gap in time between documents.

The generative process of the model is described below. The model assumes that each document in the corpus has a timestamp associated with it. Similar to the CDTM (Wang et al., 2008), timestamps are used in a continuous Brownian motion model to capture an increase in topic variance as time between observations increases. More formally, if and are timestamps at steps , then is the difference in time between and . We use the shorthand to denote the difference in time between timestamps and . For brevity the variance is denoted , where is a known process noise in the state space model.

  1. Draw distribution over words for each topic .

  2. Draw distribution over personas for each author .

  3. For each persona , draw initial distribution over topics:

  4. For each time step , where :

    • Draw distribution over topics:

    • Update according to Brownian motion model: .

    • For each document , where :

      1. Choose persona indicator where corresponds to the author of document .

      2. Draw topic distribution for document .

      3. For each word , where :

        1. Choose word topic indicator .

        2. Choose word from , a multinomial probability conditioned on the topic indicator .

Following the approach in the CTM and DTM, we use the function to map the Logistic Normal , parameterized by a mean and covariance , to the multinomial’s natural parameters via in order to obey the constraint that the parameters lie on the simplex.

The graphical model corresponding to this process is shown in Figure 1. In LDA and its extensions the parameter represents a prior probability of each topic. In the DAP model, takes on an expanded role: it’s a distribution over topics at time step for persona . The choice of letting evolve over time, as opposed to like in the DTM, is that in a collection of journals there is less interest in changes to topics themselves. In other words, we model the words associated with a topic as static in time, but the topics an author writes about will change over time.

Figure 1. Graphical representation of the Dynamic Author-Persona topic model (DAP). On top, topic distributions for each persona evolve over time: . The distribution over words for each topic, , is fixed in time. Each author is represented by a distribution over personas, that is . The distribution over topics for each document is dependent on the persona distribution for that document’s author, and the evolving topic distribution .

4. Variational EM Algorithm

Given the model structure, next we derive an inference algorithm used to estimate the model’s latent parameters. Much like LDA and its extensions, the DAP model’s posterior:

is intractable due to the normalization term. In order learn optimal values to the model’s parameters we use a form of variational inference (VI), which approximates the difficult to compute posterior distribution with a simpler distribution (see Blei et al., 2016 for a review). Variational inference casts an inference problem as an optimization problem with the goal of finding parameters to the variational distribution such that closely approximates . Our regularized variational inference (RVI) algorithm seeks a distribution such that

(1)

where is KL-Divergence. The added term is a regularization function we’ve introduced to discourage similar personas (further detail given in Section 4.2), and the corresponding hyperparameter.

To make easy to compute, we apply mean field variational inference which assumes that the parameters are posteriori independent. Under the mean field assumption the variational distribution factorizes as:

(2)

where we have introduced the following variational parameters: the persona for each author is endowed with a free Dirichlet parameter ; each assignment of a persona to an author is endowed with a free Multinomial parameter ; in the variational distribution of the sequential structure is kept intact with variational observations ; each document-topic proportion vector is endowed with a free . The variance for the document-topic parameters are and , for the model and variational parameter, respectively; each word-topic indicator is endowed with a free multinomial parameter .

An optimal cannot be computed directly, but following Jordan et al. (1999) an optimization of the variational parameters proceeds by maximizing a bound on the log-likelihood of the data. In the DAP model, data is observed in words for each document at time step , hence we re-write the log-likelihood of the data by:

(3)

The inequality in (3) follows from Jensen’s inequality. Moreover, it can be shown that the difference between and is . Hence maximizing the bound in (3) is equivalent to minimizing the KL divergence between the variational and true posteriors. We denote the Evidence Lower BOund (ELBO) by . Since our objective defined in (1) includes a regularization term, we therefore maximize a surrogate likelihood consisting of the ELBO minus the regularization term (see Wainwright and Jordan, 2007 for a review of penalized surrogate likelihoods). Hence our objective function for some regularization is defined as:

(4)

where expands for each term in the model, that is:

(5)

And, similarly is the entropy term associated with each of the parameters. Some terms in (5) are simple, and well-known from foundational topic models like LDA and CTM (Blei et al., 2003; Lafferty and Blei, 2006). For example, the topic distributions over words term is found in LDA, and in the DAP model the distributions over personas for each author follows a similar structure. Similarly, the non-conjugate pairs for word-topic assignment has been studied in the CTM. For completeness, we show the expansion of the more unique terms and in the Appendix.

Expanding the objective function according to the distribution associated with each parameter allows updates to be derived for each parameter. The parameters are optimized using a variational expectation-maximization algorithm, the details of the algorithm are given below.

4.1. Variational E-Step

During the E-step the model estimates variational parameters for each document and saves the sufficient statistics required to compute global parameters. The structure of the DAP model, while unique, has some components that mimic previous topic models. Specifically, the word-topic assignment parameter has the same update found in the CTM due to the Logistic-Normal parameter. Hence has a closed form update: (Lafferty and Blei, 2006).

Each author’s persona is parameterized by a . To find an update for we select ELBO terms featuring , and then take the derivative with respect to each document and persona. The terms in the ELBO containing are:

where the last term is the Lagrange constraint to ensure each vector must sum to one. Takings the derivative with respect to one specific document and persona we find that:

A closed form solution for doesn’t exist. We therefore estimate using exponential gradient descent.

Since the model includes non-conjugate terms (much like DTM, CDTM, and CTM), an additional variational parameter is introduced to preserve the lower-bound during the expansion of the term containing a non-conjugate pair: . Taking the derivative of all terms containing and setting it to zero yields an analogous closed form update to the one found in the CTM:

Finally, the DAP model estimates a topic distribution for each document via the parameter. To update the terms in the ELBO featuring are selected:

Taking a derivative of these terms with respect to yields:

(6)

Since a closed form solution isn’t available, a conjugate gradient algorithm is run using the gradient in (6).

Whereas represents the mean of the Logistic-Normal for a document’s topic distribution, the parameter is the variance. The ELBO terms featuring are:

Therefore, setting the derivative of with respect to to zero and solving yields:

which requires Newton’s method for each coordinate, constrained such that .

The parameter represents the noisy estimate of . After calculating , the forward and backward equations will be applied in the M-step to give a final posterior estimate . The terms in the ELBO containing are found by expanding for (7a) and for (7b) and (7c):

(7a)
(7b)
(7c)

Taking the derivative with respect to the mean term for each persona gives the closed form update:

(8)

We solve for sequentially over time steps. For the initial time step , we use the prior in place of . Note that the summations in (8) are collected during the E-step and need only be computed once after performing inference on all documents.

4.2. Regularized Variation Inference

Our RVI algorithm nudges to find topic distributions that are different for each persona. A natural choice for capturing this idea is an inner product between each of the personas (excluding a persona with itself). Hence, we define the regularization function by:

(9)

The parameter in included in the regularization for two reasons. First, it simplifies the update to . In (7) the term appears in every term, which allows it to be factored out and canceled. By including in the regularization the same cancellation can occur. Second, since then its inclusion has the effect of encouraging personas to be orthogonal to one another. We include the number of documents at time in so that the regularization is applied evenly, regardless of dataset size or a skewed distribution of documents over time. After including the regularization term in (9) with the ELBO terms in (7), the regularized update is:

(10)

Since the vector (of length ) is computed during the E-step, then the RHS is known. Similarly, the term is known, and in combination with form the weights over the unknown vector , also of length . Therefore, (10) can be solved as a system of linear equations. Through experiments we’ve found an optimal value of . The model exhibits sensitivity to the hyperparameter , if is large (e.g. ) then model quality drops due to personas overfitting to a single topic. Since is only used to estimate the global parameter during the M-step, computing isn’t necessary for inference on holdout datasets.

4.3. M-Step

In the M-step the global parameters , , and are updated such that the lower bound of the log likelihood of the data is maximized. Note, the update for is exactly the same as derived for the LDA model, and hence omitted.

The parameter represents the distribution over personas for each author. The closed form update for is:

shows that ’s closed form update is an average of the persona assignments, smoothed by the author-persona prior .

Once the variational observations are computed, our approach follows the variational Kalman filtering method from Wang’s Continuous Time Dynamic Topic Model, see Appendix for further details. Specifically, we employ the Brownian motion to model time dynamics. However, because the DAP model’s time-varying parameter is a distribution over latent topics, it performs best on data discretized in time (resulting in a smaller ). The forward equations mimic a Kalman filter:

where is the known process noise, and captures the increase in variance as time between data points grows. Finally, the backward equations:

give the updates to the remaining global parameters.

5. Experiments

5.1. CaringBridge Dataset

The creation of our model is inspired by a desire to discover topics on a unique dataset consisting of 14 million journals posted by half a million authors on the social networking site CaringBridge (CB). Established in 1997, CaringBridge is a 501(c)(3) non-profit organization focused on connecting people and reducing the feelings of isolation that are often associated with a patient’s health journey. Due to their content, CB data has been anonymized prior to analysis.

From the CB dataset we draw an evaluation dataset consisting of journals written by authors who posted, on average, at least twice a month over a one year period. Journal posts are only kept if they contain 10 or more words. These constraints help identify a set of active users. From the 123K authors meeting these criteria, 2,000 were randomly selected. Journals written by these 2,000 authors total 114,532. Overall, authors in this dataset journal an average of 57 times, with a mean of 5 days between journal posts.

The journal texts were pre-processed in a standard way: any HTML and non-ascii (including emojis) were removed; hyphenated words and contractions were split; excess whitespace was ignored; texts were tokenized and common stopwords along with words appearing in over 90% of documents were removed; all punctuation was stripped; and, words were reduced to their lemmas. Finally, the texts were transformed to a bag of words format, keeping the 5,000 most used words as a vocabulary set. Because the dataset includes journals written between 2006 and 2016 the timestamps are transformed into a relative value and discretized, reflecting the number of weeks since an authors first journal.

5.2. Evaluation

Journals are split into training and test sets with 90% of each author’s journals () for training and 10% () for testing. Further, variance in model performance is estimated by repeating this splitting procedure for 10-fold cross validation.

The performance of our model is compared to three other models representing the state-of-the-art in this area. The first model for comparison is LDA, which ignores authorship and temporal structure in the data. In order to evaluate LDA’s performance over time, we train LDA on time steps up through and testing on time step (similar to the evaluation method in Wang and McCallum, 2006). The DTM also serves as an important baseline for comparison because it models the evolution of topics over discrete time steps. Lastly, we compare out model to CDTM, which builds on DTM and introduces a continuous treatment of time. Following the approach of others, we simply fix the number of topics at 25 for all models. The number of personas learned by the DAP model is fixed at 15.

To evaluate the models we compute the per-word log-likelihood () on heldout data, which measures how well the model fits the data and is computed by . Note that perplexity, another common metric used to compare topic models, is related to via . It has been shown that perplexity (and hence s) don’t correlate with a model finding coherent topics (Chang et al., 2009). Nevertheless, s provide a fair way to compare how well each model optimizes their objective functions.

6. Results

In addition evaluating model fit, we perform a qualitative analysis of the DAP model to highlight the quality and usefulness of the personas discovered. In particular, we establish that the personas are unique from one-another and capture meaningful experiences shared by authors.

6.1. Model Comparison

Model Per-word Log-Likelihood Std. Dev.
DAP (=0.0) -7.22 0.04
DAP (=0.2) -6.47 0.04
LDA -9.23 0.02
DTM -9.65 0.03
CDTM -8.82 0.03
Table 1. Overall comparison of models. Per-word log-likelihoods for documents in the test dataset are computed. Standard deviation in performance computed over the cross-validation sets. While the basic DAP model without regularization performs significantly better than competing model, the RVI approach further increases log-likelihoods.

In Table 1 we list the per-word log-likelihood and standard deviation between cross-validation sets for each of the competing models. There is a significant improvement in the DAP model’s performance after regularization. Further analysis of the likelihood computation reveals that the regularization term contributes a relatively small drop in likelihood compared to the total likelihood during training. Nevertheless, these results show that even a small amount of regularization can nudge the model to seek out quality results. In testing additional values we found that, in general, faired comparably. Larger values of can cause model instability and the document likelihoods to have long-tailed distributions. The emergence of outlier document-likelihoods is unsurprising, regularization encourages the personas to focus on different topics — hence, large values of inevitably result in personas that overfit.

Figure 2 shows mean per-word log-likelihoods at each time step. The best performing DAP model shows consistently better results over competing models. However, the unregularized DAP model has a significant drop in performance in the first time step.

Figure 2. In general the DAP model performs better than competing models over time steps. The regularized DAP model further improves performance and reduces variable results found in the first time step of the unregularized model. Error bars show one standard deviation in document-level PWLL.

6.2. Persona Quality

Figure 3. The unregularized DAP model finds compelling, unique personas corresponding to common health journeys experienced by CaringBridge users. The three most likely topics for personas are plotted over time. Results shown for six personas that highlight diversity in topic focus. Personas 0, 6, 8, and 14 highlight nuances in how an author writes about a topic like cancer. Personas 0 and 14 engage with their community, and are less clinical when writing about cancer. Persona 14’s journals, however, are more religious and often include prayer. On the other hand, when discussing health, Personas 6 and 8 write about cancer using clinical terminology. When persona 6 is not sharing health updates the conversation is often on school, family, and celebrations. Whereas, persona 8’s non-health updates are deep, reflective, and prayerful.
Community Support Physical Therapy Reflect on Life Hopeful Prayer Family Fun Infection Weather School
family therapy life god christmas blood nice school
friend rehab know pray play infection weather shot
church therapist child prayer birthday fluid walk go
thank physical never lord game fever lunch appt
card pt love bless fun antibiotic cold class
love chair year please kid pressure snow tomorrow
service speech live heal party kidney outside grandma
friends progress people trust year iv breakfast teacher
support move cancer peace enjoy lung rain home
gift arm moment continue dinner clot go aunt
Cancer (clinical) Cancer (general) Intensive Care Well Wishes Hair Loss Surgery Bedtime Weight
chemo cancer tube dad hair surgery sleep weight
blood treatment breathe mom leg surgeon night mommy
count radiation oxygen everyone wear heart bed gain
bone scan lung message head dr wake feed
marrow chemo feed guestbook look office nurse daddy
platelet tumor x_ray please cut op say bottle
round oncologist chest prayer knee procedure asleep pound
clinic dr nurse read hat cardiologist _time_ feeding
transfusion ct vent visit wig valve room oz
_url_ result stomach update shave ha tell milk
Table 2. Top 10 words associated with the most prevalent topics found by the DAP model (). Topic labels are selected manually in order to aid reference with Figure 3. The words _time_ and _URL_ refer to the result of text pre-processing steps for capturing common patterns like the time of day and website URLs, respectively.

To evaluate the quality of personas, we focus on three key elements: authors are described by one clear persona; personas are distinct from one another, as shown in the combination of topics most associated with that persona; and personas capture a coherent health journey experienced by authors.

1:1 Author-Persona Mappings. Authors are modeled as a distribution over personas; however, to create interpretable results we want these distributions to focus on a single persona. The DAP model achieves this in the majority of cases: 71% of authors are concentrated on a single persona (% probability for that persona), and 27% of authors are evenly split between two personas. This shows that, in general, the model finds personas that generalize well enough to describe the majority of authors.

Distinct Personas. The DAP model includes a regularization term specifically for encouraging personas with unique combinations of topics. We examined the top three topics associated with each persona. In the unregularized model, the 15 personas are only a mix of 6 different topics. In fact, a topic on ”Weather” appears as a common topic for all 15 personas. On the other hand, the regularized DAP model’s personas are a mix of 18 different topics. Further, the most frequently appearing topic is ”Cancer (general)” (in 6 of 15 personas), which is appropriate given that approximately half of authors report cancer as a health condition.

Personas Reflect Coherent Health Journeys. In Figure 3 we show the top three topics evolving over time for selected personas. The labels listed for each topic are created manually based on words and journals most associated with the topic. Words most associated with each topic are listed in Tables 2. The persona plots in Figure 3 paint a compelling picture of common health journeys experienced by CB users.

Personas reflect broad trends, often encompassing a range of health journeys. Consider Persona 9, which reflects health journeys beginning with a physical element, such as physical therapy or a health issue taking a physical toll, followed by intensive care and attention to weight. Many Persona 9 authors begin physical therapy following an accident, or are caring for a premature baby or child with a congenital disorder. However, there are a number of rare disorders that follow Persona 9’s pattern. For instance, one Persona 9 author writes about a family member with Guillain-Barré syndrome, a rare rapid-onset disorder in which the immune system attacks the nervous system resulting in muscle pain, weakness, and even paralysis. The syndrome often requires admittance to an intensive care unit, followed by rehabilitation – all common themes of Persona 9.

7. Conclusion

The Dynamic Author-Persona topic model is uniquely suited to modeling text data with a temporal structure and written by multiple authors. Unlike previous temporal topic models, DAP discovers latent personas — a novel component that identifies authors with similar topics trajectories. Our RVI algorithm further improves the DAP model’s performance over competing models and results in the discovery of distinct personas. In evaluating the DAP model, we introduce the CaringBridge dataset: a massive collection of journals written by patients and caregivers, many of who face serious, life-threatening illnesses. From this dataset the DAP model extracts compelling descriptions of health journeys.

Many opportunities exist for further research. Currently, we deal with non-conjugate terms using the approach established in the CTM. Recent advances in non-conjugate inference (Ranganath et al., 2013; Kingma and Welling, 2014; Khan et al., 2015; Khan and Lin, 2017; Srivastava and Sutton, 2017) may lead to a more efficient approach to dealing with these difficult terms.

Acknowledgments

We thank reviewers for their valuable comments, University of Minnesota Supercomputing Institute (MSI) for technical support, and CaringBridge for their support and collaboration. The research was supported by NSF grants IIS-1563950, IIS-1447566, IIS-1447574, IIS-1422557, CCF-1451986, CNS-1314560.

References

  • (1)
  • Beaudoin and Tao (2007) Christopher E. Beaudoin and Chen-Chao Tao. 2007. Benefiting from Social Capital in Online Support Groups: An Empirical Study of Cancer Patients. CyberPsychology & Behavior 10, 4 (2007), 587–590. http://dev.anet.ua.ac.be/eds/eds.phtml?url=http://search.ebscohost.com/login.aspx?direct=true
  • Blei et al. (2016) David M. Blei, Alp Kucukelbir, and Jon D. McAuliffe. 2016. Variational Inference: A Review for Statisticians. arXiv (2016), arXiv:1601.00670v1 [stat.CO]. arXiv:1601.00670 http://arxiv.org/abs/1601.00670
  • Blei and Lafferty (2006) David M Blei and John D Lafferty. 2006. Dynamic Topic Models. International Conference on Machine Learning (2006), 113–120. https://doi.org/10.1145/1143844.1143859 arXiv:arXiv:0712.1486v1
  • Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3, 4-5 (2003), 993–1022. http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
  • Chang et al. (2009) Jonathan Chang, Sean Gerrish, Chong Wang, and David M Blei. 2009. Reading Tea Leaves: How Humans Interpret Topic Models. Advances in Neural Information Processing Systems 22 (2009), 288—-296. arXiv:arXiv:1011.1669v3 http://www.umiacs.umd.edu/
  • Jordan et al. (1999) Michael I. Jordan, Zoubin Ghahramani, Tommi S. Jaakkola, and Lawrence K. Saul. 1999. Introduction to variational methods for graphical models. Machine Learning 37, 2 (1999), 183–233. arXiv:arXiv:1103.5254v3
  • Khan and Lin (2017) Mohammad Khan and Wu Lin. 2017. Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics 54 (2017), 878–887. arXiv:1703.04265 http://proceedings.mlr.press/v54/khan17a.html
  • Khan et al. (2015) Mohammad Emtiyaz Khan, Reza Babanezhad, Wu Lin, Mark Schmidt, and Masashi Sugiyama. 2015. Faster Stochastic Variational Inference using Proximal-Gradient Methods with General Divergence Functions. (2015). arXiv:1511.00146 http://arxiv.org/abs/1511.00146
  • Kingma and Welling (2014) Diederik P Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR) (dec 2014), 1–14. arXiv:1312.6114 http://arxiv.org/abs/1312.6114
  • Lafferty and Blei (2006) John D. Lafferty and David M. Blei. 2006. Correlated Topic Models. Advances in Neural Information Processing Systems 18 (2006), 147–154. https://doi.org/10.1145/1143844.1143859 arXiv:arXiv:0712.1486v1
  • McCallum et al. (2005) a. McCallum, a. Corrada-Emmanuel, and Xuerui Wang. 2005. The author-recipient-topic model for topic and role discovery in social networks: Experiments with enron and academic email. NIPS’04 Workshop on’Structured Data and Representations in Probabilistic Models for Categorization (2005), 1–16. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.5833
  • Mimno and McCallum (2007) David Mimno and A McCallum. 2007. Expertise modeling for matching papers with reviewers. Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining (2007), 500–509. papers2://publication/uuid/9961263B-C837-4F51-9091-5D6B42917F1A
  • Pathak et al. (2008) Nishith Pathak, Colin DeLong, Kendrick Erickson, and Arindam Banerjee. 2008. Social topic models for community extraction. The 2nd SNA-KDD Workshop ’08 (SNA-KDD’08) (2008).
  • Ranganath et al. (2013) Rajesh Ranganath, Sean Gerrish, and David M Blei. 2013. Black Box Variational Inference. Aistats 33 (2013). arXiv:1401.0118 http://www.cs.columbia.edu/
  • Rodgers and Chen (2005) S Rodgers and Q Chen. 2005. Internet community group participation: Psychosocial benefits for women with breast cancer. Journal of Computer-Mediated Communication 10, 4 (2005), 1–27. http://www.scopus.com/scopus/inward/record.url?eid=2-s2.0-24144454821
  • Rosen-Zvi et al. (2004) M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. 2004. The author-topic model for authors and documents. Proceedings of the 20th conference on Uncertainty in artificial intelligence (2004), 487–494. http://portal.acm.org/citation.cfm?id=1036902
  • Srivastava and Sutton (2017) Akash Srivastava and Charles Sutton. 2017. Autoencoding Variational Inference For Topic Models. ICLR (mar 2017), 1–18. arXiv:1703.01488 http://arxiv.org/abs/1703.01488
  • Steyvers et al. (2004) Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 306–315.
  • Wainwright and Jordan (2007) Martin J. Wainwright and Michael I. Jordan. 2007. Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends® in Machine Learning 1, 1–2 (2007), 1–305. http://www.nowpublishers.com/article/Details/MAL-001
  • Wang et al. (2008) Chong Wang, David Blei, and David Heckerman. 2008. Continuous Time Dynamic Topic Models. Proc of UAI (2008), 579–586. arXiv:1206.3298 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.139.4535
  • Wang and McCallum (2006) Xuerui Wang and Andrew McCallum. 2006. Topics over time: a non-Markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (2006), 424–433. http://portal.acm.org/citation.cfm?doid=1150402.1150450
  • Wei et al. (2007) Xing Wei, Jimeng Sun, and Xuerui Wang. 2007. Dynamic Mixture Models for Multiple Time Series. Ijcai (2007), 2909–2914.
  • Wen et al. (2011) Kuang Yi Wen, Fiona McTavish, Gary Kreps, Meg Wise, and David Gustafson. 2011. From Diagnosis to Death: A Case Study of Coping With Breast Cancer as Seen Through Online Discussion Group Messages. Journal of Computer-Mediated Communication 16, 2 (2011), 331–361.
  • Wen and Rose (2012) Miaomiao Wen and Carolyn Penstein Rose. 2012. Understanding participant behavior trajectories in online health support groups using automatic extraction methods. Proceedings of the 17th ACM international conference on Supporting group work - GROUP ’12 (2012), 179. http://dl.acm.org/citation.cfm?id=2389176.2389205

Appendix A ELBO Terms Unique to the DAP Model

The expanion of the ELBO referenced in (5) includes a number of terms previously derived for the LDA and CTM (Blei et al., 2003; Lafferty and Blei, 2006). The DAP model’s introduction of personas, and the parameters , , and that govern them lead to a few new terms. Terms unique to the DAP model are detailed below.

a.1. Expanding

Expansion of the ELBO term is unique to the DAP model, and particularly challenging because the topic distribution for each document is drawn from a Gaussian with mean . Hence, the term is expanded to:

Note that the expectation in is over all the terms — that is, are not constants. Factorizing this expectation gives:

(11a)
(11b)
(11c)

where each of the terms in (11) is evaluated below.

Term (11a): Since is a straight-forward case of the Guassian quadratic identity:

where is the variational parameter for and is the variance parameter associated with the topic distribution over document .

Term (11b): Doesn’t take a Guassian quadratic form. To solve , recall that and are independent under the the mean-field assumption, thus:

Term (11c): Expanding the last term yields:

where denotes the variance of the terms. To evaluate , consider the variance between personas and , which simplifies the computation because and are scalars and refers to a column vector of :

The resulting covariance matrix has off-diagonal elements are all 0 since – a draw from a multinomial – is a 1 of vector. Elements along the diagonal are given by . Thus, for persona we have . Therefore term (11c) is:

Combining the three expanded terms from (11) can be reduced:

Finally, the ELBO term for is expanded out fully to:

a.2. Expanding

Expanding the ELBO term is similar to the DTM, and follows from the Gaussian quadratic form identity, which states that: .