Discriminative Relational Topic Models
Abstract
Many scientific and engineering fields involve analyzing network data. For document networks, relational topic models (RTMs) provide a probabilistic generative process to describe both the link structure and document contents, and they have shown promise on predicting network structures and discovering latent topic representations. However, existing RTMs have limitations in both the restricted model expressiveness and incapability of dealing with imbalanced network data. To expand the scope and improve the inference accuracy of RTMs, this paper presents three extensions: 1) unlike the common link likelihood with a diagonal weight matrix that allows thesametopic interactions only, we generalize it to use a full weight matrix that captures all pairwise topic interactions and is applicable to asymmetric networks; 2) instead of doing standard Bayesian inference, we perform regularized Bayesian inference (RegBayes) with a regularization parameter to deal with the imbalanced link structure issue in common real networks and improve the discriminative ability of learned latent representations; and 3) instead of doing variational approximation with strict meanfield assumptions, we present collapsed Gibbs sampling algorithms for the generalized relational topic models by exploring data augmentation without making restricting assumptions. Under the generic RegBayes framework, we carefully investigate two popular discriminative loss functions, namely, the logistic logloss and the maxmargin hinge loss. Experimental results on several real network datasets demonstrate the significance of these extensions on improving the prediction performance, and the time efficiency can be dramatically improved with a simple fast approximation method.
statistical network analysis, relational topic models, data augmentation, regularized Bayesian inference
1 Introduction
Many scientific and engineering fields involve analyzing large collections of data that can be well described by networks, where vertices represent entities and edges represent relationships or interactions between entities; and to name a few, such data include online social networks, communication networks, protein interaction networks, academic paper citation and coauthorship networks, etc. As the availability and scope of network data increase, statistical network analysis (SNA) has attracted a considerable amount of attention (see [Goldenberg:2010] for a comprehensive survey). Among the many tasks studied in SNA, link prediction [liben_nowell, backstrom] is a most fundamental one that attempts to estimate the link structure of networks based on partially observed links and/or entity attributes (if exist). Link prediction could provide useful predictive models for suggesting friends to social network users or citations to scientific articles.
Many link prediction methods have been proposed, including the early work on designing good similarity measures [liben_nowell] that are used to rank unobserved links and those on learning supervised classifiers with wellconceived features [Hasan:2006, Lichtenwalter:2010]. Though specific domain knowledge can be used to design effective feature representations, feature engineering is generally a laborintensive process. In order to expand the scope and ease of applicability of machine learning methods, fast growing interests have been spent on learning feature representations from data [Bengio:2012]. Along this line, recent research on link prediction has focused on learning latent variable models, including both parametric [Hoff:02, Hoff:07, Airoldi:nips08] and nonparametric Bayesian methods [Miller:nips09, Zhu:ICML12]. Though these methods could model the network structures well, little attention has been paid to account for observed attributes of the entities, such as the text contents of papers in a citation network or the contents of web pages in a hyperlinked network. One work that accounts for both text contents and network structures is the relational topic models (RTMs) [Chang:RTM09], an extension of latent Dirichlet allocation (LDA) [Blei:03] to predicting link structures among documents as well as discovering their latent topic structures.
Though powerful, existing RTMs have some assumptions that could limit their applicability and inference accuracy. First, RTMs define a symmetric link likelihood model with a diagonal weight matrix that allows thesametopic interactions only, and the symmetric nature could also make RTMs unsuitable for asymmetric networks. Second, by performing standard Bayesian inference under a generative modeling process, RTMs do not explicitly deal with the common imbalance issue in real networks, which normally have only a few observed links while most entity pairs do not have links, and the learned topic representations could be weak at predicting link structures. Finally, RTMs and other variants [LiuYan:09] apply variational methods to estimate model parameters with meanfield assumptions [Jordan:99], which are normally too restrictive to be realistic in practice.
To address the above limitations, this paper presents discriminative relational topic models, which consist of three extensions to improving RTMs:

we relax the symmetric assumption and define the generalized relational topic models (gRTMs) with a full weight matrix that allows all pairwise topic interactions and is more suitable for asymmetric networks;

we perform regularized Bayesian inference (RegBayes) [Zhu:nips11] that introduces a regularization parameter to deal with the imbalance problem in common real networks;

we present a collapsed Gibbs sampling algorithm for gRTMs by exploring the classical ideas of data augmentation [Dempster1977, Tanner:1987, DykMeng2001].
Our methods are quite generic, in the sense that we can use various loss functions to learn discriminative latent representations. In this paper, we particularly focus on two types of popular loss functions, namely, logistic logloss and maxmargin hinge loss. For the maxmargin loss, the resulting maxmargin RTMs are themselves new contributions to the field of statistical network analysis.
For posterior inference, we present efficient Markov Chain Monte Carlo (MCMC) methods for both types of loss functions by introducing auxiliary variables. Specifically, for the logistic logloss, we introduce a set of PolyaGamma random variables [Polson:arXiv12], one per training link, to derive an exact mixture representation of the logistic link likelihood; while for the maxmargin hinge loss, we introduce a set of generalized inverse Gaussian variables [Devroye:book1986] to derive a mixture representation of the corresponding unnormalized pseudolikelihood. Then, we integrate out the intermediate Dirichlet variables and derive the local conditional distributions for collapsed Gibbs sampling analytically. These “augmentandcollapse” algorithms are simple and efficient. More importantly, they do not make any restricting assumptions on the desired posterior distribution. Experimental results on several real networks demonstrate that these extensions are important and can significantly improve the performance.
The rest paper is structured as follows. Section 2 summarizes the related work. Section 3 presents the generalized RTMs with both the logloss and hinge loss. Section 4 presents the “augmentandcollapse” Gibbs sampling algorithms for both types of loss functions. Section 5 presents experimental results. Finally, Section 6 concludes with future directions discussed.
learning, bound, PAC, hypothesis, algorithm  
numerical, solutions, extensions, approach, remark  
mixtures, experts, EM, Bayesian, probabilistic  
features, selection, casebased, networks, model  
planning, learning, acting, reinforcement, dynamic  
genetic, algorithm, evolving, evolutionary, learning  
plateau, feature, performance, sparse, networks  
modulo, schedule, parallelism, control, processor  
neural, cortical, networks, learning, feedforward  
markov, models, monte, carlo, Gibbs, sampler  