References
Abstract

Boolean tensor decomposition approximates data of multi-way binary relationships as product of interpretable low-rank binary factors, following the rules of Boolean algebra. Here, we present its first probabilistic treatment. We facilitate scalable sampling-based posterior inference by exploitation of the combinatorial structure of the factor conditionals. Maximum a posteriori decompositions feature higher accuracies than existing techniques throughout a wide range of simulated conditions. Moreover, the probabilistic approach facilitates the treatment of missing data and enables model selection with much greater accuracy. We investigate three real-world data-sets. First, temporal interaction networks in a hospital ward and behavioural data of university students demonstrate the inference of instructive latent patterns. Next, we decompose a tensor with more than 10 billion data points, indicating relations of gene expression in cancer patients. Not only does this demonstrate scalability, it also provides an entirely novel perspective on relational properties of continuous data and, in the present example, on the molecular heterogeneity of cancer. Our implementation is available on GitHub222https://github.com/TammoR/LogicalFactorisationMachines.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

TensOrMachine: Probabilistic Boolean Tensor Decomposition

 

Tammo Rukat0 0  Chris C. Holmes0 0  Christopher Yau0 0 


footnotetext: 1AUTHORERR: Missing \icmlaffiliation. 2AUTHORERR: Missing \icmlaffiliation. 3AUTHORERR: Missing \icmlaffiliation. . Correspondence to: Tammo Rukat <tammo.rukat@stats.ox.ac.uk>.  
\@xsect

Matrix decomposition methods such as factor analysis are a widely used class of models for unsupervised learning. Typically, they operate on two-way matrices with rows thought of as objects and columns thought of as properties. The decomposition amounts to computing two low-rank matrices, one that contains prototypes of properties (loadings) and one that denotes a compressed representation of each object as a combination of the prototypes (factors). However, these methods are ill-suited for data with higher arity such as ternary interactions. Examples include network interactions at different time-points, gene-gene interactions for different individuals or any two-way data under different experimental conditions. Such data requires methods that specifically account for the higher-order relationship and that are commonly referred to as tensor decomposition.

Boolean tensor decomposition factorises a -way binary tensor , into binary factor matrices , using the Boolean algebra such that

(1)

Here, is a single tensor entry and denotes a tuple of indices . We call the modes of the tensor, and denote the logical OR and the logical AND operation, respectively.

Figure 1: Graphical intuition for 3-way Boolean tensor decomposition. Logical conjunction () of Boolean vectors (columns of factor matrices) generates Rank-1 tensors. The full tensor is the logical disjunction () of rank-1 tensors.

In plain English, eq. (1) says that an observation is , if and only if there exist one or more latent dimensions in which all corresponding factors are . This leads to easily interpretable factor matrices, , where each latent dimension denotes a subset of entries along mode , for instance subsets of objects, properties and conditions that occur jointly in the data. This can be thought of as a particular type of canonical polyadic (CP) decomposition, as illustrated in Fig. 1. A different graphical intuition starts from the factor rows and considers the 3-way Boolean tensor product as a three-stage template assignment procedure. Rows of a factor matrix are one-dimensional binary templates of size of the first tensor mode. Rows of the second factor matrix indicates possible patterns of appearance of these templates along the second tensor dimensions. In the same manner, the third factor matrix, , indicates which disjunction of these 2D patterns appears in each slice of the tensor, and so forth.

We present the first probabilistic approach to Boolean tensor decomposition, the TensOrMachine, featuring distinctly improved accuracy compared to the previous state-of-the-art methods. We develop scalable, sampling-based posterior inference and infer distributions over factor matrices which enables full quantification of uncertainty. In contrast to previous approaches, the probabilistic framework readily treats missing data, allows for tensor completion and integration of prior information. Importantly, the latent representations are amenable to informative and intuitive interpretation.

\@xsect

Tensor decomposition methods are widely used in many domains of application as reviewed by Kolda & Bader (2009). Boolean Tensor decomposition was first introduced in the Psychometric literature by Leenen et al. (1999) and has received lasting attention, following the formal study by Miettinen (2011). The author shows that computing the optimal decomposition is NP-hard and proposes an alternating least squares heuristics as approximate strategy. A different approach based on a random-walk procedure is described by Erdos & Miettinen (2013b) and trades accuracy in the factors for computational scalability. More recently a distributed Apache Spark implementation of the alternating least squares approach has been brought forward (Park et al., 2017). It demonstrates that alternating leasts squares is the state-of-the-art for computing accurate decompositions and serves as baseline method for our experiments. Boolean Matrix Factorisation (Miettinen et al., 2006) shares the logical structure of Boolean tensor decomposition. It has many real-world applications such as collaborative filtering (Su & Khoshgoftaar, 2009) and computer vision (Lázaro-Gredilla et al., 2016) and has previously been approached under a probabilistic perspective (Ravanbakhsh et al., 2016; Rukat et al., 2017). Despite the sustained theoretical interest, there have been only few practical applications of Boolean tensor decomposition, as for instance for information extraction (Erdos & Miettinen, 2013a) and clustering (Metzler & Miettinen, 2015). One of the contributions of this work is the presentation of Boolean Tensor decomposition as an interpretable, versatile and scalable analysis method.

\@xsect\@xsect

We propose a probabilistic generative process for the model described in eq. (1). Each tensor entry, , is a Bernoulli random variable that equals 1 with a probability greater if the following holds true

(2)

The logistic sigmoid has the convenient property and lets us readily parametrise the corresponding likelihood

(3)

The term inside parenthesis evaluates to 1 if the condition in eq. (2) is met and to -1, otherwise. Throughout this work, we use a tilde to denote the mapping from to such that . The noise is parametrised by , such that for the likelihood is constant, independent of the factors and for all probability mass is put on the deterministic Boolean tensor product following eq. (1). We can specify Bernoulli priors on the observations or choose more structured prior distributions. A beta-prior on is a computationally convenient choice as we discuss in the following section. However, in the relevant regime of thousand and more data-points such priors are easily outweighed by the observed data. We therefore assume constant prior distributions in the following derivations and experiments.

\@xsect

Boolean decomposition models have an interesting relation to the noisy-OR, a canonical model for multi-causal interactions, which makes the assumption of independence between the inhibition of latent causes (Pearl, 1988). TensOrM violates this assumption with inhibition occurring on the event-level rather than on the cause-level. As an example, in analogy to the data presented in Section. id1, consider the event of a students attending a lecture. We assume that this has two possible causes. They may (i) use the opportunity to meet fellow students (ii) have a particular question that they wish to clarify in class. An inhibition on the level of causes may occur if they found a satisfying answer to their question in a textbook. Then, the desire to meet their friends still contributes to the likelihood of attending class. This naturally modelled by the noisy-OR. In contrast, inhibition on the event-level may occur if a traffic disruption makes it impossible to get to campus. This is naturally described by a Boolean factor model, where any additional causes to attend the lecture will not contribute to the likelihood. We argue that for many practical purposes the latter is an interesting alternative. Even if we were to believe in the independent inhibition of latent causes, usually only one latent cause triggers an event. Compared to the noisy-OR, TensOrM acts as an Occam’s razor and enforces sparse and simple explanations, aiming to associate exactly one hidden cause with every event.

\@xsect

We condition any entry of a factor matrix, , on the state of all other parameters which yields the full conditional probability for Gibbs sampling:

(4)

We give a formal derivation in the Supplementary Information A and provide an intuitive explanation in the following. The sigmoid maps from the real line to the interval of . The sum in its argument is taken over all observations whose likelihood may depend on . These observations are given by the –way sub-tensor of X with dimension fixed and correspond to all children of in a graphical model representation. For each of them, the term inside the sigmoid contributes values in . This depends on whether or not and are aligned and an another indicator variable that take values in and denotes whether the state of has any relevance for the likelihood of . It takes the form

(5)

This variable is again composed of two indicators. The first term is a product over the state of all co-parents of to observation in the same latent dimension . Following the rules of the Boolean product, all these co-parents need to be one in order for to contribute to the likelihood of . This corresponds to the AND-operation in eq. (1) that evaluates to zero if any its arguments are zero. The second term is a product of the co-parents in all other latent dimensions and evaluates to zero if any of them explains away observation . Following the rules of the Boolean product, a single latent dimension is sufficient to explain an observation and makes its likelihood independent of the state of . This corresponds to the OR-operation in eq. (1) that evaluates to one if any of its arguments is one. It is crucial for the speed of our implementation that eq. (5), and thus eq. (4), can be rapidly evaluated. In particular, discovering a zero in any of the two terms of eq. (5) suffices for the whole expression to be zero and avoids the necessity of considering the remainder of the Markov blanket. Pseudocode for the computational procedure is given in Algorithm 1. Note that updates of can trivially be computed in parallel across for a fixed tensor-mode . The computation time scales linearly in each tensor dimension and sub-linear in the latent dimension.

   // initialise integer count for the sum in eq. (4)
  for [n] in all tensor indices with slice fixed do
     // check relevance of for .
     for n in [n] do
        if  then
           // has no relevance for .
           continue with next [n]
        end if
     end for
     // check for explaining away.
     for  do
        for n in [n] do
           if  then
              continue with next l’
           end if
           // is explained away.
           continue (next [n])
        end for
     end for
     
  end for
  
Algorithm 1 Computation of the full conditional of

After a sampling all factor entries from their full conditional, we set the noise-parameter, , to its conditional MAP estimate and repeat until convergence before drawing posterior samples. The conditional MLE of is available in closed form and given by the logit of the fraction of data-points that are correctly reconstructed by the deterministic Boolean product of the current state of the factors. As a prior for , it is natural to employ a Beta-distribution, . This intuitively affects the MAP estimate, adding correctly predicted data points and incorrectly predicted data points, such that

(6)

Here, the indicator evaluates to 1 if its arguments is true and to 0 otherwise. We see that a uniform prior, , corresponds to applying Laplace’s rule of succession to the maximum likelihood estimate.

\@xsect

An important application of latent variable models is the prediction of missing observations, e.g. in collaborative filtering or data imputation problems (Su & Khoshgoftaar, 2009). Our probabilistic construction allows us to deal with missing data in a unified way. When computing updates for the latent factors, we can marginalise over any missing observations. Practically, this is equivalent to setting any missing entries to , such that they do not contribute to the conditional probability in eq. (4) and contribute a factor of to the likelihood in eq. (9). Thus, encoding observed data as and unobserved data as 0 is the most convenient choice for any practical implementation. For MAP updates of , following eq. (6), unobserved data-points points are simply excluded.

For a direct comparison of predictions to previous methods that only provide a point estimate of the factors, we can reconstruct the data based on the Boolean product of the factor MAP estimates. We use the marginal MAP of each factor entry, and find no relevant difference to using the joint MAP. If instead we want to use the full posterior, we can determine the reconstruction of by rounding the posterior predictive,

(7)

to the closest binary value. Here, is a posterior sample. Yet another alternative is to compute the estimate from the factor matrix mean, thus taking posterior uncertainty but no higher-order correlations into account. Denoting the posterior mean of a factor entry as , we have . The predictive accuracy of this computationally cheap approximation is on par with the more expensive posterior predictive estimate as we show empirically in the next section and in Fig. 2(b).

\@xsect\@xsect
(a) Random tensor reconstruction accuracy under variation of the noise level (top), the expected tensor density (centre) and the underlying tensor rank (bottom). Averages are taken across all shown combinations of the other two parameters and across ten random tensors for each such configuration. Test accuracy on noise-free data is shown in solid colours, training accuracy on noisy data in faint colours. Compare to Fig. 8 by Park et al. (2017). (b) Comparison of reconstruction methods – We compare predictions based on TensOrM posterior predictive, the factor matrix MAP, and mean as described in Section id1 to the the dbtf baseline. M1 indicates the performance margin that is due to finding more accurate factor matrices, M2 denotes the margin that is due to posterior averaging. The underlying tensors are of rank 10. The results are averaged across tensor densities and 10 random repetitions. (c) Robustness of training performance under random perturbations – We show reconstruction accuracy on the noisy training data based on the factor matrix MAP estimate under random flips of the factor entries. The underlying tensor has a noise level of and a expected density of 10%. We perform 10 random repetitions but standard deviations are too small to be visible. TensOrM recovers exactly the expected reconstruction accuracy of 60%, while dbtf overfits to the training data. Its reconstruction accuracy deteriorates rapidly under random perturbations, indicating that it has converged to a comparatively narrow mode of the posterior.
Figure 2: The TensOrMachine is compared to the previous state-of-the-art method, distributed Boolean tensor decompositions (dbtf), and properties of its solutions are investigated. All examples are based on tensors generated from the Boolean product of i.i.d Bernoulli random matrices. Noise is introduced by flipping every entry in the tensor with a given probability. The reconstruction accuracy indicates the fraction of correctly reconstructed entries in the noise-free test data or in the noisy training data. Reconstructions are based on rounding the posterior predictive to the closest value if not mentioned otherwise. All plots indicate means and standard deviations (some are too small to be visible).
Figure 3: Model selection accuracy in terms of the relative error in predicting the tensor rank. Model selection with an Bayesian Occam’s razor or cross validation (Section id1) outperform the previously suggested MDL approach.

We first demonstrate the capabilities of TensOrM on simulated data and compare to the state-of-the-art method distributed boolean tensor factorisation (dbtf) (Park et al., 2017). We simulate random 3-way tensors, , of size and vary rank , expected density and noise-level. To this end, we take the Boolean product between binary i.i.d. random matrices, , of size , such that the expected tensor density is given by . We introduce noise by flipping each entry in the tensor with a given probability. Posterior samples of the factors are drawn, following the procedure described in Section id1 with initialised to and the initial factors drawn i.i.d. Bern(). The reconstruction of is determined based on the posterior predictive following eq. (7). The reconstruction accuracy is computed as the fraction of correctly reconstructed entries in the noise-free tensor and is shown across a variety of conditions in Fig. 2(a). Our method achieves distinctly higher accuracies throughout all conditions. The margin becomes bigger for very noisy data, as well as for higher tensor ranks and for particularly dense or particularly sparse data.

Can the superior reconstruction performance be explained by implicit model averaging when computing the posterior predictive following eq. (7)? In order to address this conjecture, Fig. 2(b) compares the reconstruction accuracies of the posterior predictive compared to the reconstruction based solely on the MAP estimates of each factor. We can see, that posterior averaging plays an important role only in scenarios with high noise levels. The main performance gain, however, is simply due to a more accurate decomposition. In the practically relevant regime of moderate noise levels below , posterior averaging has virtually no impact. We further note in Fig. 2(a), that dbtf features a similar or higher reconstruction accuracy with respect to the noisy training data. This indicates over-fitting and becomes particularly apparent for large noise levels and large tensor ranks. What distinguishes the decompositions of dbtf and TensOrM in this regime? In Fig. 2(c) we show the training performance under random perturbations of the inferred factors for a noise level of and a expected density of 10%. Our method has a lower training accuracy but is more stable towards random perturbations of the parameters whereas the training accuracy of dbtf decreases rapidly. This shows that TensOrM converges to solutions of larger point-wise density in the parameter space that have better generalisation properties. Recently, this phenomenon has been studied in order to improve the generalisation of deep neural networks (Chaudhari et al., 2016). Estimating posterior distributions, Bayesian techniques naturally assign more probability to such solutions.

\@xsect

A notorious challenge for latent variable models is the choice of dimensionality. Previously, Erdos & Miettinen (2013b) have used the Minimum description length (MDL) principle for this task in Boolean Tensor Decomposition. We follow their derivation and compare to two approaches that our Bayesian treatment offers readily. For the Bayesian Occam’s razor, we start our inference procedure with a large latent dimensionality. After convergence of the Markov chain, we remove all latent dimensions that do not contribute to the likelihood. In these dimensions one of the factors is usually all zeros and the other factors are uniformly random. After removal we restart the burn-in procedure and repeat until only contributing dimensions remain. In the second approach, cross validation, we treat 20% of the data as unobserved during training and choose the model dimensionality that achieves the highest posterior predictive accuracy on the held-out data. Results on random matrices for different noise levels are shown in Fig.3, following the previously described simulation procedure. We find that both approaches clearly outperform MDL. In the case of noise-free observations the Bayesian Occam’s razor features virtually perfect accuracy. For the more realistic scenario of moderate noise levels cross validation is superior.

\@xsect

Here we demonstrate the ability of TensOrM to infer meaningful, interpretable representations from real-world 3-way datasets of temporal interaction networks and temporal –relations. With these moderately sized datasets of less than 100,000 data-points, sampling until convergence and drawing 50 samples takes only few seconds on a single core. Eventually, we turn to a large-scale biological example, analysing networks of relative gene-expression in cancer patients with more than 10 billion data points. Here, the inference procedure takes around 10 hours.

(b) Student dynamic seating throughout course – Decomposition of the week-seat-student tensor indicating seating positions for a 9-week university course on Android programming.
(c) Representations of cancer patients – Each column corresponds to one out of approximately 8,000 cancer patients and indicates which of the latent properties of relative expression among approximately 2,000 genes they exhibit. Patients are ordered by type of cancer. See Figure 1 in the Supplementary Material for the corresponding representation from PCA on the continuous expression data, as well as for a legend of the disease types.
Figure 4: Real-world example of factors of TensOrMachine decomposition. Colours indicate posterior means.
\@xsect

Records of contact between pairs of individuals in a university hospital have originally been acquired to investigate transmission routes of infectious diseases (Vanhems et al., 2013). Proximity sensor measurements were taken for 75 individuals in 20 second time windows across 5 days. In order to examine daily patterns, we group the time-points into 13 time-of-day windows. This leaves us with a binary time-adjacency tensor. We compute the decomposition, choosing 8 latent dimensions based on cross-validation as described in Section id1. The predictive accuracy on the held-out test data is approximately . Fig. 4(a) shows posterior means of the time-specific and (one of the two equivalent) individual-specific factor matrices. Individuals are labelled as either administrative staff (ADM), medical doctors (MED), nurses (NUR) or patients (PAT). Using this proximity data alone, we were able to recapitulate information about the work patterns of staff groups. For example, 10am-12pm and 3pm-5pm represent the main morning and afternoon clinic hours when all administrative, medical and nursing staff came into interaction, shown in latent dimensions 1 and 3. Whilst medical doctors are closely interacting throughout typical work hours, 9am-7pm, their network does not include any nurses or patients as found in dimension 0. This can be explained by frequent interaction in the doctor’s room but also hints to a lack of interaction with the nursing staff, a phenomenon frequently discussed in the literature (see e.g. Manias & Street, 2001). 2pm-3pm indicates a period in which nurses and administrative staff are heavily interacting without participation of medical doctors, presumably indicating an afternoon break or daily joint meeting, see latent dimension 3. Nurses, however, were active throughout the day including being the main staffing body at night as dimension 7 shows. In comparison, dbtf finds only 2 latent dimensions (0 and 5 in our analysis).

\@xsect

Our second example is part of the student-life dataset introduced by Harari et al. (2017) and given by records of the seating positions of students throughout a 9-week Android programming course. We partition the time-points into weeks of the course and the seating positions into six regions front/back - left/centre/right. An additional seating category is labelled with a question mark, where the seating coordinates were not provided but the students appeared in class. This yields a week-seat-student tensor. We choose seven latent dimensions, again, by optimising for the test-set likelihood with a reconstruction accuracy of approximately 92% on the held out data. The decomposition is shown in Fig. 4(b) and, once more, open to straightforward interpretation. The majority of students shows up to class in the first week of the course and sits scattered throughout the lecture hall, as can be seen in dimensions 0 and 1. The group of students that attends lectures most consistently, dimension 4, sits in the front-centre. The remaining groups describe subsets of students, each corresponding to exactly one of the four front/back left/right seating regions. The formation of some of these groups starts only in week 2 of the course and they all eventually stop to appear in class after 6-7 weeks. The input tensor confirms that only two students attend class in weeks 8 and 9. Similar solutions are found by dbtf and shown in the SI (E) but lack the ability to characterise uncertainty around the inferred mode.

\@xsect

Dim.
Cancer types Pathways
1 (-) SKCM Tight junctions


4 (+)
LUAD, BLCA Hippo

5 (+)
CESC, COAD, SKCM,
LUSC, ESCA, OV
Hippo, Wnt


7 (+)
PRAD, THCA AGE-RAGE

9 (+)
LGG, PCPG, LAML,
GBM, SKCM, HNSC
Rap1 signalling

11 (+)
LAML, LIHC,
COAD, STAD
Actin
cytoskeleton


15 (+)
PCPG, LGG, PRAD,
GBM, BRCA, SARC
Focal
adhesion

16 (-)
HNSC, LUSC,
CESC, ESCA
PI3K-Akt

17 (+)
LAML, SARC, TGCT
Platelet
activation

18 (+)
KIRC, KIRP,
THCA, PAAD
Ras

24 (+)
KIRC, KIRP, PAAD Ras
Table 1: Gene-set analysis of selected gene networks indicate their correspondence to cellular pathways. The sign indicates whether genes were relatively over/under expressed.

Gene expression profiling has been used extensively in cancer research, e.g. for the characterisation of cancer sub-types, for the stratification of patients or for the identification of therapeutic targets. Correlations in gene expression are at the basis of protein interaction in cellular pathways and latent variable models are frequently applied to extract biologically meaningful information from such data. We propose a novel way of analysing gene expression data that can be applied to any continuous measurement with structure (here ). The publicly available TCGA dataset (Weinstein et al., 2013) contains gene expression measurements of a large variety of cancer patients across different types of cancer. After preprocessing, we are left with approximately 8,000 patients and 1,100 genes. We normalise the expression values for each gene across all patients to allow for comparison between genes. Then the data is transformed into a 3-way tensor using the relational encoding

(8)

This provides an entirely new view of the data in terms of networks of relative expression. For every latent dimension, the two gene specific factors indicate a subset of genes, where genes in the first subset are likely to be more expressed than genes in the second subset. The patient factor indicates which of these relational expression properties each patient exhibits. Importantly, the factors are amenable to distinct and intuitive interpretation with immediate biological relevance. We show the patient-specific factor in Fig. 4(c) with patients ordered by cancer-type. We observe a remarkable disease specificity with many latent expression networks being exclusively present in virtually all cancers of certain types. In addition, some latent properties are scattered throughout cancers of various types, highlighting the heterogeneity of the disease. Application of a traditional method, principal component analysis (PCA) on the continuous array, is shown in the Supplementary Material C. It lacks both, the interpretability and the degree of disease-type specificity. We use the representations of both methods as features for random forest classifiers. The predictive accuracy of the disease type is approximately 91% for the binary TensOrM features, 86% for continuous features from PCA and 81% for binary dbtf features, shown in see SI (C, D).

The corresponding gene-specific factors characterise latent gene-networks but are more difficult to analyse for a non-specialist. In particular, the richness of the relational latent encoding can only partially be captured by traditional pathway analyses. Nevertheless, we investigate the biological plausibility of the inferred gene sets by running a gene set enrichment analysis of the genes in each factor dimension against the KEGG pathway database (Kanehisa et al., 2017). Results are indicated in Tab. 1 and highlight the biological plausibility and significance of our analysis. We underline a few examples in the following. Firstly, Hippo signalling has been shown to be active in LUAD, LUSC, COAD, OV and LIHC (Harvey et al., 2013), all recovered by Dims. 4, 5. Secondly, aberrant Wnt signalling is observed in many cancers but most prominently in COAD (Polakis, 2012) which is found in Dim. 5. Next, PRAD progression is known to be controlled by focal adhesion (Figel & Gelman, 2011) as found in Dim. 15 as well as by the AGE-RAGE signalling pathway recovered in Dim. 7 (Bao et al., 2015). Finally, Ras signalling is aberrant in most tumours, but most significantly in PAAD (Downward, 2003) as confirmed in dims. 18, 24.

\@xsect

We have introduced the first probabilistic approach to Boolean tensor decomposition. It reaches state-of-the-art performance and can readily deal with missing data which is ubiquitous in large datasets. Due to the particular combinatorial structure of the factor matrix posterior, sampling based inference scales to large datasets and enables the inference of posterior distributions over factor matrices, providing full uncertainty quantification. We further show that Boolean tensor decomposition leads to insightful latent representations and provides a novel view on the molecular basis of cancer by relationally encoding continuous expression data.

\@ssect

Acknowledgements This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. CY is supported by MRC grant MR/P02646X/1. TR is funded by EPSRC grant EP/G037280/1. TR thanks Taha Ceritli for helpful comments.

References

  • Bao et al. (2015) Bao, J. M., He, M. Y., Liu, Y. W., Lu, Y. J., Hong, Y. Q., Luo, H. H., Ren, Z. L., Zhao, S. C., and Jiang, Y. AGE/RAGE/Akt pathway contributes to prostate cancer cell proliferation by promoting Rb phosphorylation and degradation. Am J Cancer Res, 5(5):1741–1750, 2015.
  • Chaudhari et al. (2016) Chaudhari, Pratik, Choromanska, Anna, Soatto, Stefano, and LeCun, Yann. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
  • Downward (2003) Downward, Julian. Targeting ras signalling pathways in cancer therapy. Nature Reviews Cancer, 3(1):11, 2003.
  • Erdos & Miettinen (2013a) Erdos, Dóra and Miettinen, Pauli. Discovering facts with boolean tensor tucker decomposition. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1569–1572. ACM, 2013a.
  • Erdos & Miettinen (2013b) Erdos, Dóra and Miettinen, Pauli. Walk’n’merge: A scalable algorithm for boolean tensor factorization. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp. 1037–1042. IEEE, 2013b.
  • Figel & Gelman (2011) Figel, S. and Gelman, I. H. Focal adhesion kinase controls prostate cancer progression via intrinsic kinase and scaffolding functions. Anticancer Agents Med Chem, 11(7):607–616, Sep 2011.
  • Harari et al. (2017) Harari, Gabriella M., Gosling, Samuel D., Wang, Rui, Chen, Fanglin, Chen, Zhenyu, and Campbell, Andrew T. Patterns of behavior change in students over an academic term: A preliminary study of activity and sociability behaviors using smartphone sensing methods. Computers in Human Behavior, 67:129 – 138, 2017. ISSN 0747-5632. doi: https://doi.org/10.1016/j.chb.2016.10.027.
  • Harvey et al. (2013) Harvey, Kieran F, Zhang, Xiaomeng, and Thomas, David M. The hippo pathway and human cancer. Nature reviews Cancer, 13(4):nrc3458, 2013.
  • Kanehisa et al. (2017) Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., and Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res., 45(D1):D353–D361, Jan 2017.
  • Kolda & Bader (2009) Kolda, Tamara G. and Bader, Brett W. Tensor decompositions and applications. SIAM REVIEW, 51(3):455–500, 2009.
  • Lázaro-Gredilla et al. (2016) Lázaro-Gredilla, Miguel, Liu, Yi, Phoenix, D Scott, and George, Dileep. Hierarchical Compositional Feature Learning. arXiv preprint arXiv:1611.02252, 2016.
  • Leenen et al. (1999) Leenen, Iwin, Van Mechelen, Iven, De Boeck, Paul, and Rosenberg, Seymour. Indclas: A three-way hierarchical classes model. Psychometrika, 64(1):9–24, Mar 1999. ISSN 1860-0980. doi: 10.1007/BF02294316.
  • Manias & Street (2001) Manias, E. and Street, A. Nurse-doctor interactions during critical care ward rounds. J Clin Nurs, 10(4):442–450, Jul 2001.
  • Metzler & Miettinen (2015) Metzler, Saskia and Miettinen, Pauli. Clustering boolean tensors. Data Mining and Knowledge Discovery, 29(5):1343–1373, 2015.
  • Miettinen (2011) Miettinen, Pauli. Boolean tensor factorizations. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 447–456. IEEE, 2011.
  • Miettinen et al. (2006) Miettinen, Pauli, Mielikäinen, Taneli, Gionis, Aristides, Das, Gautam, and Mannila, Heikki. The Discrete Basis Problem, pp. 335–346. Springer Berlin Heidelberg, 2006. ISBN 978-3-540-46048-0. doi: 10.1007/11871637˙33.
  • Park et al. (2017) Park, Namyong, Oh, Sejoon, and Kang, U. Fast and scalable distributed boolean tensor factorization, 04 2017.
  • Pearl (1988) Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 0-934613-73-7.
  • Polakis (2012) Polakis, Paul. Wnt signaling in cancer. Cold Spring Harbor perspectives in biology, 4(5):a008052, 2012.
  • Ravanbakhsh et al. (2016) Ravanbakhsh, Siamak, Póczos, Barnabás, and Greiner, Russell. Boolean Matrix Factorization and Noisy Completion via Message Passing. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of JMLR: W&CP, 2016.
  • Rukat et al. (2017) Rukat, Tammo, Holmes, Chris C., Titsias, Michalis K., and Yau, Christopher. Bayesian Boolean Matrix Factorisation. Proceedings of the 34th Annual International Conference on Machine Learning, pp. 2969–2978, jul 2017. ISSN 1938-7228.
  • Su & Khoshgoftaar (2009) Su, Xiaoyuan and Khoshgoftaar, Taghi M. A Survey of Collaborative Filtering Techniques. Adv. in Artif. Intell., 2009:4:2—-4:2, jan 2009. ISSN 1687-7470. doi: 10.1155/2009/421425.
  • Vanhems et al. (2013) Vanhems, Philippe, Barrat, Alain, Cattuto, Ciro, Pinton, Jean-Fran√ßois, Khanafer, Nagham, R√©gis, Corinne, Kim, Byeul-a, Comte, Brigitte, and Voirin, Nicolas. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PLoS ONE, 8(9):e73970, 09 2013. doi: 10.1371/journal.pone.0073970.
  • Weinstein et al. (2013) Weinstein, John N, Collisson, Eric A, Mills, Gordon B, Shaw, Kenna R Mills, Ozenberger, Brad A, Ellrott, Kyle, Shmulevich, Ilya, Sander, Chris, Stuart, Joshua M, Network, Cancer Genome Atlas Research, et al. The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10):1113, 2013.

 

TensOrMachine – Supplementary Information

 

Tammo Rukat0 0  Chris C. Holmes0  Christopher Yau0 

\@xsect

Here we derive the full conditional distribution for a factor entry as given in eq. (4) in the main paper. For constant priors, , the conditional is given by normalising the likelihood for . The likelihood has the form

(9)

and is factorial in the data-points . Terms that do not depend on cancel in the conditional and thus we take the product over all with fixed. Then normalising gives

(10)
(11)

Considering the term inside the logarithm in eq. (11) and using eq. (9) we find

(12)

The first equality describes all cases where the term inside the parenthesis in eq. (9) takes a value that is independent of the value of . The second term follows in all other cases and by expanding the logistic sigmoid. Hence, we can write eq. (11) as

(13)
\@xsect
  • BLCA: Bladder Urothelial Carcinoma

  • BRCA: Breast invasive carcinoma

  • CESC: Cervical squamous cell carcinoma and endocervical adenocarcinoma

  • COAD: Colon adenocarcinoma

  • ESCA: Esophageal carcinoma

  • GBM: Glioblastoma multiforme

  • HNSC: Head and Neck squamous cell carcinoma

  • KIRC: Kidney renal clear cell carcinoma

  • KIRP: Kidney renal papillary cell carcinoma

  • LAML: Acute Myeloid Leukemia

  • LGG: Brain Lower Grade Glioma

  • LIHC: Liver hepatocellular carcinoma

  • LUAD: Lung adenocarcinoma

  • LUSC: Lung squamous cell carcinoma

  • OV: Ovarian serous cystadenocarcinoma

  • PAAD: Pancreatic adenocarcinoma

  • PCPG: Pheochromocytoma and Paraganglioma

  • PRAD: Prostate adenocarcinoma

  • SARC: Sarcoma

  • SKCM: Skin Cutaneous Melanoma

  • STAD: Stomach adenocarcinoma

  • TGCT: Testicular Germ Cell Tumours

  • THCA: Thyroid carcinoma

\@xsect
Figure 5: Latent representation of the under principal component analysis and arranged in analogy to Fig. 4(c) in the main paper.
\@xsect

Fig. 6 shows patient representation for relative gene expression networks computed by dbtf. The analysis is limited to 350 genes, since our 32 core, 128 GB machine runs out of memory for larger analyses using the implementation provided by Park et al. [2017]. Data that was treated as missing in our original analysis is treated as 0 here. Comparing to Fig. 4(c) we observe similar but noisier disease-specific patterns.

Figure 6: Latent representation of the under dbtf arranged in analogy to Fig. 4(c) in the main paper.
\@xsect

Fig. 7 shows the results of dbtf, corresponding to the analysis in Fig. 4(b) of the main paper. While this solution looks similar to a possible point estimate of the TensOrM analysis, dbtf lack the ability to characterise posterior uncertainty which is crucial in this example. For the hospital data, dbtf infers only 3 latent dimensions. These correspond to dimensions 2, 5, and 7 of the TensOrM analysis. Note, that we can not assess the predictive performance since dbtf is unable to deal with held-out data

Figure 7: Student seating throughout course, analysed with dbtf (left) The latent representation are ordered to emphasis similarity to Fig. 4(b) in the main paper, a copy of which is shown on the right-hand-side for easier reference. The correspondence is as follows [dbtf latent dimensions TensOrM latent dimensions]: [0 (0,1)], [1 2], [2 3], [3 4], [4 5], [(5,6), 6], [7 7].
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
192128
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description