Abstract
Boolean tensor decomposition approximates data of multiway binary relationships as product of interpretable lowrank binary factors, following the rules of Boolean algebra. Here, we present its first probabilistic treatment. We facilitate scalable samplingbased posterior inference by exploitation of the combinatorial structure of the factor conditionals. Maximum a posteriori decompositions feature higher accuracies than existing techniques throughout a wide range of simulated conditions. Moreover, the probabilistic approach facilitates the treatment of missing data and enables model selection with much greater accuracy. We investigate three realworld datasets. First, temporal interaction networks in a hospital ward and behavioural data of university students demonstrate the inference of instructive latent patterns. Next, we decompose a tensor with more than 10 billion data points, indicating relations of gene expression in cancer patients. Not only does this demonstrate scalability, it also provides an entirely novel perspective on relational properties of continuous data and, in the present example, on the molecular heterogeneity of cancer. Our implementation is available on GitHub^{2}^{2}2https://github.com/TammoR/LogicalFactorisationMachines.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
TensOrMachine: Probabilistic Boolean Tensor Decomposition
Tammo Rukat ^{0 }^{0 } Chris C. Holmes ^{0 }^{0 } Christopher Yau ^{0 }^{0 }
\@xsect
Matrix decomposition methods such as factor analysis are a widely used class of models for unsupervised learning. Typically, they operate on twoway matrices with rows thought of as objects and columns thought of as properties. The decomposition amounts to computing two lowrank matrices, one that contains prototypes of properties (loadings) and one that denotes a compressed representation of each object as a combination of the prototypes (factors). However, these methods are illsuited for data with higher arity such as ternary interactions. Examples include network interactions at different timepoints, genegene interactions for different individuals or any twoway data under different experimental conditions. Such data requires methods that specifically account for the higherorder relationship and that are commonly referred to as tensor decomposition.
Boolean tensor decomposition factorises a way binary tensor , into binary factor matrices , using the Boolean algebra such that
(1) 
Here, is a single tensor entry and denotes a tuple of indices . We call the modes of the tensor, and denote the logical OR and the logical AND operation, respectively.
In plain English, eq. (1) says that an observation is , if and only if there exist one or more latent dimensions in which all corresponding factors are . This leads to easily interpretable factor matrices, , where each latent dimension denotes a subset of entries along mode , for instance subsets of objects, properties and conditions that occur jointly in the data. This can be thought of as a particular type of canonical polyadic (CP) decomposition, as illustrated in Fig. 1. A different graphical intuition starts from the factor rows and considers the 3way Boolean tensor product as a threestage template assignment procedure. Rows of a factor matrix are onedimensional binary templates of size of the first tensor mode. Rows of the second factor matrix indicates possible patterns of appearance of these templates along the second tensor dimensions. In the same manner, the third factor matrix, , indicates which disjunction of these 2D patterns appears in each slice of the tensor, and so forth.
We present the first probabilistic approach to Boolean tensor decomposition, the TensOrMachine, featuring distinctly improved accuracy compared to the previous stateoftheart methods. We develop scalable, samplingbased posterior inference and infer distributions over factor matrices which enables full quantification of uncertainty. In contrast to previous approaches, the probabilistic framework readily treats missing data, allows for tensor completion and integration of prior information. Importantly, the latent representations are amenable to informative and intuitive interpretation.
Tensor decomposition methods are widely used in many domains of application as reviewed by Kolda & Bader (2009). Boolean Tensor decomposition was first introduced in the Psychometric literature by Leenen et al. (1999) and has received lasting attention, following the formal study by Miettinen (2011). The author shows that computing the optimal decomposition is NPhard and proposes an alternating least squares heuristics as approximate strategy. A different approach based on a randomwalk procedure is described by Erdos & Miettinen (2013b) and trades accuracy in the factors for computational scalability. More recently a distributed Apache Spark implementation of the alternating least squares approach has been brought forward (Park et al., 2017). It demonstrates that alternating leasts squares is the stateoftheart for computing accurate decompositions and serves as baseline method for our experiments. Boolean Matrix Factorisation (Miettinen et al., 2006) shares the logical structure of Boolean tensor decomposition. It has many realworld applications such as collaborative filtering (Su & Khoshgoftaar, 2009) and computer vision (LázaroGredilla et al., 2016) and has previously been approached under a probabilistic perspective (Ravanbakhsh et al., 2016; Rukat et al., 2017). Despite the sustained theoretical interest, there have been only few practical applications of Boolean tensor decomposition, as for instance for information extraction (Erdos & Miettinen, 2013a) and clustering (Metzler & Miettinen, 2015). One of the contributions of this work is the presentation of Boolean Tensor decomposition as an interpretable, versatile and scalable analysis method.
We propose a probabilistic generative process for the model described in eq. (1). Each tensor entry, , is a Bernoulli random variable that equals 1 with a probability greater if the following holds true
(2) 
The logistic sigmoid has the convenient property and lets us readily parametrise the corresponding likelihood
(3) 
The term inside parenthesis evaluates to 1 if the condition in eq. (2) is met and to 1, otherwise. Throughout this work, we use a tilde to denote the mapping from to such that . The noise is parametrised by , such that for the likelihood is constant, independent of the factors and for all probability mass is put on the deterministic Boolean tensor product following eq. (1). We can specify Bernoulli priors on the observations or choose more structured prior distributions. A betaprior on is a computationally convenient choice as we discuss in the following section. However, in the relevant regime of thousand and more datapoints such priors are easily outweighed by the observed data. We therefore assume constant prior distributions in the following derivations and experiments.
Boolean decomposition models have an interesting relation to the noisyOR, a canonical model for multicausal interactions, which makes the assumption of independence between the inhibition of latent causes (Pearl, 1988). TensOrM violates this assumption with inhibition occurring on the eventlevel rather than on the causelevel. As an example, in analogy to the data presented in Section. id1, consider the event of a students attending a lecture. We assume that this has two possible causes. They may (i) use the opportunity to meet fellow students (ii) have a particular question that they wish to clarify in class. An inhibition on the level of causes may occur if they found a satisfying answer to their question in a textbook. Then, the desire to meet their friends still contributes to the likelihood of attending class. This naturally modelled by the noisyOR. In contrast, inhibition on the eventlevel may occur if a traffic disruption makes it impossible to get to campus. This is naturally described by a Boolean factor model, where any additional causes to attend the lecture will not contribute to the likelihood. We argue that for many practical purposes the latter is an interesting alternative. Even if we were to believe in the independent inhibition of latent causes, usually only one latent cause triggers an event. Compared to the noisyOR, TensOrM acts as an Occam’s razor and enforces sparse and simple explanations, aiming to associate exactly one hidden cause with every event.
We condition any entry of a factor matrix, , on the state of all other parameters which yields the full conditional probability for Gibbs sampling:
(4) 
We give a formal derivation in the Supplementary Information A and provide an intuitive explanation in the following. The sigmoid maps from the real line to the interval of . The sum in its argument is taken over all observations whose likelihood may depend on . These observations are given by the –way subtensor of X with dimension fixed and correspond to all children of in a graphical model representation. For each of them, the term inside the sigmoid contributes values in . This depends on whether or not and are aligned and an another indicator variable that take values in and denotes whether the state of has any relevance for the likelihood of . It takes the form
(5) 
This variable is again composed of two indicators. The first term is a product over the state of all coparents of to observation in the same latent dimension . Following the rules of the Boolean product, all these coparents need to be one in order for to contribute to the likelihood of . This corresponds to the ANDoperation in eq. (1) that evaluates to zero if any its arguments are zero. The second term is a product of the coparents in all other latent dimensions and evaluates to zero if any of them explains away observation . Following the rules of the Boolean product, a single latent dimension is sufficient to explain an observation and makes its likelihood independent of the state of . This corresponds to the ORoperation in eq. (1) that evaluates to one if any of its arguments is one. It is crucial for the speed of our implementation that eq. (5), and thus eq. (4), can be rapidly evaluated. In particular, discovering a zero in any of the two terms of eq. (5) suffices for the whole expression to be zero and avoids the necessity of considering the remainder of the Markov blanket. Pseudocode for the computational procedure is given in Algorithm 1. Note that updates of can trivially be computed in parallel across for a fixed tensormode . The computation time scales linearly in each tensor dimension and sublinear in the latent dimension.
After a sampling all factor entries from their full conditional, we set the noiseparameter, , to its conditional MAP estimate and repeat until convergence before drawing posterior samples. The conditional MLE of is available in closed form and given by the logit of the fraction of datapoints that are correctly reconstructed by the deterministic Boolean product of the current state of the factors. As a prior for , it is natural to employ a Betadistribution, . This intuitively affects the MAP estimate, adding correctly predicted data points and incorrectly predicted data points, such that
(6) 
Here, the indicator evaluates to 1 if its arguments is true and to 0 otherwise. We see that a uniform prior, , corresponds to applying Laplace’s rule of succession to the maximum likelihood estimate.
An important application of latent variable models is the prediction of missing observations, e.g. in collaborative filtering or data imputation problems (Su & Khoshgoftaar, 2009). Our probabilistic construction allows us to deal with missing data in a unified way. When computing updates for the latent factors, we can marginalise over any missing observations. Practically, this is equivalent to setting any missing entries to , such that they do not contribute to the conditional probability in eq. (4) and contribute a factor of to the likelihood in eq. (9). Thus, encoding observed data as and unobserved data as 0 is the most convenient choice for any practical implementation. For MAP updates of , following eq. (6), unobserved datapoints points are simply excluded.
For a direct comparison of predictions to previous methods that only provide a point estimate of the factors, we can reconstruct the data based on the Boolean product of the factor MAP estimates. We use the marginal MAP of each factor entry, and find no relevant difference to using the joint MAP. If instead we want to use the full posterior, we can determine the reconstruction of by rounding the posterior predictive,
(7) 
to the closest binary value. Here, is a posterior sample. Yet another alternative is to compute the estimate from the factor matrix mean, thus taking posterior uncertainty but no higherorder correlations into account. Denoting the posterior mean of a factor entry as , we have . The predictive accuracy of this computationally cheap approximation is on par with the more expensive posterior predictive estimate as we show empirically in the next section and in Fig. 2(b).
We first demonstrate the capabilities of TensOrM on simulated data and compare to the stateoftheart method distributed boolean tensor factorisation (dbtf) (Park et al., 2017). We simulate random 3way tensors, , of size and vary rank , expected density and noiselevel. To this end, we take the Boolean product between binary i.i.d. random matrices, , of size , such that the expected tensor density is given by . We introduce noise by flipping each entry in the tensor with a given probability. Posterior samples of the factors are drawn, following the procedure described in Section id1 with initialised to and the initial factors drawn i.i.d. Bern(). The reconstruction of is determined based on the posterior predictive following eq. (7). The reconstruction accuracy is computed as the fraction of correctly reconstructed entries in the noisefree tensor and is shown across a variety of conditions in Fig. 2(a). Our method achieves distinctly higher accuracies throughout all conditions. The margin becomes bigger for very noisy data, as well as for higher tensor ranks and for particularly dense or particularly sparse data.
Can the superior reconstruction performance be explained by implicit model averaging when computing the posterior predictive following eq. (7)? In order to address this conjecture, Fig. 2(b) compares the reconstruction accuracies of the posterior predictive compared to the reconstruction based solely on the MAP estimates of each factor. We can see, that posterior averaging plays an important role only in scenarios with high noise levels. The main performance gain, however, is simply due to a more accurate decomposition. In the practically relevant regime of moderate noise levels below , posterior averaging has virtually no impact. We further note in Fig. 2(a), that dbtf features a similar or higher reconstruction accuracy with respect to the noisy training data. This indicates overfitting and becomes particularly apparent for large noise levels and large tensor ranks. What distinguishes the decompositions of dbtf and TensOrM in this regime? In Fig. 2(c) we show the training performance under random perturbations of the inferred factors for a noise level of and a expected density of 10%. Our method has a lower training accuracy but is more stable towards random perturbations of the parameters whereas the training accuracy of dbtf decreases rapidly. This shows that TensOrM converges to solutions of larger pointwise density in the parameter space that have better generalisation properties. Recently, this phenomenon has been studied in order to improve the generalisation of deep neural networks (Chaudhari et al., 2016). Estimating posterior distributions, Bayesian techniques naturally assign more probability to such solutions.
A notorious challenge for latent variable models is the choice of dimensionality. Previously, Erdos & Miettinen (2013b) have used the Minimum description length (MDL) principle for this task in Boolean Tensor Decomposition. We follow their derivation and compare to two approaches that our Bayesian treatment offers readily. For the Bayesian Occam’s razor, we start our inference procedure with a large latent dimensionality. After convergence of the Markov chain, we remove all latent dimensions that do not contribute to the likelihood. In these dimensions one of the factors is usually all zeros and the other factors are uniformly random. After removal we restart the burnin procedure and repeat until only contributing dimensions remain. In the second approach, cross validation, we treat 20% of the data as unobserved during training and choose the model dimensionality that achieves the highest posterior predictive accuracy on the heldout data. Results on random matrices for different noise levels are shown in Fig.3, following the previously described simulation procedure. We find that both approaches clearly outperform MDL. In the case of noisefree observations the Bayesian Occam’s razor features virtually perfect accuracy. For the more realistic scenario of moderate noise levels cross validation is superior.
Here we demonstrate the ability of TensOrM to infer meaningful, interpretable representations from realworld 3way datasets of temporal interaction networks and temporal –relations. With these moderately sized datasets of less than 100,000 datapoints, sampling until convergence and drawing 50 samples takes only few seconds on a single core. Eventually, we turn to a largescale biological example, analysing networks of relative geneexpression in cancer patients with more than 10 billion data points. Here, the inference procedure takes around 10 hours.
Records of contact between pairs of individuals in a university hospital have originally been acquired to investigate transmission routes of infectious diseases (Vanhems et al., 2013). Proximity sensor measurements were taken for 75 individuals in 20 second time windows across 5 days. In order to examine daily patterns, we group the timepoints into 13 timeofday windows. This leaves us with a binary timeadjacency tensor. We compute the decomposition, choosing 8 latent dimensions based on crossvalidation as described in Section id1. The predictive accuracy on the heldout test data is approximately . Fig. 4(a) shows posterior means of the timespecific and (one of the two equivalent) individualspecific factor matrices. Individuals are labelled as either administrative staff (ADM), medical doctors (MED), nurses (NUR) or patients (PAT). Using this proximity data alone, we were able to recapitulate information about the work patterns of staff groups. For example, 10am12pm and 3pm5pm represent the main morning and afternoon clinic hours when all administrative, medical and nursing staff came into interaction, shown in latent dimensions 1 and 3. Whilst medical doctors are closely interacting throughout typical work hours, 9am7pm, their network does not include any nurses or patients as found in dimension 0. This can be explained by frequent interaction in the doctor’s room but also hints to a lack of interaction with the nursing staff, a phenomenon frequently discussed in the literature (see e.g. Manias & Street, 2001). 2pm3pm indicates a period in which nurses and administrative staff are heavily interacting without participation of medical doctors, presumably indicating an afternoon break or daily joint meeting, see latent dimension 3. Nurses, however, were active throughout the day including being the main staffing body at night as dimension 7 shows. In comparison, dbtf finds only 2 latent dimensions (0 and 5 in our analysis).
Our second example is part of the studentlife dataset introduced by Harari et al. (2017) and given by records of the seating positions of students throughout a 9week Android programming course. We partition the timepoints into weeks of the course and the seating positions into six regions front/back  left/centre/right. An additional seating category is labelled with a question mark, where the seating coordinates were not provided but the students appeared in class. This yields a weekseatstudent tensor. We choose seven latent dimensions, again, by optimising for the testset likelihood with a reconstruction accuracy of approximately 92% on the held out data. The decomposition is shown in Fig. 4(b) and, once more, open to straightforward interpretation. The majority of students shows up to class in the first week of the course and sits scattered throughout the lecture hall, as can be seen in dimensions 0 and 1. The group of students that attends lectures most consistently, dimension 4, sits in the frontcentre. The remaining groups describe subsets of students, each corresponding to exactly one of the four front/back left/right seating regions. The formation of some of these groups starts only in week 2 of the course and they all eventually stop to appear in class after 67 weeks. The input tensor confirms that only two students attend class in weeks 8 and 9. Similar solutions are found by dbtf and shown in the SI (E) but lack the ability to characterise uncertainty around the inferred mode.

Cancer types  Pathways  

1 ()  SKCM  Tight junctions  
4 (+) 
LUAD, BLCA  Hippo  
5 (+) 

Hippo, Wnt  
7 (+) 
PRAD, THCA  AGERAGE  
9 (+) 

Rap1 signalling  
11 (+) 



15 (+) 



16 () 

PI3KAkt  
17 (+) 
LAML, SARC, TGCT 


18 (+) 

Ras  
24 (+) 
KIRC, KIRP, PAAD  Ras 
Gene expression profiling has been used extensively in cancer research, e.g. for the characterisation of cancer subtypes, for the stratification of patients or for the identification of therapeutic targets. Correlations in gene expression are at the basis of protein interaction in cellular pathways and latent variable models are frequently applied to extract biologically meaningful information from such data. We propose a novel way of analysing gene expression data that can be applied to any continuous measurement with structure (here ). The publicly available TCGA dataset (Weinstein et al., 2013) contains gene expression measurements of a large variety of cancer patients across different types of cancer. After preprocessing, we are left with approximately 8,000 patients and 1,100 genes. We normalise the expression values for each gene across all patients to allow for comparison between genes. Then the data is transformed into a 3way tensor using the relational encoding
(8) 
This provides an entirely new view of the data in terms of networks of relative expression. For every latent dimension, the two gene specific factors indicate a subset of genes, where genes in the first subset are likely to be more expressed than genes in the second subset. The patient factor indicates which of these relational expression properties each patient exhibits. Importantly, the factors are amenable to distinct and intuitive interpretation with immediate biological relevance. We show the patientspecific factor in Fig. 4(c) with patients ordered by cancertype. We observe a remarkable disease specificity with many latent expression networks being exclusively present in virtually all cancers of certain types. In addition, some latent properties are scattered throughout cancers of various types, highlighting the heterogeneity of the disease. Application of a traditional method, principal component analysis (PCA) on the continuous array, is shown in the Supplementary Material C. It lacks both, the interpretability and the degree of diseasetype specificity. We use the representations of both methods as features for random forest classifiers. The predictive accuracy of the disease type is approximately 91% for the binary TensOrM features, 86% for continuous features from PCA and 81% for binary dbtf features, shown in see SI (C, D).
The corresponding genespecific factors characterise latent genenetworks but are more difficult to analyse for a nonspecialist. In particular, the richness of the relational latent encoding can only partially be captured by traditional pathway analyses. Nevertheless, we investigate the biological plausibility of the inferred gene sets by running a gene set enrichment analysis of the genes in each factor dimension against the KEGG pathway database (Kanehisa et al., 2017). Results are indicated in Tab. 1 and highlight the biological plausibility and significance of our analysis. We underline a few examples in the following. Firstly, Hippo signalling has been shown to be active in LUAD, LUSC, COAD, OV and LIHC (Harvey et al., 2013), all recovered by Dims. 4, 5. Secondly, aberrant Wnt signalling is observed in many cancers but most prominently in COAD (Polakis, 2012) which is found in Dim. 5. Next, PRAD progression is known to be controlled by focal adhesion (Figel & Gelman, 2011) as found in Dim. 15 as well as by the AGERAGE signalling pathway recovered in Dim. 7 (Bao et al., 2015). Finally, Ras signalling is aberrant in most tumours, but most significantly in PAAD (Downward, 2003) as confirmed in dims. 18, 24.
We have introduced the first probabilistic approach to Boolean tensor decomposition. It reaches stateoftheart performance and can readily deal with missing data which is ubiquitous in large datasets. Due to the particular combinatorial structure of the factor matrix posterior, sampling based inference scales to large datasets and enables the inference of posterior distributions over factor matrices, providing full uncertainty quantification. We further show that Boolean tensor decomposition leads to insightful latent representations and provides a novel view on the molecular basis of cancer by relationally encoding continuous expression data.
Acknowledgements This work was supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1. CY is supported by MRC grant MR/P02646X/1. TR is funded by EPSRC grant EP/G037280/1. TR thanks Taha Ceritli for helpful comments.
References
 Bao et al. (2015) Bao, J. M., He, M. Y., Liu, Y. W., Lu, Y. J., Hong, Y. Q., Luo, H. H., Ren, Z. L., Zhao, S. C., and Jiang, Y. AGE/RAGE/Akt pathway contributes to prostate cancer cell proliferation by promoting Rb phosphorylation and degradation. Am J Cancer Res, 5(5):1741–1750, 2015.
 Chaudhari et al. (2016) Chaudhari, Pratik, Choromanska, Anna, Soatto, Stefano, and LeCun, Yann. Entropysgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
 Downward (2003) Downward, Julian. Targeting ras signalling pathways in cancer therapy. Nature Reviews Cancer, 3(1):11, 2003.
 Erdos & Miettinen (2013a) Erdos, Dóra and Miettinen, Pauli. Discovering facts with boolean tensor tucker decomposition. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1569–1572. ACM, 2013a.
 Erdos & Miettinen (2013b) Erdos, Dóra and Miettinen, Pauli. Walk’n’merge: A scalable algorithm for boolean tensor factorization. In Data Mining (ICDM), 2013 IEEE 13th International Conference on, pp. 1037–1042. IEEE, 2013b.
 Figel & Gelman (2011) Figel, S. and Gelman, I. H. Focal adhesion kinase controls prostate cancer progression via intrinsic kinase and scaffolding functions. Anticancer Agents Med Chem, 11(7):607–616, Sep 2011.
 Harari et al. (2017) Harari, Gabriella M., Gosling, Samuel D., Wang, Rui, Chen, Fanglin, Chen, Zhenyu, and Campbell, Andrew T. Patterns of behavior change in students over an academic term: A preliminary study of activity and sociability behaviors using smartphone sensing methods. Computers in Human Behavior, 67:129 – 138, 2017. ISSN 07475632. doi: https://doi.org/10.1016/j.chb.2016.10.027.
 Harvey et al. (2013) Harvey, Kieran F, Zhang, Xiaomeng, and Thomas, David M. The hippo pathway and human cancer. Nature reviews Cancer, 13(4):nrc3458, 2013.
 Kanehisa et al. (2017) Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., and Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Res., 45(D1):D353–D361, Jan 2017.
 Kolda & Bader (2009) Kolda, Tamara G. and Bader, Brett W. Tensor decompositions and applications. SIAM REVIEW, 51(3):455–500, 2009.
 LázaroGredilla et al. (2016) LázaroGredilla, Miguel, Liu, Yi, Phoenix, D Scott, and George, Dileep. Hierarchical Compositional Feature Learning. arXiv preprint arXiv:1611.02252, 2016.
 Leenen et al. (1999) Leenen, Iwin, Van Mechelen, Iven, De Boeck, Paul, and Rosenberg, Seymour. Indclas: A threeway hierarchical classes model. Psychometrika, 64(1):9–24, Mar 1999. ISSN 18600980. doi: 10.1007/BF02294316.
 Manias & Street (2001) Manias, E. and Street, A. Nursedoctor interactions during critical care ward rounds. J Clin Nurs, 10(4):442–450, Jul 2001.
 Metzler & Miettinen (2015) Metzler, Saskia and Miettinen, Pauli. Clustering boolean tensors. Data Mining and Knowledge Discovery, 29(5):1343–1373, 2015.
 Miettinen (2011) Miettinen, Pauli. Boolean tensor factorizations. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 447–456. IEEE, 2011.
 Miettinen et al. (2006) Miettinen, Pauli, Mielikäinen, Taneli, Gionis, Aristides, Das, Gautam, and Mannila, Heikki. The Discrete Basis Problem, pp. 335–346. Springer Berlin Heidelberg, 2006. ISBN 9783540460480. doi: 10.1007/11871637˙33.
 Park et al. (2017) Park, Namyong, Oh, Sejoon, and Kang, U. Fast and scalable distributed boolean tensor factorization, 04 2017.
 Pearl (1988) Pearl, Judea. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. ISBN 0934613737.
 Polakis (2012) Polakis, Paul. Wnt signaling in cancer. Cold Spring Harbor perspectives in biology, 4(5):a008052, 2012.
 Ravanbakhsh et al. (2016) Ravanbakhsh, Siamak, Póczos, Barnabás, and Greiner, Russell. Boolean Matrix Factorization and Noisy Completion via Message Passing. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of JMLR: W&CP, 2016.
 Rukat et al. (2017) Rukat, Tammo, Holmes, Chris C., Titsias, Michalis K., and Yau, Christopher. Bayesian Boolean Matrix Factorisation. Proceedings of the 34th Annual International Conference on Machine Learning, pp. 2969–2978, jul 2017. ISSN 19387228.
 Su & Khoshgoftaar (2009) Su, Xiaoyuan and Khoshgoftaar, Taghi M. A Survey of Collaborative Filtering Techniques. Adv. in Artif. Intell., 2009:4:2—4:2, jan 2009. ISSN 16877470. doi: 10.1155/2009/421425.
 Vanhems et al. (2013) Vanhems, Philippe, Barrat, Alain, Cattuto, Ciro, Pinton, JeanFranâßois, Khanafer, Nagham, Râ©gis, Corinne, Kim, Byeula, Comte, Brigitte, and Voirin, Nicolas. Estimating potential infection transmission routes in hospital wards using wearable proximity sensors. PLoS ONE, 8(9):e73970, 09 2013. doi: 10.1371/journal.pone.0073970.
 Weinstein et al. (2013) Weinstein, John N, Collisson, Eric A, Mills, Gordon B, Shaw, Kenna R Mills, Ozenberger, Brad A, Ellrott, Kyle, Shmulevich, Ilya, Sander, Chris, Stuart, Joshua M, Network, Cancer Genome Atlas Research, et al. The cancer genome atlas pancancer analysis project. Nature genetics, 45(10):1113, 2013.
TensOrMachine – Supplementary Information
Tammo Rukat ^{0 }^{0 } Chris C. Holmes ^{0 } Christopher Yau ^{0 }
Here we derive the full conditional distribution for a factor entry as given in eq. (4) in the main paper. For constant priors, , the conditional is given by normalising the likelihood for . The likelihood has the form
(9) 
and is factorial in the datapoints . Terms that do not depend on cancel in the conditional and thus we take the product over all with fixed. Then normalising gives
(10)  
(11) 
The first equality describes all cases where the term inside the parenthesis in eq. (9) takes a value that is independent of the value of . The second term follows in all other cases and by expanding the logistic sigmoid. Hence, we can write eq. (11) as
(13) 

BLCA: Bladder Urothelial Carcinoma

BRCA: Breast invasive carcinoma

CESC: Cervical squamous cell carcinoma and endocervical adenocarcinoma

COAD: Colon adenocarcinoma

ESCA: Esophageal carcinoma

GBM: Glioblastoma multiforme

HNSC: Head and Neck squamous cell carcinoma

KIRC: Kidney renal clear cell carcinoma

KIRP: Kidney renal papillary cell carcinoma

LAML: Acute Myeloid Leukemia

LGG: Brain Lower Grade Glioma

LIHC: Liver hepatocellular carcinoma

LUAD: Lung adenocarcinoma

LUSC: Lung squamous cell carcinoma

OV: Ovarian serous cystadenocarcinoma

PAAD: Pancreatic adenocarcinoma

PCPG: Pheochromocytoma and Paraganglioma

PRAD: Prostate adenocarcinoma

SARC: Sarcoma

SKCM: Skin Cutaneous Melanoma

STAD: Stomach adenocarcinoma

TGCT: Testicular Germ Cell Tumours

THCA: Thyroid carcinoma
Fig. 6 shows patient representation for relative gene expression networks computed by dbtf. The analysis is limited to 350 genes, since our 32 core, 128 GB machine runs out of memory for larger analyses using the implementation provided by Park et al. [2017]. Data that was treated as missing in our original analysis is treated as 0 here. Comparing to Fig. 4(c) we observe similar but noisier diseasespecific patterns.
Fig. 7 shows the results of dbtf, corresponding to the analysis in Fig. 4(b) of the main paper. While this solution looks similar to a possible point estimate of the TensOrM analysis, dbtf lack the ability to characterise posterior uncertainty which is crucial in this example. For the hospital data, dbtf infers only 3 latent dimensions. These correspond to dimensions 2, 5, and 7 of the TensOrM analysis. Note, that we can not assess the predictive performance since dbtf is unable to deal with heldout data