# PairClone: A Bayesian Subclone Caller Based on Mutation Pairs

## Abstract

Tumor cell populations can be thought of as being composed of homogeneous cell subpopulations, with each subpopulation being characterized by overlapping sets of single nucleotide variants (SNVs). Such subpopulations are known as subclones and are an important target for precision medicine. Reconstructing such subclones from next-generation sequencing (NGS) data is one of the major challenges in precision medicine. We present PairClone as a new tool to implement this reconstruction. The main idea of PairClone is to model short reads mapped to pairs of proximal SNVs. In contrast, most existing methods use only marginal reads for unpaired SNVs. Using Bayesian nonparametric models, we estimate posterior probabilities of the number, genotypes and population frequencies of subclones in one or more tumor sample. We use the categorical Indian buffet process (cIBP) as a prior probability model for subclones that are represented as vectors of categorical matrices that record the corresponding sets of mutation pairs. Performance of PairClone is assessed using simulated and real datasets. An open source software package can be obtained at http://www.compgenome.org/pairclone.

[sub]font=normalsize

Keywords: Categorical Indian buffet process; Latent feature model; Local haplotype; Next-generation sequencing; Random categorical matrices; Subclone; Tumor heterogeneity.

## 1 Introduction

We explain intra-tumor heterogeneity by representing tumor cell populations as a mixture of subclones. We reconstruct unobserved subclones by utilizing information from pairs of proximal mutations that are obtained from next-generation sequencing (NGS) data. We exploit the fact that some short reads in NGS data cover pairs of phased mutations that reside on two sufficiently proximal loci. Therefore haplotypes of the mutation pairs can be observed and used for subclonal inference.

We develop a suitable sampling model that represents the paired nature of the data, and construct a nonparametric Bayesian feature allocation model as a prior for the hypothetical subclones. Both models together allow us to develop a fully probabilistic description of the composition of the tumor as a mixture of homogeneous underlying subclones, including the genotypes and number of such subclones.

### 1.1 Background

NGS technology (Mardis, 2008) has enabled researchers to develop bioinformatics tools that are being used to understand the landscape of tumors within and across different samples. An important related task is to reconstruct cellular subpopulations in one or more tumor samples, known as subclones. Mixtures of such subclones with varying population frequencies across spatial locations in the same tumor, across tumors from different time points, or across tumors from the primary and metastatic sites can provide information about the mechanisms of tumor evolution and metastasis. Heterogeneity of cell populations is seen, for example, in varying frequencies of distinct somatic mutations. The hypothetical tumor subclones are homogeneous. That is, a subclone is characterized by unique genomic variants in its genome (Marjanovic et al., 2013; Almendro et al., 2013; Polyak, 2011; Stingl and Caldas, 2007; Shackleton et al., 2009; Dexter et al., 1978). Such subclones arise as the result of cellular evolution, which can be described by a phylogenetic tree that records how a sequence of somatic mutations gives rise to different cell subpopulations. Figure 4 provides a stylized and simple illustration in which a homogeneous sample with one original normal clone evolves into a heterogeneous sample with three subclones. Subclone 1 is the original parent cell population, and subclones 2 and 3 are descendant subclones of subclone 1, each possessing somatic mutations marked by the red letters. Each subclone possesses two homologous chromosomes (in black and green), and each chromosome in Figure 4 is marked by a triplet of letters representing the nucleotide on the three genomic loci. Together, the three subclones include four different haplotypes, (A, G, C), (A, G, T), (C, G, C), and (A, A, T), at these three genomic loci. In addition, each subclone has a different population frequency shown as the percentage values in Figure 4.

We use NGS data to infer such tumor heterogeneity. In an NGS experiment, DNA fragments are first produced by extracting the DNA molecules from the cells in a tumor sample. The fragments are then sequenced using short reads. For the three subclones in Figure 4, there are four aforementioned haplotypes at the three loci. Consequently, short reads that cover some of these three loci may manifest different alleles. For example, if a large number of reads cover the first two loci, we might observe (A, G), (C, G), (A, T) and (C, T), four alleles for the mutation pair. Observing four alleles is direct evidence supporting the presence of subclones (Sengupta et al., 2015). This is because, in the absence of copy number variations there can be only two haploid genomes at any loci for a homogeneous human sample. Therefore, one can use mutation pairs in copy neutral regions to develop statistical inference on the presence and frequency of subclones. This is the goal of our paper.

Almost all mutation-based subclone-calling methods in the literature use only single nucleotide variants (SNVs) (Oesper et al., 2013; Strino et al., 2013; Jiao et al., 2014; Miller et al., 2014; Roth et al., 2014; Zare et al., 2014; Deshwar et al., 2015; Sengupta et al., 2015; Lee et al., 2015, 2016). Instead of examining mutation pairs, SNV-based methods use marginal counts for each recorded locus only. Consider, for example, the first locus in Figure 4. At this locus, the reference genome has an “A” nucleotide while subclones 2 and 3 have a “C” nucleotide. In the entire sample, the “C” nucleotide is roughly present in 17.5% of the DNA molecules based on the population frequencies illustrated in Figure 4. The percentage of a mutated allele is called variant allele fraction (VAF). If a sample is homogeneous and assuming no copy number variations at the locus, the population frequency for the “C” nucleotide should be close to 0, 50%, or 100%, depending on the heterozygosity of the locus. Therefore, if the population frequency of “C” deviates from 0%, 50%, or 100%, the sample is likely to be heterogeneous. Based on this argument, SNV-based subclone callers search for SNVs with VAFs that are different from these frequencies (0, 50%, 100%), which are evidence for the presence of different (homogeneous) subpopulations. In the event of copy number variations, a similar but slightly more sophisticated reasoning can be applied, see for example, Lee et al. (2016).

### 1.2 Using mutation pairs

NGS data usually contain substantially fewer mutation pairs than marginal SNVs. However, this does not weaken the power of using mutation pairs as mutation pairs naturally carry important phasing information that improves the accuracy of subclone reconstruction. For example, imagine a tumor sample that is a mixture of subclones 2 and 3 in Figure 4. Suppose a sufficient amount of short reads cover the first two loci, we should observe relatively large reads counts for four alleles (C, G), (A, T), (C, T) and (A, G). One can then reliably infer that there are heterogeneous cell subpopulations in the tumor sample. In contrast, if we ignore the phasing information and only consider the (marginal) VAFs for each SNV, then the observed VAFs for both SNVs are 50%, which could be heterogeneous mutations from a single cell population. See Simulation 1 for an illustration. In summary, we leverage the power of using mutation pairs over marginal SNVs by incorporating partial phasing information in our model. Besides the simulation study we will later also empirically confirm these considerations in actual data analysis.

The relative advantage of using mutation pairs over marginal SNV’s can be also be understood as a special case of a more general theme. In biomedical data it is often important to avoid overinterpretation of noisy data and to distill a relatively weak signal. A typical example is the probability of expression (POE) model of Parmigiani et al. (2002). Similarly, the modeling of mutation pairs is a way to extract the pertinent information from the massive noisy data. Due to noise and artifact in NGS data, such as base-calling or mapping error, many called SNVs might record unusual population frequencies, for reasons unrelated to the presence of subclones (Li, 2014). Direct modeling of all marginal read counts one ends up with noise swamping the desired signal (Nik-Zainal et al., 2012; Jiao et al., 2014). See our analysis of a real data set in Section 6 for an example. To mitigate this challenge, most methods use clustering of the VAFs, including, for example, Roth et al. (2014). One would then use the resulting cluster centers to infer subclones, which is one way of extracting more concise information. In addition, the vast majority of the methods in the literature show that even though a tumor sample could possess thousands to millions of SNVs, the number of inferred subclones usually is in the low single digit, no more than 10. To this end, we propose instead an alternative approach to extract useful information by modeling (fewer) mutation pairs, as mutation pairs contain more information and are of higher quality. We show in our numerical examples later that with a few dozens of these mutation pairs, the inference on the subclones is strikingly similar to cluster-based subclone callers using much more SNVs.

Finally, using mutation pairs does not exclude the possibility of making use of marginal SNVs. In Section 5.1, we show it is straightforward to jointly model mutation pairs and SNVs. Other biological complexities, such as tumor purity and copy number variations, can also be incorporated in our model. See Sections 5.2 and 5.3 for more details.

### 1.3 Representation of subclones

We construct a categorical valued matrix (Figure 4) to represent the subclone structure. Rows of are indexed by and represent mutation pairs, and a column of , denoted by , records the phased mutation pairs on the two homologous chromosomes of subclone , . As in Figure 4, let index the two homologous chromosomes, index the two mutation loci, be the genotype consisting of two alleles for mutation pair in subclone , and denote the allele of the -th homologous chromosome. Therefore, each entry of the matrix is a binary submatrix itself. For example, in Figure 4 the entry is a pair of 2-dimension binary row vectors, and , representing the genotypes for both alleles at mutation pair of subclone ; each vector indicates the allele for the mutation pair on a homologous chromosome. The first vector indicates that locus harbors no mutation (0) and locus harbors a mutation (1). Similarly, the second vector marks two mutations on both loci.

In summary, each entry of ,

is a matrix (with the two row vectors horizontally displayed for convenience). Each is a binary indicator and (or ) indicates a mutation (or reference). Thus, can take possible values. That is, , where we write short for etc., and refers to the genotype on the reference genome. Formally, is a binary matrix, and is a matrix of such binary matrices. Moreover, we can collapse some values as we do not have phasing across mutation pairs. For example, and , etc. have mirrored rows and are indistinguishable in defining a subclone (a column of ). (More details in Section 2.2). Typically distinct mutation pairs are distant from each other, and in NGS data they are almost never phased. Therefore, we can reduce the number of possible outcomes of to , due to the mirrored outcomes. We list them below for later reference: and . In summary, the entire matrix fully specifies the genomes of each subclone at all the mutation pairs.

Suppose tumor samples are available from the same patient, obtained either at different time points (such as initial diagnosis and relapses), at the same time but from different spatial locations within the same tumor, or from tumors at different metastatic sites. We assume those samples share the same subclones, while the subclonal population frequencies may vary across samples. For clinical decisions it can be important to know the population frequencies of the subclones. To facilitate such inference, we introduce a matrix to represent the population frequencies of subclones. The element refers to the proportion of subclone in sample , where for all and , and . A background subclone, which has no biological meaning and is indexed by , is included to account for artifacts and experimental noise. We will discuss more about this later.

The remainder of this article is organized as follows. In Sections 2 and 3, we propose a Bayesian feature allocation model and the corresponding posterior inference scheme to estimate the latent subclone structure. In Section 4, we evaluate the model with three simulation studies. Section 5 extends the models to accommodate other biological complexities and present additional simulation results. Section 6 reports the analysis results for a lung cancer patient with multiple tumor biopsies. We conclude with a final discussion in Section 7.

## 2 The PairClone Model

### 2.1 Sampling Model

Suppose paired-end short reads data are obtained by deep DNA sequencing of multiple tumor samples. In such data, a short read is obtained by sequencing two ends of the same DNA fragment. Usually a DNA fragment is much longer than a short read, and the two ends do not overlap and must be mapped separately. However, since the paired-end reads are from the same DNA fragment, they are naturally phased and can be used for inference of alleles and subclones. We use LocHap (Sengupta et al., 2015) to find pairs of mutations that are no more than a fixed number, say 500, base pairs apart. Such mutation pairs can be mapped by paired-end reads, making them eligible for PairClone analysis. See Figure 4 for an example. For each mutation pair, a number of short reads are mapped to at least one of the two loci. Denote the two sequences on short read mapped to mutation pair in tissue sample by , where index the two loci, indicates that the short read sequence is a reference or mutation. Theoretically, each can take four values, A, C, G, T, the four nucleotide sequences. However, at a single locus, the probability of observing more than two sequences across short reads is negligible since it would require the same locus to be mutated twice throughout the life span of the person or tumor, which is unlikely. We therefore code as a binary value. Also, sometimes a short read may cover only one of the two loci in a pair, and we use to represent a missing base when there is no overlap between a short read and the corresponding SNV. Therefore, . For example, in Figure 4 locus , for read , for read , and for read . Reads that are not mapped to either locus are excluded from analysis since they do not provide any information for subclones. Altogether, can take possible values, and its sample space is denoted by . Each value corresponds to an allele of two loci, with being a special “missing” coverage. For mutation pair in sample , the number of short reads bearing allele is denoted by , where is the indicator function, and the total number of reads mapped to the mutation pair is then . Finally, depending upon whether a read covers both loci or only one locus we distinguish three cases: (i) a read maps to both loci (complete), taking values ; (ii) a read maps to the second locus only (left missing), ; and (iii) a read maps to the first locus only (right missing), . We assume a multinomial sampling model for the observed read counts

(1) |

Here are the probabilities for the 8 possible values of . For the upcoming discussion, we separate out the probabilities for the three missingness cases. Let denote the probabilities of observing a short read satisfying cases (i), (ii) and (iii), respectively. We write , , and . Here are the probabilities conditional on case (i), (ii) or (iii). That is, . We still use a single running index, , to match the notation in . Below we link the multinomial sampling model with the underlying subclone structure by expressing in terms of and . Regarding we assume non-informative missingness and therefore do not proceed with inference on them (and ’s remain constant factors in the likelihood).

### 2.2 Prior Model

**Construction of
** The construction of a prior model for is based on
the following generative model.
To generate a short read, we first select a subclone from which the read
arises, using the population frequencies for sample
. Next we select with probability one of the two
DNA strands,
.
Finally, we record the read , or , corresponding to the chosen allele
.
In the case of left (or right) missing locus we observe ,
or (or or ), corresponding to the observed locus of the chosen allele.
Reflecting these three generative steps, we denote the probability of
observing a short read that bears sequence by

(2) |

with the understanding that for missing reads. Implicit in (2) is the restriction , depending on the arguments.

Finally, using the definition of we model the probability of observing a short read as

(3) |

In (3) we include to model a background subclone denoted by with population frequency . The background subclone does not exist and has no biological interpretation. It is only used as a mathematical device to account for noise and artifacts in the NGS data (sequencing errors, mapping errors, etc.). The weights are the conditional probabilities of observing a short read harboring allele if the recorded read were due to experimental noise. Note that .

**Prior for
** We assume a geometric distribution prior on , , to describe the random number of subclones (columns
of ), . A priori .

**Prior for
** We use the finite version of the categorical Indian buffet process
(cIBP) (Sengupta
et al., 2013) as the prior for the latent
categorical matrix . The cIBP is a categorical extension of the
Indian buffet process (Griffiths and
Ghahramani, 2011) and defines feature
allocation (Broderick
et al., 2013) for categorical matrices.
In our application, the mutation pairs are the objects, and the
subclones are the latent features chosen by the objects. The number of
subclones is random, with the geometric prior .
Conditional on , we now introduce for
each column of vector , where , and . Recall that are
the possible genotypes for the mutation pairs defined in Section 1.3,
, for possible genotypes.

As prior model for , we use a Beta-Dirichlet distribution (Kim et al., 2012). Let , . Conditional on , follows a beta distribution, and follows a Dirichlet distribution. Here corresponds to the situation that subclone is not chosen by mutation pair , because refers to the reference genome. We write

This construction includes a positive probability for all-zero columns . In our application, refers to normal cells with no somatic mutations, which could be included in the cell subpopulations.

In the definition of the cIBP prior, we would have one more step of dropping all zero columns. This leaves a categorical matrix with at most columns. As shown in Sengupta et al. (2013), the marginal limiting distribution of follows the cIBP as .

**Prior for
** We assume follows a Dirichlet prior,

for . We set to reflect the nature of as a background noise and model mis-specification term.

**Prior for
** We complete the model with a prior for .
Recall is the
conditional probability of observing a short read with allele
due to experimental noise. We consider complete read, left
missing read and right missing read separately, and assume

where , and .

## 3 Posterior Inference

Let denote the unknown parameters except , where , , , and . We use Markov chain Monte Carlo (MCMC) simulations to generate samples from the posterior , . With fixed such MCMC simulation is straightforward. See, for example, Brooks et al. (2011) for a review of MCMC. Gibbs sampling transition probabilities are used to update and , and Metropolis-Hastings transition probabilities are used to update and . Since is expected to be highly multi-modal, we use additional parallel tempering to improve mixing of the Markov chain. Details of MCMC simulation and parallel tempering are described in Appendix A.1.

**Updating
** Updating the value of is more difficult as it involves
trans-dimensional MCMC (Green, 1995).
At each iteration, we propose a new value by
generating from a proposal distribution .
In the later examples we assume that is a priori restricted to
, and use
a uniform proposal
.

Next, we split the data into a training set and a test set with and , respectively, for . Denote by the posterior of conditional on evaluated on the training set only. We use in two instances. First, we replace the original prior by , and second, we use as a proposal distribution for , as . Finally, we evaluate the acceptance probability of on the test data by

(4) |

The use of the prior is similar to the construction of the fractional Bayes factor (FBF) (O’Hagan, 1995) which uses a fraction of the data to define an informative prior that allows the evaluation of Bayes factors. In contrast, here is used as an informative proposal distribution for . Without the use of a training sample it would be difficult to generate proposals with reasonable acceptance rate. In other words, we use to achieve a better mixing Markov chain Monte Carlo simulation. The use of the same to replace the original prior avoids the otherwise prohibitive evaluation of in the acceptance probability (4). See more details in Appendix A.2 and A.5.

**Point estimates for parameters** We use the posterior mode as a point estimate of
. Conditional on , we follow Lee
et al. (2015) to find a
point estimate of . For any two matrices
and , a distance between the -th column of and the
-th column of is defined by , where , and we take the vectorized form of and to compute distance between them. Then, we define
the distance between and as , where is
a permutation of , and the minimum is taken over
all possible permutations.
This addresses the potential label-switching
issue across the columns of . Let be a set of posterior Monte Carlo samples of . A posterior
point estimate for , denoted by , is reported as
, where

Based on , we report posterior point estimates of and , given by and , respectively.

## 4 Simulation

We evaluate the proposed model with three simulation studies. In the first simulation we use single sample data (), since in most current applications only a single sample is available for analysis. Inferring subclonal structure accurately under only one sample is a major challenge, and not completely resolved in the current literature. The single sample does not rule out meaningful inference, as the relevant sample size is the number of SNVs or mutation pairs, or the (even larger) number of reads. In the second and third simulations we consider multi-sample data, similar to the lung cancer data that we analyze later. In all simulations, we assume the missing probabilities and to be 30% or 35%. Recall that these probabilities represent the probabilities that a short read will only cover one of the two loci in the mutation pair.

Details of the three simulation studies are reported in Appendix A.3. We briefly summarize the results here. In the first simulation, we illustrate the advantage of using mutation pair data over marginal SNV counts. We generate hypothetical short reads data for sample and mutation pairs, using a simulation truth with . Figure 11(a, d) summarizes the simulation results. See Appendix Figures A.1 and A.2 for more summaries, including a comparison with results under methods based on marginal read counts only.

In the second simulation, we consider data with mutation pairs and a more complicated subclonal structure with latent subclones and samples. Inference summaries are show in Figure 11(b, e). Again, more details of the simulation study, including inference on the weights and a comparison with inference using marginal cell counts only, are shown in the appendix.

Finally, in a third simulation we use samples with and latent subclones. Some results are summarized in Figure 11(c, f), and, again, more details are shown in the appendix.

In all simulations, panels (a, b, c) vs. (d, e, f) in Figure 11 show that posterior estimated is close to the true .

## 5 PairClone Extensions

### 5.1 Incorporating Marginal Read Counts

Most somatic mutations are not part of the paired reads that we use in PairClone. We refer to these single mutations as SNVs (single nucleotide variants) and consider the following simple extension to incorporate marginal counts for SNVs in PairClone. We introduce a new matrix to represent the genotype of the subclones for these additional SNVs. To avoid confusion, we denote the earlier subclone matrix by in this section. The element of reports the genotype of SNV in subclone , with denoting homozygous wild-type (), heterozygous variant (), and homozygous variant (), respectively. The -th column of and together define subclone . We continue to assume copy number neutrality in all SNVs and mutation pairs (we discuss an extension to incorporating subclonal copy number variations in the next subsection). The marginal read counts are easiest incorporated in the PairClone model by recording them as right (or left) missing reads (as described in Section 2.1) for hypothetical pairs, . Let and denote the total count and the number of reads bearing a variant allele, respectively, for SNV in sample . Treating as a mutation pair with missing second read, we record , and . We then proceed as before, now with mutation pairs. Inference reports an augmented subclone matrix . We record the first rows of as , and transform the remaining rows to by only recording the genotypes of the observed loci.

We evaluate the proposed modeling approach with a simulation study. The simulation setting is the same as simulation 3 in Section 4, except that we discard the phasing information of mutation pairs and only record their marginal read counts. Figure 18(a)–(f) summarizes the simulation results. Panels (a, b) show the simulation truth for the mutation pairs and SNVs, respectively. Panel (c) shows the posterior and panels (d, e) show the estimated genotypes and . Inference for the weights recovers the simulation truth (not shown). The result compares favorably to inference under BayClone (Figure A.6 in the appendix), due to the additional phasing information for the first 50 mutation pairs.

For a direct evaluation of the information in the additional marginal counts we also evaluate posterior inference with only the first 50 mutation pairs, shown in Figure 18 (c, f). Comparison with Figure 18 (c, d) shows that the additional marginal counts do not noticeably improve inference on tumor heterogeneity.

### 5.2 Incorporating Tumor Purity

Usually, tumor samples are not pure in the sense that they contain certain proportions of normal cells. Tumor purity refers to the fraction of tumor cells in a tumor sample. To explicitly model tumor purity, we introduce a normal subclone, the proportion of which in sample is denoted by , . The normal subclone does not possess any mutation (since we only consider somatic mutations). The tumor purity for sample is thus . The normal subclone is denoted by , with for all . The remaining subclones are still denoted by , , with proportion in sample , and .

The probability model needs to be slightly modified to accommodate the normal subclone. The sampling model remains unchanged as (1). Same for the prior models for , and . We only change the construction of and as follows. With a new normal subclone, the probability of observing a short read becomes based on the same generative model described in Section 2.2. Let . We use a Beta-Dirichlet prior, , and , , , . An informative prior for could be based on an estimate from a purity caller, for example, Van Loo et al. (2010) or Carter et al. (2012).

We evaluate the modified model with a simulation study. The simulation setting is the same as simulation 3 in Section 4, except that we substitute the first subclone with a normal subclone. Posterior inference (not shown) recovers the simulation truth, with posterior mode . Inference on almost perfectly recovers the simulation truth shown in Figure 11(c) (with the first column replaced by an all normal “subclone”). Similarly for . See Appendix A.4 for details.

### 5.3 Incorporating Copy Number Changes

Tumor cells not only harbor sequence mutations such as SNVs and mutation pairs, they often undergo copy number changes and produce copy number variants (CNVs). Genomic regions with CNVs have copy number . We briefly outline an extension of PairClone that includes CNVs in the inference. In addition to which describes sequence variation we introduce a matrix to represent subclonal copy number variation with reporting the copy number for mutation pair in subclone . We use to augment the sampling model to include the total read count . Earlier in (1), the multinomial sample size was considered fixed. We now add a sampling model. Following Lee et al. (2016) we assume

Here, is the expected number of reads in sample under copy-neutral conditions, and is a weighted average copy number across subclones,

The last term accounts for noise and artifacts, where and are the population frequency and copy number of the background subclone, respectively. We assume no CNVs for the background subclone, that is, for all . We complete the model with a prior . Assuming , i.e., a maximum copy number , we use another instance of a finite cIBP. For each column of , we introduce and assume , again with a Beta-Dirichlet prior for .

Recall the construction of in (3), including in particular the generative model. This generative model is now updated to include the varying . To generate a short read for mutation pair , we first select a subclone from which the read arises, using the population frequencies for sample . Next we select with probability one of the four possible alleles, , or , where we now use to denote numbers of alleles having genotypes , , or , and . In the case of left (or right) missing locus we observe , or (or or ), corresponding to the observed locus of the chosen allele, similar to before. In summary, the probability of observing a short read can be written as

where corresponds to the described generative model.

## 6 Lung Cancer Data

### 6.1 Using PairClone

We apply PairClone to analyze whole-exome in-house data. Whole-exome sequencing data is generated from four () surgically dissected tumor samples taken from a single patient diagnosed with lung adenocarcinoma. The resected tumor is divided into two portions. One portion is flash frozen and another portion is formalin fixed and paraffin embedded (FFPE). Four different samples (two from each portion) are taken. DNA is extracted from all four samples. Agilent SureSelect v5+UTR probe kit (targeting coding regions plus UTRs) is used for exome capture. The exome library is sequenced in paired-end fashion on an Illumina HiSeq 2000 platform. About 60 million reads are obtained in FASTQ file format, each of which is 100 bases long. We map paired-end reads to the human genome (version HG19) (Church et al., 2011) using BWA (Li and Durbin, 2009) to generate BAM files for each individual sample. After mapping the mean coverage of the samples is around fold. We call variants using UnifiedGenotyper from GATK toolchain (McKenna et al., 2010) and generate a single VCF file for all of them. A total of nearly SNVs and small indels are called within the exome coordinates.

Next, using LocHap (Sengupta et al., 2015) we find mutation pair positions, the number of alleles and number of reads mapped to them. LocHap searches for multiple SNVs that are scaffolded by the same pair-end reads, that is, they can be recorded on one paired end read. We refer to such sets of multiple SNV’s as local haplotypes (LH). When more than two genotypes are exhibited by an LH, it is called a LH variant (LHV). Using individual BAM files and the combined VCF file, LocHap generates four individual output file in HCF format (Sengupta et al., 2015). An HCF file contains LHV segments with two or three SNV positions. In this analysis, we are only interested in mutation pair, and therefore filter out all the LHV segments consisting of more than two SNV locations. We restrict our analysis to copy number neutral regions. To further improve data quality, we drop all LHVs where two SNVs are very close to each other (within, say, bps) or close to any type of structural variants such as indels. We also remove those LHVs where either of the SNVs is mapped with strand bias by most reads, or either of the SNVs is mapped towards the end of the most aligned reads. Finally, we only consider mutation pairs that have strong evidence of heterogeneity. Since LHVs exhibit genotypes in the short reads, by definition they are somatic mutations.

At the end of this process, mutation pairs are left and we record the read data from HCF files for the analysis. In addition, in the hope of utilizing more information from the data, we randomly choose 69 un-paired SNVs and include them in the analysis. Since in practice, tumor samples often include contamination with normal cells, we incorporate inference for tumor purity as described in Section 5.2. We run MCMC simulation for iterations, discarding the first iterations as initial burn-in and keeping every 10th MCMC sample. We set the hyperparameter exactly as in the simulation study of Section 5.2.

**Results** The posterior distribution (shown in Appendix Figure A.14) reports
, , and for , , and , respectively, and then quickly drops below , with posterior mode .
This means, excluding the effect of normal cell contamination, the tumor samples have two subclones. Figure 22(a, b) show the estimated subclone matrix and corresponding to mutation pairs and SNVs, respectively.
The first column of and represents the normal
subclone. The rows for both matrices are reordered for a better
display. Figure 22(c)
shows the estimated subclone proportions for the four samples.
The second column of represents the proportions of normal
subclones in the four samples. The small values indicate high purity
of the tumor samples.
The similar proportions across the four
samples reflect the spatial proximity of the samples.
Furthermore, excluding a few exceptions that might be due to model
mis-fitting, the subclones form a simple phylogenetic tree:
. Subclones 1 and 2 share a large portion
of common mutations, while subclone 2 has some private
mutations that are missing in subclone 1.

For informal model checking we inspect a histogram of realized residuals (Appendix Figure A.14). To define residuals, we calculate estimated multinomial probabilities according to , and empirical values of . Let . The figure plots the residuals . The resulting histogram of residuals is centered around zero with little mass beyond , indicating a good model fit.

### 6.2 Using SNVs only

For comparison, we also run BayClone and PyClone on the same dataset. Using the log pseudo marginal likelihood (LPML), BayClone reports subclones. The estimated subclone matrix in BayClone’s format is shown in Figure 27(a), with the rows reordered in the same way as in Figure 22(a, b). In light of the earlier simulation results we believe that the inference under PairClone is more reliable. Figure 27(b) shows the estimated subclone proportions under BayClone. Figure 27(c) shows the estimated clustering of the SNV loci under PyClone (the color coding along the axes). PyClone identifies 6 different clusters. The largest cluster (shown in brown) corresponds to loci that have heterozygous variants in both subclones 1 and 2, the second-largest cluster (shown in blueish green) corresponds to loci that have homozygous wild types in subclone 1 and homozygous variants in subclone 2, and the other smaller clusters represent other less common combinations. The clusters match with clustering of rows of and . PyClone does not immediately give inference on subclones, but combing clusters with similar cellular prevalence across samples one is able to conjecture subclones. In this sense, PyClone gives similar result compared with PairClone. Finally, Figure 27(d) displays PyClone’s estimated cellular prevalences of clusters across different samples. The estimated subclone proportions and cellular prevalences across the four samples remain very similar also under the BayClone and PyClone output, which strengthens our inference that the four samples possess the same subclonal profile, each with two subclones.

For another comparison, we run PyClone with a much larger number of SNVs (, which include the 69 pairs and 69 SNVs we ran analysis before) to evaluate the information gain by using additional marginal counts. The results are summarized in Figure 30, with panel (a) showing the estimated clustering of the 1800 SNVs. PyClone reports 34 clusters. The two largest clusters (olive and green clusters) in Panel (a) match with the two largest clusters (brown and bluish green clusters) in Figure 27(c) and also corroborate the two subclones inferred by PairClone. In addition, PyClone infers lots of noisy tiny clusters using 1800 SNVs, which we argue model only noise. In summary, this comparison shows the additional marginal counts do not noticeably improve inference on tumor heterogeneity, and modeling mutation pairs is a reasonable way to extract useful information from the data.

## 7 Conclusions

We can significantly enrich our understanding of cancer development by using high throughput NGS data to infer co-existence of subpopulations which are genetically different across tumors and within a single tumor (inter and intra tumor heterogeneity, respectively). In this paper, we have presented a novel feature allocation model for reconstructing such subclonal structure using mutation pair data. Proposed inference explicitly models overlapping mutation pairs. We have shown that more accurate inference can be obtained using mutation pairs data compared to using only marginal counts for single SNVs. Short reads mapped to mutation pairs can provide direct evidence for heterogeneity in the tumor samples. In this way the proposed approach is more reliable than methods for subclonal reconstruction that rely on marginal variant allele fractions only.

The proposed model is easily extended for data where an LH segment consists of more than two SNVs. We can easily accommodate -tuples instead of pairs of SNVs by increasing the number of categorical values () that the entries in the matrix can take. There are several more interesting directions of extending the current model. For example, one could account for the potential phylogenetic relationship among subclones (i.e the columns in the matrix). Such extensions would enable one to infer mutational timing and allow the reconstruction of tumor evolutionary histories.

Lastly, we focus on statistical inference using bulk sequencing data on tumor samples. Alternatively, biologists can apply single-cell sequencing on each tumor cell and study its genome one by one. This is a gold standard that can examine tumor heterogeneity at the single-cell level. However, single-cell sequencing is still expensive and cannot scale up. Also, many bioinformatics and statistical challenges are unmet in analyzing single-cell sequencing data.

## Appendix A

### a.1 MCMC Implementation Details

We first introduce as an unscaled abundance level of subclone in sample . Assume and . Let , then . We make inference on instead of as the value of is not restricted in a -simplex. Similarly, we introduce as an unscaled version of . We let and for , and for , and and for .

Conditional on , the posterior distribution for the other parameters is given by

(5) |

where counts the number of mutation pairs in subclone having genotype .

**Updating .** We update by
sampling each from:

**Updating .** The posterior distribution for is

For each , we update by sampling from

and transforming by .

**Updating .** We update each sequentially. For ,

A Metropolis-Hastings transition probability is used to update . At each iteration, we propose a new (on the log scale) by , and evaluate the acceptance probability by . The term takes into account the Jacobian of the log transformation. For , the only difference is to substitute with .

**Updating .** We update each sequentially. For ,

A Metropolis-Hastings transition probability is used to update . At each iteration, we propose a new (on the log scale) by , and evaluate the acceptance probability by . The term takes into account the Jacobian of the log transformation. For , the only difference is to substitute with .

**Parallel tempering.** Parallel tempering (PT) is a MCMC technique first proposed by Geyer (1991). A good review can be found in Liu (2008).
PT is suitable for sampling from a multi-modal state space. It helps
the MCMC chain to move freely among local modes which is desired in
our application, and to create a better mixing Markov chain.