Matrix Completion and Performance Guarantees for Single Individual Haplotyping

Matrix Completion and Performance Guarantees for Single Individual Haplotyping

Somsubhra Barik and Haris Vikalo,  S. Barik and H. Vikalo are with the Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, 78712 USA e-mail: sbarik@utexas.edu.
Abstract

Single individual haplotyping is an NP-hard problem that emerges when attempting to reconstruct an organism’s inherited genetic variations using data typically generated by high-throughput DNA sequencing platforms. Genomes of diploid organisms, including humans, are organized into homologous pairs of chromosomes that differ from each other in a relatively small number of variant positions. Haplotypes are ordered sequences of the nucleotides in the variant positions of the chromosomes in a homologous pair; for diploids, haplotypes associated with a pair of chromosomes may be conveniently represented by means of complementary binary sequences. In this paper, we consider a binary matrix factorization formulation of the single individual haplotyping problem and efficiently solve it by means of alternating minimization. We analyze the convergence properties of the alternating minimization algorithm and establish theoretical bounds for the achievable haplotype reconstruction error. The proposed technique is shown to outperform existing methods when applied to synthetic as well as real-world Fosmid-based HapMap NA12878 datasets.

matrix completion, single individual haplotyping, chromosomes, sparsity, alternating minimization.

I Introduction

DNA of diploid organisms, including humans, is organized into pairs of homologous chromosomes. The two chromosomes in a pair differ from each other due to point mutations, i.e., they contain so-called single nucleotide polymorphisms (SNPs) in a fraction of locations. SNPs are relatively rare; for humans, the SNP rate between two homologous chromosomes is roughly in base-pairs [1]. The ordered sequence of SNPs located on a chromosome in a homologous pair is referred to as a haplotype. Haplotype information is of critical importance for personalized medical applications, including the discovery of an individual’s susceptibility to diseases [2], whole genome association studies [3], gene detection under positive selection and discovery of recombination patterns [4]. High-throughput DNA sequencing platforms rely on so-called shotgun sequencing strategy to randomly oversample the pairs of chromosomes and generate a library of overlapping paired-end reads (fragments). Parts of the reads that do not cover variant positions are typically discarded; the remaining data is conveniently organized in a read-fragment matrix where the rows correspond to reads and columns correspond to SNPs. Since the SNPs are relatively rare and reads are relatively short, the read-fragment matrix is typically very sparse. If the reads were free of sequencing errors, haplotype assembly would be straightforward and would require partitioning the reads into two clusters, one for each chromosome in a pair. However, presence of sequencing errors (of the order to ) gives rise to ambiguities about the origin of reads and renders the single individual haplotyping (SIH) problem computationally very challenging.

Fig. 1: An illustration of the single individual haplotyping problem (often also referred to as haplotype assembly). The nucleotides marked in red denote the SNPs on a pair of chromosomes. During sequencing, paired-end reads (i.e., DNA fragments) containing the SNPs are generated from multiple copies of the chromosomal sequences; one can think of each read as being obtained by sampling (with replacement) one of the chromosomes. We assume that the relative ordering of reads can be determined by mapping them to a reference genome. The goal of single individual haplotyping is to determine the order of SNPs associated with each chromosome in the pair.

Approaches to SIH attempt to perform optimization of various criteria including minimum fragment removal, minimum SNP removal and most widely used minimum error correction (MEC) objectives [5]. Finding the optimal MEC solution to the SIH problem is known to be NP-hard [5, 6]. A branch-and-bound approach in [7] solves the problem optimally but the complexity of the scheme grows exponentially with the haplotype length. Similar approach was adopted in [8] where statistical information about sequencing errors was exploited to solve the MEC problem using sphere decoding. However, the complexity of this scheme grows exponentially with haplotype length and quickly becomes prohibitive. Suboptimal yet efficient methods for SIH include greedy approach [9], max-cut based solution [10], Bayesian methods based on MCMC [11], greedy cut based [12] and flow-graph based approaches [13]. More recent heuristic haplotype assembly approaches include a convex optimization program for minimizing the MEC score [14], a communication-theoretic approach solved using belief propagation [15], dynamic programming based approach using graphical models [16] and probabilistic mixture model based approach [17]. Generally, these heuristic methods come without performance guarantees.

Motivated by the recent developments in the research on matrix completion (overviewed in Section II-A), in this paper we formulate SIH as a rank-one matrix completion problem and propose a binary-constrained variant of alternating minimization algorithm to solve it. We analyze the performance and convergence properties of the proposed algorithm, and provide theoretical guarantees for haplotype reconstruction expressed in the form of an upper bound on the MEC score. Furthermore, we determine the sample complexity (essentially, sequencing coverage) that is sufficient for the algorithm to converge. Experiments performed on both synthetic and HapMap sample NA12878 datasets demonstrate the superiority of the proposed framework over competing methods. Note that a matrix factorization framework was previously leveraged to solve SIH via gradient descent in [18]; however, [18] does not provide theoretical performance guarantees that are established for the alternating minimization algorithm proposed in the current manuscript. An early preliminary version of our work was presented in [19].

I-a Notation

Matrices are represented by uppercase bold letters and vectors by lowercase bold letters. For a matrix , and represent its row and column, respectively. denotes the entry of matrix and denotes the entry of vector . , and represent respectively the transpose, the spectral norm (or 2-norm), the Frobenius norm and entry-wise norm (i.e., ) of the matrix , whereas the 2-norm of a vector is denoted by . Each vector is assumed to be a column vector unless otherwise specified. A range of integers from to is denoted by . stands for the identity matrix of an appropriate dimension. Sign of an entry is if , otherwise, and is the vector of entry-wise signs of . A standard basis vector with in the entry and everywhere else is denoted by . The singular value decomposition (SVD) of a matrix of rank is given by , where and are matrices of the left and right singular vectors, respectively, of , with and , and is a diagonal matrix whose entries are , where are the singular values of . Projection of a matrix on the subspace spanned by the columns of another matrix is denoted by and the projection to the orthogonal subspace is denoted by . Subspace spanned by vectors is denoted by span. Lastly, denotes the indicator function for the event , i.e., = 1, if is true, 0 otherwise.

Ii Matrix Completion Formulation of Single Individual Haplotyping

Ii-a Brief background on matrix completion

Matrix completion is concerned with finding a low rank approximation to a partially observed matrix, and has been an active area of research in recent years. Finding a rank- approximation , , to a partially observed matrix is often reduced to the search for factors and such that [20, 21, 22, 23, 24, 25]. Formally, the low rank matrix completion problem for with noisy entries over a partial set is stated as

(1)

where is the partially observed noisy version of . The task of inferring missing entries of by the above factorization is generally ill-posed unless additional assumptions are made about the structure of [22], e.g., satisfies the incoherence property (see definition (II.1)) and the entries of are sampled uniformly at random.

Definition II.1.

[22] A rank- matrix with SVD given by is said to be incoherent with parameter if

The optimization in (1) is NP-hard [26]; a commonly used heuristic for approximately solving (1) is the alternating minimization approach that keeps one of and fixed and optimizes over the other factor, and then switches and repeats the process [24, 27, 28]. Each of the alternating optimization steps is convex and can be solved efficiently. One of the few works that provide a theoretical understanding of the convergence properties of alternating minimization based matrix completion methods is [24], where it was shown that for a sufficiently large sampling probability of , the reconstruction error can be minimized to an arbitrary accuracy. The original noiseless analysis was later extended to a noisy case in [27].

In a host of applications, factors , or both may exhibit structural properties such as sparsity, non-negativity or discreteness. Such applications include blind source separation [29], gene network inference [30], and clustering with overlapping clusters [31], to name a few. In this work, we consider the rank-one decomposition of a binary matrix from its partial observations that are perturbed by bit-flipping noise. This formulation belongs to a broader category of non-negative matrix factorization [21] or, more specifically, binary matrix factorization [32, 33, 34, 35]. Related prior works include [32, 33], which consider decomposition of a binary matrix in terms of non-binary and , while [34] explores a Bayesian approach to factorizing matrices having binary components. The approach in [35] constrains , and to all be binary; however, it requires a fully observed input matrix . On the other hand, [36] considers a factorization of a non-binary into a binary and a non-binary factor, with the latter having “soft” clustering constraints imposed. As opposed to these works, we aim for approximate factorization in the scenario where all of , and are binary, having only limited and noisy access to the entries of .

Next, we define the notion of distance between two vectors, which will be used throughout the rest of this paper.

Definition II.2.

[37] Given two vectors and , the principal angle distance between and is defined as

where and are normalized111Normalized version of any vector is given by . forms of and .

Ii-B System model

Let the number of reads carrying information about the haplotypes (after discarding reads which cover no more than one SNP) be . If denotes the haplotype length (), then the reads can be organized into an SNP fragment matrix , whose column contains information carried by the read and whose row contains information about the SNP position on the chromosomes in a pair. Since diploid organisms typically have bi-allelic chromosomes (i.e., only possible nucleotides at each SNP position), binary labels or can be ascribed to the informative entries of , where the mapping between nucleotides and the binary labels follows arbitrary convention. Let be the set of entries of that carry information about the SNPs; note that the number of informative entries in each column of in much smaller than , reflecting the fact that the reads are much shorter than chromosomes. Let us define the sample probability as . Furthermore, let us define the operator as

(2)

represents the information about the SNP site provided by the read. Adopting the convention that ’s in column correspond to SNP positions not covered by the read, becomes a matrix with entries in . Let be the set of haplotype sequences of the diploid organism under consideration, with . Note that the binary encoding of SNPs along the haplotypes implies that the haplotypes are binary complements of each other, i.e., .

can be thought of as obtained by sampling, with errors, a rank one matrix with entries from , given by

(3)

where and are vectors of lengths and respectively, with entries from , and are normalized versions of and , and is the singular value of . represents the haplotype or (the choice can be arbitrary) and denotes the membership of read, i.e., if and only if the read is sampled from . If denotes the sequencing error noise matrix, then the erroneous SNP fragment matrix is given by

(4)

The objective of SIH is to infer (and ) from the data matrix which is both sparse as well as noisy.

Ii-C Noise model

Let denote the sequencing error probability. The noise matrix capturing the sequencing errors can be modeled as an matrix with entries in ,222For the labeling scheme adopted in this paper, . where each entry is given by

(5)

has full rank since the errors occur independently across the reads and SNPs. The SVD of is given by , where .

An important observation about the noise model defined in (5) is that it fits naturally into the worst case noise model considered in [27, 25]. Under this model, the entries of are assumed to be distributed arbitrarily, with the only restriction that there exists an entry-wise uniform upper bound on the absolute value i.e., , where is a constant. This is trivially true for the above formulation of SIH, where , leading to . With the entries of modeled as Bernoulli variables with probability , the following lemma provides a bound on the spectral norm of the partially observed noise matrix and is proved in the Appendix A.

Lemma 1.

Let be an sequencing error matrix as defined in (5). Let be the sample set of observed entries and let be the observation probability. If denotes the sequencing error probability, then with high probability we have

Iii Single Individual Haplotyping via Alternating Minimization

As seen in Section II-B, SIH can be formulated as the problem of low-rank factorization of the underlying SNP-fragment matrix ,

(6)

To perform the factorization, we optimize the loss function given by

(7)

which is identical to the MEC score associated with the factorization in (6). However, -norm optimization problems are non-convex and computationally hard; instead, we use a relaxed -norm loss function

Then, the optimization problem can be rewritten as finding and such that

The above optimization problem can be further reduced to a continuous and simpler version by relaxing the binary constraints on and ,

(8)

Iii-a Basic alternating minimization for SIH

The minimization (8) is a non-convex problem and often eludes globally optimal solutions. However, (8) can be solved in a computationally efficient manner by using heuristics such as alternating minimization, which essentially alternates between least-squares solution to or . In other words, the minimization problem boils down to an ordinary least-squares update at each step of an iterative procedure, summarized as

(9)
(10)

Once a termination condition is met for the above iterative steps, entries of are rounded off to to give the estimated haplotype vector. Following [24], the basic alternating minimization algorithm for the SIH problem is formalized as Algorithm 1.

SNP-fragment matrix , observed set , estimated probability
Power Iteration: Use power iteration to generate the top singular vector of ; denote it by
Clipping: Set entries of greater than to zero, and then normalize to get
for  do
   
   
end for
Output: Round-off entries of to to get estimate of the haplotype vector
Algorithm 1 SIH via alternating minimization

Remark 1: Performance of Algorithm 1 depends on the choice of the initial vector . The singular vector corresponding to the topmost singular value of the noisy and partially observed matrix serves as a reasonable starting point since, as shown later in Section IV, this vector has a small distance333Please refer to Section II-A for a definition of distance measure. to . However, performing singular value decomposition requires operations; therefore, it is computationally prohibitive for large-scale problems typically associated with haplotyping tasks. In practice, the power method is employed to find the topmost singular vector of the appropriately scaled matrix by iteratively computing vectors and as

(11)

with the initial chosen to be a random vector. Let us assume that the singular values of are . The power method is guaranteed to converge to the singular vector (say, ) corresponding to , provided holds strictly. The convergence is geometric with a ratio . Through successive iterations, the iterate gets closer to the true singular vector; specifically,

(12)

with per iteration complexity of [18] (see Definition 1 for ). The following lemma provides the complexity of iterations required for the convergence of the described method.

Lemma 2 ([38]).

Let be an matrix and be a random vector in . Let for , , where ’s and ’s are respectively the left singular vectors and singular values of . Then, after iterations of the power method, the iterate satisfies .

Remarks 2: It has been shown in [24] that the convergence guarantees for Algorithm 1 can be established, provided the incoherence of the iterates and is maintained for iterations (see Definition II.1). To ensure incoherence at the initial step, one needs to threshold or “clip” the absolute values of the entries of , as described in Algorithm 1. Although the singular vector obtained by power iterations minimizes the distance from the true singular vector, it is the clipping step that makes sure that the information contained in is spread across every dimension instead of being concentrated in only few, much like the true vector (see Lemma 3).

Iii-B Binary-constrained alternating minimization

The updates at each iteration of Algorithm 1 ignore the fact that the underlying true factors, namely and , have discrete entries; instead, the procedure imposes binary constraints on and at the final step only. This may adversely impact the convergence of alternating minimization; to see this, note that when is updated in Algorithm 1 according to (9), its entry is found as

(13)

Clearly, if the absolute value of is very large (or very small) compared to at a given iteration , then, given that for , we see from (13) that the absolute value of at iteration becomes close to (or much bigger than ). We empirically observe that as the iterations progress, the value of becomes increasingly bounded away from , which leads to potential incoherence of the iterates in subsequent iterations. To maintain incoherence, it is desirable that the entries of and remain close to . It is therefore of interest to explore if we can do better by restricting the update steps in the discrete domain; in other words, enforce the discreteness condition in each step, rather than using it at the final step only.

One way of enforcing discreteness is to project the solution of each update onto the set , i.e., impose the inherent binary structure of and in (10) and (9). This leads us to the updates

(14)
(15)

Replacing and updates in Algorithm 1 by (15) and (14) leads to a discretized version of the alternating minimization algorithm for single individual haplotyping, given as Algorithm 2. Clearly, rounding-off of the final iterate is no longer required since the individual iterates are constrained to be binary at each step of the algorithm.

SNP-fragment matrix , observed set , estimated sequencing error probability .
Power Iteration: Use power iteration to generate the top singular vector of and denote it by
Clipping: Set entries of greater than to zero, and then normalize to get .
for  do
   
   
end for
Output: is the estimate of the haplotype vector
Algorithm 2 SIH via discrete alternating minimization

A closer look at the iterative update of in Algorithm 2 reveals that the update can be written as

(16)

Similar update can be stated for . The non-differentiability of the update (16), however, makes the analysis of convergence of Algorithm 2 intractable. In order to remedy this problem, the “hard” update in (16) is approximated by a “soft” update using a logistic function , thus replacing the and updates at iteration in Algorithm 2 by

(17)
and
(18)

where and are vectors representing normalized and . Note that the update steps (17) and (18) can be represented in terms of the normalized vectors since it holds that sign = sign . The updates (17) and (18) relax the integer constraints on and while ensuring that the values remain in the interval . It should be mentioned here that in the multiple sets of experiments that we have run with synthetic and biological datasets, this approximation did not lead to any noticeable loss of performance when compared to Algorithm 2; on the other hand, the same allows us to derive an upper bound on the MEC score through an analysis of convergence of the algorithm (see Section IV). Algorithm 3 presents the variant of alternating minimization that relies on soft update steps as given by (17) and (18).

SNP-fragment matrix , observed set , estimated sequencing error probability .
Power Iterations: Use power iterations to generate the top singular vector of and denote it by .
Clipping: Set entries of greater than to zero, and then normalize to get .
for  do
                     
                     
                     
   
end for
Output: Round-off entries of to to get estimate of the haplotype vector.
Algorithm 3 SIH via discrete alternating minimization with soft updates

Remark 4: It is worthwhile pointing out the main differences between the approach considered here and the method in [18]. In the latter, the authors propose an alternating minimization based haplotype assembly method by imposing structural constraints on only one of the two factors, namely, the read membership factor. However, for the diploid case considered in this work, use of binary labels allow us to impose similar constraints on both and , thereby leading to computationally efficient yet accurate (as demonstrated in the results section) method outlined in Algorithm 2. Moreover, the alternating minimization algorithm in [18] is not amenable to performance analysis. Our aim in this paper is to recover (up to noise terms) the underlying true factors and analytically explore relation between the recovery error and the number of iterations required.

Iv Analysis of Performance

We begin this section by presenting our main result on the convergence of Algorithm 3. The following theorem provides a sufficient condition for the convergence of this algorithm.

Theorem 1.

Let and denote the haplotype and read membership vectors, respectively, and let denote the observed SNP-fragment matrix where , is the noise matrix with and as defined in (5), and are normalized versions of and respectively, and is the singular value of . Let and be the desired accuracy of reconstruction. Assume that each entry of is observed uniformly randomly with probability

(19)

where and are global constants. Then, after iterations of Algorithm 2, the estimate with high probability satisfies

(20)

The following corollary follows directly from Theorem 1.

Corollary 1.

Define . Under the conditions of Theorem 1, the normalized minimum error correction score with respect to , defined as , satisfies

(21)

Theorem 1 and Corollary 1 imply that if the sample probability satisfies the condition (19) for a given sequencing error probability , then Algorithm 2 can minimize the MEC score up to certain noise factors in iterations. The corresponding sample complexity, i.e., the number of entries of needed for the recovery of is . Note that compared to (20), expression (21) has an additional noise term. This is due to the fact that unlike the loss function in (20), the MEC score of is calculated with respect to the observed matrix .

Factor in the expression for sample complexity (19) is due to using independent samples at each of iterations [24]. This circumvents potentially complex dependencies between successive iterates which are typically hard to analyze [39]. We implicitly assume independent samples of in each iteration of Algorithm 3 for the sake of analysis, and consider fixed sample set in our experiments. As pointed out in [39], practitioners typically utilize alternating minimization to process the entire sample set at each iteration, rather than the samples thereof.

The analysis of convergence is based on the assumption that the samples of are observed uniformly at random. This implies that each read contains SNPs located independently and uniformly at random along the length of the haplotype. In practice, however, the reads have uniformly random starting locations but the positions of SNPs sampled by each read are correlated. To facilitate our analysis, the first of its kind for single individual haplotyping, we approximate the practical setting by assuming random SNP positions.

An interesting observation in this context is that sequencing coverage, defined as the number of reads that cover a given base in the genome, can conveniently be represented as the product of the sample probability and the number of reads. Then, (19) implies that the required sequencing coverage for convergence is , which is roughly logarithmic in .

Fig. 2: A comparison of the normalized minimum error correction score of Algorithm 3 and the trace-norm minimization (SVT) method for matrices of dimensions and , plotted as a function of sample size. is and error probability is set to . The values shown are averaged over simulation runs.
Fig. 3: A comparison of the runtime of Algorithm 3 and the trace-norm minimization (SVT) method for matrices of dimensions and , plotted as a function of sample size. is and error probability is set to . The values shown are averaged over simulation runs.

In Figure 2 we compare the MEC error rate performance of Algorithm 3 (denoted as HapAltMin in the figure) with another matrix completion approach, namely singular value thresholding (SVT) [40]; SVT is a widely used trace-norm minimization based method. We compare their performance on randomly generated binary rank-one matrices of size and , and flip the entries in a uniformly randomly chosen sample set with probability . Both methods are run times for each chosen sample size and error rates are averaged over those runs. Results of the trace-norm minimization method are rounded off in the final iterations. Figure 2 suggests that alternating minimization based matrix completion approach performs better than the trace-norm minimization based method for both problem dimensions, and the performance gap is wider for compared to . Figure 3 plots the runtime of the two methods for the same problem instances. Trace-norm minimization based method, that was reported in literature as the most accurate version of SVT, is much slower of the two, primarily due to the computationally expensive SVD operation at each iteration step.

Next, we present the analysis frameworks for the initial and subsequent iterative steps of Algorithm 3.

Iv-a Initialization Analysis

In order for Algorithm 3 to converge, a suitable initial starting point close to the ground truth is necessary. In addition to the power iteration step, which gives a singular vector that is close to the true vector , the subsequent clipping step helps retain the incoherence of the same, without sacrificing the closeness property. The following lemma uses Theorem 3 to establish the incoherence of , and that remains close to the true left singular vector .

Lemma 3.

Let be obtained after normalizing the output of the power iteration step in Algorithm 3. Let be the vector obtained after setting entries of greater than to zero. If is the normalized , then, with high probability, we have , and is incoherent with parameter , where is the incoherence parameter of .

Proof.

The proof follows directly from Lemma C.2 from [24] and Lemma 2 from [27] after suitably using conditions from Lemma 1 and Theorem 3. ∎

Iv-B Convergence Analysis

Let us denote . The Taylor series expansion of from (17) is given by

(22)

Now, we have

Clearly, for a given , implying that the absolute value of is close to the entry-wise observation probability . This, in turn, implies that in (22) all terms with higher powers of are much smaller than the dominant linear term, and the Taylor’s series expansion can be written as , where the error term can be bounded as using the Lagrange error bound. Therefore, we approximate the update in (17) as

(23)

for .

Let us introduce an error vector as , where and is diagonal with . Furthermore, let us define a noise vector as , where the quantities are as follows:

  • where ;

  • is a diagonal matrix given by ;

  • where is the column of .

We also define for any given . Using the above definitions, (23) can be written as

(24)

Therefore, using vector-matrix notation, the update of can be written as

(25)