# A Graph Auto-Encoder for Haplotype Assembly and Viral Quasispecies Reconstruction

###### Abstract

Reconstructing components of a genomic mixture from data obtained by means of DNA sequencing is a challenging problem encountered in a variety of applications including single individual haplotyping and studies of viral communities. High-throughput DNA sequencing platforms oversample mixture components to provide massive amounts of reads whose relative positions can be determined by mapping the reads to a known reference genome; assembly of the components, however, requires discovery of the reads’ origin – an NP-hard problem that the existing methods struggle to solve with the required level of accuracy. In this paper, we present a learning framework based on a graph auto-encoder designed to exploit structural properties of sequencing data. The algorithm is a neural network which essentially trains to ignore sequencing errors and infers the posterior probabilities of the origin of sequencing reads. Mixture components are then reconstructed by finding consensus of the reads determined to originate from the same genomic component. Results on realistic synthetic as well as experimental data demonstrate that the proposed framework reliably assembles haplotypes and reconstructs viral communities, often significantly outperforming state-of-the-art techniques. Source codes and datasets are publicly available at https://github.com/WuLoli/GAEseq.

## 1 Introduction

Genetic makeup of a biological sample, inferred by means of DNA sequencing, will help determine an individual’s susceptibility to a broad range of chronic and acute diseases, support the discovery of new pharmaceutical products, and personalize and improve the delivery of health care. However, before the promises of personalized medicine come to fruition, efficient methods for accurate inference of genetic variations from massive DNA sequencing data must be devised.

Information about variations in an individual genome is provided by haplotypes, ordered lists of single nucleotide polymorphisms (SNPs) on the individual’s chromosomes (\citeauthorschw2010). High-throughput DNA sequencing technologies generate massive amounts of reads that sample an individual genome and thus enable studies of genetic variations (\citeauthorschw2010,clark2004role,sabe02). Haplotype reconstruction, however, remains challenging due to limited lengths of reads and presence of sequencing errors (\citeauthorhash18). Particularly difficult is the assembly of haplotypes in polyploids, organisms with chromosomes organized in -tuples with , where deep coverage is typically required to achieve desired accuracy. This implies high cost and often renders existing haplotype assembly techniques practically infeasible (\citeauthormotazedi2018exploit).

A closely related problem to haplotype assembly is that of reconstructing viral communities. RNA viruses such as hepatitis, HIV, and Ebola, are characterized by high mutation rates which give rise to communities of viral genomes, the so-called viral quasispecies. Determining genetic diversity of a virus is essential for the understanding of its origin and mutation patterns, and the development of effective drug treatments. Reconstructing viral quasispecies (i.e., viral haplotypes, as we refer to them for convenience) is even more challenging than haplotype assembly (\citeauthorahn2018viral) since the number of constituent strains in a community is typically unknown, and its spectra (i.e., strain frequencies) non-uniform.

^{1}

^{1}footnotetext: This work was funded in part by the NSF grant CCF 1618427.

Existing methods often approach haplotype assembly as the task of grouping sequencing reads according to their chromosomal origin into as many clusters as there are chromosomes. Separation of reads into clusters is rendered challenging by their limited lengths and the presence of sequencing errors (\citeauthorhash18); such artifacts create ambiguities regarding the origin of the reads. The vast majority of existing haplotype assembly methods attempt to remove the aforementioned ambiguity by altering or even discarding the data, leading to minimum SNP removal (\citeauthorlanc01), maximum fragments cut (\citeauthorduit10), and minimum error correction (MEC) score (\citeauthorlipp02) optimization criteria. Majority of haplotype assembly methods developed in recent years are focused on optimizing the MEC score, i.e., determining the smallest possible number of nucleotides in sequencing reads that should be altered such that the resulting dataset is consistent with having originated from haplotypes ( denotes the ploidy of an organism) (\citeauthorxie16,piro15,kule14,patt15,boni16). These include the branch-and-bound scheme (\citeauthorwang05), an integer linear programming formulation in (\citeauthorchen13), and a dynamic programming framework in (\citeauthorkule14). All these techniques attempt to find exact solution to the MEC score minimization problem; the resulting high complexity has motivated search for computationally efficient heuristics. They include the greedy algorithm in (\citeauthorlevy07) and methods that compute posterior joint probability of the alleles in a haplotype sequence via MCMC (\citeauthorbans08) and Gibbs (\citeauthorkim07) sampling. A max-cut algorithm for haplotype assembly in (\citeauthorbans08) is motivated by the clustering interpretation of the problem. The efficient algorithm proposed there, HapCUT, has recently been upgraded as HapCUT2 (\citeauthoredge17). In (\citeauthoragui12), a novel flow-graph approach to haplotype assembly was proposed, demonstrating performance superior to state-of-the-art methods. More recent methods include a greedy max-cut approach in (\citeauthorduit11), convex optimization framework in (\citeauthordas15), and a communication-theoretic motivated algorithm in (\citeauthorpulj16).

Haplotype assembly for polyploids () is more challenging than that for diploids () due to a much larger space of possible solutions to be searched. Among the aforementioned methods, only HapCompass (\citeauthoragui12), SDhaP (\citeauthordas15) and BP (\citeauthorpulj16) are capable of solving the haplotype assembly problem for . Other techniques that can handle reconstruction of haplotypes for both diploid and polyploid genomes include a Bayesian method HapTree (\citeauthorberg14), a dynamic programming method H-PoP (\citeauthorxie16) shown to be more accurate than the techniques in (\citeauthoragui12,berg14,das15), and the matrix factorization schemes in (\citeauthorcai16,hash18).

On another note, a number of viral quasispecies reconstruction methods were proposed in recent years. Examples include ShoRAH (\citeauthorzagordi2011shorah) and ViSpA (\citeauthorastrovskaya2011inferring) that perform read clustering and read-graph path search, respectively, to identify distinct viral components. QuasiRecomb (\citeauthortopfer2013probabilistic) casts the problem as the decoding in a hidden Markov model while QuRe (\citeauthorprosperi2012qure) formulates it as a combinatorial optimization. PredictHaplo (\citeauthorprabhakaran2014hiv) employs non-parametric Bayesian techniques to automatically discover the number of viral strains in a quasispecies. More recently, aBayesQR (\citeauthorahn2017abayesqr) approached viral quasispecies reconstruction with a combination of hierarchical clustering and Bayesian inference while (\citeauthorahn2018viral) relies on tensor factorization.

In this paper, we propose a first ever neural network-based learning framework, named GAEseq, to both haplotype assembly and viral quasispecies reconstruction problems. The framework aims to estimate the posterior probabilities of the origins of sequencing reads using an auto-encoder whose design incorporates salient characteristics of the sequencing data. Auto-encoders (\citeauthorFukushima1975) are neural networks that in an unsupervised manner learn a low-dimensional representation of data; more specifically, they attempt to perform a dimensionality reduction while robustly capturing essential content of high-dimensional data (\citeauthorgoodfellow2016deep). Auto-encoders have shown outstanding performance in a variety of applications across different fields including natural language processing (\citeauthorrichard2011), collaborative filtering (\citeauthorrianne2017graph), and information retrieval (\citeauthorthomas2016variational), to name a few. Typically, auto-encoders consist of two blocks: an encoder and a decoder. The encoder converts input data into the so-called codes while the decoder reconstructs the input from the codes. The act of copying the input data to the output would be of little interest without an important additional constraint – namely, the constraint that the dimension of codes is smaller than the dimension of the input. This enables auto-encoders to extract salient features of the input data. For both the single individual and viral haplotype reconstruction problems, the salient features of data are the origins of sequencing reads. In our work, we propose a graph auto-encoder architecture with an encoder featuring a softmax function placed after the dense layer that follows graph convolutional layers (\citeauthorMasci2011; \citeauthorrianne2017graph); the softmax function acts as an estimator of the posterior probabilities of the origins of sequencing reads. The decoder assembles haplotypes by finding the consensus sequence for each component of the mixture, thus enabling end-to-end solution to the reconstruction problems.

## 2 Methods

### 2.1 Problem formulation

Let denote a haplotype matrix where is the number of (single individual or viral) haplotypes and is the haplotype length. Furthermore, let denote an SNP fragment matrix whose rows correspond to sequencing reads and columns correspond to SNP positions. Matrix is formed by first aligning reads to a reference genome and then identifying and retaining only the information that the reads provide about heterozygous genomic sites. One can interpret as being obtained by sparsely sampling an underlying ground truth matrix , where the row of is the haplotype sampled by the read. The sampling is sparse because the reads are much shorter than the haplotypes; moreover, the reads may be erroneous due to sequencing errors. Following (\citeauthorahn2018viral) , we formalize the sampling operation as

(1) |

where denotes the set of informative entries in , i.e., the set of such that the SNP is covered by the read, and is the projection operator denoting the sampling of haplotypes by reads. Sequencing is erroneous and thus may differ from ; in particular, given sequencing error rate , with probability .

Since each read samples one of the haplotypes, where denotes the matrix indicating origins of the reads in . In particular, each row of matrix is one of the -dimensional standard unit vectors , with in the position and the remaining entries . If read samples haplotype, the row of is . If the origins of reads were known, each haplotype could be reconstructed by finding consensus of reads which sample that particular haplotype. We think of the assembly as a two-step procedure: given the SNP fragment matrix we first identify the read origin indicator matrix and then use to reconstruct the haplotype matrix .

To characterize the performance of haplotype assembly methods we rely on two metrics: the minimum error correction (MEC) score, which can be traced back to (\citeauthorlippert2002algorithmic), and the correct phasing rate, also referred to as reconstruction rate. The MEC score is defined as the smallest number of observed entries in that need to be altered (i.e., corrected) such that the resulting data is consistent with having originated from distinct haplotypes, i.e.,

(2) |

where HD denotes the Hamming distance between its arguments (sequences, evaluated only over informative entries), denotes the row of and denotes the row of . The correct phasing rate (CPR) is defined as

(3) |

where is the one-to-one mapping from the set of reconstructed haplotype to the set of true haplotype (\citeauthorhash18), i.e., mapping that determines the best possible match between the two sets of haplotypes. To characterize performance of methods for reconstruction of viral quasispecies with generally a priori unknown number of components, in addition to correct phasing rate we also quantify recall rate, defined as the fraction of perfectly reconstructed components in a population (i.e., recall rate = ), and predicted proportion, defined as the ratio of the estimated and the true number of components in a genomic mixture (\citeauthorahn2018viral).

To assemble haplotypes from a set of reads we design and employ a graph auto-encoder. Fig. 1 (b) shows the entire end-to-end pipeline that takes the collection of erroneous reads and generates reconstructed haplotypes. First, the SNP fragment matrix is processed by the graph encoder to infer the read origin indicator matrix ; then, a haplotype decoder reconstructs matrix . The graph auto-encoder is formalized in the next section.

### 2.2 Graph auto-encoders

Graph auto-encoders are a family of auto-encoders specifically designed for learning on graph-structured data (\citeauthorrianne2017graph; \citeauthorthomas2016variational). In this paper, we design graph auto-encoders for the assembly of the components of a genomic mixture. As in conventional auto-encoder structures, the developed architecture consists of two parts: the graph encoder and the decoder. The graph encoder takes the SNP fragment matrix and the graph adjacency matrix as inputs, and outputs the node embedding matrix . Note that we impose constraints on the node embedding matrix so that the salient features extracted by a graph auto-encoder approximate the read origin indicator matrix . Such a constraint does not prevent efficient training of the auto-encoders via backpropagation. The decoder is utilized to reconstruct the SNP fragment matrix and the haplotype matrix from the node embedding matrix ; this implies that the decoder is essentially capable of imputing the unobserved entries in the SNP fragment matrix.

To numerically represent information in the SNP fragment matrix , we encode its entries using a set of discrete values – one for each of possible nucleotides – where the mapping between nucleotides and the discrete values can be decided arbitrarily. To this end, we may simply represent the nucleotides A, C, G and T by , , and , respectively; non-informative entries in each row of , i.e., SNP positions not covered by a read, are represented by . Note that the SNP fragment matrix can be represented by an undirected bipartite graph where the set of read nodes with and the set of SNP nodes with together form the set of vertices , i.e., . The weights assigned to edges are the discrete values used to represent nucleotides. With this model in place, we can rephrase the graph encoder as , where represents the graph adjacency matrix for a nucleotide encoded by . Equivalently, has 1’s for the entries whose corresponding positions in are encoded by . Since we are interested in imputing the unobserved entries based on the observed entries in instead of simply copying the observed entries to , it is beneficial to reformulate the decoder as . In other words, the auto-encoder is trained to learn from the observed entries in order to determine origin of reads, impute unobserved entries of , and reconstruct haplotypes in the genomic mixture.

### 2.3 Read origin detection via graph encoder

Recall the interpretation that the SNP fragment matrix is obtained by erroneously sampling an underlying ground truth matrix . This motivates development of a specific graph encoder architecture, motivated by the ideas of the design in (\citeauthorrianne2017graph), that is capable of detecting origin of sequencing reads in via estimating the posterior probabilities of the origin of each read.

Let denote an diagonal read degree matrix whose entries indicate the number of SNPs covered by each read, and let denote an diagonal SNP degree matrix whose entries indicate the number of reads covering each SNP. We facilitate exchange of messages between read nodes and SNP nodes in the graph, initiating it from the set of read nodes ; doing so helps reduce the dimensions of weights and biases since the number of reads is far greater than the haplotype length . Note that the dimension of messages keeps reducing during the message passing procedure.

The messages from read nodes to SNP nodes are

(4) |

where and denote the weights and biases of the first convolutional layer for the nucleotide encoded with , respectively, denotes an element-wise activation function such as , and denotes the transpose of a matrix. The dimension of both and is , where denotes the message length after the first message passing step.

The messages from SNP nodes to read nodes are

(5) |

where and denote the weights and biases of the second convolutional layer for the nucleotide encoded with , respectively. The dimension of both and is , where denotes the message length after the second message passing step.

Repeating message passing and stacking the convolutional layers leads to formation of a deep model. The read nodes to SNP nodes layer is readily generalized as

(6) |

where and for . The dimension of is . Furthermore, the SNP nodes to read nodes layer is generalized as

(7) |

where . The dimension of is . Note that the messages are passed from read nodes to SNP nodes when the subscript of is odd, and otherwise traverse in the opposite direction.

Equation (6) and (7) specify the graph convolutional layer while the dense layer is defined as

(8) |

where denotes the output of the dense layer, and are the weights and biases of the dense layer, respectively, is the output of the last graph convolutional layer, and represents the number of graph convolutional layers. The dimension of is and the dimension of and is , where denotes the ploidy (i.e., the number of components in a genomic mixture).

To find which approximates the read origin indicator matrix (i.e., with each row close in the -norm sense to a -dimensional standard basis vector), we employ the softmax function

(9) |

where in our experiments we set to 200. Having estimated read origins by the node embedding matrix , the reads can be organized into clusters. This enables straightforward reconstruction of haplotypes by determining the consensus sequence for each cluster.

### 2.4 Haplotype decoder

Thus far, we have conveniently been representing alleles as the numbers in . It is desirable, however, that in the definition of a loss function the distance between numerical values representing any two alleles is identical, no matter which pair of alleles is considered; this ensures the loss function relates to the MEC score – the metric of interest in haplotype assembly problems. Following (\citeauthorahn2018viral), we define the loss function of the auto-encoder as the squared Frobenius norm of the difference between a one-hot SNP fragment matrix and the reconstructed matrix at the informative positions, i.e., , where and are formed by substituting discrete values by the set of four dimensional standard basis vectors . With such a notational convention, the proposed loss function approximates the MEC score; it only approximates the score, rather than coincides with it, because is an approximation of the read-origin matrix . Therefore, the graph auto-encoder is trained to approximately minimize the MEC score. Fig. 2 illustrates the data processing pipeline that takes as inputs reads in the SNP fragment matrix and produces the matrix of haplotypes as well as imputes missing entries in the SNP fragment matrix. The proposed graph auto-encoders for haplotype assembly and viral quasispecies reconstruction are formalized as Algorithm 1 and Algorithm 2, respectively. For the viral quasispecies reconstruction problem, the number of clusters is typically unknown; detailed strategy based on (\citeauthorahn2018viral) for the automated inference of can be found in Supplementary Document B.

## 3 Results

The hyper-parameters of GAEseq are determined by training on 5 synthetic triploid datasets with coverage 30 and validated on different 5 synthetic triploid datasets with the same coverage. The results reported in this section are obtained on test data. Detailed description of the computational platform and the choice of hyper-parameters can be found in Supplementary Document A.

MEC | CPR | ||||
---|---|---|---|---|---|

Coverage | Mean | SD | Mean | SD | |

15 | GAEseq | 8.200 | 4.686 | 0.822 | 0.048 |

HapCompass | 100.700 | 66.150 | 0.763 | 0.046 | |

H-PoP | 28.700 | 32.667 | 0.783 | 0.066 | |

AltHap | 59.100 | 28.125 | 0.709 | 0.054 | |

25 | GAEseq | 8.400 | 4.719 | 0.831 | 0.081 |

HapCompass | 124.800 | 132.156 | 0.810 | 0.063 | |

H-PoP | 33.800 | 47.434 | 0.798 | 0.046 | |

AltHap | 92.600 | 83.649 | 0.756 | 0.068 | |

35 | GAEseq | 10.700 | 3.234 | 0.857 | 0.087 |

HapCompass | 217.400 | 174.135 | 0.775 | 0.072 | |

H-PoP | 41.700 | 53.971 | 0.823 | 0.094 | |

AltHap | 164.000 | 101.583 | 0.754 | 0.093 |

p17 | p24 | p2-p6 | PR | RT | RNase | int | vif | vpr | vpu | gp120 | gp41 | nef | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

GAEseq | PredProp | 1 | 1 | 1 | 1 | 1.2 | 1 | 1 | 1 | 1 | 1.2 | 1 | 1 | 1 |

CPR | 100 | 99.4 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 96.2 | 96.7 | 100 | |

CPR | 100 | 99.4 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99.2 | 99.4 | 100 | 98.2 | |

CPR | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99.9 | 100 | 99.3 | |

CPR | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99.8 | |

CPR | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99.6 | 100 | 98.1 | |

PredictHap | PredProp | 1 | 0.6 | 1 | 1 | 1 | 0.8 | 0.8 | 0.8 | 1 | 0.8 | 0.8 | 0.8 | 0.8 |

CPR | 100 | 0 | 100 | 100 | 100 | 98.9 | 100 | 100 | 100 | 93.2 | 0 | 0 | 0 | |

CPR | 100 | 100 | 100 | 100 | 100 | 100 | 99.8 | 100 | 100 | 0 | 97.8 | 100 | 98.8 | |

CPR | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 99.7 | 100 | 100 | |

CPR | 100 | 99.1 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | |

CPR | 100 | 0 | 100 | 100 | 100 | 0 | 0 | 0 | 100 | 100 | 98.6 | 100 | 100 | |

TenSQR | PredProp | 1 | 1.6 | 1 | 1 | 1.4 | 1 | 1 | 1 | 1 | 1.6 | 2.2 | 1.2 | 0.8 |

CPR | 100 | 98.9 | 100 | 100 | 99.2 | 100 | 100 | 100 | 100 | 92.8 | 96.0 | 99.0 | 0 | |

CPR | 100 | 100 | 100 | 100 | 98.0 | 100 | 100 | 100 | 100 | 94.0 | 97.2 | 100 | 95.7 | |

CPR | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 98.3 | 97.7 | 99.8 | |

CPR | 100 | 99.3 | 100 | 100 | 99.5 | 100 | 100 | 100 | 100 | 100 | 99.8 | 99.5 | 99.7 | |

CPR | 100 | 99.3 | 100 | 99.7 | 99.7 | 100 | 100 | 100 | 100 | 100 | 94.9 | 100 | 98.6 | |

aBayesQR | PredProp | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1.2 | 1 | 0.8 | 0.8 | 1.2 |

CPR | 100 | 99.4 | 100 | 100 | 98.5 | 100 | 99.9 | 100 | 100 | 99.6 | 98 | 0 | 95.8 | |

CPR | 100 | 98.7 | 100 | 100 | 98.6 | 100 | 100 | 100 | 100 | 92 | 96.5 | 98.9 | 95.5 | |

CPR | 100 | 99.6 | 100 | 100 | 99 | 100 | 100 | 100 | 100 | 98.8 | 97.7 | 99.1 | 98.2 | |

CPR | 100 | 100 | 100 | 100 | 98.9 | 100 | 100 | 99.8 | 100 | 100 | 96.3 | 98.8 | 100 | |

CPR | 100 | 99.7 | 100 | 100 | 99.2 | 100 | 99.5 | 99.7 | 100 | 100 | 0 | 98.6 | 99.2 |

### 3.1 Performance comparison on biallelic Solanum Tuberosum semi-experimental data

We first evaluate performance of GAEseq on realistic simulations which, for convenience and to distinguish from perhaps more rich synthetic and experimental data discussed in supplementary documents, we refer to as ”semi-experimental data”. The semi-experimental data is obtained by simulating mutations, shotgun sequencing procedure, read alignment and SNP calling steps in a fictitious experiment on a single individual Solanum Tuberosum (polyploid with ). Details on how exactly the semi-experimental data is generated and processed can be found in Supplementary Document C. We compare the performance of GAEseq on this data with publicly available software HapCompass (\citeauthoragui12), an algorithm that relies on graph-theoretic models to perform haplotype assembly, H-PoP (\citeauthorxie16), a dynamic programming method, and AltHap (\citeauthorhash18), a method based on tensor factorization. The performance of different methods is evaluated in terms of the MEC score and CPR. All the considered softwares were executed with their default settings, i.e. we follow instructions in the papers they were originally proposed; there are no parameter tuning steps required for these methods. We report the MEC scores and CPR achieved by the considered algorithms in Table 1. For each sequencing coverage, the mean and standard deviation (SD) of the adopted metrics are evaluated over 10 samples. As shown in the table, GAEseq achieves the lowest average MEC score as well as the lowest standard deviation of the MEC score at all sequencing coverage settings. Moreover, GAEseq achieves the highest average CPR at all coverage settings. Note that the MEC score increases with sequencing coverage since higher coverage implies more reads. The results demonstrate that the adopted graph abstraction enables GAEseq to achieve high accuracy of the reconstruction task by learning posterior probabilities of the origins of reads. Fig. 3 shows the precision-recall curves for data with coverage 15, 25 and 35. Note that GAEseq performs very accuratly at high sequencing coverage while its performance deteriorates at low coverage. An extended version of Table 1 with additional coverage settings is in Supplementary Document C.

We further test the performance of GAEseq on simulated biallelic diploid, polyallelic triploid and tetraploid data, and on real Solanum Tuberosum data; in addition to H-Pop, AltHap and HapCompass, comparisons on diploid data also include performance of HapCUT2 (\citeauthoredge17). GAEseq outperforms all the considered algorithms by achieving lower MEC score and higher CPR. Further details can be found in Supplementary Document D and E.

### 3.2 Performance comparison on gene-wise reconstruction of real HIV-1 data

The real HIV-1 data with pairwise distances between and relative frequencies between and is an *in vitro* viral population of 5 known HIV-1 strains generated by Illumina’s MiSeq Benchtop Sequencer (\citeauthordi2014full). These reads are then aligned to the HIV- reference genome. According to (\citeauthordi2014full), we remove reads of length lower than 150bp and mapping quality scores lower than 60 for better results. We compare the performance of GAEseq on gene-wise reconstruction of the HIV population to that of other state-of-the-art methods such as PredictHaplo (\citeauthorprabhakaran2014hiv), TenSQR (\citeauthorahn2018viral) and aBayesQR (\citeauthorahn2017abayesqr), following their default settings. For fair benchmarking, we use the same dataset as (\citeauthorahn2018viral) which is why the results of our benchmarking tests match those in (\citeauthorahn2018viral). The correct phasing rate and the inferred strain frequencies are evaluated for all reconstructed strains because the ground truth for the 5 HIV-1 strains is available at (https://bmda.dmi.unibas.ch/software.html). Following (\citeauthorahn2018viral), we evaluate predicted proportion by setting the parameter needed to detect the number of HIV-1 strains to . The results in Table 2 show that GAEseq perfectly reconstructs all 5 HIV-1 strains in 8 genes while other methods correctly reconstruct components in 5 or 6 genes. This demonstrates that GAEseq’s inference of read origins based on posterior probabilities enables high accuracy of the reconstruction tasks. Regarding the 5 genes where GAEseq and other methods do not achieve perfect reconstruction (p24, vpu, gp120, gp41, nef): closer examination of viral strains reconstructed by various methods suggests translocations of short viral segments within those 5 genes in the “gold standard” dataset created by (\citeauthordi2014full). Those short translocation cause mismatch between the actual ground truth and the sequences (\citeauthordi2014full) generated. Further results on reconstruction of HIV viral communities can be found in Supplement Document F.

## 4 Conclusions

In this article, we introduce auto-encoders to the problem of reconstructing components of a genomic mixture from high-throughput sequencing data that is encountered in haplotype assembly and analysis of viral communities. In particular, a graph auto-encoder is trained to group together reads that originate from the same component of a genomic mixture and impute missing information in the SNP fragment matrix by learning from the available data. The graph convolutional encoder attempts to discover origin of the reads while the decoder aims to reconstruct haplotypes and impute missing information, effectively correcting sequencing errors. Studies on semi-experimental data show that GAEseq can achieve significantly lower MEC scores and higher CPR than the competing methods. Benchmarking tests on simulated and experimental data demonstrate that GAEseq maintains good performance even at low sequencing coverage. Studies on real HIV-1 data illustrate that GAEseq outperforms existing state-of-the-art methods in viral quasispecies reconstruction.

## References

- [Schwartz 2010] Schwartz, R. 2010. Theory and algorithms for the haplotype assembly problem. Communications in Info. & Sys. vol. 10, no. 1, 2010, 23–38.
- [Clark 2004] Clark, A. G. 2004. The role of haplotypes in candidate gene studies. Genet Epidemiol. vol. 27, no. 4, 2004, 321–33.
- [Sabeti 2002] Sabeti, P.; Reich, D.; Higgins, J.; Levine, H.; and Richter, D. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature, 419(6909):832–37, 2002.
- [Lancia 2001] Lancia, G.; Bafna, V.; Istrail, S.; Lippert, R.; and Schwartz, R. 2001. SNPs problems, complexity, and algorithms. European symposium on algorithms, vol. 1, Springer: 2001.182–93.
- [Duitama 2010] Duitama, J.; Huebsch, T.; McEwen, G.; Suk, E.; and Hoehe, MR. 2010. Refhap: a reliable and fast algorithm for single individual haplotyping. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, ACM: 2010. 160–169.
- [Lippert 2002] Lippert, R.; Schwartz, R.; Lancia, G.; and Istrail, S. 2002 Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Brief Bioinformatics, 2002, 3(1):23–31.
- [Xie 2016] Xie, M.; Wu, Q.; Wang, J.; and Jiang, T. 2016. H-PoP and H-PoPG: Heuristic partitioning algorithms for single individual haplotyping of polyploids. Bioinformatics, 2016, 32(24):3735–44.
- [Pirola 2015] Pirola, Y.; Zaccaria, S.; Dondi, R.; Klau, G.; Pisanti, N.; and Bonizzoni, P. 2015. Hapcol: accurate and memory-efficient haplotype assembly from long reads. Bioinformatics, 2015, 32(11), 1610–1617.
- [Kuleshov 2014] Kuleshov, V. 2014. Probabilistic single-individual haplotyping. Bioinformatics, 2014, 30(17):379–85.
- [Patterson 2015] Patterson, M.; Marschall, T.; Pisanti, N.; Van, Iersel L.; and Stougie, L. 2015. Whatshap: weighted haplotype assembly for future-generation sequencing reads. J Comput Biol., 2015, 22(6):498–509.
- [Bonizzoni 2016] Bonizzoni, P.; Dondi, R.; Klau, G.; Pirola, Y.; Pisanti, N.; and Zaccaria, S. 2016. On the minimum error correction problem for haplotype assembly in diploid and polyploid genomes. J Comput Biol., 2016, 23(9):718–36.
- [Edge 2017] Edge, P.; Bafna, V.; and Bansal, V. 2017. Hapcut2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res., 2017; 27(5):801–12.
- [Aguiar 2012] Aguiar, D., and Istrail, S. 2012. Hapcompass: a fast cycle basis algorithm for accurate haplotype assembly of sequence data. J Comput Biol. 2012; 19(6):577–90.
- [Duitama 2011] Duitama, J.; McEwen, G.; Huebsch, T.; Palczewski, S.; and Schulz, S. 2011. Fosmid-based whole genome haplotyping of a hapmap trio child: evaluation of single individual haplotyping techniques. Nucleic Acids Res., 2011, 40(5), 2041–2053.
- [Das 2015] Das, S.; and Vikalo, H. 2015. SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming. BMC Genomics, 2015; 16(1):260.
- [Puljiz 2016] Puljiz, Z.; and Vikalo, H. 2016. Decoding genetic variations: communications-inspired haplotype assembly. IEEE/ACM Trans Comput Biol Bioinform (TCBB), 2016; 13(3):518–30.
- [Berger 2014] Berger, E.; Yorukoglu, D.; Peng, J.; and Berger, B. 2014. Haptree: A novel Bayesian framework for single individual polyplotyping using NGS data. PLoS Comput Biol. 2014; 10(3):e1003502.
- [Ahn 2018] Ahn, S.; Ke, Z.; and Vikalo, H. (2018). Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics (Oxford, England), 34(13), i23–i31.
- [Wang 2005] Wang, R.; Wu, L.; Li, Z.; and Zhang, X. 2005. Haplotype reconstruction from SNP fragments by minimum error correction. Bioinformatics, 2005, 21(10):2456–2462.
- [Chen 2013] Chen, Z.; Deng, F.; and Wang, L. 2013. Exact algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics, 2013, 29(16):1938–1945.
- [Levy 2007] Levy, S.; Sutton, G.; Ng, P.; Feuk, L.; and Halpern, A. 2007 The diploid genome sequence of an individual human. PLoS Biol., 2007, 5(10):254.
- [Bansal 2008] Bansal, V.; Halpern, A.; Axelrod, N.; and Bafna, V. 2008. An MCMC algorithm for haplotype assembly from whole-genome sequence data. Genome Res., 2008; 18(8):1336–46.
- [Kim 2007] Kim, J.; Waterman M.; and Li, L. 2007. Diploid genome reconstruction of ciona intestinalis and comparative analysis with ciona savignyi. Genome Res. 2007; 17(7):1101–10.
- [Goodfellow 2016] Goodfellow I.; Bengio, Y.; and Courville A. 2016. Deep Learning. MIT Press.
- [Lippert 2002] Lippert, R.; Schwartz, R.; Lancia, G.; and Istrail, S. 2002. Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, 3(1), 23–31.
- [Motazedi 2018] Motazedi, E.; Finkers, R.; Maliepaard, C.; and de Ridder, D. 2018. Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Briefings in bioinformatics, 19(3), 387–403.
- [Cai 2016] Cai, C.; Sanghavi, S.; and Vikalo, H. 2016. Structured low-rank matrix factorization for haplotype assembly. IEEE J Sel Top Sign Proc. 2016; 10(4):647–57.
- [Hashemi 2018] Hashemi, A.; Zhu, B.; and Vikalo, H. 2018. Sparse tensor decomposition for haplotype assembly of diploids and polyploids. BMC Genomics, 2018; 19(4), 191.
- [Rianne 2017] Rianne van den Berg, Thomas N. Kipf and Max Welling 2017. Graph Convolutional Matrix Completion. arXiv:1706.02263.
- [Thomas 2016] Kipf, Thomas N. and Max Welling 2016. Variational Graph Auto-Encoders. CoRR, abs/1611.07308.
- [Di 2014] Di, F.; Töpfer, A.; Rey, M.; Prabhakaran, S.; Duport, Y.; Leemann, C.; Schmutz, S.; Campbell, N. K.; Joos, B.; and Lecca, M. R. 2014. Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations. Nucleic acids research, 42(14), e115.
- [Ahn 2017] Ahn, S.; and Vikalo, H. 2017. aBayesQR: A bayesian method for reconstruction of viral populations characterized by low diversity. In International Conference on Research in Computational Molecular Biology, pages 353–369. Springer.
- [Zagordi 2011] Zagordi, O.; Bhattacharya, A.; Eriksson, N.; and Beerenwinkel, N. 2011. Shorah: estimating the genetic diversity of a mixed sample from next-generation sequencing data. BMC bioinformatics, 12(1), 119.
- [Astrovskaya 2011] Astrovskaya, I.; Tork, B.; Mangul, S.; Westbrooks, K.; Măndoiu, I.; Balfe, P.; Zelikovsky, A. 2011. Inferring viral quasispecies spectra from 454 pyrosequencing reads. BMC bioinformatics, 12(6), 1.
- [Prosperi 2012] Prosperi, M. C; and Salemi, M. 2012. Qure: software for viral quasispecies reconstruction from next-generation sequencing data. Bioinformatics, 28(1), 132–133.
- [Prabhakaran 2014] Prabhakaran, S.; Rey, M.; Zagordi, O.; Beerenwinkel, N.; and Roth, V. 2014. Hiv haplotype inference using a propagating dirichlet process mixture model. IEEE/ACM Trans. on Comput. Biol. Bioinform. (TCBB), 11(1), 182–191.
- [Töpfer 2013] Töpfer, A.; Zagordi, O.; Prabhakaran, S.; Roth, V.; Halperin, E.; and Beerenwinkel, N. 2013. Probabilistic inference of viral quasispecies subject to recombination. Journal of Computational Biology, 20(2), 113–123.
- [Socher 2011] Socher, R.; Pennington, J.; Huang, Eric H.; Ng, Andrew Y.; and Manning, Christopher D. 2011. Semi-supervised Recursive Autoencoders for Predicting Sentiment Distributions. Conference on Empirical Methods in Natural Language Processing, EMNLP(11), 151-161.
- [Fukushima 1975] Fukushima K. C. 1975. a self-organizing multilayered neural network. Biol Cybern, 20(3-4), 121-36.
- [Masci 2011] Masci J.; Meier U.; and Ciresan D. 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. Artificial Neural Networks and Machine Learning – ICANN, 52–9.

### Supplementary Document A : Computational settings

The models were implemented on a 3.70GHz Intel i7-8700K processor, 2 NVIDIA GeForce GTX 1080Ti computer graphics cards and 32GB RAM. Randomness of the initial weights of the auto-encoder may cause the neural network to remain in a local minimum during training. To overcome this, we run GAEseq multiple times and choose the result with the lowest MEC score. Given a SNP fragment matrix, we run GAEseq 200 times, train 200 models and get 200 reconstructed haplotype matrix candidates; the algorithm selects the candidate corresponding to the lowest MEC score automatically. In the GAEseq software, users can specify how many times to run the algorithm; the software will run automatically as opposed to manually running GAEseq multiple times.

As for hyperparameters, the number of graph convolutional layers in the auto-encoder was set to 2 (in particular, one read-nodes-to-SNP-nodes layer and one SNP-nodes-to-read-nodes layer); dimension of messages reduces linearly during message passing between the layers. For example, if the number of reads is , the haplotype length is , the ploidy is and we use 2 graph convolutional layers and a dense layer, the dimensions of the weight and bias matrix of the first and second layer are and , respectively. The dimensions of the weight and bias matrix of the denser layer are set to and , respectively. We use the Adam optimizer (\citeauthoradam2015), set the step size to 0.0001, and set all other parameters to their default values in Tensorflow. We also use Xavier initialization (\citeauthorXavier2010) with default settings and the number of epoches set to 100. In our studies, we found that an architecture with two graph convolutional layers achieves significantly better results than state-of-the-art haplotype assembly methods. To ensure the model generalizes well to unobserved entries, we added dropout regularization to message passing between the layers; for each layer, the dropout probability is 0.1. For all experiments on viral quasispecies reconstruction, the initial number of clusters is set to 2.

### Supplementary Document B : Determining the number of components in a viral quasispecies

When reconstructing haplotypes sampled by a collection of sequencing reads, GAEseq requires as input the number of haplotypes, . While the ploidy of an individual organism in the haplotype assembly problem is known a priori, cardinality of a viral community needs to be estimated. To determine , we examine the improvement rate of the MEC score defined as

(10) |

Recall that the MEC score is defined as the smallest number of the observed entries in that need to be altered such that the resulting data is consistent with having originated from distinct haplotypes. The score decreases monotonically with ; however, once reaches the actual number of components, the improvement rate of the MEC score (MECimpr) saturates. To find the saturation point, we compare MECimpr with a pre-defined threshold. Following (\citeauthorahn2018viral), the number of components is determined via binary search. Specifically, starting from an initial , the number of components is updated as until ; at this point, the number of components starts to decrease as where . Once , the number of components increases again as where . If , the search procedure stops by assigning which is the estimated number of strains. The recommended choice of the threshold is discussed in (\citeauthorahn2017abayesqr) where the estimation of the number of components via MECimpr was demonstrated to be robust with respect to the choice of the threshold.

### Supplementary Document C : Performance comparison on biallelic Solanum Tuberosum semi-experimental data

The semi-experimental data is obtained by simulating mutations, shotgun sequencing procedure, read alignment and SNP calling steps in an experiment on a single individual Solanum Tuberosum. In particular, we use Haplogenerator (\citeauthormotazedi2018exploit) to generate haplotypes by introducing independent mutations that follow the lognormal distribution of a randomly selected genome region from Solanum Tuberosum chromosome 5 (\citeauthorpotato2011genome) of length 5000 bp. The mean distance between neighboring SNPs and the standard deviation (SD) are set to 21 bp and 27 bp, respectively, as previously suggested by (\citeauthormotazedi2018exploit). Due to Haplogenerator’s limitations, we constrain mutations to transitions and do not consider transversions (i.e., mutations are constrained to be between A and C and between G and T). bp-long Illumina’s MiSeq reads of inner distance 50 bp and standard deviation 10 bp are generated to uniformly sample haplotypes using ART software (\citeauthorhuang2012art) with default setting. Following this step, the generated reads are aligned to the reference genome using the BWA-MEM algorithm (\citeauthorli2009fast); the reads having mapping quality score lower than 60 or being shorter than 70 bp are discarded. SNPs are called if, at any given site, the abundance of a minor allele exceeds a predetermined threshold; the SNP fragment matrix is formed by collecting all such heterozygous sites. Seven different sets of semi-experimental data obtained by sampling at varying coverage (10, 15, 20, 25, 30, 35 and 40) are generated; each set consists of 10 samples. We first generate genome regions of length 5000 bp by partitioning the Solanum Tuberosum chromosome 5 and then randomly select 70 among them (generated haplotypes and reads are different for each sample). The sequencing error rate is automatically set by the built-in quality profiles of ART inferred from large amounts of recalibrated sequencing data (\citeauthorhuang2012art). Table 1 shows the performance comparison of GAEseq, AltHap, HapCompass and H-PoP on biallelic Solanum Tuberosum semi-experimental data.

MEC | CPR | ||||
---|---|---|---|---|---|

Coverage | Mean | SD | Mean | SD | |

10 | GAEseq | 18.500 | 4.552 | 0.848 | 0.074 |

HapCompass | 100.300 | 43.584 | 0.769 | 0.039 | |

H-PoP | 19.700 | 25.254 | 0.803 | 0.086 | |

AltHap | 64.100 | 32.953 | 0.727 | 0.072 | |

15 | GAEseq | 8.200 | 4.686 | 0.822 | 0.048 |

HapCompass | 100.700 | 66.150 | 0.763 | 0.046 | |

H-PoP | 28.700 | 32.667 | 0.783 | 0.066 | |

AltHap | 59.100 | 28.125 | 0.709 | 0.054 | |

20 | GAEseq | 16.800 | 15.873 | 0.862 | 0.062 |

HapCompass | 95.600 | 53.883 | 0.795 | 0.047 | |

H-PoP | 30.500 | 37.023 | 0.791 | 0.078 | |

AltHap | 82.100 | 56.658 | 0.737 | 0.068 | |

25 | GAEseq | 8.400 | 4.719 | 0.831 | 0.081 |

HapCompass | 124.800 | 132.156 | 0.810 | 0.063 | |

H-PoP | 33.800 | 47.434 | 0.798 | 0.046 | |

AltHap | 92.600 | 83.649 | 0.756 | 0.068 | |

30 | GAEseq | 27.200 | 19.887 | 0.914 | 0.033 |

HapCompass | 306.800 | 187.934 | 0.796 | 0.081 | |

H-PoP | 34.200 | 32.798 | 0.879 | 0.088 | |

AltHap | 263.000 | 499.659 | 0.762 | 0.133 | |

35 | GAEseq | 10.700 | 3.234 | 0.857 | 0.087 |

HapCompass | 217.400 | 174.135 | 0.775 | 0.072 | |

H-PoP | 41.700 | 53.971 | 0.823 | 0.094 | |

AltHap | 164.000 | 101.583 | 0.754 | 0.093 | |

40 | GAEseq | 16.400 | 7.333 | 0.835 | 0.034 |

HapCompass | 208.000 | 176.699 | 0.833 | 0.070 | |

H-PoP | 30.4 | 28.487 | 0.823 | 0.102 | |

AltHap | 195.8 | 281.641 | 0.762 | 0.084 |

### Supplementary Document D : Performance comparison on simulated biallelic diploid data and polyallelic triploid and tetraploid data.

To further test GAEseq, we evaluate its performance on synthetic data. Once again we use Haplogenerator (\citeauthormotazedi2018exploit) to generate haplotypes of a randomly synthesized reference genome of length 5000 bp. The mean distance between neighboring SNPs and the standard deviation (SD) are set to 5 bp and 3 bp respectively, creating haplotype blocks of length about 500. All the possible mutations were allowed and set to be equally likely, leading to not only biallelic but also polyallelic SNPs in the synthesized haplotype data. Illumina’s MiSeq read generation, read alignment and SNP calling procedures are implemented following the same procedure as in the case of semi-experimental data from Section 3.1. The data synthesized in this fashion consists of different sets, each with 10 samples, as we explore different ploidy (k = 2, 3 and 4) and sequencing coverage (5, 10, 15, 20, 25, 30, 35 and 40).

MEC | CPR | ||||
---|---|---|---|---|---|

Coverage | Mean | SD | Mean | SD | |

5 | GAEseq | 23.300 | 4.165 | 0.996 | 0.002 |

HapCUT2 | 110.500 | 23.922 | 0.975 | 0.006 | |

HapCompass | 87.500 | 25.903 | 0.965 | 0.010 | |

H-Pop | 40.000 | 30.551 | 0.989 | 0.011 | |

AltHap | 884.200 | 659.565 | 0.699 | 0.204 | |

10 | GAEseq | 30.700 | 6.667 | 0.999 | 0.001 |

HapCUT2 | 213.600 | 63.132 | 0.980 | 0.005 | |

HapCompass | 159.600 | 58.329 | 0.974 | 0.005 | |

H-Pop | 34.600 | 6.736 | 0.997 | 0.004 | |

AltHap | 583.900 | 948.344 | 0.796 | 0.218 | |

15 | GAEseq | 47.800 | 8.587 | 0.999 | 0.001 |

HapCUT2 | 339.800 | 59.066 | 0.978 | 0.003 | |

HapCompass | 268.300 | 67.003 | 0.971 | 0.005 | |

H-Pop | 47.900 | 9.539 | 0.998 | 0.002 | |

AltHap | 342.900 | 379.213 | 0.852 | 0.169 | |

20 | GAEseq | 70.900 | 10.754 | 1.000 | 0.001 |

HapCUT2 | 519.400 | 57.386 | 0.972 | 0.010 | |

HapCompass | 408.000 | 81.067 | 0.961 | 0.018 | |

H-Pop | 129.700 | 191.788 | 0.989 | 0.030 | |

AltHap | 668.400 | 579.261 | 0.787 | 0.201 | |

25 | GAEseq | 85.200 | 16.130 | 1.000 | 0.001 |

HapCUT2 | 613.000 | 157.786 | 0.977 | 0.006 | |

HapCompass | 460.700 | 97.637 | 0.968 | 0.007 | |

H-Pop | 85.700 | 17.192 | 0.998 | 0.003 | |

AltHap | 1151.600 | 649.058 | 0.743 | 0.150 | |

30 | GAEseq | 97.800 | 8.954 | 1.000 | 0.000 |

HapCUT2 | 685.300 | 180.714 | 0.979 | 0.006 | |

HapCompass | 591.600 | 150.400 | 0.968 | 0.009 | |

H-Pop | 98.000 | 8.743 | 0.999 | 0.001 | |

AltHap | 554.000 | 612.292 | 0.871 | 0.185 | |

35 | GAEseq | 107.300 | 8.138 | 1.000 | 0.001 |

HapCUT2 | 827.600 | 202.643 | 0.978 | 0.006 | |

H-Pop | 702.200 | 180.647 | 0.968 | 0.007 | |

H-Pop | 107.900 | 8.006 | 0.999 | 0.001 | |

AltHap | 668.800 | 730.814 | 0.891 | 0.146 | |

40 | GAEseq | 124.000 | 10.499 | 1.000 | 0.001 |

HapCUT2 | 1015.400 | 219.442 | 0.977 | 0.006 | |

HapCompass | 896.500 | 204.603 | 0.965 | 0.008 | |

H-Pop | 124.500 | 10.277 | 0.999 | 0.001 | |

AltHap | 1073.300 | 1099.181 | 0.847 | 0.184 |

MEC | CPR | ||||
---|---|---|---|---|---|

Coverage | Mean | SD | Mean | SD | |

5 | GAEseq | 103.400 | 51.379 | 0.920 | 0.047 |

AltHap | 1908.500 | 237.324 | 0.559 | 0.059 | |

10 | GAEseq | 112.800 | 45.917 | 0.958 | 0.037 |

AltHap | 1769.300 | 948.754 | 0.760 | 0.091 | |

15 | GAEseq | 165.800 | 106.999 | 0.945 | 0.073 |

AltHap | 1058.100 | 864.563 | 0.796 | 0.123 | |

20 | GAEseq | 241.300 | 159.657 | 0.959 | 0.047 |

AltHap | 1287.100 | 578.507 | 0.682 | 0.070 | |

25 | GAEseq | 314.900 | 158.326 | 0.934 | 0.070 |

AltHap | 1430.200 | 757.482 | 0.775 | 0.093 | |

30 | GAEseq | 292.400 | 203.242 | 0.974 | 0.040 |

AltHap | 2133.200 | 1082.576 | 0.729 | 0.109 | |

35 | GAEseq | 306.200 | 196.918 | 0.982 | 0.037 |

AltHap | 2928.700 | 869.617 | 0.723 | 0.075 | |

40 | GAEseq | 502.200 | 247.380 | 0.922 | 0.088 |

AltHap | 2943.600 | 1113.480 | 0.737 | 0.104 |

MEC | CPR | ||||
---|---|---|---|---|---|

Coverage | Mean | SD | Mean | SD | |

5 | GAEseq | 266.700 | 46.371 | 0.739 | 0.041 |

AltHap | 2641.700 | 410.159 | 0.544 | 0.056 | |

10 | GAEseq | 415.100 | 74.608 | 0.800 | 0.051 |

AltHap | 2807.200 | 938.668 | 0.658 | 0.075 | |

15 | GAEseq | 592.200 | 112.282 | 0.798 | 0.054 |

AltHap | 2742.500 | 1055.672 | 0.718 | 0.081 | |

20 | GAEseq | 628.900 | 245.841 | 0.843 | 0.047 |

AltHap | 1929.700 | 1008.766 | 0.729 | 0.063 | |

25 | GAEseq | 881.900 | 189.987 | 0.845 | 0.058 |

AltHap | 1987.100 | 1091.893 | 0.779 | 0.084 | |

30 | GAEseq | 944.100 | 182.440 | 0.848 | 0.041 |

AltHap | 2265.200 | 1277.366 | 0.759 | 0.051 | |

35 | GAEseq | 815.900 | 295.195 | 0.866 | 0.063 |

AltHap | 3906.400 | 1131.654 | 0.747 | 0.056 | |

40 | GAEseq | 949.500 | 319.238 | 0.878 | 0.046 |

AltHap | 3775.300 | 1036.702 | 0.762 | 0.075 |

For the diploid synthetic data sets, we represent an allele by 0 if it coincides with the corresponding reference allele and by 1 if it is an alternative allele. SNP positions with only alternative alleles are removed. In addition to H-PoP, HapCompass and AltHap, we also compare GAEseq with HapCUT2 (\citeauthoredge17); by design, use of HapCUT2 is limited to haplotype assembly of diploids. The metrics of performance are the previously introduced MEC score and CPR. Table 2 shows the mean and standard deviation of the MEC score and CPR for diploid data. The results are evaluated over 10 samples for each combination of ploidy and coverage. GAEseq achieves the lowest average MEC score and the lowest standard deviation of the MEC score for almost all coverage settings; its performance is followed by those of H-PoP, HapCompass, HapCut2 and AltHap. The average CPR achieved by GAEseq is very close to 1 for all coverage settings, indicating that GAEseq is able to near-perfectly reconstruct haplotypes of diploid species even when the coverage is very low; its performance is followed by those of H-PoP, HapCut2, HapCompass and AltHap. When the coverage is , the average CPR achieved by GAEseq is while it is approximately , , and for H-PoP, HapCut2, HapCompass and AltHap, respectively.

For the polyploid synthetic data sets, both H-PoP and HapCompass are restricted to reconstruction of biallelic haplotypes and are not applicable to the assembly of polyallelic ones. Furthermore, recall that HapCUT2 can only be applied to diploid haplotypes. We therefore limit performance comparison of GAEseq on polyploid synthetic data to only AltHap; Tables 3 and 4 illustrate the mean and standard deviation of the MEC score and CPR for triploid and tetraploid data, respectively. The results are evaluated over 10 samples for each combination of ploidy and coverage. As can be seen in these tables, GAEseq outperforms AltHap for all ploidy and coverage settings. As shown in Table 3, GAEseq performs well on triploid data, achieving average CPR and relatively small standard deviation even for the low coverage of ; at the same time, performance of AltHap deteriorates rapidly with increased ploidy, achieving average CPR while GAEseq achieves at coverage . As illustrated in Table 4, in applications to tetraploid data the performance of GAEseq starts to gracefully deteriorate – when the coverage is , GAEseq achieves average CPR of approximately while in the same scenario AltHap achieves average CPR of approximately . When the coverage is increased to , GAEseq achieves average CPR of approximately while AltHap achieves average CPR of approximately .

### Supplementary Document E : Performance comparison on real Solanum Tuberosum data

We further test the performance of GAEseq on real potato data (accession SRR6173308) at Solanum Tuberosum chromosome 5 (\citeauthorpotato2011genome). The 10 samples of real potato data are generated by first randomly selecting 10 genome regions of length varying from 5032 to 7573 and then aligning the Illumina HiSeq 2000 paired-end reads to the selected genome regions. After the read alignment step using the BWA-MEM algorithm (\citeauthorli2009fast), the SNP calling step is implemented to create the SNP fragment matrix. Reads having mapping quality score lower than 60 or shorter than 70 bp are discarded. Since the ground truth haplotypes are not available for this dataset, we only evaluate the performance of GAEseq and the competing methods in terms of the MEC score. Table 5 compares the performance of GAEseq, AltHap, HapCompass and H-PoP averaged over 10 selected regions of the real Solanum Tuberosum data. As can be seen from the table, GAEseq outperforms all the competing schemes in terms of both the average MEC score and its standard deviation, achieving 379.8 average MEC score. GAEseq is followed by H-PoP and AltHap while HapCompass achieves the highest average MEC score.

MEC | ||
---|---|---|

Mean | SD | |

GAEseq | 379.8 | 271.61 |

HapCompass | 2726 | 2393.7 |

H-PoP | 409.5 | 282.24 |

AltHap | 742.1 | 469.5 |

### Supplementary Document F : Further results on reconstruction of HIV viral communities

Table 6 shows the gene-wise reconstruction results on the real HIV-1 data that include inferred frequencies (omitted from Table 2 in the main paper for brevity).

We further evaluate the performance of GAEseq on the 4036bp long gag-pol region. Following (\citeauthorahn2018viral), we divide the gag-pol region into overlapping blocks, reconstruct the viral components in each block independently, and combine the results to reconstruct the full region of interest. Specifically, the region is divided into a sequence of blocks of length 500bp where the consecutive blocks overlap by 250bp. We run GAEseq to perform reconstruction of viral components in each of the total 18 blocks and merge the results to retrieve the entire region of interest. Particularly, the mismatches between strains reconstructed on two consecutive blocks in the overlapping region are corrected based on majority voting using reads that are covering the mismatched positions and are assigned to the aligned strains. Following this procedure, GAEseq perfectly reconstructed all of 5 HIV-1 strains in the gag-pol region, achieving 100% Reconstruction Rate for all 5 strains and Predicted Proportion of 1 on 355241 remained paired-end reads. The frequencies of 5 HIV-1 strains are estimated as and by counting the proportion of reads assigned to the same strain; these results are consistent with the frequencies estimated by aBayesQR and TenSQR softwares.

p17 | p24 | p2-p6 | PR | RT | RNase | int | vif | vpr | vpu | gp120 | gp41 | nef | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

GAEseq | PredProp | 1 | 1 | 1 | 1 | 1.2 | 1 | 1 | 1 | 1 | 1.2 | 1 | 1 | 1 |

CPR | 100(20.5) | 99.4(17.1) | 100(21) | 100(30.9) | 100(12.1) | 100(9.6) | 100(13.6) | 100(10.4) | 100(6.6) | 100(34.3) | 96.2(8.7) | 96.7(2.8) | 100(6.6) | |

CPR | 100(18.8) | 99.4(21.8) | 100(20) | 100(18) | 100(18.2) | 100(20.9) | 100(18.2) | 100(20.4) | 100(20.3) | 99.2(10.5) | 99.4(24.1) | 100(25.6) | 98.2(22.9) | |

CPR | 100(30.9) | 100(31.5) | 100(27) | 100(21.6) | 100(23.5) | 100(20.2) | 100(21.5) | 100(29.8) | 100(34.6) | 100(36.4) | 99.9(33.0) | 100(27) | 99.3(19.7) | |

CPR | 100(17.4) | 100(18.3) | 100(14.1) | 100(19.7) | 100(30.6) | 100(33.1) | 100(37.1) | 100(32.8) | 100(30.6) | 100(7.9) | 100(29.6) | 100(34.5) | 99.8(39.1) | |

CPR | 100(12.3) | 100(11.2) | 100(18) | 100(9.9) | 100(11.9) | 100(16.2) | 100(9.6) | 100(6.6) | 100(7.9) | 100(10.2) | 99.6(4.6) | 100(10) | 98.1(11.7) | |

PredictHap | PredProp | 1 | 0.6 | 1 | 1 | 1 | 0.8 | 0.8 | 0.8 | 1 | 0.8 | 0.8 | 0.8 | 0.8 |

CPR | 100(17.8) | 0(0) | 100(18.7) | 100(15.2) | 100(12.2) | 98.9(25.4) | 100(12.1) | 100(17.7) | 100(10.2) | 93.2(10.8) | 0(0) | 0(0) | 0(0) | |

CPR | 100(19.9) | 100(46.4) | 100(21.7) | 100(22.2) | 100(19.4) | 100(18.2) | 99.8(27.6) | 100(20.9) | 100(22.1) | 0(0) | 97.8(20.7) | 100(26.7) | 98.8(20.7) | |

CPR | 100(31.9) | 100(21.8) | 100(30.3) | 100(26.9) | 100(23.4) | 100(23.2) | 100(22.3) | 100(24.9) | 100(23.7) | 100(34.1) | 99.7(42.7) | 100(28.9) | 100(23.2) | |

CPR | 100(17) | 99.1(31.8) | 100(16.4) | 100(20.9) | 100(30.2) | 100(33.2) | 100(38.1) | 100(36.6) | 100(35.5) | 100(47.1) | 100(28.6) | 100(32.7) | 100(39.3) | |

CPR | 100(13.4) | 0(0) | 100(12.9) | 100(14.8) | 100(14.7) | 0(0) | 0(0) | 0(0) | 100(8.5) | 100(7.9) | 98.6(7.9) | 100(11.7) | 100(16.9) | |

TenSQR | PredProp | 1 | 1.6 | 1 | 1 | 1.4 | 1 | 1 | 1 | 1 | 1.6 | 2.2 | 1.2 | 0.8 |

CPR | 100(18.7) | 98.9(13.1) | 100()17.4 | 100(9.9) | 99.2(12.1) | 100(9.2) | 100(8.1) | 100(9.6) | 100(7.2) | 92.8(5.9) | 96.0(18) | 99.0(11.5) | 0(0) | |

CPR | 100(18.4) | 100(19.6) | 100(20.1) | 100(17.2) | 98.0(13.5) | 100(17.2) | 100(16.7) | 100(25) | 100(19.3) | 94.0(15) | 97.2(10.3) | 100(27.8) | 95.7(26) | |

CPR | 100(33.8) | 100(33) | 100(33.6) | 100(21.7) | 100(20.7) | 100(24.6) | 100(23.3) | 100(20.5) | 100(20.3) | 100(31.4) | 98.3(33.5) | 97.7(18.8) | 99.8(19) | |

CPR | 100(17) | 99.3(19.7) | 100(17.2) | 100(21.4) | 99.5(26.7) | 100(37.7) | 100(41.2) | 100(38.4) | 100(46.2) | 100(38.8) | 99.8(9.2) | 99.5(23.2) | 99.7(42.7) | |

CPR | 100(12.1) | 99.3(14.6) | 100(7.7) | 99.7(29.8) | 99.7(14.5) | 100(11.4) | 100(10.7) | 100(6.5) | 100(7.1) | 100(4.1) | 94.9(10.5) | 100(10.2) | 98.6(12.3) | |

aBayesQR | PredProp | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1.2 | 1 | 0.8 | 0.8 | 1.2 |

CPR | 100(16.3) | 99.4(21.1) | 100(22.2) | 100(12.5) | 98.5(24.3) | 100(16.1) | 99.9(9.7) | 100(9.2) | 100(16.4) | 99.6(17) | 98(30.3) | 0(0) | 95.8(11.4) | |

CPR | 100(27.1) | 98.7(17) | 100(17.3) | 100(17.3) | 98.6(18.1) | 100(19.7) | 100(22.2) | 100(20.6) | 100(16.3) | 92(10.4) | 96.5(20.2) | 98.9(23.7) | 95.5(16.4) | |

CPR | 100(31.3) | 99.6(24.6) | 100(25.8) | 100(29.9) | 99(21.5) | 100(22.1) | 100(20.8) | 100(32.7) | 100(27) | 98.8(26.7) | 97.7(21.4) | 99.1(29.7) | 98.2(21.1) | |

CPR | 100(12.9) | 100(21.6) | 100(25.6) | 100(20.1) | 98.9(17.7) | 100(30) | 100(39.5) | 99.8(28.5) | 100(23.2) | 100(41.3) | 96.3(28) | 98.8(36.6) | 100(31.8) | |

CPR | 100(12.4) | 99.7(15.8) | 100(9.2) | 100(20.3) | 99.2(18.5) | 100(12.2) | 99.5(7.9) | 99.7(9) | 100(17.1) | 100(4.6) | 0(0) | 98.6(10.1) | 99.2(14) |

Predicted Proportion (PredProp) and Correct Phasing Rate (CPR ()) for GAEseq, PredictHaplo, TenSQR and aBayesQR applied to reconstruction of HIV-1HXB2, HIV-189.6, HIV-1JR-CSF, HIV-1NL4-3 and HIV-1YU2 for all 13 genes of the HIV-1 dataset. Frequencies are reported in parenthesis.

## References

- [Diederik 2015] Diederik P. Kingma and Jimmy Ba 2015. Adam: A Method for Stochastic Optimization. arXiv:1412.6980 [cs.LG].
- [Xavier 2010] Xavier Glorot and Yoshua Bengio 2010. Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth international conference on artificial intelligence and statistics, 249-256.
- [Ahn 2018] Ahn, S.; Ke, Z.; and Vikalo, H. (2018). Viral quasispecies reconstruction via tensor factorization with successive read removal. Bioinformatics (Oxford, England), 34(13), i23–i31.
- [Ahn 2017] Ahn, S.; and Vikalo, H. 2017. aBayesQR: A bayesian method for reconstruction of viral populations characterized by low diversity. In International Conference on Research in Computational Molecular Biology, pages 353–369. Springer.
- [Motazedi 2018] Motazedi, E.; Finkers, R.; Maliepaard, C.; and de Ridder, D. 2018. Exploiting next-generation sequencing to solve the haplotyping puzzle in polyploids: a simulation study. Briefings in bioinformatics, 19(3), 387–403.
- [Potato Genome Sequencing Consortium 2011] Potato Genome Sequencing Consortium 2011. Genome sequence and analysis of the tuber crop potato. Nature, 475, 189–195.
- [Huang 2012] Huang, W.; and Li, L.; Myers, J. R. and Marth, G. T. 2012. ART: a next-generation sequencing read simulator. Bioinformatics, 28(4), 593–594.
- [Li 2009] Li, H.; and Durbin, R. 2009. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14), 1754–1760.
- [Edge 2017] Edge, P.; Bafna, V.; and Bansal, V. 2017. Hapcut2: robust and accurate haplotype assembly for diverse sequencing technologies. Genome Res., 2017; 27(5):801–12.