Computing Petaflops over Terabytes of Data:
The Case of GenomeWide Association Studies
Abstract
In many scientific and engineering applications, one has to solve not one but a sequence of instances of the same problem. Often times, the problems in the sequence are linked in a way that allows intermediate results to be reused. A characteristic example for this class of applications is given by the GenomeWide Association Studies (GWAS), a widely spread tool in computational biology. GWAS entails the solution of up to trillions () of correlated generalized leastsquares problems, posing a daunting challenge: the performance of petaflops ( floatingpoint operations) over terabytes of data.
In this paper, we design an algorithm for performing GWAS on multicore architectures. This is accomplished in three steps. First, we show how to exploit the relation among successive problems, thus reducing the overall computational complexity. Then, through an analysis of the required data transfers, we identify how to eliminate any overhead due to input/output operations. Finally, we study how to decompose computation into tasks to be distributed among the available cores, to attain high performance and scalability.
We believe the paper contributes valuable guidelines of general applicability for computational scientists on how to develop and optimize numerical algorithms.
V \acmNumberN \acmYearYY \acmMonthMonth \categoryG.4Mathematical Software[Algorithm design and analysis, Efficiency] \termsAlgorithms, Performance
Authors’ addresses:
Diego Fabregat, AICES, RWTH Aachen, Aachen, Germany,
fabregat@aices.rwthaachen.de.
Paolo Bientinesi, AICES, RWTH Aachen, Aachen, Germany,
pauldj@aices.rwthaachen.de.
1 Introduction
Many traditional linear algebra libraries, such as LAPACK [Anderson et al. (1999)] and ScaLAPACK [Blackford et al. (1997)], and tools such as Matlab, focus on providing efficient building blocks for solving one instance of many standard problems. By contrast, engineering and scientific applications often originate multiple instances of the same problem, thus leading to a —possibly very large— sequence of independent library invocations. The drawback of this situation is the missed opportunity for optimizations and data reuse; in fact, due to their blackbox nature, libraries cannot avoid redundant computations or exploit problemspecific properties across invocations. The underlying theme of this paper is that a solution scheme devised for a specific sequence may be much more efficient than the repeated execution of the best singleinstance routine [Bientinesi et al. (2010), Di Napoli et al. (2012)].
Genomewide Association Studies (GWAS) clearly exemplify this issue, posing a formidable computational challenge: the solution of billions, or even trillions, of correlated generalized leastsquares (GLS) problems. As described below, a solution based on the traditional black box approach is entirely unfeasible. With this paper we demonstrate that by tackling all the problems as a whole, and by exploiting application specific knowledge in combination with a careful parallelization scheme, it becomes possible for biologists to complete GWAS in matter of hours.
Any algorithm to perform GWAS needs to address the following issues.

Complexity. In a representative study, one has to solve a grid of generalized leastsquares problems, where and are of the order of millions and hundreds of thousands, respectively. The complexity for the solution of one problem (of size ) in isolation is floating point operations (flops), adding up to flops for the entire study. In the case of the largest problem addressed in our experiments (Section 6; , , and ), this approach would require the execution of roughly flops; even if performed at the theoretical peak of Sequoia, the fastest supercomputer in the world,^{1}^{1}1http://www.top500.org/, as of October 2012. the computation would take more than 2 months. Here we illustrate how, by exploiting the structure that links different problems, the complexity reduces to , and the same experiment on a 40core node completes in about 12 hours.

Size of the datasets. GWAS entails the processing of terabytes of data. Specifically, the aforementioned analysis involves reading 10 GBs and writing 3.2 TBs. Largely exceeding even the combined main memory of current typical clusters, these requirements demand an outofcore mechanism to efficiently handle datasets residing on disk. In such a scenario, data is partitioned in slabs, and it becomes critical to determine the most appropriate traversal, as well as the size and shape of each slab. The problem is far from trivial, because these factors affect the amount of transfers and computation performed, as well as the necessary extra storage. By modeling all these factors, we find the best traversal direction, and show how to determine the shape and size of the slabs to achieve a complete overlap of transfers with computations. As a result, irrespective of the data size, the efficiency of our incore solver is sustained.

Parallelism. The partitioning of data in slabs that fit in main memory translates to the computation of the twodimensional grid of GLSs in terms of subgrids or tiles of such problems. The computation of each tile must be organized to exploit multithreaded parallelism and to attain scalability even with a large number of cores. To this end, we present a study on how to decompose the tiles into smaller computational tasks —to take advantage of the cache memories— and how to distribute such tasks among the computing cores.
Our study of these three issues is not specifically tied to GWAS; we keep the discussion in general terms, to concentrate on the methodology rather than on the problem at hand. We believe this paper contributes valuable guidelines for computational scientists on how to optimize numerical algorithms. For the specific case of GWAS, thanks to the combination of a lower complexity, a perfect overlapping of data movement with computation, and a nearly perfect scalability, we outperform the current stateoftheart tools, GenABEL and FaSTLMM [Aulchenko et al. (2007), Lippert et al. (2011)], by a factor of more than 1000.
Section 2 introduces GWAS both in biology and linear algebra terms. In Section 3, we detail how our algorithm for GWAS exploits applicationspecific knowledge to reduce the asymptotical cost of the best existing algorithms, while in Section 4 we analyze the required data transfers, and discuss the application of outofcore techniques to completely hide the overhead due to input/output operations. In Section 5, we tailor our algorithm to exploit sharedmemory parallelism, and we provide performance results in Section 6. Future work is discussed in Section 7, and conclusions are drawn in Section 8.
2 Multitrait GenomeWide Association Studies
In a living being, observable characteristics —traits or phenotypes— such as eye color, height, and susceptibility to disease, are influenced by information encoded in the genome. The identification of specific regions of the genome —singlenucleotide polymorphisms or SNPs— associated to a given trait enhances the understanding of the trait, and, in the case of diseases, facilitates prevention and treatment. Genomewide Association Studies (GWAS) are a powerful statistical tool for locating the SNPs involved in the control of a trait [Lauc et al. (2010), Levy et al. (2009), Speliotes et al. (2010)]. The simultaneous analysis of many phenotypes is the objective of the so called multitrait GWAS.
Every year, computational biologists publish hundreds of GWASrelated papers [Hindorff, MacArthur, Wise, Junkins, Hall, Klemm, and Manolio (Hindorff et al.)], with a clear trend towards analyses that include more and more SNPs. Ideally, the computational biologists aim at testing the whole human genome against as many traits as possible. Mathematically, for each SNP and trait ( ), one has to solve the generalized leastsquares problem (GLS)
(1) 
where

is the vector of observations;

is the design matrix;

is the covariance matrix, representing dependencies among observations; and

the vector expresses the relation between a variation in the SNP () and a variation in the trait ().
In multitrait GWAS, the covariance matrix satisfies
(2) 
where

the kinship matrix contains the relationship among all studied individuals,

and are traitdependent scalar estimates, and

is the identity matrix.
The number of SNPs, , ranges between and (180.000.000 for the full human genome), and the number of traits, , is either 1 (singletrait analysis), or ranges between and .
In Fig. 1, we provide a visual interpretation of the problem: Each point in the grid, i.e., the vertical vector of size at position , corresponds to the solution of one GLS (association between the th SNP and the th trait). A column in the figure represents a singletrait analysis (association between all the SNPs and a given trait); in this case, as well as and , and therefore , are fixed. The full grid depicts the whole multitrait analysis; as expected, , , , and vary along the dimension, while the SNP is fixed; the kinship matrix is constant throughout the twodimensional grid.
In linear algebra terms, Eq. (1) solves a linear regression with nonindependent outcomes, where , is full rank, is symmetric positive definite (SPD), , is symmetric, , and and . Moreover, the design matrix presents a special structure: each may be partitioned as , where is the same for all ’s, while varies. The sizes are and . The quantities , , , , and are known, and the vector is to be computed.
2.1 Related Work
The blackbox nature of traditional numerical libraries limits them to offering routines for the zerodimensional case, a point in the grid, and use such kernel repeatedly for each problem in the sequence. As discussed previously, such an approach is absolutely unviable. Instead, GWASspecific tools, such as GenABEL and FaSTLMM, incorporate and exploit knowledge specific to the application. However, these tools still present limitations in all three of the GWAS challenges: 1) they are tailored for a onedimensional sequence of problems, individual columns of Fig. 1; for the solution of the entire twodimensional grid they only offer the possibility of repeatedly using their onedimensional solver. From a multitrait GWAS perspective, this still represents a black box approach, missing opportunities for further computation reuse; 2) they incorporate rudimentary outofcore mechanisms that lack the overlapping of data transfers with computation, thus incurring a considerable overhead; and 3) they attain only poor scalability.
2.2 Terminology
Here we give a brief description of the acronyms used throughout the paper.

GWAS: GenomeWide Association Studies.

GLS: Generalized LeastSquares problems.

GenABEL: A framework for statistical genomics, including one of the most widely used packages to perform gwas.

gwfgls: GenABEL’s stateoftheart routine for GWAS.

fastlmm: The most recent highperformance tool for GWAS.

mtgwas: our novel solver for multitrait GWAS.
Additionally, we will reference many times

BLAS (Basic Linear Algebra Subprograms) [Dongarra et al. (1990)], and

LAPACK (Linear Algebra PACKage),
as the defacto standard libraries for highperformance dense linear algebra computations.
3 The multitrait GWAS algorithm
mtgwas, a novel algorithm for multitrait GWAS is introduced: First we describe a simplified version that solves one single GLS instance, and then show how the algorithm can be tailored for Eqs. (1) and (2), the solution of the whole GWAS. We stress that while mtgwas is suboptimal for the solution of one single GLS, it exploits the specific structure and properties of the full problem, and dramatically reduces its computational complexity, thus achieving remarkable speedups.
3.1 Singleinstance
The fastest approach for Eq. (1) involves computing the Cholesky factorization of the matrix [Paige (1979)], for a cost of flops. In the context of GWAS, in which varies with the index (Eq. (2)), such a factorization has to be performed – times, thus originating a computational bottleneck. Instead, the main idea behind mtgwas is to take advantage of the fact that results from scaling and shifting the matrix , which remains constant for all ’s and ’s. Then, computing the eigendecomposition ( orthogonal, diagonal),
While the eigendecomposition is much more expensive than the Cholesky factorization, vs. flops, it allows us to express , for any , with only additional flops:
After this initial step, Eq. (1) can be rewritten as
Since is SPD, its eigenvalues are all positive. Accordingly, the algorithm may take advantage of the symmetry of the expression , and save computations. This is accomplished by computing the inverse of the square root of each entry of (, resulting in
Next, the products and are computed
and also and
What remains is an ordinary leastsquares problem. Numerical considerations allow us to safely rely on forming without incurring instabilities. The algorithm completes by computing and solving the corresponding SPD linear system . mtgwas is detailed in Algorithm 1.
3.2 Tailoring for the twodimensional sequence
We now discuss how to extend Algorithm 1 for the solution of the twodimensional sequence of GLSs specific to GWAS. The general objective is to identify opportunities for reusing partial calculations across different problems, in order to reduce the overall cost.
As a first step, we expose further applicationspecific knowledge: between any two matrices and , only the right portion changes. In Algorithm 2, every appearance of is replaced with its partitioned counterpart , and the partitioning is propagated. The subscripts , , , and stand for eft, ight, op, and ottom, respectively. Since is symmetric, the star in the topright quadrant indicates the transpose of , and this quadrant does not need to be either stored or computed. This intricate refinement of the algorithm is absolutely necessary to expose for each operand, which portion varies along which dimension.
Next, we wrap Algorithm 2 with a double loop, corresponding to the traversal of the twodimensional  grid, and reorganize the operations, aiming at eliminating redundant computation. Both traversals of the grid —by rows (for i, for j) and by columns (for j, for i)— are provided, in Algorithms 3 and 4, respectively. We will use the latter to describe how the computation can be rearranged; the same reasoning applies to the former.
For each operation in the algorithm, the dependencies on the indices and are determined by the lefthand side operand(s). Any operation whose lefthand side does not include any subscript is invariant across the twodimensional sequence; therefore it can be computed once and reused in every other iteration of the loops. This is the case for the eigendecomposition of , and the computation of . Operations that only vary across the dimension (subscript ), are performed once per iteration over —the outer loop— and reused across the iterations over —the inner loop—. Finally, the operations labeled with or are placed in the innermost loop.
As the reader might have noticed, the algorithm still performs redundant computations: Lines 6–9, 11–12, and 15 in Algorithm 3 and line 12 in Algorithm 4 depend only on the dimension traversed by the inner loop, and are therefore recomputed at each iteration of the outer loop. This flaw can be resolved by precomputing the quantities outside of the double loop, and then accessing them from within the loops. While mathematically and algorithmically possible, the approach poses practical constraints on Algorithm 3, as it would require a fairly large amount of extra storage. The solution is instead applicable in Algorithm 4: The operation may overwrite , making the size of temporary storage negligible. Let us stress the significance of this improvement: by avoiding redundant calculations, Algorithm 4 saves flops, thus drastically lowering the cost with respect to the other stateoftheart algorithms, and eliminates the main computation bottleneck. Henceforth, Algorithm 4 is our algorithm of choice.
Practical considerations allow us for one more optimization. Notice that operation is performed independently for each , resulting in matrixvector multiplications. Instead, these matrixvector operations may be bundled together in a single large matrixmatrix product, known to be a much more efficient operation. Similar reasoning applies to the operation at line 6, .
mtgwas, the final algorithm that includes all these optimizations, is provided in Algorithm 5. There, and are used to represent the collection of all ’s and all ’s, respectively: and . The second and third columns indicate, respectively, for each individual operation, the corresponding BLAS or LAPACK routine and its associated cost.
3.3 Computational cost
Let us now compare the computational cost of mtgwas with that of the aforementioned alternatives: LAPACK, FaSTLMM, and GenABEL. To this end, we recall the size of the input and output operands:

,

,

,

, ,

,

.
Typical values for these dimensions are:

,

,

,

.
The asymptotical cost of mtgwas (see third column in Algorithm 5) is . Since in a typical scenario for multitrait GWAS and are much larger than , the dominating factor is . By contrast, the cost for the traditional library approach of LAPACK, which optimizes for a single GLS () and uses it for each point in the twodimensional grid, is ; the cost for stateoftheart tools —FaSTLMM and GenABEL—, which optimize for a onedimensional analysis () and use it for each column in Fig. 1, is .
Table 1 collects the mentioned costs together with the ratio with respect to mtgwas. The message is clear: No matter how optimized a solver for a single GLS or for a onedimensional analysis is, it cannot compete with a solver specifically tailored for the entire multitrait analysis.
mtgwas exploits the specific structure of the operands and the correlation among GLSs, and lowers the cost of the best existing methods by a factor of . For a problem of size , and large and , we can expect mtgwas to be around two orders of magnitude faster than fastlmm and gwfgls.
Computational cost  Ratio over mtgwas  

lapack  
gwfgls  
fastlmm  
mtgwas 
4 Outofcore: Analysis of computation and data transfers
In addition to the formidable computational complexity, GWAS poses a second challenge: the management of large datasets. In the prospective scenario in which , , and , the size of input and output data amounts to tens and hundreds of terabytes, respectively. Current analyses already involve the processing and generation of few terabytes of data.
Obviously, present sharedmemory architectures are not equipped with such an amount of main memory; the size of the datasets thus becomes a limiting factor. In order to overcome this limitation, we turn our attention to outofcore algorithms [Toledo (1999), Grimes (1988), Agullo et al. (2007)]. The goal is to design an algorithm that makes an effective use of the available input/output (I/O) mechanisms, to deal with data sets as large as the available secondary storage, and to minimize the overhead due to data transfers.
In the general case, the operands , , and are too large to fit in main memory and need to be streamed from disk to memory and vice versa. A naive approach, which too often becomes the choice for actual implementations, is sketched in Algorithm 6. The algorithm loads one and one and stores one at a time, resembling an unblocked algorithm. A quick study of the computation and data transfers suffices to understand the poor quality of the approach. Let us assume the following problem sizes: , , , and . On the one hand, this data set requires the performance of approximately 1.1 petaflops; at a rate of 25 GFlops/sec (see Sections 5 and 6 for details), it would complete in about 12 hours. On the other hand, loading times the whole operand (, assume 8 byte, double precision, data) already causes 800 TBs of traffic between disk and main memory. At a peak bandwidth of 2 GB/sec, which is by no means reached due to the small size of the transfers, the data movement alone takes 111 hours, about 10 times more than the time spent in actual computation. In order to attain an efficient outofcore design, one has to a) study how different parameters affect the ratio between computation and data movement, aiming at reducing the latter, and b) use techniques for overlapping I/O with computation, thus leading to a complete elimination of I/O overhead.
⬇ 1$\mathcal{B} :=$ mtgwas( $X_L$, $\mathcal{X}_R$, $\mathcal{Y}$, $h^2_j$, $\sigma^2_j$, $\Phi$ ) 2 [] 3 for 1 $\le$ j $\le$ t 4 load 5 [] 6 for 1 $\le$ i $\le$ m 7 load 8 [] 9 storeThe key behind a drastic reduction of data movement is data reuse; this is commonly attained by blocked algorithms or, in the context of outofcore, tiled algorithms. The idea consists in loading not one single and at a time, but many of them, in a chunk or slab; once loaded in memory, each of the elements in the slabs can be used repeatedly. Following this approach, the cost of an expensive operation such as a diskmemory transfer, is amortized by performing many more arithmetic operations per transferred element.
The ratio of computation over transferred data
(3) 
gives an approximate idea of the potential for the minimization of the impact of the data transfers. Since the time for loading an element from disk is much larger than performing a scalar operation, large ratios are desired. When applied to a concrete situation, the ratio exposes a number of parameters or degrees of freedom that may be adjusted to improve the ratio. In the context of Algorithm 5, these degrees of freedom are the number of ’s and ’s loaded at a time.
The above ratio may be extended to incorporate the concept of overlapping. For the algorithm to completely hide the I/O under computation, the time spent in computation must be larger than the time for loading and storing data. Hence, the inequality
must hold. The inequality may be refined as
(3) 
Inequality (3) enables the identification of (ranges of) values for the aforementioned degrees of freedom, so that a perfect overlapping is achieved.
As an overlapping outofcore mechanism we choose the socalled doublebuffering. In short, to deploy doublebuffering, main memory is divided into two workspaces. Each workspace contains buffers, one per operand to be streamed; in this case, two buffers for the input operands and , and one for the output stream . At a given iteration over the streams, one workspace is used for computation, while the other one is used for downloading previous results and uploading data for the next iteration. After each iteration, the workspaces swap roles. For more details, we refer the reader to [FabregatTraver et al. (2012)].
The rest of this section is dedicated to the analysis of the computation and data transfer required by Algorithm 5. During the analysis we expose the degrees of freedom, and provide the specific constraints they must satisfy.
4.1 Preloops: Single sweep over the input streams
We logically divide Algorithm 5 in two sections: 1) preloops —lines 1 to 4—, and 2) loops —lines 5 to 17—. In the preloops, the operations at lines 1 and 2 involve operands that fit in main memory; they are therefore computed through direct calls to the corresponding BLAS and LAPACK routines. On the contrary, the two matrix products at lines 3 and 4 involve the operands and , which reside on disk, and must therefore be performed in a streaming fashion.
The approach to compute (line 4), and equivalently (line 3), consists in a traversal over the stream in slabs containing ’s, where the optimal value for is to be estimated. At every step over the stream (Fig. 1(a)), the algorithm

loads ’s, each of them of size , i.e., elements;

stores ’s, each of them also of size ; and

performs flops, corresponding to the matrix product .
The ratio of computation over data movement is
For typical values of —from one thousand to tens of thousands—, such a ratio shows great potential for perfect overlapping.
Notice that, even though the ratio is independent of the value of , two quantities in (3) are influenced by this parameter: the performance of gemm (# flops/sec), and the data transfer rate (IO_bandwidth). The objective for both quantities is clear: One wants to maximize them both, to minimize execution time and I/O time. The constraint to satisfy is
(4) 
which simplifies to
(5) 
Since these quantities are specific to the architecture and the problem, we defer their evaluation to Sec. 6, where the experimental setup is defined.
4.2 Loops: Cross product of the input streams
The analysis and application of doublebuffering to the second section of the algorithm requires a deeper discussion. Instead of a single sweep through the streams, this section operates on every pair (, ) of data, i.e., a “cross product” of the input streams and . The problem can be translated into how the object is built. In view of the impossibility of fitting the entire data set in memory, the problem of computing is decomposed into the computation of smaller parts or tiles (Fig. 1(b)). Let us assign each tile the size , where and are the number of ’s and ’s processed, respectively. Henceforth we concentrate on determining adequate values for these two parameters.
As for the first section of the algorithm, the goal is to achieve a perfect overlapping of I/O with computation. To this end, we must once more study the ratio of computation over data movement as a function of the tile size. To compute a tile, the algorithm must load elements, corresponding to the loading of a slab of ’s and a slab of ’s, and store elements, corresponding to a slab of . Per tile, flops are performed, for a ratio of
In Section 3.2 we deduced that a traversal by columns of the grid depicted in Fig. 1, is favorable in terms of computational cost and temporary storage. Hence, the algorithm will load a slab of ’s and reuse it for all ’s. Consequently, the cost of loading the slab of ’s can be neglected, and the ratio simplifies to
This ratio is independent of the value . Intuitively, if is multiplied by 2, both the amount of computation within the tile and the required I/O are doubled (twice the number of ’s are loaded and twice the number of ’s are computed and stored). On the contrary, the ratio grows monotonically with . For the sake of clarity, we illustrate in Fig. 3 the behavior of as a function of . For , is ; the ratio rapidly grows with until it reaches a value of about , from where the growth is much smaller.
As discussed previously, large values of the ratio are favored. Therefore, tiling along the dimension () is imperative to reduce the I/O overhead.
However, since we are making use of doublebuffering, we do not need to choose the largest possible ; we only require to be large enough to completely hide the overhead due to I/O. The minimum value of leading to a perfect overlap can be determined analytically via (3). The time for computing a tile must be larger than the time for loading a slab of and storing the corresponding slab of :
Dividing both sides by :
(6) 
In Eq. (6), the values for and are given as input; the datatype is known; the harddrive bandwidth is to be determined empirically; and the performance attained in the computation of a tile is determined when tailoring for the architecture, as discussed in Section 5. As we observed for , the parameter does not influence the reduction of data transfer, and it has to satisfy no explicit constraint. Still, tiling along () is recommended for performance reasons, as we demonstrate in Section 5.
Let us summarize the study performed in this section. We exposed and analyzed the degrees of freedom in the tiling of our algorithm: , , and . The main message of the study is that is the key in the reduction of the data movement to make the computation feasible. Additionally, we discussed the implication of the value of all three parameters in terms of performance, and stated the two key constraints —Eqs. (5) and (6)— to be satisfied in order to completely eliminate overhead due to I/O operations.
5 Tailoring for sharedmemory parallelism
Algorithm 5 (described in Sections 3 and 4) both has a lower computational complexity than all the current alternatives and eliminates the overhead due to I/O. Yet, a careful implementation is required to exploit sharedmemory parallelism. In this section, we explain how to select the bestsuited type of parallelism for each section of the algorithm, and outline the steps to achieve almost optimal scalability.
5.1 Preloop: Multithreaded BLAS
The operations in the preloop section, lines 2–5, correspond to either LAPACK routines (eigendecomposition) that cast most of the computation in terms of BLAS3 kernels, or are direct calls to BLAS3 routines (gemm). For this class of routines, it is well known that optimized multithreaded BLAS libraries deliver both highperformance and scalability. This is therefore the solution we adopt.
5.2 Loops: Singlethreaded BLAS + OpenMP parallelism
In sharp contrast to the preloops, the computation performed within the loops, lines 6–18, maps to nonscalable BLAS1 and BLAS2 operations, thus making the use of multithreaded BLAS not viable. Instead, we exploit the multicore parallelism by utilizing a singlethreaded BLAS in combination with OpenMP threads and by decomposing the computation in tasks.
In Section 4, it was shown that tiling the computation of along the dimension is the key to eliminate penalties due to I/O data transfers. In this section we demonstrate that for a highperformance sharedmemory implementation, it is also crucial to tile along the dimension, and discuss how to select the best tile size and shape. Furthermore, we explore different multithreading strategies to manage the data transfers, and to split and assign computational tasks to cores; since the total number of options is daunting, here we only describe those two that we considered most promising:

a single master thread is responsible for the streaming of all tiles, and all threads collaborate in the computation of each tile; and

each thread operates on whole tiles, and is responsible for their streaming.
5.2.1 Twodimensional tiling
If we only consider tiles of size , the largest tile is of size (See Fig. 1(b), with and ). In our experimental environment with 40 threads, each thread computes a task or block of size , attaining poor efficiency: 1.54% Instead, by tiling (and blocking) along as well as along , efficiency increases:

For a tile size of with blocks of size , the attained efficiency is 3.75%.

For a tile size of with blocks of size , the efficiency raises to 6.35%.
A higher workload per thread —case a)— results in a speedup of about 2.5x; by further increasing the value of —case b)— data locality improves, for speedups of about 4x. The benefits of using twodimensional tiles are thus clear; it remains to determine how such tiles should be decomposed to maximize performance.
5.2.2 Estimating the best block size
Typically, as long as the tile size satisfies the constraints (5) and (4.2), one chooses and to maximize the RAM usage. For performance purposes, the computation of such large tiles must be split into smaller blocks to exploit both the memory hierarchy and all the available cores. The same way the tile size is chosen according to the amount of available main memory, in cachebased architectures the block size is determined according to the size of the cache memories. The challenge lies on finding the optimal block size .
We consider blocking for the highest level of cache. Fig. 4 provides a heatmap representing the efficiency attained varying the block sizes. The red line delimits the space of block sizes that fit in the last level of cache —level 3 (L3) for the architecture used in our experiments (see Section 6 for details)—. As anticipated, the block sizes attaining higher efficiency (green line: Efficient region) lie above the red line, i.e., fit in L3 cache. Also expected is the fact that relatively large square blocks attain the highest efficiency (yellow line: Peak region).
Interestingly, we observed that inside the Peak region performance plateaus, exhibiting variations below 0.5%. This suggests that one should focus on finding this region rather than the most efficient block size. As a matter of fact, since within the Peak region the fluctuations due to the hardware and the operating system are greater than the differences in performance, it can be argued that a best block size does not even exist.
5.2.3 Work distribution
We conclude describing two possibly strategies to distribute the blocks among cores.

Cooperative threads (mtgwasct). A common scheme for tilebased implementations of outofcore algorithms consists of a master thread that takes care of all data transfers, and a number of spawned threads that cooperate in the computation of each tile. As Fig. 5 illustrates, each tile is divided into blocks, and these are distributed among computing threads, thus sharing the workload.

Independent threads (mtgwasit). Alternatively, work may be distributed so that each thread is responsible for the loading, computation, and downloading of its own entire tiles. As Fig. 5 illustrates, each tile is still decomposed into blocks for performance reasons.
6 Experimental results
We focus now on experimental results. We compare the performance and scalability of the stateofthe art tools, fastlmm and gwfgls, with those of the presented algorithm mtgwas. For mtgwas we provide results for both parallel approaches described in Section 5.2.3: mtgwasct for cooperative threads, and mtgwasit for independent threads.
6.1 Experimental setup
All tests were run on a SMP system consisting of 8 Intel Xeon E74850 multicore processors. Each processor comprises ten cores, operating at a frequency of 2.00 GHz, for a combined peak performance of 320 GFlops/sec. The system is equipped with 512GB of RAM and 4TBs of disk as secondary memory. The I/O system attains a maximum bandwidth of 2GBs/sec for data transfers of at least 2MBs.
The routines were compiled with the GNU C Compiler (gcc, version 4.4.5), and linked to a multithreaded Intel’s MKL library (version 10.3). Our routines make use of the OpenMP parallelism provided by the compiler through a number of pragma directives. All computations were performed in double precision.
6.2 Configuring the degrees of freedom
In order to attain maximum performance, we need to estimate the most effective values for the algorithm parameters: , , , , and .
6.2.1
As described in Section 4, this parameter is chosen to be greater than or equal to the minimum value that maximizes both data transfer rate and gemm’s performance. The maximum I/O bandwidth is attained for transfers of at least 2MBs; the minimum value of to reach this size is 250 (MBs). We also determined empirically that the minimum value of to maximize gemm’s performance for 40 cores in the specified architecture is (attaining 240 GFlops/sec). Therefore, we set to . Substituting in Eq. (5), we see that we will achieve a perfect overlapping:
6.2.2 and
According to the results shown in Fig. 4, we simply choose a square block size within the Peak region: .
6.2.3 and
We determined that the maximum performance attained in the computation of a tile is about 25 GFlops/sec. Substituting each variable in Eq. (6):
Therefore, and given the chosen block size, the constraint above () does not impose any restriction in our choice of tile size ( will be at least 160). In any case, we emphasize the importance of always carrying out such an analysis, as differences in performance or I/O bandwidth could lead to more restrictive constraints.
As discussed in Section 5, we choose different tile sizes for the different approaches to work distribution. In the case of mtgwasct, we consider large square tiles maximizing the usage of free main memory; accordingly we choose . For mtgwasit, we fix the value of to that of the block size, i.e., , while for we select a small multiple of the block size (). The reason for such a small tile size is to show that our routine can achieve impressive speedups even with a limited amount of main memory, emphasizing that the only restriction for mtgwas is the size available amount of secondary device storage. As an example, for the execution of the largest test reported in this section, which involves the processing of more than 3 terabytes of data, mtgwasit only required less than 2 gigabytes of memory.
6.3 Experiments
We present performance results for fastlmm, gwfgls, mtgwasct, and mtgwasit. At first, results for a single core are given, to highlight the speedup due to the sole improvement of the algorithm, i.e., the reduction in computational cost. Then, we compare the scalability and performance of all four algorithms, using all the available 40 cores.
In Fig. 6, we provide timings for the four routines for increasing values of . The experiments were run using a single thread. The gap between the stateoftheart routines and our novel algorithm is substantial: while fastlmm and gwfgls would take, respectively, about 10 and 8.5 days to complete, mtgwasct and mtgwasit reduce the execution time to 5.3 and 4.8 hours, respectively. In terms of speedups, our best routine, mtgwasit, is 50 and 43 times faster than fastlmm and gwfgls, respectively. It is also worth noticing the 10% speedup of mtgwasit with respect to mtgwasct.
Scalability results are shown in Fig. 7. The figure reveals clear scalability issues in the parallel implementations of both fastlmm and gwfgls; they achieve speedups between 1.4 and 1.6, and plateau when 16 or more cores are used. Instead, the results for our routines clearly demonstrate the benefit of carefully tailoring Algorithm 5 for large sharedmemory architectures: Using all available 40 cores, mtgwasct and mtgwasit attain speedups of 35 and 36.6, respectively. Furthermore, the trend presented by both our routines forecasts larger speedups, should more cores be available.
Figure 8 presents performance results for the same four routines when using 40 threads. Here, the effects of the computational cost reduction, the perfect I/O overlapping, and a better scalability are all combined, yielding speedups of and over fastlmm and gwfgls, respectively. The time to compute the largest presented problem, , , , and , is reduced from (unfeasible) years to 12 hours.
7 Future Work
Multitrait GWAS is constrained by three dimensions: , ranging from to , , ranging from to , and , ranging from to . The work presented in this paper allows and to grow as large as desired. On the contrary, our routines assume the operand to fit in main memory, and thus are constrained by the size of . For very large values of , the problem demands a distributedmemory version of mtgwas, and therefore requires an extension of the analysis of data transfers and work distribution undergone in this paper.
Additionally, there is an increasing demand for support of coprocessors such as GPUs. While the use of GPUs was proven successful for the singletrait case () [Beyer (2012)], routines for the more general multitrait case are not yet available. There, the challenge lies in writing optimized kernels for the computation within the loops, tuning for the intricate memory hierarchy of the architecture.
8 Conclusions
We addressed an extremely challenging and widespread problem in computational biology, the genomewide association study (GWAS) with multiple traits. GWAS involves largescale computations —petaflops— on large data sets —terabytes of data—, and the existing stateoftheart tools are only effective in conjunction with supercomputers. In this paper we introduced mtgwas, a novel algorithm for sequences of leastsquares problems, tailored to take advantage of both applicationspecific knowledge and shared memory parallelism, and demonstrated that for performing the full GWAS analysis, a single multicore node suffices.
First, we described the derivation of an algorithm that exploits all knowledge available from the application: from the specific twodimensional sequence of generalized leastsquares problems, to the special structure of the operands. By eliminating redundant computations, this algorithm lowers the asymptotical cost of stateoftheart tools by several orders of magnitude.
Then, we discussed how to deal with largescale datasets. In order to incorporate an outofcore mechanism, we analyzed the ratio between data movement and computations, and derived the best tile size and shape for a perfect overlapping of data transfers with computation. This mechanism enables the processing of data sets as large as the available secondary storage, without any overhead due to I/O operations.
Finally, we tailored our algorithm for sharedmemory parallel architectures. The study of the different sections of the algorithm suggested the use of a mixed parallelism: 1) multithreaded BLAS, and 2) singlethreaded BLAS and OpenMP parallelism. We empirically estimated the best size for the computational tasks, and studied two different approaches to distribute those task among threads. The resulting routines attain speedups of 35x and 36.6x on 40 cores.
By combining the effects of the computational cost reduction, the perfect I/O overlapping, and a high scalability, our routines yield, when compared to the stateoftheart tools, a 1000fold reduction in the time to solution. Thanks to this algorithm, analyses that were thought to be feasible only with the help of supercomputers, can now be completed in matter of a few hours with a single multicore node.
9 Acknowledgements
Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. The authors thank Matthias Petschow for discussion on the algorithms, and the Center for Computing and Communication at RWTH Aachen for the computing resources.
References
 Agullo et al. (2007) Agullo, E., Guermouche, A., Num, T., Agullo, E., Guermouche, A., Graal, P., ens Lyon, L., and Bordeaux, L. 2007. Towards a parallel outofcore multifrontal solver: Preliminary study. Research report 6120. INRIA.
 Anderson et al. (1999) Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. 1999. LAPACK Users’ Guide, Third ed. Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Aulchenko et al. (2007) Aulchenko, Y. S., Ripke, S., Isaacs, A., and van Duijn, C. M. 2007. Genabel: an R library for genomewide association analysis. Bioinformatics 23, 10 (May), 1294–6.
 Beyer (2012) Beyer, L. 2012. Exploiting graphics accelerators for computational biology. M.S. thesis, Aachen Institute for Computational Engineering Science, RWTH Aachen.
 Bientinesi et al. (2010) Bientinesi, P., Eijkhout, V., Kim, K., Kurtz, J., and van de Geijn, R. 2010. Sparse direct factorizations through unassembled hypermatrices. Computer Methods in Applied Mechanics and Engineering 199, 430–438.
 Blackford et al. (1997) Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Di Napoli et al. (2012) Di Napoli, E., Bluegel, S., and Bientinesi, P. 2012. Correlations in sequences of generalized eigenproblems arising in density functional theory. Computer Physics Communications (CPC) 183, 8 (Aug.), 1674–1682.
 Dongarra et al. (1990) Dongarra, J., Croz, J. D., Hammarling, S., and Duff, I. S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1, 1–17.
 FabregatTraver et al. (2012) FabregatTraver, D., Aulchenko, Y. S., and Bientinesi, P. 2012. Fast and scalable algorithms for genome studies. Tech. rep., Aachen Institute for Advanced Study in Computational Engineering Science. Available at http://www.aices.rwthaachen.de:8080/aices/preprint/documents/AICES20120501.pdf.
 Grimes (1988) Grimes, R. G. 1988. Solving systems of large dense linear equations. The Journal of Supercomputing 1, 291–299. 10.1007/BF00154340.
 Hindorff, MacArthur, Wise, Junkins, Hall, Klemm, and Manolio (Hindorff et al.) Hindorff, L., MacArthur, J., Wise, A., Junkins, H., Hall, P., Klemm, A., and Manolio, T. A catalog of published genomewide association studies. Available at: www.genome.gov/gwastudies. Accessed July 22nd, 2012.
 Lauc et al. (2010) Lauc, G. et al. 2010. Genomics Meets Glycomics–The First GWAS Study of Human NGlycome Identifies HNF1 as a Master Regulator of Plasma Protein Fucosylation. PLoS Genetics 6, 12 (12), e1001256.
 Levy et al. (2009) Levy, D. et al. 2009. Genomewide association study of blood pressure and hypertension. Nature Genetics 41, 6 (Jun), 677–687.
 Lippert et al. (2011) Lippert, C., Listgarten, J., Liu, Y., Kadie, C. M., Davidson, R. I., and Heckerman, D. 2011. Fast linear mixed models for genomewide association studies. Nat Meth 8, 10 (Oct), 833–835.
 Paige (1979) Paige, C. C. 1979. Fast numerically stable computations for generalized linear least squares problems. SIAM Journal on Numerical Analysis 16, 1, pp. 165–171.
 Speliotes et al. (2010) Speliotes, E. K. et al. 2010. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genetics 42, 11 (Nov), 937–948.
 Toledo (1999) Toledo, S. 1999. External memory algorithms. American Mathematical Society, Boston, MA, USA, Chapter A survey of outofcore algorithms in numerical linear algebra, 161–179.
eceived Month Year; revised Month Year; accepted Month Year