Solving Sequences of Generalized LeastSquares Problems on Multithreaded Architectures
Abstract
Generalized linear mixedeffects models in the context of genomewide association studies (GWAS) represent a formidable computational challenge: the solution of millions of correlated generalized leastsquares problems, and the processing of terabytes of data. We present high performance incore and outofcore sharedmemory algorithms for GWAS: By taking advantage of domainspecific knowledge, exploiting multicore parallelism, and handling data efficiently, our algorithms attain unequalled performance. When compared to GenABEL, one of the most widely used libraries for GWAS, on a 12core processor we obtain 50fold speedups. As a consequence, our routines enable genome studies of unprecedented size.
V \acmNumberN \acmYearYY \acmMonthMonth \categoryG.4Mathematical Software[Algorithm design and analysis,Efficiency] \termsAlgorithms, Performance
Authors’ addresses: Diego Fabregat, AICES, RWTH Aachen, Aachen, Germany, fabregat@aices.rwthaachen.de. Yurii Aulchenko, Institute of Cytology and Genetics, SD RAS, Novosibirsk, Russia, yurii.aulchenko@gmail.com. Paolo Bientinesi, AICES, RWTH Aachen, Aachen, Germany, pauldj@aices.rwthaachen.de.
1 Introduction
Generalized linear mixedeffects models (GLMMs) are a type of statistical model widespread in many different disciplines such as genomics, econometrics, and social sciences [Teslovich et al. (2010), Antonio and Beirlant (2007), Gibbons et al. (2010)]. Applications based on GLMMs face two computational challenges: the solution of a sequence comprising millions of generalized leastsquares problems (GLSs), and the processing of data sets so large that they only fit in secondary storage devices. In this paper, we target the computation of GLMMs in the context of genomewide association studies (GWAS).
GWAS is the tool of choice to analyze the relationship between DNA sequence variations and complex traits such as diabetes and coronary heart diseases [Lauc et al. (2010), Levy et al. (2009), Speliotes et al. (2010)]. More than 1400 papers published during the last five years endorse the relevance of GWAS [Hindorff, MacArthur, Wise, Junkins, Hall, Klemm, and Manolio (Hindorff et al.)]. The GLMM specific to GWAS solves the equation
(1) 
where is the vector of observations, representing a given trait or phenotype; is the design matrix, including covariates and genome measurements; represents dependencies among observations; and represents the relation between a variation in the genome sequence () and a variation in the trait (). In linear algebra terms, Eq. (1) solves a linear regression with nonindependent outcomes where , is full rank, is symmetric positive definite (SPD), and ; the sizes are as follows: , , and , the length of the sequence, ranges from to . The quantities , , and are known. Additionally, the ’s present a special structure that will prove to be critical for performance: each may be partitioned as , where is the same for all ’s.
1.1 Limitations
Computational biologists performing GWAS aim for the sizes described above; in a typical scenario, 3 Terabytes of data have to be processed through arithmetic operations (Petaflops). In practice, current GWAS solvers are constrained to much smaller problems due to time limitations. For instance, in [Aulchenko et al. (2010)], the authors carry out a study that takes almost 4 hours for the following problem sizes: , , and . The time to perform the same study for million is estimated to be roughly 43 hours. With our routines, the time to complete the latter reduces to 10 minutes.
1.2 Terminology
We collect and give a brief description of the acronyms used throughout the paper.

gwas: GenomeWide Association Studies

gls: Generalized LeastSquares problems

GenABEL: One of the most widely used frameworks to perform gwas

gwfgls: GenABEL’s stateoftheart routine for the solution of Eq. (1)

hpgwas: our novel incore solver for Eq. (1)

oochpgwas: outofcore version of hpgwas.
Table 1 enumerates the BLAS (Basic Linear Algebra Subprograms) [Dongarra et al. (1990)] and LAPACK (Linear Algebra PACKage) [Anderson et al. (1999)] routines used in the algorithms presented in this paper. LAPACK and BLAS are the defacto standard libraries for highperformance dense linear algebra computations.
BLAS 1 and 2  
dot  Dot product  
gemv  Matrixvector product  
trsv  Triangular system with single righthand side  
BLAS 3  
gemm  Matrixmatrix product  
syrk  Rankk update  
trsm  Triangular system with multiple righthand sides  
LAPACK  
getri  Inversion of a general matrix  
gesv  General system with multiple righthand sides  
posv  SPD system with multiple righthand sides  
potrf  Cholesky factorization  
gels  Solution of a leastsquares problem  
ggglm  Solution of a general GaussMarkov linear model 
1.3 Related Work
Traditionally, LAPACK is the tool of choice to develop highperformance algorithms and routines for linear algebra operations. Although LAPACK does not support the solution of a single GLS directly, it offers routines for closely related problems: gels for least squares problems, and ggglm for the general GaussMarkov linear model. Algorithms 1 and 2 provide examples for the reduction of GLS problems to gels and ggglm, respectively. Unfortunately, none of the algorithms provided by LAPACK is able to exploit the sequence of GLSs within GWAS, nor the specific structure of its operands. Conversely, existing adhoc routines for Eq. (1), such as the widely used gwfgls, are aware of the specific knowledge arising from the application, but exploit it in a suboptimal way.
LAPACK and GenABEL present additional drawbacks: LAPACK routines are incore, i.e., data must fit in main memory; since GWAS may involve terabytes of data, it is in general not feasible to use these routines directly. Contrarily, GenABEL incorporates an outofcore mechanism, but it suffers from significant overhead.
1.4 Contributions
We present highperformance incore and outofcore algorithms, hpgwas and oochpgwas, and their corresponding routines for the computation of GWAS on multithreaded architectures.
Our algorithms are optimized not for a single instance of the GLS problem but for the whole sequence of such problems. This is accomplished by

breaking the black box structure of traditional libraries, which impose a separate routine call for each individual GLS,

exploiting domainspecific knowledge such as the particular structure of the operands,

grouping successive problems, allowing the use of high performance kernels at their full potential, and

organizing the computation to use multiple types of parallelism.
When combined, these optimizations lead to an incore routine that outperforms GenABEL’s gwfgls by a factor of 50.
Additionally, we enable the solution of very large sequences of problems by incorporating an efficient outofcore mechanism to our incore routine. Thanks to this extension, the outofcore routine is capable of sustaining the high performance of the incore one for data sets as large as the secondary storage.
1.5 Organization
Section 2 details, through a series of improvements, how hpgwas exploits both domainspecific knowledge and the BLAS library to attain high performance and scalability. In Section 3, we quantify the gain of each improvement and present a performance comparison between hpgwas and gwfgls. Section 4 exposes the key ideas behind the outofcore mechanism leading to oochpgwas, which maintains hpgwas performance for very large sets of data. Outofcore results are provided in Section 5. We discuss future work in Section 6, and draw conclusions in Section 7.
2 Incore algorithm
We commence the discussion by describing the incremental steps to transform a generic algorithm for the solution of a single GLS problem into a highperformance algorithm that 1) solves a sequence of GLS problems, 2) exploits GWASspecific knowledge, and 3) exploits multicore parallelism. The resulting algorithm is then used in Section 4 as a starting point towards a highperformance outofcore algorithm.
Algorithm 3 solves a generic GLS problem. The approach consists in first reducing the GLS to a linear leastsquares problem (as shown in Algorithm 1), and then solving the associated normal equations , where the coefficient matrix is full rank and . To this end, Algorithm 3 first factors via a Cholesky factorization: ; and then, it solves the systems and . Several alternatives exist for the solution of the normal equations; for a detailed discussion we refer the reader to [Golub and Van Loan (1996), Å. Björck (1996)]. Numerical considerations allow us to safely rely on the Cholesky factorization of the SPD matrix without incurring instabilities. The algorithm completes by computing and solving the linear system . For each operation in the algorithm, we specify in brackets the corresponding BLAS/LAPACK routine.
Algorithm 3 solves a single GLS problem. The algorithm may be used to solve a sequence of problems in a black box fashion, i.e., for each individual coefficient matrix , use Algorithm 3 to solve the corresponding GLS problem. As the reader might have noticed, this approach leads to a considerable amount of redundant computation. We avoid the black box approach, and exploit the fact that we are solving a sequence of correlated problems. A closer look at Algorithm 3 reveals that, since only varies from problem to problem, operations at lines 1 and 3 may be performed once and reused across the sequence. The resulting Algorithm 4 greatly reduces the computation performed using a black box approach.
Although Algorithm 4 already solves a sequence of GLS problems, it is still suboptimal in a number of ways. The first crucial step towards high performance is a reorganization of the computation. A large percent of the computation in the loop is carried out by the trsm operation at line 4. Even though trsm is a BLAS3 operation, the fact that the system is solved for, at most, 20 righthand sides does not allow trsm to reach its peak performance; thus, the overall performance is affected. To overcome this limitation, we take advantage again from the sequence of problems: we group multiple trsms corresponding to successive problems into a larger trsm with enough righthand sides to deliver its maximum performance, i.e., , where represents the collection of all ’s: ().
As a further improvement, we focus on the knowledge specific to GWAS: the special structure of . Each individual may be partitioned in , where is fixed; thus the trsm operation may be split into two trsms: and , where represents the collection of all ’s: (). Additionally, the fact that is fixed allows for more computation reuse: as shown in Fig. 1, the top left part of , and the top part of are also fixed. The resulting algorithm is assembled in Algorithm 5, hpgwas.
2.1 GenABEL’s gwfgls
For completeness we provide in Algorithm 6 the algorithm implemented by GenABEL’s gwfgls routine. The algorithm takes advantage from the specific structure of GWAS by computing lines 1 and 2 once, and reusing the results across the sequence of problems. Unfortunately, a number of choices prevent it from attaining high performance:

the inversion of (line 1) performs 6 times more computation than a Cholesky factorization of ,

line 2 breaks the symmetry of the expression , which translates into doubling the amount of computation performed,

the BLAS2 operation at line 4 (gemv) could be cast as a single BLAS3 gemm involving all ’s (what we called in Algorithm 5); gwfgls does not include this improvement, thus it does not benefit from gemm’s high performance.
2.2 Computational cost
Table 2 includes the asymptotic cost of Algorithms 3 – 6 together with the ratio over our best algorithm, hpgwas. A discussion of the provided data follows.
Computational cost  Ratio Alg.# / Alg. 5  

Alg. 3 (blackbox)  
Alg. 4 (seqgls)  
Alg. 5 (hpgwas)  
Alg. 6 (gwfgls) 

The solution of a single GLS problem via Algorithm 3, blackbox, has a computational cost of . The solution of a sequence of such problems using this algorithm as a black box entails thus flops, corresponding to the computation of Cholesky factorizations. Clearly, this is not the best approach to solve a sequence of correlated problems: it performs times more operations than hpgwas.

The key insight in Algorithm 4, seqgls, is to take advantage from the fact that we are solving not one but a sequence of correlated problems. Based on an analysis of dependencies, the algorithm breaks the rigidity of blackbox, and rearranges the computation. As a result, the computational cost is reduced by a factor of . Even though seqgls represent a great improvement with respect to blackbox, it is still not optimal for GWAS: a further reduction of redundant computation is possible.

hpgwas incorporates two further optimizations to overcome the limitations of seqgls. First, the algorithm exposes the structure of , and the quantities computed from it, completely eliminating redundant operations. Then, the computation is carefully reorganized to exploit the full potential of the underlying libraries, resulting in an extremely efficient algorithm (see Fig. 2).

As for gwfgls, it benefits from both the sequence of problems and the specific structure of . Unfortunately, the algorithm fails at exploiting the existing symmetries, thus performing twice as much computation as hpgwas. Additionally, the algorithm is not properly designed to benefit from the highlyoptimized BLAS library, having a negative impact on its performance.
2.3 Parallelism
hpgwas relies on a set of kernels provided by the highlyoptimized BLAS and LAPACK libraries. In this situation, a straightforward approach to target multicore architectures is to link the routine to a multithreaded version of the libraries. While the first section of hpgwas (lines 1 to 6) benefits from this approach, showing high scalability, the second section (lines 7 to 11) does not scale. Therefore the weight of the second, although small in the sequential case, increases with the number of cores, affecting the overall scalability. To address this shortcoming we use a different parallelization scheme for the two sections: multithreaded BLAS for lines 1 to 6, and OpenMP parallelism with singlethreaded BLAS for lines 7 to 11. As we show in the next section, the resulting routine is highly scalable.
3 Performance results (I)
We turn now the attention towards the experimental results. We first report on timings for all four presented algorithms for the sequential case; the goal is to show and discuss the effect of the improvements described in the previous section. Then we focus on hpgwas and gwfgls; we concentrate on timings for the multithreaded versions of the routines and their scalability.
3.1 Experimental setup
As a computing environment we chose an architecture that we believe is readily available to most computational scientists. All four algorithms were implemented in C. Although GenABEL’s interface is written in R, gwfgls and most of its routines are written in C. We ran all tests on a SMP system made of two Intel Xeon X5675 multicore processors. Each processor has six cores, operating at a frequency of 3.06 GHz, for a combined peak performance of 146.88 GFlops/sec. The system is equipped with 32GB of RAM and 1TB of disk as secondary memory. We compiled the routines with the GNU C Compiler (gcc, version 4.4.5), and linked to a multithreaded Intel’s MKL library (version 10.3). hpgwas also makes use of the OpenMP parallelism provided by the compiler through a number of pragma directives.
3.2 Results
Fig. 2 shows the timings of all four algorithms for an increasing value of , the number of GLS problems to be solved. The experiments were run using a single thread. The results for blackbox exemplify the limitations of solving a sequence of correlated problems as if they are unrelated: no matter how optimized the algorithm is for a single instance, it cannot compete with algorithms specially tailored to solve the sequence as a whole. As a first step towards high performance, seqgls reuses computation across the sequence of problems. Consequently, the algorithm reduces dramatically the execution time of the naive blackbox approach, leading to a speedup greater than 250.
hpgwas further reduces the execution time of seqgls by a factor of 12. The gain is explained by the effect of two optimizations. On the one hand, hpgwas exploits applicationspecific knowledge, the structure of , leading to a speedup of (larger values of result in even larger speedups). On the other hand, the computation is reorganized taking into account highperformance considerations. It is a common misconception that every BLAS routine attains the same efficiency. However, due to architectural constraints such as memory hierarchy and associated latency, BLAS3 routines attain higher efficiency than BLAS1 and BLAS2. Therefore, rewriting multiple trsvs (BLAS2) as a single large trsm (BLAS3), our algorithm achieves an extra speedup of 3. As shown in Fig. 2, hpgwas is an efficient algorithm to carry out GWAS; it attains 94% of the architecture’s peak performance.
Although gwfgls is aware of the specific properties of GWAS and benefits from such knowledge, the algorithm suffers from inefficiencies similar to seqgls: it still performs redundant computation, and it is not properly tailored to benefit from BLAS3 performance. The combination of both shortcomings results in a routine that is 8 times slower than hpgwas.
Henceforth, we concentrate on hpgwas and gwfgls. In Fig. 3 we report on the scalability of both algorithms. As the figure reflects, while gwfgls barely reaches a speedup of 2, completely stalling after 6 cores are used, hpgwas attains a speedup of almost 11 when using 12 cores. Most interestingly, the tendency clearly shows that larger speedups are expected for hpgwas when increasing the number of cores available.
The disparity in the scalability of these two algorithms is mainly due to their use of the BLAS library. In the case of gwfgls, the algorithm casts most of the computation in terms of the BLAS2 operation gemv, which, being a memorybound operation, is limited not only in performance but also in scalability. Instead, as described earlier, hpgwas mainly builds on top of trsm (BLAS3), which attains high scalability when operating on a large number of righthand sides.
We provide in Fig. 4 timings for both algorithms when using 12 threads. As expected, the speedup of hpgwas with respect to gwfgls soars from 8 to 50.
4 Outofcore algorithm
So far, we have developed an algorithm for GWAS that overcomes the limitations of current approaches. It

solves a sequence of GLS problems,

exploits the available knowledge specific to GWAS, and

achieves high performance and scalability.
However, the algorithm presents a critical limitation: data must fit in main memory. The most common scenarios of GWAS require the processing of data sets that greatly exceed common main memory capacity: in a typical scenario, where 36 millions of GLS problems are to be solved with , the size of the input operand is roughly 3 terabytes. To overcome this limitation, we turn our attention to outofcore algorithms [Toledo (1999)]. The goal is to design algorithms that make a proper use of available input/output (I/O) mechanisms to deal with data sets as large as the harddrive size, while sustaining incore high performance.
We regard the solution of GWAS as a process that takes as input a large stream of data, corresponding to successive GLS problems, and generates as output a large stream of data corresponding to the solution of such problems; thus, it demands outofcore algorithms that efficiently stream data from secondary storage to main memory and vice versa.
We compare two approaches to data streaming. The first, used by GenABEL, is based on nonoverlapping synchronous I/O; because of wait states, this approach introduces a considerable overhead in the execution time. The second, based on the wellknown doublebuffering technique, allows the overlapping of I/O with computation; thanks to the overlapping, wait states, and the associated overhead, are reduced or even completely eliminated.
4.1 Nonoverlapping approach
The application of nonoverlapping synchronous I/O to our incore algorithm (hpgwas) results in Algorithm 7. The algorithm first computes the operations common to every GLS problem (lines 1 to 5) and then iterates over the stream of ’s (lines 6 to 14). At each iteration, the following actions are performed:

read the ’s for a block of successive GLS problems,

compute the solutions, ’s, of such problems, and

write the ’s to disk.
Both I/O requests (lines 7 and 14) are synchronous: after the requests are issued, the processor enters a wait state until the I/O transfer has completed. Fig. 5 depicts this shortcoming: red (dark) regions represent computation stalls where the processor waits for data to be read or written; blue (light) regions represent actual computation. Since loading data from secondary memory is orders of magnitude slower than loading data from main memory, I/O operations introduce a considerable overhead that negatively impacts performance. For the scenario described above, in which , , and , synchronous I/O applied to hpgwas causes a 5% to 10% overhead.
4.2 Overlapping approach  Double buffering
To put double buffering into practice, the main memory is split into two workspaces: one for downloading and uploading data and one for computation. Also, the data streams are divided into blocks such that they fit in the corresponding workspaces. While iterating over the blocks, the workspaces alternate their role, allowing the overlapping of I/O with computation, and reducing or even eliminating the overhead due to I/O. Specifically for GWAS, both workspaces are subdivided in individual buffers, one for each operand to be streamed, and . As illustrated in Fig. 6, at iteration , results from the previous iteration are located in Workspace1::b and input data for next iteration is to be loaded in Workspace1::XR. Simultaneously, GLS problems corresponding to the current iteration are computed and stored in Workspace2::b. At iteration the workspaces exchange their role: Workspace1 is used for computation, and Workspace2 is used for I/O.
It remains to be addressed how the downloading and uploading of data is actually performed in the background while computation is being carried out. Our approach is based on the use of asynchronous libraries, which allow a process to request the prefetching of data needed for the next iteration: data is loaded in the background while the process carries out computation with the current data set.
In Algorithm 8 we provide the outofcore algorithm oochpgwas that applies doublebuffering to hpgwas. At each iteration over the blocks of data (lines 6 to 16), the algorithm performs the following steps:

request the loading of the next block of input data (XR[i+1]),

wait, if necessary, for the current block of data (XR[i]),

compute the current set of problems defined by the current set of data,

request the storage of current results (b[i]), and

wait, if necessary, until previous results are stored (b[i1]).
As illustrated in Fig. 7, a perfect overlapping of I/O with computation means that no I/O is exposed and no processor idles waiting for I/O operations.
4.3 Sustaining incore high performance
A perfect overlapping is only one of two requirements for the outofcore routine to sustain incore high performance. The second is to ensure that the operations within the loop over the stream of data (lines 6 to 16) attain the same efficiency as in the incore routine. Both requirements depend on the number of threads and the block size, i.e., the number of ’s loaded at each iteration.
To completely eliminate the overhead due to data movement from disk to memory and vice versa, the following equation must hold:
The block size has to be large enough to ensure that for each iteration the time spent in computing is larger than the time spent on storing the previous results and loading data for the next iteration. Since the computation time varies with the number of threads, the block size needs to be adjusted accordingly.
Although it may seem that the best approach to select a block size is to simply maximize memory usage, the initial overhead must be taken into account: the loading of the first block of data is not overlapped with computation. In systems equipped with large amounts of main memory, it is advised to initiate the computation with a small block size to avoid exposed I/O, and increase it after a few iterations.
5 Performance results (II)
In this section, we focus on the experimental results for oochpgwas, our outofcore routine. To measure the performance of the incorporated outofcore mechanism, we compare the timings with those of the incore routine, previously shown in Fig. 4. We complete the picture with a comparison between the timings of our routines and those of the incore and outofcore implementations of gwfgls. All routines were written in C, and the experiments were run in the same environment as Section 3. In addition, oochpgwas utilizes the AIO (asynchronous input/output) library, available on UNIX systems as part of their standard libraries.
In Fig. 8, we combine timings for both the incore routine hpgwas and the outofcore routine oochpgwas. The incore routine is used for problems whose data sets fit in main memory, and we switch to the outofcore routine for larger problems. The vertical line indicates the size of the largest problem that can be solved incore. The figure shows that, thanks to the doublebuffering and an appropriate choice of the block size, oochpgwas achieves a perfect overlapping of I/O with computation. As a consequence, oochpgwas is able to sustain incore performance for problems as large as the harddrive size.
In Table 3, we collect timings for both our routines and gwfgls in both incore and outofcore scenarios. The provided ratios confirm the impact of using a suboptimal approach to outofcore: While, as seen in Fig. 8, the overlapping I/O mechanism incorporated in oochpgwas sustains incore performance, the nonoverlapping approach in gwfgls results in a 10% to 15% overhead. As a consequence, the speedup over gwfgls raises from 50 to 58.
gwfgls  429.2  
*hpgwas  10.9  43.0  82.6  414  816  
Ratio  39.2  48.1  49.9  58.1  58.9  57.5  57.7 
The largest tests presented, involving 10 millions of genetic markers (’s), took less than 2.5 hours with oochpgwas. This means that a complete genomewide scan of association between 36 millions of genetic markers in a population of individuals takes now slightly more than 8 hours, and is only limited by the availability of a large (and cheap) secondary storage device.
6 Future Work
As current biomedical research experiences a large boost in the amount of available genomic data, computational biologists are eager to solve problems of ever increasing size. Even though the time spent to perform genomewide analysis is reduced significantly thanks to the techniques presented in this paper, further speedups are required to satisfy future needs.
Throughout the paper we assumed, as is the case in current analyses, that the matrix fits in main memory. In this context, a further reduction of execution time can be achieved through distributedmemory architectures via an MPI parallelization: the processes would first perform the Cholesky factorization of redundantly, and then operate on distinct chunks of . In addition, if GPUs with enough memory to host were available, the routines could be further sped up by offloading the computation of the trsm (line 8 in Alg. 8) onto the devices.
When instead does not fit in main memory, one should rely on approaches based on outofcore algorithmsbyblocks [QuintanaOrtí et al. (2009), QuintanaOrtí et al. (2012)] and libraries such as ScaLAPACK [Blackford et al. (1997)] and Elemental [Poulson et al. (2012)], respectively for sharedmemory and distributedmemory architectures.
7 Conclusions
We tackled a problem, extremely common in bioinformatics, that requires both highperformance computing and storage. Neither general nor domainspecific libraries provide a viable solution. Indeed, due to the expected execution time and storage requirements, it was believed that these problems could be solved exclusively with the aid of supercomputers. This paper instead demonstrates that a single multicore node suffices.
We presented highperformance algorithms, and their corresponding implementations, for the solution of sequences of generalized leastsquares problems (GLSs) in the context of genomewide association studies (GWAS). When compared to the widely used gwfgls routine from the GenABEL package, our routines attain speedups larger than 50.
Our routines are specifically tailored for multithreaded architectures. We followed an incremental approach: starting from an algorithm to solve one single GLS, we detailed the steps towards a highperformance algorithm for GWAS. At each step, we identified the limitations of current existing libraries and tools, and described the key insight to overcome such limitations.
First, we showed that no matter how optimized is a routine to solve a single GLS instance, it cannot possibly compete with tools specifically designed for GWAS: it is imperative to take advantage of the sequence of correlated problems. This discards the blackbox approach of traditional libraries.
Then, we identified gwfgls’ issues regarding efficiency, scalability, and data handling, and detailed how we addressed them. Taking advantage of problem symmetries and applicationspecific knowledge, we were able to completely eliminate redundant computation. Next, we pointed out that even BLAS3 kernels might suffer from low efficiency. A careful rearrangement of the operations leads to an efficiency of 94%. Combining two kinds of parallelism –a multithreaded version of BLAS and OpenMP parallelism–, our incore solver attains speedups close to 11 on 12 cores.
Finally, thanks to an adequate utilization of the doublebuffering technique, allowing for a perfect overlapping of data transfers with computation, our outofcore routine not only inherits incore efficiency and scalability, but it is also capable of sustaining the achieved high performance for problems as large as the available secondary storage.
As an immediate result, our routines enable genomewide association studies of unprecedented size and shift the limitation from computation time to size of secondary storage devices.
Acknowledgments
Financial support from the Deutsche Forschungsgemeinschaft (German Research Association) through grant GSC 111 is gratefully acknowledged. Y. Aulchenko was funded by grants from the Russian Foundation of Basic Research (RFBR), the Helmholtz society (RFBRHelmholtz Joint Research Groups), and the MIMOmics project supported by FP7. The authors thank Matthias Petschow for discussion on the algorithms, and the Center for Computing and Communication at RWTH Aachen for the computing resources.
References
 Anderson et al. (1999) Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. 1999. LAPACK Users’ Guide, Third ed. Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Antonio and Beirlant (2007) Antonio, K. and Beirlant, J. 2007. Actuarial statistics with generalized linear mixed models. Insurance: Mathematics and Economics 40, 1, 58 – 76.
 Aulchenko et al. (2010) Aulchenko, Y. S., Struchalin, M., and van Duijn, C. 2010. Probabel package for genomewide association analysis of imputed data. BMC Bioinformatics 11, 1, 134.
 Blackford et al. (1997) Blackford, L. S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., and Whaley, R. C. 1997. ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA.
 Dongarra et al. (1990) Dongarra, J., Croz, J. D., Hammarling, S., and Duff, I. S. 1990. A set of level 3 basic linear algebra subprograms. ACM Trans. Math. Softw. 16, 1, 1–17.
 Gibbons et al. (2010) Gibbons, R. D., Hedeker, D., and DuToit, S. 2010. Advances in Analysis of Longitudinal Data. Annual Review of Clinical Psychology, vol. 6. ANNUAL REVIEWS, 4139 EL CAMINO WAY, PO BOX 10139, PALO ALTO, CA 943030897 USA, 79–107.
 Golub and Van Loan (1996) Golub, G. H. and Van Loan, C. F. 1996. Matrix computations (3rd ed.). Johns Hopkins University Press, Baltimore, MD, USA.
 Hindorff, MacArthur, Wise, Junkins, Hall, Klemm, and Manolio (Hindorff et al.) Hindorff, L., MacArthur, J., Wise, A., Junkins, H., Hall, P., Klemm, A., and Manolio, T. A catalog of published genomewide association studies. Available at: www.genome.gov/gwastudies. Accessed July 22nd, 2012.
 Lauc et al. (2010) Lauc, G. et al. 2010. Genomics Meets Glycomics–The First GWAS Study of Human NGlycome Identifies HNF1 as a Master Regulator of Plasma Protein Fucosylation. PLoS Genetics 6, 12 (12), e1001256.
 Levy et al. (2009) Levy, D. et al. 2009. Genomewide association study of blood pressure and hypertension. Nature Genetics 41, 6 (Jun), 677–687.
 Poulson et al. (2012) Poulson, J., Marker, B., van de Geijn, R. A., Hammond, J. R., and Romero, N. A. 2012. Elemental: A new framework for distributed memory dense matrix computations. ACM Trans. Math. Softw.. To appear.
 QuintanaOrtí et al. (2012) QuintanaOrtí, G., Igual, F. D., Marqués, M., QuintanaOrtí, E. S., and Van de Geijn, R. A. 2012. A runtime system for programming outofcore matrix algorithmsbytiles on multithreaded architectures. ACM Transactions on Mathematical Software 38, 4.
 QuintanaOrtí et al. (2009) QuintanaOrtí, G., QuintanaOrtí, E. S., Van de Geijn, R. A., Van Zee, F. G., and Chan, E. 2009. Programming matrix algorithmsbyblocks for threadlevel parallelism. ACM Trans. Math. Softw. 36, 3 (July), 14:1–14:26.
 Å. Björck (1996) Å. Björck. 1996. Numerical Methods for Least Squares Problems. SIAM, Philadelphia, USA.
 Speliotes et al. (2010) Speliotes, E. K. et al. 2010. Association analyses of 249,796 individuals reveal 18 new loci associated with body mass index. Nature Genetics 42, 11 (Nov), 937–948.
 Teslovich et al. (2010) Teslovich, T. M. et al. 2010. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 7307 (Aug), 707–713.
 Toledo (1999) Toledo, S. 1999. External memory algorithms. American Mathematical Society, Boston, MA, USA, Chapter A survey of outofcore algorithms in numerical linear algebra, 161–179.
eceived Month Year; revised Month Year; accepted Month Year