Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

Sandra Catalán111Depto. Ingeniería y Ciencia de Computadores, Universidad Jaume I, Castellón, Spain. {catalans,adcastel,quintana}@icc.uji.es    Adrián Castelló{}^{*}    Francisco D. Igual222Depto. de Arquitectura de Computadores y Automática, Universidad Complutense de Madrid, Spain {figual,rafaelrs}@ucm.es    Rafael Rodríguez-Sánchez{}^{\dagger}    Enrique S. Quintana-Ortí{}^{*}
July 31, 2019
Abstract

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multi-threaded version of BLAS. This approach is also different from the more sophisticated runtime-assisted implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a high-level of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of LAPACK functionality on any multicore platform with an OpenMP-like runtime.

1 Introduction

Dense linear algebra (DLA) lies at the bottom of the “food chain” for many scientific and engineering applications, which require numerical kernels to tackle linear systems, linear least squares problems or eigenvalue computations, among other problems [Dem97]. In response, the scientific community has created the Basic Linear Algebra Subroutines (BLAS) and the Linear Algebra Package (LAPACK) [BLAS3, lapack]. These libraries standardize domain-specific interfaces for DLA operations that aim to ensure performance portability across a wide range of computer architectures.

For multicore processors, the conventional approach to exploit parallelism in the dense matrix factorization (DMF) routines implemented in LAPACK has relied, for many years, on the use of a multi-threaded BLAS (MTB). The list of current high performance instances of this library composed of basic building blocks includes Intel MKL [MKL], IBM ESSL [ESSL], GotoBLAS [Goto:2008:AHM:1356052.1356053, Goto:2008:HPI], OpenBLAS [OpenBLAS], ATLAS [atlas] or BLIS [BLIS1]. These implementations exert a strict control over the data movements and can be expected to make an extremely efficient use of the cache memories. Unfortunately, for complex DLA operations, this approach constrains the concurrency that can be leveraged by imposing an artificial fork-join model of execution on the algorithm. Specifically, with this solution, parallelism does not expand across multiple invocations to BLAS kernels even if they are independent and, therefore, could be executed in parallel.

The increase in hardware concurrency of multicore processors in recent years has led to the development of parallel versions of some DLA operations that exploit task-parallelism via a runtime (RTM). Several relevant examples comprise the efforts with OmpSs [ompssweb], PLASMA-Quark [plasmaweb], StarPU [starpuweb], Chameleon [chamaleonweb] and libflame-SuperMatrix [flameweb]. In short detail, the task-parallel RTM-assisted parallelizations decompose a DLA operation into a collection of fine-grained tasks, interconnected with dependencies, and issues the execution of each task to a single core, simultaneously executing independent tasks on different cores while fulfilling the dependency constraints. The RTM-based solution is better equipped to tackle the increasing number of cores of current and future architectures, because it leverages the natural concurrency that is present in the algorithm. However, with this type of solution, the cores compete for the shared memory resources and may not amortize completely the overhead of invoking the BLAS to perform fine-grain tasks [catalan17].

In this paper we demonstrate that, for complex DMFs, it is possible to leverage the advantages of both approaches, extracting coarse-grain task-parallelism via a static look-ahead strategy [Str98], with the multi-threaded execution of certain highly-parallel BLAS with fine granularity. Our solution thus exhibits some relevant differences with respect to an approach based solely on either MTB or RTM, making the following contributions:

  • From the point of view of abstraction, we use of a high-level parallel application programming interface (API), such as OpenMP [openmp], to identify two parallel sections (per iteration of the DMF algorithm) that become coarse-grain tasks to be run in parallel.

  • Within some of these coarse tasks, we employ OpenMP as well to extract loop-parallelism while strictly controlling the data movements across the cache hierarchy, yielding two nested levels of parallelism.

  • In contrast with a RTM-based approach, we apply a static version of look-ahead [Str98] (instead of a dynamic one), in order to remove the panel factorization from the critical path of the algorithm’s execution. This is combined with a cache-aware parallelization of the trailing update where all threads efficiently share the memory resources.

  • We offer a high-level description of the DMF algorithms, yet with enough details about their parallelization to allow the practical development of a library for dense linear algebra on multicore processors.

  • We expose the distinct behaviors of the DMF algorithms on top of GNU’s or Intel’s OpenMP runtimes when dealing with nested parallelism on multicore processors. For the latter, we illustrate how to correctly set a few environment variables that are key to avoid oversubscription and obtain high performance for DMFs.

  • We investigate the performance of the DMF algorithms when running on top an alternative multi-threading runtime based on the light-weight thread (LWT) library in Argobots [argobots] accessed via the OpenMP-compatible APIs GLT+GLTO [GLTAPI, lwthlpm].

  • We provide a complete experimental evaluation that shows the performance advantages of our approach using three representative DMF on a 8-core server with recent Intel Xeon technology.

The rest of the paper is organized as follows. In Section 2, we review the cache-aware implementation and multi-threaded parallelization of the BLAS-3 in the BLIS framework. In Section 3, we present a general framework that accommodates a variety of DMFs, elaborating on their conventional MTB-based and the more recent RTM-assisted parallelization. In Section LABEL:sec:OpenMP, we present our alternative that combines task-loop parallelization, static look-ahead, and a “malleable” instance of BLAS. In Section LABEL:sec:LWT, we discuss nested parallelism and inspect the parallelization of DMF via the LWT runtime library underlying Argobots and the OpenMP APIs GLT and GLTO [argobots, GLTAPI, GLTO]. Finally, in Section LABEL:sec:experiments we provide an experimental evaluation of the different algorithms/implementations for three representative DFMs, and in Section LABEL:sec:remarks we close the paper with a few concluding remarks.

2 Multi-threaded BLIS

BLIS is a framework to develop high-performance implementations of BLAS and BLAS-like operations on current architectures [BLIS1]. We next review the design principles that underlie BLIS. For this purpose, we use the implementation of the general matrix-matrix multiplication (gemm) in this framework/library in order to expose how to exploit fine-grain loop-parallelism within the BLIS kernels, while carefully taking into account the cache organization.

2.1 Exploiting the cache hierarchy

Consider three matrices A, B and C, of dimensions m\times k, k\times n and m\times n, respectively. BLIS mimics GotoBLAS to implement the gemm operation

C\mathrel{+}=A\cdot B (1)

(as well as variants of this operation with transposed/conjugate A and/or B) as three nested loops around a macro-kernel plus two packing routines; see Loops 1–3 in Listing 1. The macro-kernel is realized as two additional loops around a micro-kernel; see Loops 4 and 5 in that listing. In the code, C_{c}(i_{r}:i_{r}+m_{r}-1,j_{r}:j_{r}+n_{r}-1) is a notation artifact, introduced to ease the presentation of the algorithm and no data copies are involved. In contrast, A_{c},B_{c} correspond to actual buffers that are involved in data copies.

The loop ordering in BLIS, together with the packing routines and an appropriate choice of the cache configuration parameters n_{c}, k_{c}, m_{c}, n_{r} and m_{r}, dictate a regular movement of the data across the memory hierarchy. Furthermore, these selections aim to amortize the cost of these transfers with enough computation from within the micro-kernel to deliver high performance [BLIS1]. In particular, BLIS is designed to maintain B_{c} into the L3 cache (if present), A_{c} into the L2 cache, and a micro-panel of B_{c} (of dimension k_{c}\times n_{r}) into the L1 cache; in contrast, C is directly streamed from main memory to the core registers.

1void Gemm(int m, int n, int k, double *A, double *B, double *C) {
2  // Declarations: mc, nc, kc,...
3  for ( jc = 0; jc < n; jc += nc )                       // Loop 1
4    for ( pc = 0; pc < k; pc += kc ) {                   // Loop 2
5      // B(p_{c}:p_{c}+k_{c}-1,j_{c}:j_{c}+n_{c}-1)\rightarrow{\color[rgb]{0,0,1}B_{c}}
6      Pack_buffer_B(kc, nc, &B(pc,jc), &Bc);
7      for ( ic = 0; ic < m; ic += mc ) {                 // Loop 3
8        // A(i_{c}:i_{c}+m_{c}-1,p_{c}:p_{c}+k_{c}-1)\rightarrow{\color[rgb]{1,0,0}A_{c}}
9        Pack_buffer_A(mc, kc, &A(ic,pc), &Ac);
10        // Macro-kernel:
11        for ( jr = 0; jr < nc; jr += nr )                // Loop 4
12          for ( ir = 0; ir < mc; ir += mr ) {            // Loop 5
13            // Micro-kernel:
14            //   C_{c}(i_{r}:i_{r}+m_{r}-1,j_{r}:j_{r}+n_{r}-1)~{}+=
15            //       {\color[rgb]{1,0,0}Ac(i_{r}:i_{r}+m_{r}-1,1:1+k_{c}-1)}~{}\cdot
16            //       {\color[rgb]{0,0,1}Bc(j,1:1+k_{c}-1,_{r}:j_{r}+n_{r}-1)}
17            Gemm_mkernel( mr, nr, kc, &Ac(ir,1), &Bc(1,jr),
18                                                 &Cc(ir,jr) );
19          }
20      }
21    }
22}
Listing 1: High performance implementation of gemm in BLIS.

2.2 Multi-threaded parallelization

The parallelization strategy of BLIS for multi-threaded architectures takes advantage of the loop-parallelism exposed by the five nested-loop organization of gemm at one or more levels. A convenient option in most single-socket systems is to parallelize either Loop 3 (indexed by i_{c}), Loop 4 (indexed by j_{r}), or a combination of both [BLIS2, BLIS3, Catalan2016].

For example, we can leverage the OpenMP parallel application programming interface (API) to parallelize Loop 4 inside gemm, with t_{\textsc{mm}} threads, by inserting a simple parallel for directive before that loop (hereafter, for brevity, we omit most of the parts of the codes that do not experience any change with respect to their baseline reference):

1// Fragment of Gemm: Reference code in Listing 1
2void Gemm(int m, int n, int k, double *A, double *B, double *C) {
3  // Declarations: mc, nc, kc,...
4  for ( jc = 0; jc < n; jc += nc )                       // Loop 1
5    // Loops 2, 3, 4 and packing of Bc, Ac (omitted for simplicity)
6    // ...
7        #pragma omp parallel for num_threads(tMM)
8        for ( jr = 0; jr < nc; jr += nr )                // Loop 4
9          // Loop 5 and GEMM micro-kernel (omitted)
10          // ...
11}

Unless otherwise stated, in the remainder of the paper we will consider a version of BLIS gemm that extracts loop-parallelism from Loop 4 only, using t_{\textsc{mm}} threads; see Figure 1. To improve performance, the packing of A_{c} and B_{c} are also performed in parallel so that, for example, at each iteration of Loop 3, all t_{\textsc{mm}} threads collaborate to copy and re-organize the entries of A(i_{c}:i_{c}+m_{c}-1,p_{c}:p_{c}+k_{c}-1) into the buffer A_{c}. From the point of view of the cache utilization, with this parallelization strategy, all threads share the same buffers A_{c} and B_{c}, while each thread operates on a distinct micro-panel of B_{c}, of dimension k_{c}\times n_{r}. The shared buffers for A_{c},B_{c} are stored in the L2, L3 caches while the micro-panels of B_{c} reside in the L1 cache.

Figure 1: Distribution of the workload among t_{\textsc{mm}}=3 threads when Loop 4 of BLIS gemm is parallelized. Different colors in the output C distinguish the micro-panels of this matrix that are computed by each thread as the product of A_{c} and corresponding micro-panels of the input B_{c}.

3 Parallel Dense Matrix Factorizations

3.1 A general framework

Many of the routines for DMFs in LAPACK fit into a common algorithmic skeleton, consisting of a loop that processes the input matrix in steps of b columns/rows per iteration. In general the parameter b is referred to as the algorithmic block size. We next offer a general framework that accommodates the routines for the LU, Cholesky, QR and LDL{}^{T} factorizations (as well as matrix inversion via Gauss-Jordan elimination) [GVL3]. To some extent, it also applies to two-sided decompositions for the reduction to compact band forms in two-stage methods for the solution of eigenvalue problems and the computation of the singular value decomposition (SVD) [Bischof:2000:AST].

Let us denote the input m\times n matrix to factorize as A, and assume, for simplicity, that m=n and this dimension is an integer multiple of the block size b. Many routines for the afore-mentioned DMFs (and matrix inversion) fit into the general code skeleton displayed in Listing LABEL:lst:dmf_flame_algorithm, which is partially based on the FLAME API for the C programming language [FLAME:Recipe]. In that scheme, before the loop commences, and in preparation for the first iteration, routine FLA_Part_2x2 decouples the input matrix as

{{{{A\rightarrow\left(\begin{array}[]{c I c}A_{TL}&\tabularcell@hbox{A_{TR}}\\ \omit\span\omit\span\@@LTX@noalign{}\omit\\ \hline\omit\span\omit\span\@@LTX@noalign{}\omit\\ A_{BL}&\tabularcell@hbox{A_{BR}}\cr\omit\span\omit\span\@@LTX@noalign{ }\omit\\ \inner@par\@@numbered@section{section}{toc}{Static Look-ahead and Mixed % Parallelism}\inner@par\inner@par\inner@par\inner@par Theintroductionofstaticlook% -ahead~{}\@@cite[cite]{[\@@bibref{}{Str98}{}{}]}% aimstoovercomethestrictdependenciesintheDMF.Forthispurpose,% thefollowingmodificationsareintroducedintotheconventionalfactorizationalgorithm% :\begin{itemize} \par\itemize@item@The trailing update is broken into two panels/suboperations/% tasks only, ${\tt TU}_{k}\rightarrow({\tt TU}_{k}^{L}~{}|~{}{\tt TU}_{k}^{R})$, where ${\tt TU}_{k}^{L}$ contains the leftmost $b$ columns of ${\tt TU}_{k}$, which % exactly overlap with those of ${\tt PF}_{k+1}$. \par\itemize@item@The algorithm is then (manually) re-organized, applying a % sort of software pipelining in order to perform the panel factorization ${\tt PF}_{k+1}$ in the same iteration as the update $({\tt TU}_{k}^{L}~{}|~{}{\tt TU}_{k}^{R})$. \par\end{itemize}Thesechangesallowtooverlapthesequentialfactorizationofthe% \mbox{``{\tt next}''}panelwiththehighlyparallelupdateofthe\mbox{``{\tt current% }''}trailingsubmatrixinthesameiteration;\;seeFigure~{}\ref{fig:dependencies}(% bottom)andthere-organizedversionoftheDMFwithlook-aheadinListing~{}\ref{lst:dmf% _algorithm_la}.There,weassumethatthe{\tt k}-thlefttrailingupdate${\tt TU}_k^L$% andthe$(k+1)$-thpanelfactorization${\tt PF}_{k+1}$% arebothperformedinsideroutine{\tt PU(k+1)}(forpanelupdate);\;andthe{\tt k}-% thrighttrailingupdate${\tt TU}_k^R$occursinsideroutine{\tt TU\_right(k)}.% \inner@par\inner@par\inner@par\inner@par\inner@par\inner@par\inner@par% \inner@par{\@listings{\@@listings@block{3}{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@frame% \lst@@@set@language\lst@@@set@numbers\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\footnotesize\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}1}}}{\@listingGroup{ltx_lst_keyword}{\bf\color[% rgb]{0,0,0}void}}{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{FLA\textunderscore DMF\textunderscore la}}({\@listingGroup% {ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{0,0,0}int% }}{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{n}},% {\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{FLA% \textunderscore Obj}}{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{A}},{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_keyword}{\bf\color[rgb]{0,0,0}int}}{\@listingGroup{ltx_lst_space}{~{}}% }{\@listingGroup{ltx_lst_identifier}{b}}{\@listingGroup{ltx_lst_space}{~{}}})% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 2}}}\textbraceleft\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize% \color[rgb]{0,0,1}3}}}{\@listingGroup{ltx_lst_space}{~{}~{}}}{\@listingGroup{% ltx_lst_identifier}{PF}}({\@listingGroup{ltx_lst_space}{~{}}}0{\@listingGroup{% ltx_lst_space}{~{}}});\;{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{% }~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{% 0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}First{\@listingGroup{ltx_lst_space% }{~{}}}panel{\@listingGroup{ltx_lst_space}{~{}}}factorization}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}4}}}{% \@listingGroup{ltx_lst_space}{~{}~{}}}{\@listingGroup{ltx_lst_keyword}{\bf% \color[rgb]{0,0,0}for}}{\@listingGroup{ltx_lst_space}{~{}}}({\@listingGroup{% ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{k}}{\@listingGroup{% ltx_lst_space}{~{}}}={\@listingGroup{ltx_lst_space}{~{}}}0;\;{\@listingGroup{% ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{k}}{\@listingGroup{% ltx_lst_space}{~{}}}\textless{\@listingGroup{ltx_lst_space}{~{}}}{% \@listingGroup{ltx_lst_identifier}{n}}{\@listingGroup{ltx_lst_space}{~{}}}/{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{b}};\;{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{k}}++{% \@listingGroup{ltx_lst_space}{~{}}}){\@listingGroup{ltx_lst_space}{~{}}}% \textbraceleft\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize% \color[rgb]{0,0,1}5}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}/\textasteriskcentered------% -----------------------------------------------------\textasteriskcentered/}}% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 6}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Opera% tions}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]% {0,0,1}7}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_identifier}{PU}}({\@listingGroup{ltx_lst_space}{~{}}}{% \@listingGroup{ltx_lst_identifier}{k}}+1{\@listingGroup{ltx_lst_space}{~{}}});% \;{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Panel{\@listingGroup{ltx_lst_space}{~{}}}update:{% \@listingGroup{ltx_lst_space}{~{}}}PF{\@listingGroup{ltx_lst_space}{~{}}}+{% \@listingGroup{ltx_lst_space}{~{}}}TU{\@listingGroup{ltx_lst_space}{~{}}}(left% )}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{% 0,0,1}8}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_identifier}{TU\textunderscore right}}({\@listingGroup{% ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{k}}{\@listingGroup{% ltx_lst_space}{~{}}});\;{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}}}% {\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Trailing{\@listingGroup{ltx_lst_space}{~{}}}update{% \@listingGroup{ltx_lst_space}{~{}}}(right)}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}9}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{% 0,0,1}/\textasteriskcentered--------------------------------------------------% ---------\textasteriskcentered/}}\@lst@endline\@lst@startline{\@lst@linenumber% {{\footnotesize\color[rgb]{0,0,1}10}}}{\@listingGroup{ltx_lst_space}{~{}~{}}}% \textbraceright\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize% \color[rgb]{0,0,1}11}}}\textbraceright\@lst@endline}}\@@toccaption{{\lx@tag[][% ]{{2}}{SimplifiedroutineforaDMFwithlook-ahead.}}}\@@caption{{\lx@tag[][: ]{% Listing~2}{SimplifiedroutineforaDMFwithlook-ahead.}}}}}}\inner@par\inner@par% \inner@par\inner@par\@@numbered@section{subsection}{toc}{Parallelization with % the OpenMP API}\inner@par Thegoalofour\mbox{``{\tt mixed}''}% strategyexposednextistoexploitacombinationoftask-levelandloop-% levelparallelisminthestaticlook-aheadvariant,extractingcoarse-graintask-% levelparallelismbetweentheindependenttasks${\tt PU}_{k+1}$and${\tt TU}_{k}^R$% ateachiteration,whileleveragingthefine-grainloop-% parallelismwithinthelatterusingacache-awaremulti-% threadedimplementationoftheBLAS.\inner@par\inner@par Letusassumethat,% foranarchitecturewith$t$hardwarecores,wewanttospawnoneOpenMPthreadpercore,% withasinglethreaddedicatedtothepanelupdate${\tt PU}_{k+1}$andtheremaining$t_{{% mm}} = t-1$totherighttrailingupdate${\tt TU}_{k}^R$.(% Thismappingoftaskstothreadsaimstomatchthereducedandampledegreesofparallelismofthepanelfactorization% (insidethepanelupdate)andtrailingupdate,respectively.)Toattainthisobjective,% wecanthenusetheOpenMP{\tt parallelsections}% directivetoparallelizetheoperationsintheloopbodyofthealgorithmfortheDMFasfollows% :\\ \begin{minipage}[t]{433.62pt} \inner@par{\@@listings{\@@listings@block{4}{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@frame% \lst@@@set@language\lst@@@set@numbers\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\footnotesize\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}1}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}% }}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Fragment{\@listingGroup{ltx_lst_space}{~{}}}of{% \@listingGroup{ltx_lst_space}{~{}}}FLA\textunderscore DMF\textunderscore la:{% \@listingGroup{ltx_lst_space}{~{}}}Reference{\@listingGroup{ltx_lst_space}{~{}% }}code{\@listingGroup{ltx_lst_space}{~{}}}in{\@listingGroup{ltx_lst_space}{~{}% }}Listing{\@listingGroup{ltx_lst_space}{~{}}}5}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}2}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{% 0,0,1}/\textasteriskcentered--------------------------------------------------% ---------\textasteriskcentered/}}\@lst@endline\@lst@startline{\@lst@linenumber% {{\footnotesize\color[rgb]{0,0,1}3}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~% {}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Operations}}\@lst@endline\@lst@startline{\@lst@linenumber{% {\footnotesize\color[rgb]{0,0,1}4}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{% }}}{\@listingGroup{}{\color[rgb]{1,0,0}{tMM = t-1;}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}5}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{% 1,0,0}{\#pragma omp parallel sections num\_threads(2)}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}6}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{% 1,0,0}{\{}}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color% [rgb]{0,0,1}7}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{}{\color[rgb]{1,0,0}{\#pragma omp section}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}8}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_identifier}{PU}}({\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{k}}+1{\@listingGroup{ltx_lst_space}{~{}}});{\@listingGroup% {ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Panel% {\@listingGroup{ltx_lst_space}{~{}}}update:{\@listingGroup{ltx_lst_space}{~{}}% }PF{\@listingGroup{ltx_lst_space}{~{}}}+{\@listingGroup{ltx_lst_space}{~{}}}TU% {\@listingGroup{ltx_lst_space}{~{}}}(left)}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}9}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{1,0,0}{\#% pragma omp section}}}\@lst@endline\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}10}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{% }~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_identifier}{TU\textunderscore right}}({% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{k}}{% \@listingGroup{ltx_lst_space}{~{}}});{\@listingGroup{ltx_lst_space}{~{}~{}~{}~% {}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{% \@listingGroup{ltx_lst_space}{~{}}}Trailing{\@listingGroup{ltx_lst_space}{~{}}% }update{\@listingGroup{ltx_lst_space}{~{}}}(right)}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}11}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{% 1,0,0}{\}}}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color% [rgb]{0,0,1}12}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_comment}{\color[rgb]{0,0,1}/\textasteriskcentered---------------------% --------------------------------------\textasteriskcentered/}}\@lst@endline}}% \@@toccaption{}\@@caption{}}}} \end{minipage}\\ Herewemapthepanelupdateandtrailingupdatetoonethreadeach.Then,% theinvocationtoaloop-parallelinstanceoftheBLASfromthetrailingupdate(% butasequentialoneforthepanelupdate)yieldsthedesired{\em nested-% mixedparallelism}(NMP),withtheOpenMP{\tt parallelsections}directiveatthe\mbox{% ``{\tt outer}''}levelandaloop-parallelizationoftheBLAS(% invokedfromtherighttrailingupdate)usingOpenMP{\tt parallelfor}directivesatthe% \mbox{``{\tt inner}''}level;\;seesubsection~{}\ref{subsec:mtb}.\inner@par% \inner@par\inner@par\inner@par\inner@par\inner@par\inner@par\inner@par% \inner@par\inner@par\inner@par\@@numbered@section{subsection}{toc}{Workload % balancing via malleable BLAS}\inner@par Extractingparallelismwithintheiterationsviaastaticlook% -aheadusingtheOpenMP{\tt parallelsections}% directiveimplicitlysetsasynchronizationpointattheendofeachiteration.% Inconsequence,aperformancebottleneckmayappearifthepracticalcosts(i.e.,% executiontime)of${\tt PU}_{k+1}$(=${\tt TU}_{k}^R+{\tt PF}_k$)and${\tt TU}_{k}% ^R$areunbalanced.\inner@par\inner@par Ahighercostof${\tt PU}_{k+1}$is,% inprinciple,duetotheuseofavaluefor$b$% thatistoolargeandoccurswhenthenumberofcoresisrelativelylargewithrespecttotheproblemdimension% .Thiscanbealleviatedbyadjusting,on-the-fly,theblockdimensionviaanauto-% tuningtechniquereferredtoas{\em earlytermination}~{}\@@cite[cite]{[\@@bibref{}% {catalan17}{}{}]}.\inner@par Herewefocusonthemorechallengingoppositecase,% inwhich${\tt TU}_{k}^R$isthemostexpensiveoperation.Thisscenarioistackledin~{}% \@@cite[cite]{[\@@bibref{}{catalan17}{}{}]}bydevelopinga{\em malleablethread-% level}(MTL)implementationoftheBLASsothat,whenthethreadinchargeof${\tt PU}_{k+1% }$completesthistask,itjoinstheremaining$t_{{mm}}$threadsthatareexecuting${\tt TU% }_{k}^R$.% NotethatthisisonlypossiblebecausetheinstanceofBLASthatweareusingisopensource,% andinconsequence,wecanmodifythecodetoachievethedesiredbehavior.Incomparison,% standardmulti-threadedinstancesofBLAS,suchasthoseinIntelMKL,OpenBLASorGotoBLAS% ,allowtheusertorunaBLASkernelwithacertainamountofthreads,% butthisnumbercannotbevariedduringtheexecutionofthekernel(thatison-the-fly).% \inner@par ComingbacktoourOpenMP-basedsolution,% wecanattainthemalleabilityeffectasfollows:\\ \begin{minipage}[t]{433.62pt} \inner@par{\@@listings{\@@listings@block{5}{{\lst@@@set@language% \lst@@@set@numbers\lst@@@set@frame\lst@@@set@rulecolor\lst@@@set@frame% \lst@@@set@language\lst@@@set@numbers\lst@@@set@rulecolor\lst@@@set@language% \lst@@@set@language\footnotesize\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}1}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}% }}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Fragment{\@listingGroup{ltx_lst_space}{~{}}}of{% \@listingGroup{ltx_lst_space}{~{}}}FLA\textunderscore DMF\textunderscore la:{% \@listingGroup{ltx_lst_space}{~{}}}Reference{\@listingGroup{ltx_lst_space}{~{}% }}code{\@listingGroup{ltx_lst_space}{~{}}}in{\@listingGroup{ltx_lst_space}{~{}% }}Listing{\@listingGroup{ltx_lst_space}{~{}}}5}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}2}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{% 0,0,1}/\textasteriskcentered--------------------------------------------------% ---------\textasteriskcentered/}}\@lst@endline\@lst@startline{\@lst@linenumber% {{\footnotesize\color[rgb]{0,0,1}3}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~% {}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Operations}}\@lst@endline\@lst@startline{\@lst@linenumber{% {\footnotesize\color[rgb]{0,0,1}4}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{% }}}{\@listingGroup{}{\color[rgb]{1,0,0}{tMM = t-1;}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}5}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{% 1,0,0}{\#pragma omp parallel sections num\_threads(2)}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}6}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{% 1,0,0}{\{}}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color% [rgb]{0,0,1}7}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{}{\color[rgb]{1,0,0}{\#pragma omp section}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}8}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb% ]{1,0,0}{\{}}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize% \color[rgb]{0,0,1}9}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}% }{\@listingGroup{ltx_lst_identifier}{PU}}({\@listingGroup{ltx_lst_space}{~{}}}% {\@listingGroup{ltx_lst_identifier}{k}}+1{\@listingGroup{ltx_lst_space}{~{}}})% ;{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}Panel{\@listingGroup{ltx_lst_space}{~{}}}update:{% \@listingGroup{ltx_lst_space}{~{}}}PF{\@listingGroup{ltx_lst_space}{~{}}}+{% \@listingGroup{ltx_lst_space}{~{}}}TU{\@listingGroup{ltx_lst_space}{~{}}}(left% )}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{% 0,0,1}10}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{}{\color[rgb]{1,0,0}{tMM = t;}}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}11}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{1,0,0}{\}}}}% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 12}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}}}{\@listingGroup{}{% \color[rgb]{1,0,0}{\#pragma omp section}}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}13}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_identifier}{T% U\textunderscore right}}({\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{k}}{\@listingGroup{ltx_lst_space}{~{}}});{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[% rgb]{0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Trailing{\@listingGroup{% ltx_lst_space}{~{}}}update{\@listingGroup{ltx_lst_space}{~{}}}(calls{% \@listingGroup{ltx_lst_space}{~{}}}GEMM)}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}14}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{}{\color[rgb]{1,0,0}{\}}}}% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 15}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_comment}{\color[rgb]{0,0,1}/\textasteriskcentered---------------------% --------------------------------------\textasteriskcentered/}}\@lst@endline}}% \@@toccaption{}\@@caption{}}}} \end{minipage}\inner@par Forsimplicity,% letusassumetherighttrailingupdateboilsdowntoasinglecallto\mbox{\sc gemm}.% Settingvariable{\tt tMM=t}afterthecompletionofthepanelupdate(inline~{}8)% ensuresthat,providedthischangeisvisibleinside\mbox{\sc gemm},% thenexttimetheOpenMP{\tt parallelfor}directivearoundLoop~{}4in\mbox{\sc gemm}% isencountered(i.e.,inthenextiterationofLoop~{}3;\;seeListing~{}\ref{lst:% gotoblas_gemm}),thisloopwillbeexecutedbyall$t$threads.% Thechangeinthenumberofthreadsalsoaffectstheparallelismdegreeofthepackingroutinefor% $A_c$.\inner@par\inner@par\inner@par\inner@par\inner@par\@@numbered@section{% section}{toc}{Re-visiting Nested Mixed Parallelism}\inner@par Exploitingdatalocalityiscrucialoncurrentarchitectures% .Thisisthecaseformanyscientificapplicationsand,especially,% forDMFwhenthegoalistosqueezethelastdropsofperformanceofanalgorithm--% architecturepair.Toattainthis,atightcontrolofthedataplacement/% movementandthreadingactivitymaybenecessary.Unfortunately,theuseofahigh-% levelprogrammingmodelsuchasOpenMPabstractsthesemappings,% makingthistaskmoredifficult.\inner@par\inner@par\@@numbered@section{subsection% }{toc}{Conventional OS threads}\inner@par NestedparallelismmaypotentiallyyieldaperformanceissueduetothethreadmanagementrealizedbytheunderlyingOpenMPruntime% .Inparticular,whenthefirst{\tt parallel}directiveisfound,% ateamofthreadsiscreatedandthefollowingregionisexecutedinparallel.Now,ifasecond% {\tt parallel}directiveisencounteredinsidetheregion(nestedparallelism),% anewteamofthreadsiscreatedforeachthreadencounteringit.% Thisruntimepolicymayspawnmorethreadsthanphysicalcores,% addingarelevantoverheadduetooversubscriptionascurrentOpenMPreleasesareimplementedontopof% \mbox{``{\tt heavy}''}Pthreads,whicharecontrolledbytheoperatingsystem(OS).% \inner@par IntheDMFalgorithms,% weencounternestedparallelismbecauseofthenestedinvocationofa{\tt parallelfor}(% fromaBLASkernel)insidea{\tt parallelsections}directive(% encounteredintheDMFroutine).Totacklethisproblem,% wecanrestrictthenumberofthreadsforthe{\tt sections}toonlytwoand,% inanarchitecturewith{\tt t}physicalcores,setthenumberofthreadsinthe{\tt parallelfor% }to{\tt tMM}={\tt t}$-1$,foratotalof{\tt t}threads.Unfortunately,% withtheadditionofmalleability,thethreadthatexecutesthepanelfactorization,% uponcompletingthiscomputation,willremain\mbox{``{\tt alive}''}(% eitherinabusywaitorblocked)whileanewthreadisspawnedforthenextiterationofLoop~{% }3inthepanelupdate,yieldingatotalof~{}{\tt t}$+1$% threadsandtheundesiredoversubscriptionproblem.\inner@par WewillexplorethepracticaleffectsofoversubscriptionforclassicalOpenMPruntimesthatleverageOSthreadsinSection% ~{}\ref{sec:experiments},% whereweconsiderthedifferencesbetweentheOpenMPruntimesunderlyingGNU{\tt gcc}% andIntel{\tt icc}compilers,% anddescribehowtoavoidthenegativeconsequencesforthelatter.\inner@par\inner@par% \inner@par\@@numbered@section{subsection}{toc}{LWT in Argobots}\inner@par IntheremainderofthissectionweintroduceanalternativetodealwithoversubscriptionproblemsusingtheimplementationofLWTsinArgobots% ~{}\@@cite[cite]{[\@@bibref{}{argobots}{}{}]}.ComparedwithOSthreads,LWTs(% alsoknownasuser-levelthreadsorULTs)runintheuserspace,providingalower-% costthreadingmechanism(intermsofcontext-switch,suspend,cancel,etc.)% thanPthreads~{}\@@cite[cite]{[\@@bibref{}{stein1992}{}{}]}.Furthermore,% LWTinstancesfollowatwo-levelhierarchicalimplementation,wherethebottomlevel(% closertothehardware)comprisestheOSthreadswhichareboundtocoresfollowinga1:1% relationship.Incontrast,thetoplevelcorrespondstotheULTs,% whichcontaintheconcurrentcodethatwillbeexecutedconcurrentlybytheOSthreads.% Withthisstrategy,thenumberofOSthreadswillneverexceedtheamountofcoresand,% therefore,oversubscriptionisprevented.\inner@par\inner@par\inner@par\inner@par% \inner@par\inner@par\inner@par\@@numbered@section{subsubsection}{toc}{LWT % parallelization with GLTO}\inner@par Toimprovecodeportability,% weutilizetheGLTOAPI~{}\@@cite[cite]{[\@@bibref{}{GLTO}{}{}]},whichisanOpenMP-% compatibleimplementationbuiltontopoftheGLTAPI~{}\@@cite[cite]{[\@@bibref{}{% GLTAPI}{}{}]},andrelyonArgobotsastheunderlyingthreadinglibrary.Concretely,% ourfirstLTW-basedparallelizationemploysGLTOtoextracttask-parallelismfromtheDMF% ,usingtheOpenMP{\tt parallelsections}directive,andloop-% parallelisminsidetheBLAS,usingtheOpenMP{\tt parallelfor}directive.Therefore,% nochangesarerequiredtothecodefortheDMFwithstaticlook-ahead,NMPandMTLBLAS.% TheonlydifferenceisthattheOpenMPthreadinglibraryisreplacedbyGLTO^{\prime}s(i.e% .,Argobot^{\prime}s)instanceinordertoavoidpotentialoversubscriptionproblems.% \inner@par AppliedtotheDMFs,thissolutioninitiallyspawnsoneOSthreadpercore.% Themasterthreadfirstencountersthe{\tt parallelsections}directive,% creatingtwoULTwork-units(onepersection),% andthencommencestheexecutionofoneofthesesections/ULTs/branches.% UntilthecreationoftheadditionalULTs,theremainingthreadscycleinabusy-wait.% Oncethisoccurs,% oneofthesethreadswillcommencewiththeexecutionofthealternativesection(% whiletheremainingoneswillremaininthebusy-wait).% ThethreadinchargeoftherighttrailingupdatethencreatesseveralULTsinsidetheBLAS,% oneperiterationchunkduetothe{\tt parallelfor}directive.TheseULTswillbeexecuted% ,whenready,bytheOSthreads.% TheTLMtechniqueiseasilyintegratedinthissolutionasOSthreadsexecuteULTs,% independentlyofwhichsectionofthecodethey\mbox{``{\tt belong to}''}.\inner@par% \inner@par\inner@par\inner@par\inner@par\inner@par\inner@par{\@listings{% \@@listings@block{6}{{\lst@@@set@language\lst@@@set@numbers\lst@@@set@frame% \lst@@@set@rulecolor\lst@@@set@frame\lst@@@set@language\lst@@@set@numbers% \lst@@@set@rulecolor\lst@@@set@language\lst@@@set@language\lst@@@set@language% \footnotesize\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 1}}}{\@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{0,0,0}void}}{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{Gemm% \textunderscore Tasklets}}({\@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{% 0,0,0}int}}{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{m}},{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_keyword}{\bf\color[rgb]{0,0,0}int}}{\@listingGroup{ltx_lst_space}{~{}}% }{\@listingGroup{ltx_lst_identifier}{n}},{\@listingGroup{ltx_lst_space}{~{}}}{% \@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{0,0,0}int}}{\@listingGroup{% ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{k}},{\@listingGroup{% ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{0,0,0}doub% le}}{\@listingGroup{ltx_lst_space}{~{}}}\textasteriskcentered{\@listingGroup{% ltx_lst_identifier}{A}},{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_keyword}{\bf\color[rgb]{0,0,0}double}}{\@listingGroup{ltx_lst_space}{~% {}}}\textasteriskcentered{\@listingGroup{ltx_lst_identifier}{B}},\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}2}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}% ~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{0,0,0}double}}{\@listingGroup{% ltx_lst_space}{~{}}}\textasteriskcentered{\@listingGroup{ltx_lst_identifier}{C% }}){\@listingGroup{ltx_lst_space}{~{}}}\textbraceleft\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}3}}}{% \@listingGroup{ltx_lst_space}{~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[% rgb]{0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Declarations:{\@listingGroup{% ltx_lst_space}{~{}}}mc,{\@listingGroup{ltx_lst_space}{~{}}}nc,{\@listingGroup{% ltx_lst_space}{~{}}}kc,...}}\@lst@endline\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}4}}}{\@listingGroup{ltx_lst_space}{~{}~{}}}{% \@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}GLT{\@listingGroup{ltx_lst_space}{~{}}}tasklet{% \@listingGroup{ltx_lst_space}{~{}}}handlers}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}5}}}{\@listingGroup{% ltx_lst_space}{~{}~{}}}{\@listingGroup{}{\color[rgb]{1,0,0}{GLT\_% tasklettasklet[tMM];\;}}}\@lst@endline\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}6}}}{\@listingGroup{ltx_lst_space}{~{}~{}}}{% \@listingGroup{}{\color[rgb]{0,0,0}{structL4\_argsL4args[tMM];\;}}}% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 7}}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{% 0,0,1}8}}}{\@listingGroup{ltx_lst_space}{~{}~{}}}{\@listingGroup{% ltx_lst_keyword}{\bf\color[rgb]{0,0,0}for}}{\@listingGroup{ltx_lst_space}{~{}}% }({\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{jc}}% {\@listingGroup{ltx_lst_space}{~{}}}={\@listingGroup{ltx_lst_space}{~{}}}0;\;{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{jc}}{% \@listingGroup{ltx_lst_space}{~{}}}\textless{\@listingGroup{ltx_lst_space}{~{}% }}{\@listingGroup{ltx_lst_identifier}{n}};\;{\@listingGroup{ltx_lst_space}{~{}% }}{\@listingGroup{ltx_lst_identifier}{jc}}{\@listingGroup{ltx_lst_space}{~{}}}% +={\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{nc}}% {\@listingGroup{ltx_lst_space}{~{}}}){\@listingGroup{ltx_lst_space}{~{}}}% \textbraceleft{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}% ~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{% 0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Loop{\@listingGroup{ltx_lst_space}% {~{}}}1}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[% rgb]{0,0,1}9}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Loops% {\@listingGroup{ltx_lst_space}{~{}}}2,{\@listingGroup{ltx_lst_space}{~{}}}3,{% \@listingGroup{ltx_lst_space}{~{}}}4{\@listingGroup{ltx_lst_space}{~{}}}and{% \@listingGroup{ltx_lst_space}{~{}}}packing{\@listingGroup{ltx_lst_space}{~{}}}% of{\@listingGroup{ltx_lst_space}{~{}}}Bc,{\@listingGroup{ltx_lst_space}{~{}}}A% c{\@listingGroup{ltx_lst_space}{~{}}}(omitted{\@listingGroup{ltx_lst_space}{~{% }}}for{\@listingGroup{ltx_lst_space}{~{}}}simplicity)}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}10}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_keyword}{\bf\color[rgb]{0,0,0}for}}{\@listingGroup{ltx_lst_space}{~{}}% }({\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{th}}% {\@listingGroup{ltx_lst_space}{~{}}}={\@listingGroup{ltx_lst_space}{~{}}}0;\;{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{th}}{% \@listingGroup{ltx_lst_space}{~{}}}\textless{\@listingGroup{ltx_lst_space}{~{}% }}{\@listingGroup{ltx_lst_identifier}{tMM}};\;{\@listingGroup{ltx_lst_space}{~% {}}}{\@listingGroup{ltx_lst_identifier}{th}}++{\@listingGroup{ltx_lst_space}{~% {}}}){\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}% ~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{% \@listingGroup{ltx_lst_space}{~{}}}Loop{\@listingGroup{ltx_lst_space}{~{}}}4}}% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 11}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}\textbraceleft% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 12}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{}{\color[rgb]{0,0,0}{L4args[th].arg1=arg1;\;}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}13}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{}{\color[rgb]{0,0,0}{L4args[th].arg2=arg2;\;}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}14}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{% ltx_lst_space}{~{}}}...}}\@lst@endline\@lst@startline{\@lst@linenumber{{% \footnotesize\color[rgb]{0,0,1}15}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{% }~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{ltx_lst_comment}{\color[rgb]{0,0,1}% //{\@listingGroup{ltx_lst_space}{~{}}}Tasklet{\@listingGroup{ltx_lst_space}{~{% }}}creation{\@listingGroup{ltx_lst_space}{~{}}}that{\@listingGroup{% ltx_lst_space}{~{}}}invokes{\@listingGroup{ltx_lst_space}{~{}}}L4{% \@listingGroup{ltx_lst_space}{~{}}}function}}\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}16}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{}{\color[% rgb]{1,0,0}{glt\_tasklet\_create(L4,L4args[th],\&tasklet[th]);\;}}}% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 17}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}\textbraceright% \@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}% 18}}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{% 0,0,1}19}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{}{\color[rgb]{1,0,0}{glt\_yield();\;}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}20}}}{% \@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{% ltx_lst_comment}{\color[rgb]{0,0,1}//{\@listingGroup{ltx_lst_space}{~{}}}Join{% \@listingGroup{ltx_lst_space}{~{}}}the{\@listingGroup{ltx_lst_space}{~{}}}task% lets}}\@lst@endline\@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{% 0,0,1}21}}}{\@listingGroup{ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}}}{% \@listingGroup{ltx_lst_keyword}{\bf\color[rgb]{0,0,0}for}}{\@listingGroup{% ltx_lst_space}{~{}}}({\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{th}}{\@listingGroup{ltx_lst_space}{~{}}}={\@listingGroup{% ltx_lst_space}{~{}}}0;\;{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{% ltx_lst_identifier}{th}}{\@listingGroup{ltx_lst_space}{~{}}}\textless{% \@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{tMM}};% \;{\@listingGroup{ltx_lst_space}{~{}}}{\@listingGroup{ltx_lst_identifier}{th}}% ++{\@listingGroup{ltx_lst_space}{~{}}})\@lst@endline\@lst@startline{% \@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}22}}}{\@listingGroup{% ltx_lst_space}{~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}}}{\@listingGroup{}{% \color[rgb]{1,0,0}{glt\_tasklet\_join(\&tasklet[th]);\;}}}\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}23}}}{% \@listingGroup{ltx_lst_space}{~{}~{}}}\textbraceright\@lst@endline% \@lst@startline{\@lst@linenumber{{\footnotesize\color[rgb]{0,0,1}24}}}% \textbraceright\@lst@endline}}\@@toccaption{{\lx@tag[][ ]{{3}}{% Highperformanceimplementationof\mbox{\sc gemm}inBLISontopofGLTusingTasklets.}}% }\@@caption{{\lx@tag[][: ]{Listing~3}{Highperformanceimplementationof\mbox{\sc gemm% }inBLISontopofGLTusingTasklets.}}}}}}\inner@par\inner@par\inner@par\inner@par% \inner@par\inner@par\@@numbered@section{subsubsection}{toc}{LWT % parallelization with GLTO+GLT}\inner@par ArgobotsprovidesdirectaccesstoTasklets% ,atypeofwork-% unitsthatisevenlighterthanULTsandcandeliverhigherperformanceforjust-% computationcodes~{}\@@cite[cite]{[\@@bibref{}{cluster16}{}{}]}.% Inourparticularexample,TaskletscanleveragedtoparallelizetheBLASroutines,% providinganMTLblack-boximplementationofthislibrarythatcanbeinvokedfromhigher-% leveloperations,suchasDMFs.InthisalternativeLWT-basedparallelsolution,% thepotentialhigherperformancederivedfromtheuseofTaskletscomesatthecostofsomedevelopmenteffort% .ThereasonisthatGLTOdoesnotsupportTaskletsbutreliesonULTstorealizeallwork-% units.Therefore,ourimplementationofMTLBLAShastoabandonGLTO,% employingtheGLTAPItointroducetheuseofTaskletsintheBLASinstance.\inner@par Inmoredetail% ,weimplementedahybridsolutionwithGLTOandGLT.Attheouterlevel,% theparallelizationoftheDMFemploysthe{\tt parallelsections}directiveontopofGLTO% ,theOpenMPruntimeandArgobots^{\prime}threadingmechanism.Internally,% theBLASroutinesareimplementedwith{\tt GLT\_tasklets},% asdepictedintheexampleinListing~{}\ref{lst:lllwt_gemm}.Inthe{\tt Gemm\_% Tasklets}routinethere,inline~{}5wefirstdeclarethetasklethandlers(% oneperthreadthatwillexecuteLoop~{}4,thatis,{\tt tMM}).TheoriginalLoop~{}4in{% \tt Gemm},indexedby{\tt jr}(seeListing~{}\ref{lst:gotoblas_gemm}),% isthenreplacedbyaloopthatcreatesoneTaskletperthread.Lines12--14% insidethisnewloopinitializetheargumentstofunction{\tt L4},% amongotherparametersdefiningwhichiterationsoftheiterationspaceoftheoriginalloopindexedby% {\tt jr}willbeexecutedaspartoftheTaskletindexedby{\tt th}.Then,line16% generatesa{\tt GLT\_tasklet}thatcontainsthefunctionpointer({\tt L4}),% thefunctionarguments({\tt L4args})andthetasklethandler.% ThisTaskletwillberesponsibleforexecutingthecorrespondingiterationspaceof{\tt jr% },includingLoop~{}5andthemicro-kernel(s).Line19% allowsthecurrentthreadtoyieldandstartexecutingpendingwork-units(Tasklets).% Finally,line22checkstheTaskletstatustoensurethattheworkhasbeencompleted(% synchronizationpoint).\inner@par\inner@par\inner@par\inner@par\inner@par% \inner@par\inner@par\inner@par\inner@par\inner@par\inner@par InSection~{}\ref{% sec:experiments},weevaluatetheLWTsolutionsbasedonGLTOvsGLTO+GLT,% andwecomparetheperformancecomparedwithaconventionalOpenMPruntimeusingtheDMFalgorithmsasthetargetcasestudy% .\inner@par\@@numbered@section{section}{toc}{Performance Evaluation}\inner@par% \inner@par\@@numbered@section{subsection}{toc}{Experimental setup}\inner@par Alltheexperimentsinthispaperwereperformedindoubleprecisionrealarithmetic% ,onaserverequippedwithan8-coreIntelXeonE5-2630v3(\mbox{``{\tt Haswell}''})% processor,runningat2.4GHz,and64GbytesofDDR4RAM.ThecodeswerecompiledwithIntel{% \tt icc}17.0.1orGNU{\tt gcc}6.3.0.TheLWTimplementationisthatinArgobots.% \lx@note{footnote}{Version from October 2017. Available online at \url{http://www.argobots.org}.}(Unlessexplicitlystatedotherwise,% wewilluseIntel^{\prime}scompilerandOpenMPruntime.)% TheinstanceofBLASisamodifiedversionofBLIS0.1.8,toaccommodatemalleability,% wherethecacheconfigurationparametersweresetto$n_c$=4032,$k_c$=256,$m_c$=72,$n_% r$=6,and$m_r$=8.ThesevaluesareoptimalfortheIntelHaswellarchitecture.\inner@par Thematricesemployedinthestudyareallsquareoforder% $n$,withrandomentriesfollowingauniformdistribution.(% ThespecificvaluescanonlyhaveamildimpactontheexecutiontimeofLUpp,% becauseofthedifferentpermutationsequencesthattheyproduce.)% Thealgorithmicblocksizeforallalgorithmswassetto$b=192$.Thisspecificvalueof$b$% isnotparticularlybiasedtofavoranyofthealgorithms/% implementationsandavoidsaverytime-% consumingoptimizationofthisparameterforspacerangeoftuplesDMF/problemdimension/% implementation.\inner@par Inthefollowingtwosubsections,% weemployLUpptocomparethedistinctbehaviorofIntel^{\prime}sandGNU^{\prime}% sruntimeswhendealingwithnestedparallelism;\;% andtheperformancedifferenceswhenusingGLTOorGLTtoparallelizeBLAS.% Afteridentifyingthebestoptionswiththeseinitialanalyses,% inthesubsequentsubsectionweperformaglobalcomparisonusingthreeDMFs:LUpp,% theQRfactorization(QR),% andaroutineforthereductiontobandformthatisutilizedinthecomputationoftheSVD.% TheseDMFsarerepresentativeofmanylinearalgebracodesinLAPACK.\inner@par% \inner@par\@@numbered@section{subsection}{toc}{Conventional OS threads: GNU vs% Intel}\inner@par GNUandIntelhavedifferentpoliciestodealwithnestedparallelismthatmayproducerelevantconsequencesonperformance% .Inprinciple,uponencounteringthefirst(outer)parallelregion,say{\sf OR}(% forouterregion),bothruntimes\mbox{``{\tt spawn}''}therequestednumberofthreads.% Foreachthreadhittingthesecond(inner)region,say{\sf IR1}(innerregion-1),% theywillnext\mbox{``{\tt spawn}''}% asmanythreadsasrequestedinthecorrespondingdirective.Thedifferencesappearwhen,% aftercompletingtheexecutionof{\sf IR1},anewinnerregion{\sf IR2}isencountered.% Inthisscenario,GNU^{\prime}sruntimewillsetthethreadsthatexecuted{\sf IR1}% toidle,andanewteamofthreadswillbespawnedandputincontrolofexecuting{\sf IR2}.% Intel^{\prime}sruntimebehaviordiffersfromthisinthatitre-% utilizestheteamthatexecuted{\sf IR1}for{\sf IR2}(plus/% minusthedifferencesinthenumberofthreadsrequestedbythetwoinnerregions).% Thisdiscussionisimportantbecause,inourparallelizationoftheDMFs,% thisisexactlythescenariothatoccurs:{\sf OR}% istheregionintheDMFalgorithmthatemploysthe{\tt parallelsections}directive,% while{\sf IR1},{\sf IR2},{\sf IR3},\ldots correspondtoeachoneofregionsannotatedwiththe% {\tt parallelfor}directivesthatareencounteredinsuccessiveiterationsofLoop~{}3% fortheBLAS.Itisthuseasytoinferthat,underthesecircumstances,% GNUwillproduceconsiderableoversubscription,% duetotheoverheadofcreatingnewteamsevenifthethreadsaresettoapassivemodeafternolongerneeded% (orevenworseiftheyactivelycycleinabusy-wait).\inner@par WithIntel,% amildriskofoversubscriptionstillappearswiththeversionoftheDMFalgorithmthatemploysamalleableBLAS% .Inthiscase,thethreadthatcompletestheexecutionofthepanelfactorization,% uponexecutionofthispart,issettoidle;\;andthenexttimethe{\tt parallelfor}% insideLoop~{}3oftheBLASisencountered,% anewthreadbecomespartoftheteamexecutingthepanelupdate.% Theoutcomeisthatnowwehaveonethreadwaitingforthesynchronizationattheendofthe{% \tt parallelsections}and{\tt tMM=t}threadsexecutingthetrailingupdate,where{\tt t% }denotesthenumberofcores.Fortunately,% wecanavoidthenegativeconsequencesinthiscasebycontrollingthebehavioroftheidlethreadviaIntel% ^{\prime}senvironmentvariables,aswedescribenext.\inner@par Theexperimentsinthissubsectionaimtoillustratetheseeffects% .Concretely,Figure~{}\ref{fig:gcc_vs_icc}% comparestheperformanceofbothconventionalruntimesfortheLUppcodes(withstaticlook% -aheadinallcases),% andshowstheimpactoftheirmechanismsforthreadmanagementinperformance.ForIntel^{% \prime}sruntime,wealsoprovideamoredetailedinspectionusingseveralfine-% grainedoptimizationstrategiesenforcedviaenvironmentvariables.% Eachlineoftheplotcorrespondstoadifferentcombinationofruntime-% environmentvariablesasfollows:\begin{description} \par\description@item@[\sf Base:] Basic configuration for both runtimes. Nested parallelism is explicitly enabled by setting {\tt OMP\_NESTED=true} and {\tt OMP\_MAX\_LEVELS=2}. The waiting policy for idle threads is explicitly enforced to be passive for % both runtimes via the initialization {\tt OMP\_WAIT\_POLICY=passive}. This environment variable defines whether % threads spin (active policy) or sleep (passive policy) while they are waiting. \par\par\description@item@[\sf Blocktime:] Only available for Intel's runtime.% When using a passive waiting policy, we leverage variable {\tt KMP\_BLOCKTIME} to fix the time that a thread should wait after completing the execution of a parallel region before sleeping. In our case, we% have empirically determined an optimal waiting time of 1 ms. (In comparison, the default value % is 200 ms.) \par\par\description@item@[\sf HotTeams:] Only available for Intel's runtime. % {\em Hot teams} is an extension of OpenMP supported by the Intel runtime that specifies the runtime behavior when% the number of threads in a team is reduced. Specifically, when the {\em hot teams} are active, extra threads are kept in % the team in reserve, for faster re-use in subsequent parallel regions, potentially reducing the overhead associated with a full start/stop procedure. This functionality by setting {\tt KMP\_HOT\_TEAMS\_MODE=1} and {\tt KMP\_HOT\_TEAMS\_MAX\_LEVEL=% 2}. \par\par\end{description}\inner@par\begin{figure}[htb]\centering% \includegraphics[width=260.172pt]{GFLOPS_LU_vs_LU_icc_gcc.pdf}\@@toccaption{{% \lx@tag[][ ]{{2}}{PerformanceofLUppusingtheconventionalOpenMPruntimeson8% coresofanIntelXeonE5-2630v3.}}}\@@caption{{\lx@tag[][: ]{{\small Figure 2}}{% \small PerformanceofLUppusingtheconventionalOpenMPruntimeson8% coresofanIntelXeonE5-2630v3.}}}\@add@centering\end{figure}\inner@par TheanalysisofperformanceinFigure% ~{}\ref{fig:gcc_vs_icc}exposesthedifferencesbetween{\sf Base}% configurationsoftheIntel^{\prime}sandGNU^{\prime}sruntimes,% mainlyderivedfromthedistinctpoliciesinthreadre-usebetweenthetworuntimes,% andtheconsequentoversubscriptionproblemdescribedabove.ForIntel^{\prime}% sruntime,theexplicitintroductionofapassivewaitpolicy({\sf Base}line)% yieldsasubstantialperformanceboostcomparedwithGNU;\;% andadditionalperformancegainsarederivedfromtheuseofanoptimalblocktimevalue,and% {\em hotteams}(lineslabeledwith{\sf Blocktime}and{\sf HotTeams},respectively).% \inner@par\inner@par\@@numbered@section{subsection}{toc}{LWT in Argobots: GLTO% vs GLTO+GLT}\inner@par Figure~{}\ref{fig:gltovsglt}% comparestheperformanceoftheLUppcodes(withstaticlook-ahead),% usingthetwoLWTsolutionsdescribedinsubsection~{}\ref{sec:LWT}.% HereweremindthatthesimplestvariantutilizesGLTO^{\prime}sOpenMP-% APIontopofArgobot^{\prime}sruntime(linelabeledas{\sf GLTO}intheplot)% whilethemostsophisticatedone,inaddition,employsTaskletstoparallelizetheBLAS(% line{\sf GLTO+GLT}).% ThisexperimentshowthatusingTaskletscompensatestheadditionaleffortsofdevelopingthisspecificimplementationoftheBLAS% .Thisisespeciallythecase,asthisdevelopmentisaone-timeeffortthat,oncecompleted,% canbeseamlesslyleveragedmultipletimesbytheusersofthisspecializedinstanceofthelibrary% .\inner@par\begin{figure}[htb]\centering\includegraphics[width=260.172pt]{% GFLOPS_LU_GLTO_GLTO+GLT.pdf}\@@toccaption{{\lx@tag[][ ]{{3}}{% PerformanceofLUppusingtheLWTinArgobotson8coresofanIntelXeonE5-2630v3.}}}% \@@caption{{\lx@tag[][: ]{{\small Figure 3}}{\small PerformanceofLUppusingtheLWTinArgobotson% 8coresofanIntelXeonE5-2630v3.}}}\@add@centering\end{figure}\inner@par% \inner@par\@@numbered@section{subsection}{toc}{Global comparison}\inner@par Thefinalanalysisinthispapercomparesthefiveparallelalgorithms% /implementationslistednext.Unlessotherwisestated,theyallemployIntel^{\prime}% sOpenMPruntime.\begin{itemize} \par\itemize@item@{\sf MTB}: Conventional approach that extracts parallelism % in the reference DMF routines (without look-ahead) by simply linking them with% a multi-threaded instance of BLAS. \par\itemize@item@{\sf RTM}: Runtime-assisted parallelization that decomposes % the trailing update into multiple tasks and simultaneously executes independent tasks in different cores. Most of the tasks correspond to BLAS kernels which are executed using a serial (i.e., single-threaded) instance of this library. The tasks are identified using the OpenMP 4.5 {\tt task} directive and % dependencies are specified via representants for the blocks and the proper {\tt in}/{\tt out} clauses. \par\itemize@item@{\sf LA}: DMF algorithm that integrates a static look-ahead % and exploits NMP with task-parallelism extracted from the loop-body of the factorization and loop-parallelism from the multi-threaded % BLAS. \par\itemize@item@{\sf LA\_MB\_S} and {\sf LA\_MB\_G}: Analogous to {\sf LA} % but linked with an MTL multi-threaded version of BLAS. The first implementation (with the suffix ``{\sf\_S}'') employs Intel's OpenMP% runtime, with the environment variables set as determined in the study in subsection~{}\ref{subsec:intelvsgnu}. The second one (suffix ``{\sf\_% G}'') employs GLTO+GLT and Argobot's runtime, as derived to be the best option from the experiment in% subsection~{}\ref{subsec:gltovsglt}. \par\end{itemize}\inner@par Forthisstudy,weuseleveragethefollowingthreeDMFs:% \begin{itemize} \par\itemize@item@LUpp: The LU factorization with partial pivoting as utilized% and described earlier in this work; see subsection~{}\ref{subsec:MTBvsRTM}. \par\itemize@item@QR: The QR factorization via Householder transformations. % The reference implementation is a direct translation into C of routine {\sc geqrf% } in LAPACK. The version with static look-ahead is obtained from this code by re-organizing% the operations as explained for the generic DMF earlier in the paper. The runtime-assisted parallelization operates differently, in order to expose % a higher degree of parallelism, but due to the numerical stability of orthogonal transformations, produces the% same result. In particular, {\sf RTM} divides the panel and trailing submatrix into square blocks, using the same approach proposed in~{}\@@cite[cite]{[% \@@bibref{}{Buttari200938,Quintana-Orti:2009:PMA:1527286.1527288}{}{}]}, and % derived from the incremental QR factorization in~{}\@@cite[cite]{[\@@bibref{}{Gunter:2% 005:POC}{}{}]}. \par\itemize@item@SVD: The reduction to compact band form for the (first stage% of the) computation of the SVD, as described in~{}\@@cite[cite]{[\@@bibref{}{% GROER1999969,DBLP:journals/corr/abs-1709-00302}{}{}]}. This is a right-looking routine that, at each iteration, computes two panel factorizations, using Householder % transformations respectively applied from the left- and right-hand side of the% matrix. These transformations are next applied to update the trailing parts of the % matrix via efficient BLAS-3 kernels. The variants that allow the introduction of static look-ahead were presented % in~{}\@@cite[cite]{[\@@bibref{}{DBLP:journals/corr/abs-1709-00302}{}{}]}. No runtime version exist at present for this factorization~{}\@@cite[cite]{[% \@@bibref{}{DBLP:journals/corr/abs-1709-00302}{}{}]}. \par\end{itemize}TheresultsarecomparedintermsofGFLOPS,% usingthestandardflopcountsforLUpp($2n^3/3$)andQR($4n^3/3$).% FortheSVDreductionroutine,weemploythetheoreticalflopcountof$8n^3/3$% forthefullreductiontobidiagonalform.However,% theactualnumberofflopsdependsontherelationbetweentheactualtargetbandwidth$w$% andtheproblemdimension.Intheseexperiments,$w$wassetto384.FortheSVD,% thisperformanceratioallowsafaircomparisonbetweenthedifferentalgorithmsastheGFLOPScanstillbeviewedasanscaledmetric% (fortheinverseof)time.\inner@par\begin{figure}[t]\centering\includegraphics[wi% dth=260.172pt]{GFLOPS_LU_global.pdf}\hfil\@add@centering\\ \@@generic@caption{PerformanceofLUppon8coresofanIntelXeonE5-2630v3.}% \end{figure}\inner@par\begin{figure}[t]\centering\includegraphics[width=260.17% 2pt]{GFLOPS_QR_global.pdf}\hfil\@add@centering\\ \@@generic@caption{PerformanceofQRon8coresofanIntelXeonE5-2630v3.}\end{figure}% \inner@par\begin{figure}[t]\centering\includegraphics[width=260.172pt]{GFLOPS_% SVD_global.pdf}\hfil\@add@centering\\ \end{array}

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
174341
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description