On the Efficacy and High-Performance Implementation of Quaternion Matrix Multiplication
Quaternion symmetry is ubiquitous in the physical sciences. As such, much work has been afforded over the years to the development of efficient schemes to exploit this symmetry using real and complex linear algebra. Recent years have also seen many advances in the formal theoretical development of explicitly quaternion linear algebra with promising applications in image processing and machine learning. Despite these advances, there do not currently exist optimized software implementations of quaternion linear algebra. The leverage of optimized linear algebra software is crucial in the achievement of high levels of performance on modern computing architectures, and thus provides a central tool in the development of high-performance scientific software. In this work, a case will be made for the efficacy of high-performance quaternion linear algebra software for appropriate problems. In this pursuit, an optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix operation. The results of this work pave the path for further development of high-performance quaternion linear algebra software which will improve the performance of the next generation of applicable scientific applications.
In the ever evolving ecosystem of high–performance computing (HPC), the full exploitation of contemporary computational resources must constitute a central research effort in computationally intensive fields such as scientific computing. However, it is often the case that simply applying conventional algorithms and data structures to existing problems will yield sub-optimal time–to–solution and resource management on modern HPC systems. By exploiting the symmetry of a particular problem and explicitly considering the structure and resources of these computational architectures, one can often develop more optimal computational research pathways. Such development has the potential to enable routine inquiry and simulation of systems which were inaccessible or impractical by existing computational methods.
In this work, we consider the computational benefits of exploiting the scalar and linear algebras generated by the quaternion numbers. The quaternions, also known as Hamilton’s quaternions, are a hyper-complex number system which extends the complex numbers and are isomorphic with the special unitary group, (Hamilton, 1866). Perhaps the most notable feature of the quaternion numbers is the loss of scalar commutivity under multiplication. As such, the algebra which is generated by the quaternions is peculiar in that it more closely resembles that of matrix algebra than that of its real or complex counterpart. Since their inception, the quaternions have seen extensive application both in pure (Cayley, 1845; Frobenius, 1878; Hurwitz, 1922; Hopf, 1931; Chevalley, 1996) and applied mathematics (Grubin, 1970; Deavours, 1973; Sudbery, 1979; Arnol’d, 1995; Mocoroa et al., 2006). Historically, the quaternions have been applied most successfully in the treatment of rigid body mechanics due to their relationship with (and thus the group of spacial rotations, ) (Grubin, 1970; Mocoroa et al., 2006). This application has been widespread in the field of computer graphics to accelerate video animation via algorithms such as Slerp (Shoemake, 1985). From a computational perspective, quaternion arithmetic offers an attractive alternative to complex arithmetic in that it admits a higher arithmetic intensity and smaller memory footprint than that of the the complex matrix algebra it represents. A demonstration of this state of affairs will be given in the body of this work.
In the context of linear algebra, quaternions and quaternion linear algebra naturally manifest in many scientific applications such as quantum chemistry and nuclear physics (Saue et al., 1997; Visscher and Saue, 2000; Saue and Helgaker, 2002; Henriksson et al., 2005; Ekström et al., 2006; Peng et al., 2009; Armbruster, 2017; Nakano et al., 2017; Konecny et al., 2018). Traditionally, quantum mechanics and quantum field theories have been formulated in terms of the complex numbers. However, since the earliest days of the development of quantum mechanics, it has been known that the spinor nature of the fermionic wave function (i.e. electrons, neutrinos, etc) admits a quaternion representation due to its relationship with (Adler, 1995; Horwitz and Biedenharn, 1984). Typically, this representation manifests as a result of time–reversal being a global symmetry of the Hamiltonian (Dongarra et al., 1984; Stuber and Paldus, 2003; Saue et al., 1997). As such, the quaternion algebra has been exploited in the development of efficient real (Dongarra et al., 1984) and complex (Shiozaki, 2017) eigensolvers for time–reversal symmetric Hamiltonians. Despite the success and power of these methods, their exploitation of the quaternion algebra is implicit in that the final computer implementation of these methods is done in either real or complex matrix arithmetic; thus they cannot leverage the full computational potential of the quaternion arithmetic. In this work, a case will be made for explicit exploitation of quaternion arithmetic in high–performance software.
In general, high–performance methods for scientific applications rely on highly tuned numerical linear algebra software libraries which implement the BLAS (Lawson et al., 1979; Dongarra et al., 1988; Dongarra et al., 1990; Blackford et al., 2002) and LAPACK (Anderson et al., 1999) standards for performance on modern HPC systems. Historically, numerical linear algebra has been the archetypal example and motivation for the careful consideration of a computer’s architecture in the development of high–performance software (Agarwal et al., 1994; Whaley and Dongarra, 1998; Whaley et al., 2001; Gunnels et al., 2001; Goto and van de Geijn, 2008). It was realized early on that straightforward implementations of operations such as matrix multiplication will yield sub-optimal performance results and that achieving peak performance on modern architectures requires a drastic departure from conventional implementations. We refer the reader to the work of (Goto and van de Geijn, 2008) for a reasonably contemporary discussion on the optimization of BLAS functions, specifically matrix-matrix multiplication, on modern architectures. BLAS and LAPACK optimization still constitutes a major research effort in the field of numerical linear algebra, and this has led to a number of different approaches which are available in open source (Whaley and Dongarra, 1998; Wang et al., 2013; Xianyi et al., 2012; Van Zee and van de Geijn, 2015; Van Zee et al., 2016; Low et al., 2016; Van Zee and Smith, 2017) and vendor tuned software such as the Intel Math Kernel Library (MKL) and the IBM Engineering Scientific Subroutine Library (ESSL).
We note that over the years, there have been many important theoretical developments in the field of quaternion linear algebra (Zhang, 1997; Rodman, 2014). Many of necessary algorithmic building blocks for ubiquitous operations such as eigenvalue and singular value decompositions, and matrix factorizations have been developed (Bunse-Gerstner et al., 1989; Zhang, 1997; Baker, 1999; Janovská and Opfer, 2003; Loring, 2012; Jia et al., 2018; Li et al., 2019). Such explorations have been key in the development of recent methods for efficient signal and image processing by exploiting the quaternion algebra (Le Bihan and Sangwine, 2003; Le Bihan and Mars, 2004; Ell et al., 2014; Zeng et al., 2016). Despite these successes, the field of quaternion linear algebra is still in its infancy in terms of software adoption by scientists and engineers. This is primarily due to the fact that, relative to its complex counterpart, there do not currently exist highly optimized software implementations of quaternion linear algebra. In order for quaternion linear algebra to become a viable alternative to complex linear algebra, such software must be developed. In this work, it will be demonstrated that optimized implementations of quaternion linear algebra operations hold the potential for leveraging drastic performance improvements relative to analogous complex operations in problems which they may be applied.
A key building block of optimized linear algebra algorithms is that of an optimized implementation of matrix multiplication, which we will refer to as GEMM in this work. Having an optimized GEMM implementation is a necessary (but not necessarily sufficient) condition for optimization more involved linear algebra operations such as eigenvalue decomposition and matrix factorization. As such, this work will focus on the performance and implementation of a quaternion GEMM to demonstrate the efficacy of high-performance quaternion linear algebra.
The remainder of this work will be organized as follows. Section 2 will review the necessary theory for the development of quaternion linear algebra and how one might leverage this algebra as an alternative to complex linear algebra for a special class of complex matrices. Section 3 will briefly review the nature and abstract structure of high–performance GEMM implementations. Section 4 details an implementation of a high-performance GEMM operation using explicitly quaternion arithmetic on a contemporary computing architecture. Finally, performance results for the implementation quaternion GEMM relative to a vendor tuned complex implementation will be presented in Sec. 5, and Sec. 6 will provide an examination of the future research which will be enabled as a result of our findings.
Throughout the remainder of this work, we will adopt the following notation conventions:
Scalars of a ring will be denoted with lower case letters .
will denote the ring of -by- matrices over . Further, for brevity. Matrices will be denoted with capital letters, .
For , will denote the conjugate of , its transpose, and will denote its conjugate transpose. For , we note that .
For , we denote the set of vectors over a ring as , and denote its elements as lower-case letters, .
represents the sub-matrix consisting of the -st to -nd rows of . If is to be understood from the context, we use the abbreviated notation .
represents the sub-matrix consisting of the -st to -nd columns of . If is to be understood from the context, we use the abbreviated notation .
The limiting cases of single row or column vectors will be denoted and .
If the indices and are to be understood from the context, we denote the sub-matrix of a sub-matrix as , and so on for row sub-matrices.
2. The Quaternion Algebra
2.1. Scalar Operations
Fundamental to the development of quaternion linear algebra is the nature of the scalar algebra generated by the quaternions. In this work, the set of quaternion numbers will be denoted and is defined as the set of all that (Hamilton, 1866)
where is referred to as the scalar component of and is referred to as its vector component. are the quaternion basis elements and they generate the skew–symmetric algebra defined by, \cref@addtoresetequationparentequation
where is the Kronecker delta and is the totally anti-symmetric Levi-Civita tensor. As such, the following must hold true \cref@addtoresetequationparentequation
Given Eq. 2, the product of quaternion scalars is given by the Hamilton product
and is thus non-commutative
There are a number remarkable results that arise from the algebra defined by Eqs. 2 and 1 which are important to the construction of quaternion linear algebra. The first is that constitutes a normed division algebra, and is in fact the largest normed division algebra for which multiplication is associative (Frobenius, 1878). As such, we may define a quaternion norm and inverse for every nonzero element of , \cref@addtoresetequationparentequation
where we have defined the quaternion conjugate
We note here that quaternions of unit norm () are known in the literature as versors (Hamilton, 1866). In examining Eqs. 7 and 6, the expressions for norm, inverse and conjugate closely resemble those of the complex numbers, . In fact, is a sub-algebra embedded in , and this relationship is crucial for the development of the relationship between complex and quaternion linear algebra.
To examine the relationship between and , we introduce a common, simplified notation
Consider the subset defined by
Note that . The algebra defined by is exactly that of (this may be easily verified through expansion of Eq. 4). Thus, via the map
with and . It is important to note here that Eq. 11 does not imply nor , simply that there exists a bijection between and . Keeping this in mind, however, it will typically be the case that one can use them interchangeably without ambiguity. As such, whether scalars of the form given in Eq. 10 are treated as complex or quaternion scalars should will be implied from their context in the following discussion.
While one tends to describe quaternions in terms of scalars (and rightly so), the algebra which they generate more closely that of a matrix algebra, specifically that of a Lie group (Hall, 2015). In the development of quaternion linear algebra, it is instructive to examine the isomorphism between the versors and the special unitary group through the mapping of basis elements
where the Pauli matrices are given as
By expanding in terms of the Pauli basis, it may be demonstrated that the algebra of is isomorphic to the algebra generated by via the map
Here we have denoted the complex matrix representation of a quaternion scalar with a subscript .
|Operation||FLOPs in||FLOPs in|
|Addition (Eq. 15a)||4||8|
|Multiplication (Eq. 15b)||16||32|
While the representation given in Eq. 14 may seem inconsequential, it demonstrates the great potential for improving computational performance by exploiting quaternion arithmetic. As their algebras are isomorphic, it is known that \cref@addtoresetequationparentequation
Although the result of both the quaternion and complex arithmetic may be thought of as to represent the same mathematical object (up to an isomorphism), the computational work required for these operations is different for the two arithmetics, respectively. Table 1 summarizes the number of real floating point operations (FLOPs) required for the operations given in Eq. 15. In this work we adopt the convention that the operation
constitutes a single FLOP. From Tab. 1, we can see that there is a 2x reduction in FLOPs for –arithmetic over generic –arithmetic for the same mathematical operation. We may further note that there is also a 2x reduction in memory operations (MOPs) and computational storage requirements between and data structures. This fact will prove important in the following developments of high–performance quaternion linear algebra.
As forms a normed, associative division algebra, it is natural to consider quaternion linear algebra, i.e. the algebra generated by matrices and vectors of quaternion elements, as an extension of the discussion presented in this subsection. In the following subsection, we briefly review the relevant theory to motivate the usage of explicitly quaternion linear algebra for software implementation.
2.2. Matrices of Quaternions and Quaternion Linear Algebra
Consider the space of matrices with quaternion elements denoted and given by the generic element (Zhang, 1997),
In the same manner as , we define quaternion hermiticity as . Further, we may define a scalar and matrix product operations for , and (Zhang, 1997). \cref@addtoresetequationparentequation
However, unlike real and complex matrices, the loss of scalar commutivity in dictates that we must also consider operations of the form \cref@addtoresetequationparentequation
where, in general, and . In addition, generally , however this is in perfect analogy with real and complex linear algebra. As one may intuitively guess, this loss of scalar commutivity greatly complicates proofs and algorithm development in quaternion linear algebra (Zhang, 1997; Rodman, 2014), often requiring researchers to resort to rather complex and abstract mathematical paradigms, such as algebraic topology (Zhang, 1997; Baker, 1999), to obtain the desired outcomes. Despite these complications, it is possible to extend operations which are important to scientific application, such as matrix inversion (Zhang, 1997; Loring, 2012) and eigenvalue decomposition (Zhang, 1997; Bunse-Gerstner et al., 1989; Baker, 1999; Jia et al., 2018; Li et al., 2019), to . In particular, the set of all invertable quaternion matrices forms a group under the matrix product (Zhang, 1997).
Just as admits a close relationship with and , analogous relationships may be developed between , and . Consider the subset ,
We may define an analogous expression to Eq. 8 for via \cref@addtoresetequationparentequation
with . In the same manner as (Eq. 11), via the map
To construct its relationship to , we examine the representation of a quaternion matrix element,
Thus Eq. 24 may be written in terms of a Kronecker product:
where we have denoted the complex matrix representation of the quaternion matrix with a subscript in analogy with Eq. 14.
|Operation||FLOPs in||FLOPs in|
|Addition (Eq. 26a)|
|Multiplication (Eq. 26b)|
In analogy to Eq. 15, the isomorphism between and admits the following relationships \cref@addtoresetequationparentequation
As in Eq. 15, the amount of computational work required to perform the operations in Eq. 26 in quaternion and complex arithmetic are different. As an extension of Tab. 1, Tab. 2 summarizes differences in the the number of FLOPs required for the same algebraic operation in the two arithmetics, respectively. For simplicity and brevity, the summary in Tab. 2 only accounts for and , though completely analogous results hold for the general rectangular case. Just as in Tab. 1, there is a 2x reduction in both FLOPs and MOPs in utilizing explicitly quaternion arithmetic and data structures over the analogous complex operations. However, this comparison is in terms of a ratio of computational work requirements. In terms of raw differences between the two arithmetics, the potential computational savings scale to some power of the dimension the matrix in question. For example, the difference in the number of FLOPs required for quaternion / complex matrix multiplication is . For small this would not make a drastic difference, but for large , this difference becomes significant. Due to the fundamental and central importance of matrix multiplication in numerical linear algebra, similar comparisons could be made for any matrix operation, such as eigenvalue decomposition or matrix factorization, between complex and quaternion arithmetic.
3. High–Performance Matrix–Multiplication
The cornerstone of high–performance numerical linear algebra software is the optimized implementation of general matrix–matrix multiplication (GEMM). Without an optimized GEMM implementation, operations such as eigenvalue decomposition and matrix inversion become impractical for large matrices. Thus, the first step in the development of high-performance quaternion linear algebra is the development of an optimized quaternion GEMM.
Over the past several decades, an enormous amount of research effort in the fields of numerical linear algebra and HPC has been directed towards the development of optimized GEMM operations on various computing architectures. As a result, many different strategies have been developed for high–performance GEMM implementations (Whaley and Dongarra, 1998; Gunnels et al., 2001; Goto and van de Geijn, 2008; Van Zee and van de Geijn, 2015; Wang et al., 2013; Xianyi et al., 2012). Despite their differences, the common motif among these methods is the rejection of a “one–size–fits–all” development strategy for all computing platforms, i.e. one must explicitly consider and optimize for the underlying features of the computer architecture in question to reach optimal performance. In modern HPC, there are effectively three fundamental aspects of computing architectures which must be considered in the development of optimized GEMM operations (Goto and van de Geijn, 2008):
Efficient and effective utilization of various levels of the computational data and instruction caches.
Utilization of microarchitechture specific features such as single instruction multiple data (SIMD) and fused multiply–add (FMA) operations.
Achieving efficient parallelism on modern multi–core and many–core computing architectures.
To demonstrate the efficacy of quaternion GEMM, we will only consider the former two of these features; leaving the treatment of parallelism for future work. Due to its relative simplicity and portability to general architectures, the development of high–performance quaternion GEMM operations in this work will extend the strategy adopted by the BLIS library for real and complex GEMM operations (Van Zee and van de Geijn, 2015; Van Zee et al., 2016; Low et al., 2016; Van Zee and Smith, 2017). In the BLIS strategy, the aspects of the GEMM operation which must be explicitly optimized for a specific architecture are factored into a manageably small set of auxiliary procedures, referred to as kernels, while the general scaffold for the GEMM remains consistent between architectures. Further, the structure and function of the kernels yielded by this strategy are designed in such a way that they may be used in the implementation of other BLAS-3 functionality such as rank- updates (XSYRK), triangular matrix multiplication (XTRMM), etc. In this section, we examine the salient aspects of this strategy which are agnostic to the data representation and arithmetic operations relating to the matrices being multiplied.
3.1. The General Algorithm
Consider the GEMM of two matrices , over a general ring ,
where and . Computationally, and are stored as linear, contiguous data structures of lengths , and , respectively. In this work, we will consider column–major storage of matrices, i.e. and , and and are stored contiguously in memory.
For comparison in the following, LABEL:alg:reference_gemm outlines the simplest implementation of the GEMM operation which was suggested in the earliest developments of the BLAS standard (Dongarra et al., 1990). This method will be referred to as the reference GEMM algorithm. To fully understand the drawbacks of LABEL:alg:reference_gemm and to motivate the development of a more optimal algorithm, we must examine that nature of the memory hierarchy on modern computers. Figure 1 illustrates a simplified model of a representative memory hierarchy on a modern computer (Goto and van de Geijn, 2008). At the top of the hierarchy, the fastest and least abundant memory resource are the registers which physically reside on the processor. It is on the data which resides in the registers that the processor may issue instructions such as arithmetic operations, etc. On architectures which support SIMD instructions, i.e vector processors with instruction sets such as SSE, AVX, AVX2 and AVX-512, each floating point register can hold a small number of floating point numbers at a time, typically between 2 and 16. However, their abundance is very limited: 16 registers on SSE, AVX and AVX2, and 32 on AVX-512. Thus, to fully exploit the speed of the registers, care must be taken to carefully populate the data which resides there to minimize the data movement between the registers and other levels of the hierarchy.
At the bottom of the hierarchy is the slowest and largest memory resource: the random access memory (RAM). It is in the RAM that the matrices which participate in the GEMM operation are typically stored. The RAM is the memory resource that resides furthest from the processor, which allows it to be orders of magnitude larger than any of the other memory resources (on the order of 1GB-1TB). However, the penalty for its size and distance from the processor is a high access latency. The speed at which data can be moved to and from RAM varies drastically between different architectures and manufacturers, and also depends on factors such as how the data being moved is laid out in memory (contiguous, strided, cache aligned, etc). Generally, reading to and from RAM amounts to hundreds of clock cycles on modern processors. Due to its very high latency, data movement in and out of RAM must be kept to a minimum to achieve optimal performance. As such data is rarely read directly from RAM to the registers or visa versa. Instead, it is typically the case that data is read to and from the RAM to a low latency intermediary storage, known as the data cache, which resides closer to the processor and is thus capable of moving data to and from the registers much faster than would be possible from the RAM.
Due to the slow rate at which RAM may be accessed, typical memory access patterns dictate that the RAM should be read in large chunks of contiguous data into the cache. Whenever a read instruction is issued from the processor for a particular memory address in RAM, it first checks if that data resides in cache. If the data resides in cache, what is referred to as a cache hit, it may be read directly from cache and avoid the RAM completely. However, if the data does not reside in cache when the instruction is issued, the data must be moved from RAM to cache and then read into the registers. This process is referred to as a cache miss. Due to the large latency differential between RAM and cache, the penalty for a cache miss can often be quite large. Further, the restricted size of the cache only allows a limited amount of data can be stored there at any point in time. When the cache reaches its capacity, data which resides in the cache must be replaced when new data is read in from the RAM. The process by which this replacement happens is referred to as the cache’s replacement policy. If data is to be often reused in an algorithm, it is important to ensure that it resides in the cache as often as possible to minimize the probability of a cache miss. Thus, knowledge of the replacement policy is paramount in the development of a strategy for cache population to maximally reuse the data the resides there while not ejecting reusable data with data which is to be used less often.
On contemporary architectures, the cache is divided into cascading “levels”: the L1, L2, L3 caches, etc. The capacity and access latencies for the cache levels vary considerably between processor generations and manufacturers; however, the general trend is to lose an order of magnitude on access latency and gain an order of magnitude in capacity between successive cache levels. For example, the Intel(R) Xeon(R) CPU E5-2660 (Sandy Bridge) processor yields cache capacities of 32 kB, 256 kB, 20 MB and access latencies of 4, 12 and 29 clock cycles for the L1, L2 and L3 caches, respectively (Fog, 2012). It is typically the case that the population of the different levels of cache cannot be explicitly programmed; one typically relies on heuristics issued by the CPU, such as data prefetching and cache replacement policies, to perform this population. However, with knowledge of the sizes of the cache levels and replacement policies, one may develop algorithms which aim to populate these caches optimally for data reuse.
From the perspective of effective utilization of the memory hierarchy and the other aforementioned features of computing architectures, there are a number of drawbacks in LABEL:alg:reference_gemm:
All of is loaded into cache for each column of ,
For large , loading potentially ejects from cache, triggering a cache miss on each update of ,
There is no useful caching of ,
In a high-level programming language, this algorithm relies on an optimizing compiler to utilize SIMD, FMA, etc.
Scalable parallelism is non–trivial.
LABEL:alg:reference_gemm is referred to as a memory bound algorithm, i.e. its performance is completely determined by the latency at which data may be moved to and from the RAM. As such, even for relatively small GEMM operations, performance will be sub-optimal (Goto and van de Geijn, 2008). A demonstration of this state of affairs in the context of quaternion GEMM will be given in Sec. 5.
In order to overcome the memory bottle neck, one must develop an algorithm which populates the levels of cache and registers with sub-matrices of and according how their data may be reused throughout the GEMM operation. For a detailed explanation of the extent to which one may reuse different sub-matrices of and , we refer the reader to the work of (Goto and van de Geijn, 2008). In general, the mechanism by which one achieves optimal cache utilization is through a layered approach to the GEMM operation (Whaley and Dongarra, 1998; Gunnels et al., 2001; Goto and van de Geijn, 2008; Van Zee and van de Geijn, 2015). An optimized layered GEMM algorithm may be constructed through the specification of three caching parameters: , two register blocking parameters: , two packing kernels: PACK1, PACK2, and a microkernel, KERN. A representative example of such an algorithm, specifically the algorithm which has been proposed in the development of the BLIS framework (Van Zee and van de Geijn, 2015; Van Zee et al., 2016; Van Zee and Smith, 2017), is outlined in LABEL:alg:opt_gemm. For simplicity in LABEL:alg:opt_gemm, we have assumed and . However, extension of LABEL:alg:opt_gemm without these constraints is straightforward through zero padding in the packing kernels (Van Zee and van de Geijn, 2015). We note for clarity that the scaling by in LABEL:alg_ln:a_pack of LABEL:alg:opt_gemm may instead be performed in LABEL:alg_ln:b_pack for rings which admit scalar commutivity in the sense of Eq. 20b (i.e. and ). Each of these parameters and kernels must be carefully chosen and optimized for each computer architecture of interest. In the following subsection, we examine the nature of each of these moieties and the factors one must consider in their selection.
3.2. Register Blocking and The Microkernel
Consider the expression of a specific sub-matrix in terms of the corresponding sub-matrices and ,
In other words, may be expressed as a sum of rank–1 updates over rows and columns of and , respectively. As this is the fundamental arithmetic operation of the GEMM operation to be performed by the CPU, and must all reside in the registers for the operation to take place. is referred to as the register block of , with dimensions and . To achieve optimal memory performance, and must be chosen such that and may reside in the registers simultaneously in order to avoid data movement between the registers and other levels of the memory hierarchy (Goto and van de Geijn, 2008).
algocf[t] \end@float In LABEL:alg:opt_gemm, the full product, , is constructed by successively updating each of its (disjoint) sub-matrices via partial summation (over elements) of Eq. 28 with the microkernel performing arithmetic operations which amount to the sum over rank-1 updates. As the arithmetic kernel of the GEMM operation, the microkernel is the fundamental operation which is most sensitive to the underlying computer architecture and is a key factor in the performance of the GEMM implementation. It is in the microkernel that one must explicitly consider microarchitechture specific operations such as SIMD and FMA. As such, optimized GEMM implementations typically do not express the microkernel in a high-level language; it is typically expressed directly in assembly language (Whaley and Dongarra, 1998; Goto and van de Geijn, 2008; Van Zee and van de Geijn, 2015) or with use of low–level access paradigms such as vector intrinsics in C++. An abstract template for a generic microkernel implementation is given in LABEL:alg:abstract_kern.
There is a subtle, yet crucial aspect of the loop expressed in LABEL:alg:abstract_kern in relationship to LABEL:alg:opt_gemm: as all of the arithmetic intensity is folded into the rank-1 updates performed from within the microkernel inner-loop, optimality of the GEMM operation is directly related to the amount of time spent in this loop. In other words, the number of operations performed inside of this loop, whether they be FLOPs or MOPs, must be kept to a minimum to achieve optimal performance. The number of FLOPs required to perform the rank-1 update is fixed based on and , thus optimality is generally achieved through minimizing the number of MOPs performed inside this inner loop. To this end, the microkernel utilizes packed representations, and , of sub-matrices, and , produced by the packing kernels, PACK1 and PACK2, respectively. The remainder of this section is dedicated to the design and optimization of the packing kernels and caching parameters to achieve optimal data movement between levels of the memory hierarchy and to minimize the number of MOPs required to be performed from within the microkernel.
3.3. Sub-matrix Packing for Optimal Data Layout and Cache Utilization
Perhaps the most ingenious aspect of the layered GEMM algorithm outlined in LABEL:alg:opt_gemm is the utilization of auxiliary memory and packing kernels to amortize the cost of data manipulation over the movement of data between the levels of the memory hierarchy (Goto and van de Geijn, 2008). This packing strategy has two primary objectives:
To populate the various levels of the cache with sub-matrices of and according to the extent which they will be reused in the GEMM operation as to minimize probability of triggering cache misses,
To ensure optimal, contiguous data layouts of the packed sub-matrices to minimize the number of operations (FLOPs and MOPs) which must be performed from within the inner loop of the microkernel.
In the following, we will examine both of these objectives in turn.
To optimize data movement for cache utilization, one must obtain optimal choices for the caching parameters and for the architecture of interest. Typically, these parameters are chosen such that (Goto and van de Geijn, 2008; Van Zee and van de Geijn, 2015):
Contiguous storage of size may reside in and be addressed from the L3 cache once the data is loaded from RAM (e.g. ) until it is no longer needed.
Contiguous storage of size may reside in and be addressed from the L2 cache once the data is loaded from RAM (e.g. ) until it is no longer needed.
Contiguous storage of size may be moved from the L3 to the L1 cache without triggering a cache miss or cache invalidation (e.g. ).
Clearly, the choice of these parameters are integrally tied to the sizes of the L1, L2 and L3 caches and the size of the data structure which represents . Several methods exist for determining optimal choices for the caching parameters. There has been work in the development of analytical models and formulas which take into account the specifics of and the architecture in question and return optimal values for the caching parameters (Low et al., 2016). Other approaches utilize guided or black–box optimization (Whaley et al., 2001; Xianyi et al., 2012; Wang et al., 2013), to obtain these parameters. Once these parameters have been determined, the task then becomes to develop efficient packing utilities which optimize the data layout for use with the microkernel.
There are a number of desirable features one wishes to express in the data layout of packed matrices, and , to optimize the data movement between the levels of cache and the registers from within the microkernel:
The elements of and should be contiguous, respectively. As vectors, this amounts to ensuring and are contiguous in memory, and similarly for .
The elements of and which contribute to adjacent register blocks of should be contiguous in memory, i.e. and should be contiguous in memory.
For which is represented by a compound datatype of primitive data, e.g. and , the primitive data for contiguous datastrutures which contain elements of type should be arranged into a data layout which allows for a minimum number of MOPs to be performed from within the microkernel, as long as map between the standard and new data layout is space preserving.
To demonstrate what is meant by a space preserving map in this context, consider an complex element, , which is represented by two primitive real numbers which are contiguous in memory, denoted . For a datastructure which contains two contiguous elements , the data layouts and occupy the same space in memory. Thus a map between these two data layouts would be considered space preserving. While the first two aspects of data packing are well explored in the literature, the latter has not to the best of authors’ knowledge. As will be demonstrated in Sec. 4, optimizing the primitive data layout of contiguous quaternion datastructures will prove important in the development of an optimized quaternion GEMM.
algocf[b] \end@float The fact that the rank-1 updates required by LABEL:alg:abstract_kern and 28 involve both row and column vectors, a single packing strategy would not be sufficient to achieve optimal data layout for both and . Thus, the packing kernels PACK1 and PACK2 must be designed separately to optimize the layouts of and , respectively. An abstract templates for these packing kernels are given in LABEL:alg:pack2_abstract and LABEL:alg:pack1_abstract, respectively. We refer the reader to the work of Van Zee, et al (Van Zee and van de Geijn, 2015) for an intuitive graphical illustration of the optimal packing procedure. To account for the rearrangement of primitive data in the packing procedure, we have introduced two additional operations, PACKOP1 and PACKOP2, to perform this operation for the kernels PACK1 and PACK2, respectively. Note that typical implementations for real and complex GEMM would yield both PACKOP1 and PACKOP2 as either the identity or linear scaling operation.
Due to the large access latency difference between RAM and the other levels of the memory hierarchy, operations performed within PACKOP1 and PACKOP2 have little to no impact on the performance of the GEMM implementation. This is due to the fact that that these operations are to be done in the registers, and are thus amortized over the time it takes to access the data from the RAM. For example, the construction of the packed sub-matrix in LABEL:alg_ln:a_pack of LABEL:alg:opt_gemm requires the scaling of the sub-matrix . As there is a two orders of magnitude latency ratio between RAM access (O(100s) of clock cycles) and the FLOP required to scale an element of the matrix (O(4-5) clock cycles), the cost of the scaling operation may be thought of as negligible. The same logic holds true for data rearrangement operations, such as register transpose, which will be explored in the following section.
4. Quaternion Matrix Multiplication: HGEMM
In this section, we develop the details of a high–performance implementation of quaternion GEMM for the AVX microarchitechture. The primary focus of this section is the development of AVX-optimized versions of the kernels described in the previous section for use with quaternion arithmetic and data structures. In practice, there are two primary features of the AVX microarchitechture that one must consider in the development of optimized GEMM kernels:
processors with support for AVX instructions have (at least) 16 256-bit floating point (YMM) registers, and
AVX dictates support for SIMD (but not FMA) arithmetic instructions on these YMM registers.
For the purposes of this work, we will restrict the discussion of kernel development to double precision floating point storage, i.e. each floating point primitive will occupy 64-bits. As such, each YMM register on AVX can hold and perform arithmetic operations on up to 4 double precision floats, simultaneously. In analogy to the DGEMM and ZGEMM naming conventions of real and complex GEMM operations, we will refer to the double precision quaternion GEMM as HGEMM. As an extension of the standard construction of complex datatypes as two contiguous floats, the following developments will describe quaternion datatypes as four contiguous floats, using the notation of Eq. 1. As such, each AVX YMM register can hold one double precision quaternion (or equivalent) at any point in time.
4.1. Batch SIMD Quaternion Multiplication
Critical to the development of an AVX-optimized quaternion microkernel is an efficient strategy for quaternion product using SIMD arithmetic operations. The the product of quaternions given by the Hamilton product in Eq. 4 requires a minimum of 16 FLOPs to complete. As each YMM register in AVX is capable of storing and manipulating 4 floats at once, one could in principle perform some of these FLOPs concurrently if the task is simply to perform a single quaternion product. However, if the task is to perform many quaternion products in a structured manner, as is the case for the rank-1 updates required by Eq. 28, implementations which optimize for a single quaternion product will yield sub-optimal throughput. To leverage the full power of SIMD instructions in this case, one needs to develop a strategy which aims to perform multiple quaternion products simultaneously at the highest throughput possible. As each YMM register is able to manipulate 4 floats, the simplest manner to reach optimal throughput is to perform 4 quaternion products simultaneously.
Consider the batch quaternion product which takes two sets of four quaternions, and , and returns a set of four quaternion products, . For simplicity in the following, we will augment the product operation to perform an update of the result as opposed to an assignment,
where is the Hadamard (entry-wise) product such that and so on. Each quaternion product (Eq. 4) may be separated into 4 sets of 4 FLOPs which update each component of the result, respectively. In the following, we examine the update of the scalar component of the product, , as a representative example. Note that extension and generalization to vector components of is straightforward. The scalar components of each updated quaternion product may be obtained simultaneously by
In the SIMD paradigm, each of these vectors may be represented by a single YMM register. As such, each of these Hadamard products may be performed by the VMULPD vector instruction and each vector addition (subtraction) by the VADDPD (VSUBPD) vector instruction. In this form, the entire batch quaternion multiplication may be completed using 32 vector instructions.
The structure of Eq. 30 requires that each of the sets , and occupy 4 YMM registers, with each register containing a particular quaternion component of each element in the set, respectively. In other words, one YMM register contains all of the scalar components for each element of , one for the scalar components of , and so on for the vector components of these sets and for the components of . For clarity in the following, we will denote the YMM register containing the scalar components of , and as , and , respectively, and so on for the vector parts of these sets with indices , and . Using this notation, we will define the SIMD implementation of Eq. 29 as
For quaternion data structures which store a single quaternion contiguously, such as the one considered in this work, the vector load instruction (VMOVAPD) would populate each register with the 4 components of a single quaternion. As such, one would need to rearrange the quaternion data once it is read into registers in order to utilize Eq. 31. In general, this rearrangement may be achieved by a 4x4 register transpose on each of the quaternion sets. This register transpose will be denoted MM_4x4_TRANSPOSE_PD in the following and is illustrated graphically in Fig. 2. For clarity, we endow MM_4x4_TRANSPOSE_PD with the function signature
where is a YMM register containing the components of , the components of , and so on. Remark that the result of MM_4x4_TRANSPOSE_PD is not invariant to the permutation of its parameters. Further, we note that MM_4x4_TRANSPOSE_PD is an involution. In general, register transpose is a relatively expensive operation due to the high aggregate latency of the vector instructions (VPERM2F128 and VSHUFPD) involved in its implementation. However, it will be shown in the following subsection that the special structure of the rank-1 update will simplify and cheapen the general register transpose through the use of optimal packing layouts in the GEMM operation.
4.2. The Quaternion Microkernel and Amortization of Register Transpose
Given that AVX only supports 16 YMM registers, the largest register block (Eq. 28) which allows for , and to all reside in registers simultaneously is given by . As such, the quaternion microkernel must perform a sum over 2x2 rank-1 updates to update a register block of . A single 2x2 rank-1 update requires 4 product evaluations given by
where we have dropped the super– and subscripts for brevity. Per the discussion of the previous subsection, these product evaluations may be performed simultaneously using SIMD vector instructions given that the register data arrangement adheres to the structure Eq. 31 via Eq. 32. On top of the 32 vector instructions requires to perform the product accumulations, the general scheme for register transpose depicted in Fig. 2 requires an additional 16 register operations: 8 for transposing the components of and , respectively. The operation overhead is further compounded by the fact that the microkernel performs many () rank-1 updates successively, thus this scheme costs additional operations over the execution of the microkernel. However, such a general approach for register transpose would only be required for 4 unique quaternions, whereas the 4 (unique) products required for the evaluation of Eq. 33 only involve 2 sets of 2 unique quaternions. As such, simplifications to the general register transpose scheme of Fig. 2 may be made in this case.
There are two special cases for register transpose which we must consider for Eq. 33, namely those which represent the data ordering of the elements of and , respectively:
Here, and hold the components of and , respectively, and similarly for and for the elements of . The YMM registers and represent the components of and in the order which they were passed, i.e. has the layout while has the layout and so on. The presence of redundancies in the register transpose allows for factorization of MM_4x4_TRANSPOSE_PD into the convolution of two simpler operations:
An illustration of this state of affairs is given in Fig. 3 with Figs. 2(b) and 2(a) depicting the transpose of elements of and , respectively. The first step of Fig. 2(a) demonstrates the effect of ATRANS1 and the second the effect of ATRANS2, and similarly for Fig. 2(b). The most important aspect of the alternative transpose schemes depicted in Fig. 3 is that both ATRANS1 and BTRANS1 are space preserving. As such, they may be factored into the packing scheme as discussed in Sec. 3.3, leading to an amortization of register operations over data movement from RAM. Further, in the case of ATRANS1, not only does this procedure reduce the number of instructions which must be issued from inside the microkernel loop, it does so in a way that the most expensive (highest latency) register operations required for the transpose are amortized in the packing procedure. In the context of LABEL:alg:pack2_abstract and LABEL:alg:pack1_abstract, this factorization may be accounted for by setting and . Utilizing this packing strategy, the operation overhead for performing register transpose from within the microkernel is reduced by a factor of 3/4 ( to ). LABEL:alg:avx_microkernel outlines the general structure for the AVX optimized HGEMM microkernel. The following section demonstrates its performance.
5. Implementation and Performance Results
HGEMM, as described in the previous section, has been implemented in the quaternion BLAS (HBLAS) component of the HAXX library. HAXX (Hamilton’s Quaternion Algebra for CXX) (Willams-Young, 2019) is a modern C++ software infrastructure developed to enable efficient scalar and linear algebra operations using quaternion and mixed–type (quaternion–complex, quaternion–real) arithmetic. As of this work, HAXX provides reference and optimized serial implementations for a representative subset of BLAS-1,2,3 functionality. For the optimized implementation of HGEMM in HAXX, the arithmetic microkernel has been implemented using C++ vector–intrinsics rather than the assembly implementations which have become ubiquitous in high–performance implementation of DGEMM and ZGEMM. This has been done primarily for the fact that vector intrinsics offer a reasonable balance between transparency in the code-base and potential performance from low-level access to assembly instructions, even if this transparency comes at a slight performance degradation. In this section, we provide performance results for the reference and optimized HGEMM implementations in HAXX for the AVX microarchitechture. All timing results were obtained using an Intel(R) Xeon(R) CPU E5-2660 (Sandy Bridge) @ 2.20 GHz (max 3.0 GHz). The E5-2660 processor admits cache sizes of 32 kB, 256 kB and 20 MB for the L1, L2, and L3 caches respectively. The L3 cache is shared among all cores on the CPU. Theoretical (serial) peak performance double precision arithmetic on this CPU is 24 GFLOP/s. HAXX and all benchmark executables were compiled using the Intel(R) C++ compiler with architecture specific optimizations (‘-xHost’) and interprocedural optimization enabled. To obtain the caching parameters, the open–source autotuning software OpenTuner (Ansel et al., 2014) was employed. On this architecture, the optimal caching parameters were found to be and .
Figure 4 illustrates performance comparisons for three GEMM implementations: the reference (HGEMM-Ref) and AVX-optimized (HGEMM-OptAVX) HGEMM implementations provided in the HAXX library, and the (serial) ZGEMM implementation provided by the Intel(R) Math Kernel Library (MKL) (Version 18.0 Update 1). All timings presented are representative of Eq. 27 for square matrices with and . Timings for the HGEMM implementations are for the matrix product operation on while those for ZGEMM are for the analogous product operation on (see Eq. 26b). The comparison between complex and quaternion operations are presented in this manner to demonstrate the efficacy of the quaternion operation over the complex operation for the same arithmetic operation, i.e. the results of these operations represent the same mathematical object (up to an isomorphism). There two primary results which are to be taken from these numerical experiments:
The timing comparisons depicted in Fig. 3(a) illustrate that quaternion arithmetic alone is not sufficient to obtain performance leverage over tuned complex matrix multiplication. The reference HGEMM implementation is significantly less performant than the ZGEMM implementation found in MKL, while the AVX optimized HGEMM implementation outperforms the ZGEMM operation by roughly a factor of 2 (as would be expected from Tab. 2). Further, as was described in Sec. 3, the reference HGEMM implementation performs significantly under the theoretical peak performance (~7 GFLOP/s vs 24 GFLOP/s) due to the algorithm being memory bound.
Despite a slight difference in the FLOP rate in the GEMM implementations depicted in Fig. 3(b) (~22 GFLOP/s for ZGEMM-MKL and ~21 GFLOP/s for HGEMM-OptAVX), the optimized HGEMM implementation consistently outperforms the optimized ZGEMM implementation even for large matrices ().
In this work, we have demonstrated the efficacy and potential of high–performance quaternion linear algebra to leverage performance increases over complex linear algebra for special class of matrices through the efficient implementation of the quaternion matrix product. The software development proposed in this work extends the existing theory of high-performance serial matrix multiplication for use with explicitly quaternion arithmetic, as outlined in Secs. 4 and 3. A series of numerical experiments given in Sec. 5 have illustrated performance comparisons between reference quaternion, optimized quaternion, and vendor tuned complex GEMM implementations. It was shown that exploitation of quaternion arithmetic alone is not sufficient to outperform high-performance implementations of complex GEMM and that analogous implementations of high-performance quaternion GEMM are necessary to leverage such improvements. Further, it was shown that even in the presence of slight difference the FLOP rate comparisons, the optimized implementation of quaternion GEMM outperforms the optimized implementation of complex GEMM for the analogous arithmetic operation. We note for completeness that while Intel(R) Sandy Bridge and the AVX instruction set are not contemporary in and of themselves, they representative of more contemporary architectures such as the Intel(R) Haswell and AMD(R) Excavator architectures which support the AVX2 instruction set. In the context of the GEMM operation, the primary feature introduced in these architectures is FMA arithmetic instructions. With the exception of architectures which support the AVX-512 instruction set (such as Intel(R) Skylake-X and Intel(R) Knight’s Landing), the SIMD vector units on architectures which support either the AVX or AVX2 instruction sets are 256 bits in length. Thus with the exception of the arithmetic kernel (Eq. 31) and the optimal values of the caching parameters, the remainder of the findings in this work would would be invariant between AVX and AVX2. In summary, the optimized implementation of quaternion GEMM provided by the HAXX library was shown to outperform its MKL optimized complex analogue by roughly a factor of 2 (as would be expected from the discussion in Sec. 2.2). As the architecture on which the numerical experiments were performed is a representative example of modern HPC architectures in general, the results presented in this work would translate to other architectures given that one provides optimized versions of the GEMM kernels for the architecture in question.
As the matrix product is the fundamental building block for the development of important operations such as eigendecomposition and matrix factorization, its efficient implementation is a necessary condition for high-performance linear algebra software. In order for quaternion linear algebra to be a viable alternative to complex linear algebra in problems which it may be applied, optimized implementations of quaternion operations which outperform their complex counterparts must be developed. Although the power of the theory of quaternion algebra in the context of scientific theory and computation has been known for some time, prior to this work, no performant implementation of quaternion linear algebra has been available. It is our hope that the software developments presented in this work will aid and spark interest in the future development of high-performance quaternion linear algebra such that the full power of quaternion arithmetic may be leveraged in computationally intensive fields such as scientific computing and image processing.
Acknowledgements.In the development of HAXX, DWY was supported by a fellowship from The Molecular Sciences Software Institute under National Science Foundation grant ACI-1547580. The development of HAXX has also been supported through the development of the open source Chronus Quantum supported by the National Science Foundation (OAC-1663636 to XL). This work has been further supported in part by the U.S. Department of Energy, Office of Science, Basic Energy Sciences, under Award LAB 17-1775, as part of the Computational Chemical Sciences Program. The authors would like to thank Edward Valeev and Benjamin Pritchard for insightful discussions regarding microarchitechture optimizations and high-performance matrix multiplication and Wissam Sid Lakhdar for aid in the tuning of HAXX. Further, the authors would like to thank Benjamin Pritchard, Wissam Sid Lakhdar, Joseph Kasper and the anonymous reviewers for reviewing the content of the manuscript and providing meaningful insight.
- Adler (1995) Stephen L. Adler. 1995. Quaternionic Quantum Mechanics and Quantum Fields. Oxford University Press, Oxford, U.K.
- Agarwal et al. (1994) Ramesh C. Agarwal, Fred G. Gustavson, and Mohammad Zubair. 1994. Exploiting Functional Parallelism of POWER2 to Design High-Performance Numerical Algorithms. IBM Journal of Research and Development 38, 5 (1994), 563–576. https://doi.org/10.1147/rd.385.0563
- Anderson et al. (1999) E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. 1999. LAPACK Users’ Guide (third ed.). Society for Industrial and Applied Mathematics, Philadelphia, PA. https://doi.org/10.1137/1.9780898719604
- Ansel et al. (2014) Jason Ansel, Shoaib Kamil, Kalyan Veeramachaneni, Jonathan Ragan-Kelley, Jeffrey Bosboom, Una-May O’Reilly, and Saman Amarasinghe. 2014. OpenTuner: An Extensible Framework for Program Autotuning. In Proceedings of the 23rd International Conference on Parallel Architectures and Compilation (PACT ’14). ACM, New York, NY, USA, 303–316. https://doi.org/10.1145/2628071.2628092
- Armbruster (2017) Markus K Armbruster. 2017. Quaternionic Formulation of the Two-Component Kohn-Sham Equations and Efficient Exploitation of Point Group Symmetry. J. Chem. Phys. 147, 5 (2017), 054101. https://doi.org/10.1063/1.4995614
- Arnol’d (1995) Vladimir Igorevich Arnol’d. 1995. The Geometry of Spherical Curves and The Algebra of Quaternions. Russian Mathematical Surveys 50, 1 (1995), 1–68. https://doi.org/10.1070/RM1995v050n01ABEH001662
- Baker (1999) Andrew Baker. 1999. Right Eigenvalues for Quaternionic Matrices: A Topological Approach. Linear Algebra and Its Applications 286, 1-3 (1999), 303–309. https://doi.org/10.1016/S0024-3795(98)10181-7
- Blackford et al. (2002) L. Susan Blackford, James Demmel, Jack J. Dongarra, Iain Duff, Sven Hammarlind, Greg Henry, Michael Heroux, Linda Kaufman, Andrew Lumsdaine, Antoine Petitet, Rolan Pozo, Karin Remington, and R. Clint Whaley. 2002. An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Soft. 28, 2 (2002), 135–151. https://doi.org/10.1145/567806.567807
- Bunse-Gerstner et al. (1989) Angelika Bunse-Gerstner, Ralph Byers, and Volker Mehrmann. 1989. A Quaternion QR Algorithm. Numer. Math. 55, 1 (1989), 83–95. https://doi.org/10.1007/BF01395873
- Cayley (1845) Arthur Cayley. 1845. XIII. On Certain Results Relating to Quaternions: To the Editors of the Philosophical Magazine and Journal. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 26, 171 (1845), 141–145. https://doi.org/10.1080/14786444508562684
- Chevalley (1996) Claude Chevalley. 1996. The Algebraic Theory of Spinors and Clifford Algebras: Collected Works. Vol. 2. Springer Science & Business Media.
- Deavours (1973) Cipher A. Deavours. 1973. The Quaternion Calculus. The American Mathematical Monthly 80, 9 (1973), 995–1008. https://doi.org/10.2307/2318774
- Dongarra et al. (1990) Jack J. Dongarra, Jermey Du Cruz, Sven Hammarling, and Iain S. Duff. 1990. Algorithm 679: A Set of Level 3 Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Trans. Math. Soft. 16, 1 (1990), 18–28. https://doi.org/10.1145/77626.77627
- Dongarra et al. (1988) Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. 1988. Algorithm 656: An Extended Set of Basic Linear Algebra Subprograms: Model Implementation and Test Programs. ACM Trans. Math. Soft. 14, 1 (1988), 18–32. https://doi.org/10.1145/42288.42292
- Dongarra et al. (1984) Jack J. Dongarra, John R. Gabriel, Dale D. Koelling, and James H. Wilkinson. 1984. Solving the Secular Equation Including Spin Orbit Coupling for Systems with Inversion and Time Reversal Symmetry. J. Comput. Phys. 54, 2 (1984), 278–288. https://doi.org/10.1016/0021-9991(84)90119-0
- Ekström et al. (2006) Ulf Ekström, Patrick Norman, and Vincenzo Carravetta. 2006. Relativistic Four-Component Static-Exchange Approximation for Core-Excitation Processes in Molecules. Phys. Rev. A 73, 2 (2006), 022501. https://doi.org/10.1103/PhysRevA.73.022501
- Ell et al. (2014) Todd A Ell, Nicolas Le Bihan, and Stephen J Sangwine. 2014. Quaternion Fourier Transforms for Signal and Image Processing. John Wiley & Sons.
- Fog (2012) Agner Fog. 2012. The Microarchitecture of Intel, AMD and VIA CPUs: An Optimization Guide for Assembly Programmers and Compiler Makers. Copenhagen University College of Engineering (2012), 2–29.
- Frobenius (1878) Georg Frobenius. 1878. Über lineare Substitutionen und bilineare Formen. Reimer.
- Goto and van de Geijn (2008) Kazushige Goto and Robert A. van de Geijn. 2008. Anatomy of High-Performance Matrix Multiplication. ACM Trans. Math. Soft. 34, 3 (2008), 12. https://doi.org/10.1145/1356052.1356053
- Grubin (1970) Carl Grubin. 1970. Derivation of the Quaternion Scheme via the Euler Axis and Angle. Journal of Spacecraft and Rockets 7, 10 (1970), 1261–1263. https://doi.org/10.2514/3.30149
- Gunnels et al. (2001) John A. Gunnels, Greg M. Henry, and Robert A. van de Geijn. 2001. A Family of High-Performance Matrix Multiplication Algorithms. In International Conference on Computational Science. Springer, 51–60.
- Hall (2015) Brian Hall. 2015. Lie Groups, Lie Algebras, and Representations: An Elementary Introduction. Vol. 222. Springer.
- Hamilton (1866) William Rowan Hamilton. 1866. Elements of Quaternions. Longmans, Green, & Company.
- Henriksson et al. (2005) Johan Henriksson, Patrick Norman, and Hans Jørgen Aa Jensen. 2005. Two-Photon Absorption in the Relativistic Four-Component Hartree–Fock approximation. J. Chem. Phys. 122, 11 (2005), 114106. https://doi.org/10.1063/1.1869469
- Hopf (1931) Heinz Hopf. 1931. Über die Abbildungen der dreidimensionalen Sphäre auf die Kugelfläche. Math. Ann. 104, 1 (1931), 637–665. https://doi.org/10.1007/BF01457962
- Horwitz and Biedenharn (1984) Lawrence P. Horwitz and Lawrence C. Biedenharn. 1984. Quaternion quantum mechanics: Second quantization and gauge fields. Annals of Physics 157, 2 (1984), 432 – 488. https://doi.org/10.1016/0003-4916(84)90068-X
- Hurwitz (1922) Adolf Hurwitz. 1922. Über die Komposition der quadratischen Formen. Math. Ann. 88, 1-2 (1922), 1–25. https://doi.org/10.1007/bf01448439
- Janovská and Opfer (2003) Drahoslava Janovská and Gerhard Opfer. 2003. Givens’ Transformation Applied to Quaternion Valued Vectors. BIT Numerical Mathematics 43, 5 (2003), 991–1002. https://doi.org/10.1023/B:BITN.0000014561.58141.2c
- Jia et al. (2018) Zhigang Jia, Musheng Wei, Mei-Xiang Zhao, and Yong Chen. 2018. A New Real Structure-Preserving Quaternion QR Algorithm. J. Comput. Appl. Math. 343 (2018), 26–48. https://doi.org/10.1016/j.cam.2018.04.019
- Konecny et al. (2018) Lukas Konecny, Marius Kadek, Stanislav Komorovsky, Kenneth Ruud, and Michal Repisky. 2018. Resolution-of-Identity Accelerated Relativistic Two-and Four-Component Electron Dynamics Approach to Chiroptical Spectroscopies. J. Chem. Phys. 149, 20 (2018), 204104. https://doi.org/10.1063/1.5051032
- Lawson et al. (1979) Chuck L. Lawson, Richard J. Hanson, David R. Kincaid, and Fred T. Krogh. 1979. Basic Linear Algebra Subprograms for FORTRAN Usage. ACM Trans. Math. Soft. 5, 3 (1979), 308–323.
- Le Bihan and Mars (2004) Nicolas Le Bihan and Jérôme Mars. 2004. Singular Value Decomposition of Quaternion Matrices: A New Tool for Vector-Sensor Signal Processing. Signal Processing 84, 7 (2004), 1177–1199.
- Le Bihan and Sangwine (2003) Nicolas Le Bihan and Stephen J Sangwine. 2003. Quaternion Principal Component Analysis of Color Images. In ICIP 2003. Proceedings. 2003 International Conference on Image Processing, 2003., Vol. 1. IEEE, I–809.
- Li et al. (2019) Ying Li, Musheng Wei, Fengxia Zhang, and Jianli Zhao. 2019. On The Power Method for Quaternion Right Eigenvalue Problem. J. Comput. Appl. Math. 345 (2019), 59–69. https://doi.org/10.1016/j.cam.2018.06.015
- Loring (2012) Terry A. Loring. 2012. Factorization of Matrices of Quaternions. Expositiones Mathematicae 30, 3 (2012), 250–267. https://doi.org/10.1016/j.exmath.2012.08.006
- Low et al. (2016) Tze Meng Low, Francisco D. Igual, Tyler M. Smith, and Enrique S. Quintana-Ortí. 2016. Analytical Modeling Is Enough for High-Performance BLIS. ACM Trans. Math. Soft. 43, 2 (Aug. 2016), 12:1–12:18. https://doi.org/10.1145/2925987
- Mocoroa et al. (2006) Santiago Arribas Mocoroa, Antonio Elipe, and Manuel Palacios Latasa. 2006. Quaternions and the rotation of a rigid body. Celestial Mechanics and Dynamical Astronomy 96, 3-4 (2006), 239–251. https://doi.org/10.1007/s10569-006-9037-6
- Nakano et al. (2017) Masahiko Nakano, Junji Seino, and Hiromi Nakai. 2017. Development of Spin-Dependent Relativistic Open-Shell Hartree–Fock Theory with Time-Reversal Symmetry (I): The Unrestricted Approach. Int. J. Quant. Chem. 117, 10 (2017), e25356. https://doi.org/10.1002/qua.25356
- Peng et al. (2009) Daoling Peng, Jianyi Ma, and Wenjian Liu. 2009. On the Construction of Kramers Paired Double Group Symmetry Functions. Int. J. Quant. Chem. 109, 10 (2009), 2149–2167. https://doi.org/10.1002/qua.22078
- Rodman (2014) Leiba Rodman. 2014. Topics in Quaternion Linear Algebra. Princeton University Press.
- Saue et al. (1997) Trond Saue, Knut Fægri Jr., Trygve Helgaker, and Odd Gropen. 1997. Principles of Direct 4-Component Relativistic SCF: Application to Caesium Auride. Mol. Phys. 91, 5 (1997), 937–950. https://doi.org/10.1080/002689797171058
- Saue and Helgaker (2002) Trond Saue and Trygve Helgaker. 2002. Four-Component Relativistic Kohn–Sham Theory. J. Comput. Chem. 23, 8 (2002), 814–823. https://doi.org/10.1002/jcc.10066
- Shiozaki (2017) Toru Shiozaki. 2017. An Efficient Solver for Large Structured Eigenvalue Problems in Relativistic Quantum Chemistry. Mol. Phys. 115, 1-2 (2017), 5–12. https://doi.org/10.1080/00268976.2016.1158423
- Shoemake (1985) Ken Shoemake. 1985. Animating Rotation with Quaternion Curves. ACM SIGGRAPH computer graphics 19, 3 (1985), 245–254.
- Stuber and Paldus (2003) Jason L. Stuber and Joesf Paldus. 2003. Symmetry Breaking in the Independent Particle Model. In Fundamental World of Quantum Chemistry: A Tribute to the Memory of Per-Olov Löwdin. Kluwer Academic Publishers, 67–139.
- Sudbery (1979) Anthony Sudbery. 1979. Quaternionic analysis. In Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 85. Cambridge University Press, 199–225. https://doi.org/10.1017/S0305004100055638
- Van Zee and Smith (2017) Field G. Van Zee and Tyler Smith. 2017. Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods. ACM Trans. Math. Soft. 44, 1 (July 2017), 7:1–7:36. https://doi.org/10.1145/3086466
- Van Zee et al. (2016) Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, and Robert A. van de Geijn. 2016. The BLIS Framework: Experiments in Portability. ACM Trans. Math. Soft. 42, 2 (June 2016), 12:1–12:19. https://doi.org/10.1145/2755561
- Van Zee and van de Geijn (2015) Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS Functionality. ACM Trans. Math. Soft. 41, 3 (June 2015), 14:1–14:33. https://doi.org/10.1145/2764454
- Visscher and Saue (2000) Lucas Visscher and Trond Saue. 2000. Approximate Relativistic Electronic Structure Methods Based on the Quaternion Modified Dirac Equation. J. Chem. Phys. 113 (2000), 3996. https://doi.org/10.1063/1.1288371
- Wang et al. (2013) Qian Wang, Xianyi Zhang, Yunquan Zhang, and Qing Yi. 2013. AUGEM: Automatically Generate High Performance Dense Linear Algebra Kernels on x86 CPUs. In 2013 International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE, 1–12.
- Whaley and Dongarra (1998) R. Clinton Whaley and Jack J. Dongarra. 1998. Automatically Tuned Linear Algebra Software. In IEEE/ACM Conference on Supercomputing, 1998. IEEE, 38–38.
- Whaley et al. (2001) R. Clint Whaley, Antoine Petitet, and Jack J. Dongarra. 2001. Automated Empirical Optimizations of Software and the ATLAS project. Parallel Comput. 27, 1-2 (2001), 3–35. https://doi.org/10.1016/S0167-8191(00)00087-9
- Willams-Young (2019) David B. Willams-Young. 2019. HAXX: Hamilton’s Quaternion Algebra for CXX. https://github.com/wavefunction91/HAXX
- Xianyi et al. (2012) Zhang Xianyi, Wang Qian, and Zhang Yunquan. 2012. Model-Driven Level 3 BLAS Performance Optimization on Loongson 3A Processor. In 2012 IEEE 18th International Conference on Parallel and Distributed Systems. IEEE, 684–691.
- Zeng et al. (2016) Rui Zeng, Jiasong Wu, Zhuhong Shao, Yang Chen, Beijing Chen, Lotfi Senhadji, and Huazhong Shu. 2016. Color Image Classification via Quaternion Principal Component Analysis Network. Neurocomputing 216 (2016), 416–428. https://doi.org/10.1016/j.neucom.2016.08.006
- Zhang (1997) Fuzhen Zhang. 1997. Quaternions and Matrices of Quaternions. Linear Algebra and Its Applications 251 (1997), 21–57. https://doi.org/10.1016/0024-3795(95)00543-9