SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

# SIMD Parallel MCMC Sampling with Applications for Big-Data Bayesian Analytics

## Abstract

Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the compute-based and/or memory-based hardware limits. Two opportunities for Single-Instruction Multiple-Data (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), child-node contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete nodes, concurrent sampling of conditionally-independent nodes can be transformed into a SIMD form. High-performance libraries with multi-threading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of high-level source-code and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for Non-Uniform Memory Access architectures. For big-data Bayesian GLM graphs, the end-result is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (built-in) multi-threaded Intel MKL BLAS, and reaches within the striking distance of the memory-bandwidth-induced hardware limit. Using multi-threading for cache-friendly, fine-grained parallelization can outperform coarse-grained alternatives which are often less cache-friendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to many-core SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in cost-effectiveness, energy efficiency (‘green computing’), and higher speed on multi-core x86 processors.

## 1Introduction

Many inference problems in statistics and machine learning are best expressed in the language of probabilistic graphical models, where a probability distribution function (PDF) over a high-dimensional parameter space can be motivated as the product of a collection of terms, each a function of a subset of the parameters. Inference in such models requires summarizing the joint PDF, which can be quite complex and lacking closed-form integrals. Monte Carlo Markov Chain (MCMC) sampling techniques offer a practical way to summarize complex PDFs for which exact sampling algorithms are not available. For many real-world problems, MCMC sampling can be very time-consuming due to a combination of large data sets, high dimensionality of joint PDF, lack of conjugacy between likelihood and prior functions, and poor mixing of the MCMC chain. Fast MCMC sampling is, therefore, important for wider adoption of probabilistic models in real-world applications.

For many years, software developers could rely on faster processors to see improved performance, without any need for code modification. In the past decade, however, chip manufacturers have warned of single-core performance saturation as CPU clock rates and instruction-level parallelism reach their physical limits ([?]). Instead, we are seeing a steady rise in core counts and width of vector units, culminating in the emergence of many-core Single-Instruction-Multiple-Data (SIMD) architectures such as Graphic Processing Units (GPUs) 1 and Intel Xeon Phi 2. Today, a CPU purchased for around \$3000 can offer a theoretical peak double-precision performance of more than 300 GFLOPS 3, and many-core processors selling for nearly the same price have broken the Tera FLOP barrier 4. For scientific computing applications to take full advantage of such enormous performance potential within a single compute node, the entire parallelism potential of an algorithm must be efficiently exposed to the compiler, and eventually to the hardware.

The remainder of this paper is organized as follows. In Section 2 we lay the theoretical foundation for the paper, including a summary of previous research on parallel MCMC and overview of two single-chain parallelization techniques for parallel MCMC of Directed Acyclic Graphs (DAGs). In Section 3 we do a detailed performance analysis and optimization of parallel MCMC for Bayesian GLM (focusing on logistic regression). In Section 4 we discuss several extensions to the ideas developed in Section 3, including Hierarchical Bayesian (HB) GLM, calculation of derivatives, Ising model, batch RNG, distributed and many-core computing, and compile-time loop unrolling. Section 5 contains a summary of our results and pointers for future research.

## 2Parallel MCMC for Graphical Models

### 2.1Previous Work on Parallel MCMC

A straightforward method for parallel MCMC is to run multiple chains ([?]). Since each chain must go through the burn-in period individually, multi-chain parallelization is less suitable for complex models with poor convergence where a big fraction of time might be spent on the burn-in phase. The most common single-chain MCMC parallelization strategy is concurrent sampling of conditionally-independent nodes within the Gibbs sampling framework ([?]). This is also known as graph coloring or chromatic sampling. Example applications include Hierarchical LDA models ([?]) and Markov Random Fields ([?]). In molecular dynamics simulations, assumption of short-range intermolecular forces provides for similar domain decompositions ([?]). This parallelization strategy is reviewed in Section 2.3. In synchronous Gibbs sampling ([?]), all variables are updated simultaneously - rather than sequentially - in each Gibbs cycle, regardless of whether conditional independence holds or not. Despite its theoretical and practical shortcomings, this embarrassingly parallelizable approach remains popular in large-scale MCMC problems such as Latent Dirichlet Allocation (LDA) models for topical analysis of text documents ([?]). Parallel message-passing algorithms are discussed in [?] with application to discrete factor graphs. [?] present CPU and GPU parallelization of multivariate slice sampling ([?]) with application to linear Gaussian processes. [?] propose a parallel ‘look-ahead’ sampler for discrete-valued densely-connected graphical models such as Boltzmann Machines and LDA. This approach takes advantage of the observation that, in densely-connected networks, contribution from any given node to the conditional posterior of a target node is small. A similar strategy, called ‘pre-fetching’ has been proposed ([?]) in the context of one-block Metropolis-Hastings algorithms. [?] propose an approximate parallel Gibbs sampler - with GPU implementation - for DNA motif finding, where parallelization is achieved through replacing sequential random steps by a deterministic, pre-ordered search of the parameter space.

In terms of target platforms, MCMC parallelization work has primarily focused on shared-memory multi-threading ([?]), distributed-memory parallelization using explicit or implicit message-passing ([?]), and more recently, many-core architectures such as GPUs ([?]). This leaves two areas under-explored: efficient use of vector instructions for SIMD parallelization, and NUMA awareness for efficient utilization of memory bandwidth. Addressing these two areas is important in light of trends towards higher core counts and multi-socket processors with NUMA architecture, and increasing width of vector units. Today’s desktop processors are increasingly dual-socket, while quad-socket setups are available for high-end servers. Since each socket has a separate memory controller, a processor’s access to memory attached to other sockets is mediated through a slower path such as Intel QPI 5, thus creating a non-uniform access to memory. On the other hand, vector units have steadily widened over recent generations to reach 128 bits for SSE4, and 256 bits for AVX instruction set extensions. Intel Xeon Phi co-processors have doubled the width yet again to 512 bits. The combined effect of vectorization and NUMA awareness for a dual-socket x86 can reach an order of magnitude: A fully-vectorized code can run up to 4 times faster for double-precision calculations, while minimizing cross-socket communication on a dual-socket processor can double the performance for a memory-bound application. Our paper aims to help better utilize single-server capacities of a modern x86 processor for parallel MCMC sampling. In addition to achieving more efficient execution on x86 processors, our performance optimization techniques pave the way for further speedup on many-core processors since the underlying principles of reducing parallel overhead and increase data locality apply to all architectures. This strategy has been dubbed ‘transforming and tuning’ ([?]).

### 2.2Gibbs Sampling of Graphical Models

A graphical model consists of nodes (or vertices) connected by links (or edges) 6. In a probabilistic graphical model, each node represents a random variable (or a group of random variables), and links express probabilistic relationship between these variables. The graph represents a decomposition of the joint probability distribution over all variables into a product of factors, each consisting of a subset of variables. In DAGs, links have directionality indicated by arrows, and no closed path exists such that one can always travel in the direction of arrows. For a DAG with nodes, the joint distribution is given by the following factorization

where represents the set of parents of node , and . For undirected graphs, the joint distribution is expressed as a product of potential functions over maximal cliques of the graph:

where is the normalization factor, also known as the partition function.

Inference and prediction in graphical models often requires calculating the expected values of functions of with respect to the joint distribution . In many cases, closed forms for integrals do not exist and approximate techniques such as MCMC can be used to approximate the joint distribution with a finite sample. A popular MCMC technique is Gibbs sampling ([?]) which is a special case of the Metropolis-Hastings (MH) algorithm ([?]) with 100% acceptance rate. In Gibbs sampling, we iterate through the stochastic nodes and draw samples from the probability distribution for each node, conditioned on the latest samples drawn for all remaining nodes. Below is an outline of the Gibbs sampling algorithm:

1. Initialize all nodes .

2. For iteration

• Sample .

• Sample .

• Sample .

• Sample .

Gibbs sampling is appealing because it reduces the problem of sampling from a high-dimensional distribution to a series of low-dimensional sampling problems, which are often easier, especially when the resulting low-dimensional distributions have standard sampling algorithms (e.g. multivariate Gaussian, Gamma, etc.). When conditional distributions cannot be directly sampled, a hybrid approach can be used where an MCMC algorithm such as Metropolis ([?]) or slice sampler ([?]) is applied within the Gibbs framework ([?]).

### 2.3SIMD Parallel MCMC for DAGs

Here we present two canonical parallelization opportunities for DAGs to provide a foundation for our analysis of Bayesian GLM. Equivalent ideas for undirected graphs are discussed in Section 4.3 in the context of Ising model and Boltzmann machine.

According to Equation 1, a given node appears in the joint distribution in two ways: a) self-term, i.e. the term that expresses the probability of the node conditioned on its parents, and b) child terms, i.e. terms corresponding to the conditional probability of the node’s children. Therefore, the conditional distributions appearing in Gibbs sampling algorithm can be decomposed as follows:

This decomposition allows us to identify the following two single-chain parallelization modes for Gibbs sampling of DAGs.

0pt Parallel Sampling of Conditionally-Independent Nodes: Consider two nodes and . If their joint conditional distribution is factorizable, i.e. if , then the nodes are called conditionally-independent and can be sampled simultaneously in each iteration of Gibbs sampling. From Equation 2, we can infer the following theorem about conditional independence: [section]

Condition (a) means that neither node appears in the self term of the other node. Condition (b) means that neither node appears in any of the child terms for the other nodes. Therefore, neither node appears in the conditional distribution of the other node. This is sufficient to prove that nodes are conditionally-independent and can be sampled concurrently in Gibbs sampling.

The above theorem can be used to determine if any pair of nodes are conditionally-independent or not. However, constructing the most efficient partitioning of a graph into node blocks, each of which consists of conditionally-independent nodes is non-trivial and an NP-hard problem ([?]). For specific graph structures, such as HB models ([?]), conditionally-independent node blocks can be constructed rather easily. This parallelization mode can be difficult to vectorize, especially for non-conjugate, complex posteriors (such as GLM) where drawing a sample for each node involves multiple function calls including random number generation (RNG) and a sampling routine. As we see for Ising models, however, for simpler distributions, this parallel mode can be partially vectorized using the strategy discussed in Section 3.4, with significant contribution towards overall performance of the sampling routine.

0pt Parallel Calculation of Conditional Posterior: First, consider the conjugate case, where the product of prior and each child-term contribution retains the functional form of the prior. For example, in LDA models a Dirichlet prior is used on the distribution of topics in each document. The likelihood contribution of each token in a document is a categorical distribution, with their collective contribution being a multinomial distribution whose parameter vector is the sum of contributions from all tokens. Similarly, with a conjugate Dirichlet prior on distribution of words in each topic, the contribution from each document (and tokens within each document) takes an additive form towards the posterior. In such cases, calculating the parameters of the posterior distribution can be parallelized, followed by a reduction. Second, consider the case where conjugacy does not exist, and rather than drawing independent samples from the posterior we must create a Markov chain that converges to the posterior. For most MCMC sampling techniques such as Slice sampler ([?]) or Metropolis sampler ([?]), the majority of time during each sampling cycle is spent on evaluating the conditional log-posterior function (e.g. within the Gibbs sampling framework) or its derivatives (Section 4.2). Therefore, any speedup in function evaluation translates, nearly 1-to-1, into MCMC speedup. Similar to the conjugate case, the additive contribution of child terms can be calculated in parallel, followed by a reduction step. In both cases, when the model is exchangeable ([?]) as in the case of i.i.d. observations, the contributions from child nodes are symmetric, resulting in a SIMD computation.

Non-conjugate cases are typically better candidates for SIMD parallelization since 1) they tend to take a larger fraction of each sampling cycle, 2) in conjugate cases, we need a larger number of observations before the cost of calculating the conjugate posterior parameters dominates the cost of generating a random deviate from the posterior. GLM models are good candidates for SIMD Gibbs sampling due to non-conjugacy of prior and likelihood functions, and due to high computational cost of each function evaluation, especially for long and wide data and in the presence of transcendental functions (Section 3.1).

*********

Figure ? summarizes the two single-chain parallelization modes for DAGs discussed above. In relative terms, parallel sampling of conditionally-independent nodes can be considered coarse-grained, while parallel evaluation of log-likelihood can be considered fine-grained. In the remainder of this paper, we use two examples - Bayesian GLM and Ising model - to demonstrate how each of these modes can be parallelized on multi-core x86 processors using multi-threading and vectorization.

## 3SIMD Parallel MCMC for Bayesian GLM on Multi-core x86

### 3.1Setup

#### Running Example: Bayesian Logistic Regression

GLMs ([?]) are the workhorse of statistics and machine learning, with applications such as risk analysis ([?]) and public health ([?]) among others. They can be extended to handle data sparseness and heterogeneity via HB framework ([?]), or to account for repeated measurements and longitudinal data via Generalized Linear Mixed Models ([?]).

For our purposes, GLM models have the following log-likelihood function:

where is the vector of features or explanatory variables for observation and has a length , is the response or target variable, is the coefficient vector of length , and is the number of observation points. Logistic regression is an important member of GLM family, where each binary response variable is assumed to follow a Bernoulli distribution whose probability is . The log-likelihood becomes

with being a binary response variable with possible values . In a Bayesian setting, we impose a prior distribution on such as a multivariate Gaussian with parameters and , leading to the DAG representation of Fig. ?. Unless we are dealing with very small data sizes, calculation of prior contribution is often computationally negligible compared to the likelihood calculation, and hence we focus on the latter throughout this paper.

#### Hardware

We use a dual-socket 2.6GHz Intel Xeon E5-2670 7 with 8 cores per socket (16 total cores), 20MB L3 cache per processor (shared among 8 cores), 256KB of L2 cache per core, and 32KB of L1 cache per core (separately for data and instructions each). Total installed RAM is 128GB, with maximum memory bandwidth of 51.2GB/sec per socket. Since each processor has its own memory controller, the dual-socket setup creates a Non-Uniform Memory Access (NUMA) environment (Section 8.8.2 of Intel Optimization Reference Manual 8). This processor supports the AVX instruction set extensions, using 256-bit wide floating-point registers YMM0-YMM15. Hyper-threading is turned off for our experiments.

#### Software

All code is written in C/C++ 9. (The only C++ feature used is template metaprogramming for compile-time loop unroll. See Section 4.6.) We use the Intel software stack for compiling and micro-benchmarking as well as high-performance mathematical routines:

• Intel Math Kernel Library (MKL) (part of Intel Composer XE 2013): We use MKL for vectorized and multithreaded BLAS calls (dgemv/ddot), vectorized transcendental functions (as part of Vector Math Library or VML 10), and vectorized random number generation (RNG) as part of Vector Statistical Library or VSL 11. Switching between single-threaded and multi-threaded MKL is done via the link option ’-mkl=sequential’ or ’-mkl=parallel’.12

• Intel C++ Compiler (also part of Intel Composer XE 2013): Combined with Intel VML, the compiler allows us to vectorize functions of transcendental functions. All codes tested in this paper were compiled using optimization flag -O2.

• Intel Performance Counter Monitor (PCM)13, an open-source C++ wrapper library for reading micro-benchmarking (core and uncore) events from Intel’s Performance Monitoring Units, including cache hit rate, clock cycles lost to memory stalls, and instructions per cycle.

#### Parallel Programming

Our overall approach is to maintain software portability and developer productivity while achieving good performance. For multi-threading, we use the OpenMP API and Intel’s implementation of it (as part of Intel Composer XE 2013). OpenMP offers a mix of high-level and low-level compiler directives and runtime functions for creating and managing threads, and offers an efficient alternative to working directly with Pthreads. OpenMP is perhaps today’s most-widely used API for multi-threading in HPC applications. Other notable options for multi-threading include Cilk Plus 14 (also handles vectorization) and Threading Building Blocks 15 from Intel, Chapel 16 from Cray, X10 17 from IBM, and Phoenix MapReduce 18 from Stanford. For vectorization, we choose ‘guided auto-vectorization’ 19, i.e. we use pragmas and code modifications to guide the compiler to vectorize the code. We find this option to strike a balance between performance and portability/productivity. Other options include vector intrinsics and assembly code. Another parallelization option is distributed-computing on a server cluster using Message-Passing Interface (MPI) 20, MapReduce 21 or Hadoop 22. Given the fine-grained nature of our SIMD parallelization approach and significant network latency and bandwidth restrictions in distributed computing, however, focusing within a single server is the logical choice. Our parallelization concepts, however, can be combined with coarse-grained options and thus incorporated into distributed computing frameworks. See Section 4.5.

Note that while we are using Intel’s hardware and software, the optimization lessons learned are applicable to other platforms, including open-source software. More importantly, the software itself is portable since we are using standard APIs (C/C++ standard library, OpenMP, BLAS).

### 3.2Baseline Implementation

To calculate the log-likelihood in Equation 3, we can consolidate ’s into rows of a matrix , and use a matrix-vector multiplication routine such as BLAS dgemv to calculate . We refer to this first map operation ([?]) as the Linear Algebra (LA) map. This is followed by an element-wise transformation of involving transcendental functions, and a final reduction step. We call this second map operation the Transcendental (TR) map. The result is the loglike routine in Fig. ?. Since each map operation is finished before the start of the new map, this strategy is a Sequence of Maps (SOM) pattern ([?]). Using an off-the-shelf, optimized BLAS library such as Intel MKL provides automatic vectorization and multithreading of the dgemv routine. Further parallelization can be achieved by multithreading the TR map using the omp parallel for pragma. This same loop can also be vectorized, but it requires access to a vector math library for transcendental functions, as well as compiler support for vectorization of functions of transcendental functions. The Intel Composer suite offers both such capabilities. The simd pragma in Fig. ? directs the Intel C++ compiler to ignore any potential vectorization barriers (such as pointer aliasing between Xbeta and y) and generate vector instructions for the TR map. 23.

--------------------------------------------------------------------------double loglike(double beta[], double X[], double y[], int N, int K) {
double r=0.0;
double *Xbeta = new double[N];

/* Linear Algebra (LA) Map */
cblas_dgemv(CblasRowMajor,CblasNoTrans,N,K,1.0,X,K,beta,1,0.0,Xbeta,1);

/* Transcendental Loop (TR) Map + Reduction */
#pragma omp parallel for reduction(+:r) num_threads(NTHD) schedule(static)
#pragma simd reduction(+:r)
for (int n=0; n

Fig. ? (left) shows the performance of our baseline implementation, measured in ‘Clock Cycles Per Row’ (CPR), defined as total time (measured in CPU clock cycles) for each function evaluation, divided by , the number of rows in 24. Time per function evaluation is an average over many iterations. All iterations share the same and - allowing for caching - but is refreshed from one iteration to the next. This mimics the real-world application in MCMC sampling, where observed data does not change throughout the sampling iterations but the coefficients are refreshed. We see that simply adding the OpenMP pragma to the TR loop leads to a 2.4x speedup (MKL+), while adding the simd pragma to the TR loop provides another 1.6x speedup (MKL++), for a total of 3.9x compared to the automatic parallelization of LA offered by Intel MKL library.

Achieving 4x speedup by adding two lines of code (and spending a few hundred dollars on the Intel C++ compiler) is a great return on investment. But we also see two alarming trends, depicted in Fig. ? (right): 1) For small data sizes (), performance scales poorly as we increase number of threads, 2) As data sizes increase towards , absolute performance decays. It appears that we have both a small-data and a big-data problem. Next we do a deeper performance analysis of the code, paving the way for optimization strategies discussed in Section 3.4. These optimizations will add significant speedup to the baseline performance achieved by the code in Fig. ?.

### 3.3Performance Analysis

In LA map, doubles per row are consumed 25 and 1 double per row is produced. LA involves multiply/add calculations per row. The TR map consumes 1 double per row and produces a scalar after reduction, but involves 2 transcendental functions (exp and log) and a few arithmetic operations; as such, it is expensive but , i.e. it does not scale with . For small ’s (around 10), the TR step dominates computation but when approaches 100, the scale tips towards the LA map.

#### Hardware-Induced Performance Limits for Big Data

We can use hardware specifications 26 of our server to derive memory-based and compute-based upper bounds on performance of our code in the big-data regime. This regime is formally defined as combinations of and such that 1) is much bigger than L3 cache and must therefore be fetched entirely from memory in each iteration, 2) is so large that the LA step dominates computation, making FLOPS ( multiplications and additions) a reasonable approximation of computation performed per row. The compute-based upper-bound on performance (i.e. minimum CPR) is derived as follows:

Similarly, we calculate the memory-based performance ceiling:

Comparing the two bounds, we see that in the big-data limit we are memory-bound nearly by a factor of 10. Achieving this memory-based performance ceiling is non-trivial and requires code modification. We now discuss two key factors that adversely affect performance scaling of our baseline (SOM) code shown in Fig. ?.

#### Memory Stalls

As gets larger, total size of and exceeds memory size at various levels of the hierarchy. As shown in Fig. ? (top right), the L3 transition marks a drastic performance degradation. Micro-benchmarking (Fig. ?, bottom panels) verifies that this transition phase is associated with a drop in L3 cache hit rate, and an increase in clock cycles lost to L3 cache misses. This second performance barrier can be viewed as a ’big-data’ problem. Note that L2 transition has little impact on performance, as seen in speedup and lost-CPU-cycle measures. This suggests that there is sufficient computation per row to hide L2 latency. Furthermore, for data sizes that are small enough to fit into L2 the thread management overhead tends to be the dominating factor. Given these observations, we must focus on improving L3 cache utilization as well as memory bandwidth utilization. We note that there is indeed room for improving memory-bandwidth utilization, given the gap between the actual and best big-data performance, seen in Fig. ? (horizontal line in right panel). This inefficiency is inherited from the MKL-BLAS routine which is responsible for the LA map step in the log-likelihood function. This makes sense because LA map is the primary consumer of data and hence the performance bottleneck in big-data regime. We validated this claim by running dgemv routine alone and measuring the resulting CPR, seeing that the gap with memory-based performance ceiling remains virtually unchanged (data not shown).

### 3.4Performance Optimization

The baseline code in Fig. ? contains two disjoint parallel regions, one inside the dgemv call of LA map (with an implicit nested loop over rows and columns of ) and one in the TR map. As a result, we incur the multi-threading overhead twice, and fusing these two loops should reduce such overhead by half. A straightforward way to fuse the two loops is to move the dgemv call inside the TR loop by converting it to a ddot BLAS level 1 call. Each iteration of the fused loop serves to calculate the contribution from one row of the data towards the log-likelihood function. This routine can be considered a literal implementation of Equation 3, and the outcome shown in Fig. ? is an example of code/loop fusion technique used to transform a SOM into a Map of Sequences (MOS) ([?]). While this version does reduce the multi-threading overhead, yet it can prevent full vectorization of the loop by injecting a function call inside of it. Although it is possible to improve performance through inlining and unrolling the ddot loop (see Section 4.6), the MOS pattern is too inflexible for other cases such as calculating function derivatives Section 4.2 or the Ising model Section 4.3. We thus opt for an alternative, more general approach that reduces the multi-threading overhead of the SOM code in Fig. ? while maximizing vectorization coverage.

--------------------------------------------------------------------------double loglike(double beta[], double X[], double y[], int N, int K) {
double r=0.0;
double xbeta;
#pragma omp parallel for reduction(+:r) num_threads(NTHD) schedule(static)
#pragma simd reduction(+:r)
for (int n=0; n

The solution is presented in Fig. ?, and we refer to it as Partial Loop Fusion (PLF). It works by introducing an outer OpenMP parallel region. Each thread within the parallel region receives N/NTHD rows of data to process. Inside the parallel region, we switch to the single-threaded version of MKL, which remains vectorized. The TR map is explicitly vectorized (but requires no OpenMP parallelization since it is wrapped in a parallel region). We now have a single parallel region, incurring half the parallelization cost of code in Fig. ?, while maintaining full vectorization. For illustrative simplicity, we have assumed that N is divisible by NTHD, but generalizations can be easily handled with a few more lines of code. The PLF approach is a general strategy for combined multi-threading and vectorization of a sequence of maps, and its applications and merits are further discussed in later sections (see Sections Section 4.2 and Section 4.3).

--------------------------------------------------------------------------double loglike(double beta[], double X[], double y[], int N, int K) {
double f=0.0;
int Ny=N/NTHD, Nx=Ny*K;
{
double *Xbeta = new double[Ny];
cblas_dgemv(CblasRowMajor,CblasNoTrans,Ny,K,1.0,X+m*Nx,K,beta,1,0.0,Xbeta,1);

double rBuff=0.0;
#pragma simd reduction(+:rBuff)
for (int n=0; n

#### Reducing Memory Stall

When total data size exceeds L3 cache size, subsets of data that fit in L3 must be moved in and out from memory. Therefore, in big-data regime, efficient memory bandwidth utilization is crucial. In multi-socket servers, each physical processor provides its own memory controller and cross-socket memory transactions are mediated through a slower point-to-point interconnect such as Intel QPI 27 or AMD’s HyperTransport 28. This creates asymmetric memory access latency and bandwidth depending on whether memory transaction happens within-socket or cross-socket. In contrast to Symmetric Multi-Processor Systems (SMPs) that suffer from memory bandwidth contention, NUMA topologies offer total memory bandwidth that scales with number of processors, but to utilize this total bandwidth we must localize data access within each socket as much as possible. To this end, we seek to ‘specialize’ each socket on one half of data (assuming all 16 cores are being used), by taking the following two steps:

1. Split and in half, and allocate each half on the memory banks associated with one processor. The implementation is described below.

2. Utilize a compact thread-core affinity policy, which packs threads 0-7 on the first socket, and threads 8-15 on the second processor. This is done through the following system call:

export KMP_AFFINITY=“granularity=fine,compact”

.

3. Modify the multi-threaded log-likelihood routine to point all threads on each socket to their respective socket-local arrays.

A note on NUMA-aware memory allocation is warranted here. Commodity servers use a virtual memory system to hide the physical RAM address space and present a single linear address space to all applications. This makes memory affinity management (step 1) highly OS-specific. Providing automatic and efficient NUMA-aware memory allocation and task scheduling to applications running on multi-socket processors is an active area of research for operating systems, compilers and libraries ([?]). The Linux header file numa.h 29 along with the libnuma library offers a collection of functions for NUMA-aware memory allocation and thread scheduling. In this paper, we take advantage of the first-touch policy adopted by the Linux kernel 30, where physical pages are allocated to the local memory controller of the thread that first ’touches’ the page, i.e. reads from or writes to a memory address within that page. More specifically, we spawn 2 threads, attach each to one processor, and then use a standard heap management library such as malloc or new to allocate memory for one half of the data and copy the data from the master copy into socket-local arrays. The resulting memory allocation and log-likelihood code is shown in Fig. ?.

--------------------------------------------------------------------------/* NUMA-aware memory allocation */
double *X1, *X2, *y1, *y2;
int N2=N/2, NK2=N*K/2;
{
if (tid==0) { // socket 1 gets first half of data
X1 = new double[NK2];
y1 = new double[N2];
memcpy(X1, X, (NK2)*sizeof(double));
memcpy(y1, y, (N2)*sizeof(double));
} else if (tid==Ncore/2) { // socket 2 gets second half of data
X2 = new double[NK2];
y2 = new double[N2];
memcpy(X2, X+NK2, (NK2)*sizeof(double));
memcpy(y2, y+N2, (N2)*sizeof(double));
}
}
...
/* NUMA-aware log-likelihood function */
double loglike(double beta[], double X1[], double X2[], double y1[], double y2[], int N, int K) {
double f=0.0;
int Ny=N/NTHD, Nx=Ny*K, NTHD2=NTHD/2;
{
double *Xp, *yp;
if (m

*********

Fig. ? shows the results of applying the above two performance tuning techniques, PLF and NUMA adjustments, in succession (labelled as ‘PLF’ and ‘NUMA-PLF’). PLF reduces multi-threading overhead, leading to nearly 2x speedup for small data sizes, while NUMA adjustments also produce a nearly 2x speedup in the big-data regime. Our big-data performance is now within 46% of memory-based hardware limit (top left, horizontal line).

Minimizing cross-socket memory transactions reinforces the assumption that task durations in the parallel region are nearly equal. This allows for a static thread scheduling to remain effective, and we can avoid the extra overhead associated with dynamic thread scheduling (Fig. ?, top left).

#### Cache contention vs. Data Locality

When using all 16 cores, we see clear benefit from increasing data locality by using a compact affinity policy in conjunction with NUMA-aware memory allocation. When using fewer than 16 cores, the picture is more complex, as seen in Fig. ? (bottom right). If total data size is larger than per-socket L3 but smaller than total L3, then spreading out the threads across sockets is advantageous since it allows for better utilization of total L3 (L3 contention regime). In contrast, when all data fits within L3 on each socket, keeping all cores on the same socket reduces inter-thread communication (L3 locality regime). When data is much bigger than total L3, it must be fetched from memory and the NUMA topology favors a compact affinity (NUMA locality regime). This tension between data locality and cache contention in NUMA multi-core systems has been noted in the literature ([?]).

#### Differential Update

The optimization strategies discussed so far have broad applicability, as they were focused on restructuring a given computation to minimize parallelization overhead and maximize data locality. Here we present another optimization technique, ‘differential update’, that specifically applies to MCMC sampling. It seeks to remove unnecessary computation during Gibbs cycles and applies to graphs with continuous as well as discrete nodes. In the context of Bayesian GLM, it is more useful for univariate samplers such as slice sampler, and less useful for multivariate samplers such as Metropolis ([?]) or Hamiltonian Monte Carlo (HMC) ([?]). It is also less useful for samplers that require evaluation of log-likelihood derivatives (Section 4.2). In this section we apply differential update to Bayesian GLM, and in Section 4.3 we discuss the concept for discrete graphs such as Boltzmann Machines and LDA.

Recall that in Gibbs sampling, we cycle through stochastic nodes and draw from the conditional distribution of each node, given the most recent samples for the remaining nodes. Using a univariate sampler such as the slice sampler ([?]), each of the elements of the coefficient vector are updated one at a time. So both inside the sampling algorithm as well as at the end of it, only one is being updated at a time, while the other components remain fixed. The key insight in differential update technique is that we can update changes to following a change to without having to perform the full computation. Instead, we recognize that

where is the result of removing column , i.e. , from . From the above expression, we can see that a change to leads to the following change to :

Our strategy, therefore, is to maintain a copy of throughout the Gibbs sampling process, updating it according to the above formula after proposed or materialized changes to ’s. Fig. ? shows the resulting function. This strategy both reduces the amount of computation needed (in LA), and more importantly, reduces the data that must be read into memory nearly by a factor of . For large , this optimization has very significant impact.

--------------------------------------------------------------------------double loglike(double dbetak, double Xbeta[], double Xt[], double y[], int k, int N, int K) {
double f=0.0;
#pragma omp parallel for num_threads(NTHD) reduction(+:f)
#pragma simd reduction(+:f)
for (int n=0; n

Fig. ? shows results for differential update. The left panel shows significantly better scaling of performance with number of cores for differential update method (‘DiffUpdate’), compared to the default (‘NoDiffUpdate’) method for and . The right panel shows the impact of reduced data movement, where the differential update performance does not deteriorate when total data size exceeds L3 (in contrast to the NoDiffUpdate method). Essentially, DiffUpdate has expanded the range of ’s for which data fits within L3 by a factor of . Fig. ? compares micro-benchmarking results for DiffUpdate and NodiffUpdate for , (with vectorization and multi-threading on all 16 cores). We see that the overall speedup of can be broken down into a x factor due to higher instruction execution rate (less memory stall), while the remaining x is due to fewer instructions, i.e. reduced computation.

**********

Fig. ? summarizes the impact of performance improvement techniques applied to the problem of calculating the log-likelihood for the logistic regression problem, using parameters , . Note that in all cases we are using all 16 cores on the machine, and thus the performance improvement reflects more efficiency rather than simply using more resources.

While our focus was on logistic regression, same techniques are directly applicable to other type of GLM such as Poisson regression. While the detailed steps in the TR map change, the LA map stays the same, and so do the underlying concepts of minimizing parallel overhead and data movement.

## 4Extensions

In this section, we explore several extensions of the performance optimization concepts developed in the previous section in the context of Bayesian GLM.

### 4.1Hierarchical Bayesian GLM

The multi-level framework combines heterogeneous but similar units into a single model, allowing for information sharing without complete homogeneity ([?]). HB allows for consistent treatment of uncertainty and incorporation of prior knowledge in multi-level models, and has been applied to quantitative marketing ([?]), visual information processing ([?]), medicine ([?]) and changepoint analysis ([?]), among others.

HB GLM can be constructed from a GLM DAG such as Fig. ? by replicating it times, with being the number of regression groups. Each unit has a distinct set of explanatory and target variables as well as coefficients. These DAGs are tied together by sharing the parent nodes of the group coefficients. Since ’s are not parents to each other, nor do they have common children (due to groups having distinct observations), then according to Theorem ? they are conditionally independent and can be sampled concurrently within a Gibbs sampling framework. Within each group, as with ordinary GLM, we can calculate the child-node contributions to the log-likelihood of each in parallel. The factorizable log-likelihood functions for each coefficient vector is given by:

Full discussion of HB parallelization is beyond the scope of this paper (and subject of future research). Here we focus on a key question: Given the two parallel modes available (parallel sampling of group coefficients, and parallel calculation of log-likelihood function for each group), how should available hardware resources, i.e. vectorization and multi-core multi-threading, be allocated to each mode?

Vectorization

In Section 3.2 we discussed how to fully vectorize the log-likelihood function by adding the simd pragma before the TR loop in ?, which produced a nearly 3x speedup (exact numbers depend on parameter values). Since the coarser parallelism mode does not lend itself to a SIMD form (due to presence of function calls and complex conditional code e.g. in the slice sampling routine), it cannot benefit from vectorization. We therefore continue to dedicate vectorization to parallel log-likelihood calculation, and see a near-multiplicative speedup as a result.

How should the available cores be allocated to the two parallel modes? For example, should all 16 cores be dedicated to parallel sampling of coefficients, with log-likelihood calculation being single-threaded (but vectorized), or should we sample one group at a time but use a multi-threaded version of log-likelihood evaluator, or somewhere in-between such as 2x8, 4x4, or 8x2 arrangements? From the perspective of coverage and granularity ([?]), mapping coarse-grained parallelism to multi-threading is a better option since the longer duration of parallel tasks leads to better amortization of parallelization overhead. However, data locality considerations suggest a more complicated story, as seen in Fig. ?. We have compared the performance of two extreme resource allocation strategies for HB logistic regression problem with 100 regression groups and 50 covariates in each regression group, as a function of (uniform) group size (). We have looked at two scenarios, one where the log-likelihood function for each group is evaluated only once per iteration (), and one where it is evaluated 10 times (). The latter is more representative of a real-world HB problem where each log-likelihood must be evaluated as often as a few hundred times for a univariate slice sampler. The interplay between Neval and mapping strategy is illuminating. When Neval is 1, data for each regression group is not reused until the next iteration, while when Neval is 10, data is reused and we see that mapping fine-grained parallelism to multi-threading is advantageous - by more than 3x - over a range of data sizes where each group fits in L3 but data size per group times NTHD exceed L3 capacity. The latter is the total amount of data processed by all threads at any given time, under the coarse-parallelism approach. This is supported by the measured L3 cache hit rates (right panel), where we see a much better cache reuse over the middle range of data when using multi-threading for SIMD parallel mode.

The broader lesson here is that, for data sizes and access patterns where cache-friendliness matters, locality considerations may favor using multi-threading for a fine-grained, cache-friendly parallelism over a coarse-grained but less cache-friendly parallelism. Similar situations to HB can arise in other settings, especially in the data analytics lifecycle. Examples are 1) variable selection using forward or backward selection ([?]) where many candidate explanatory variables are added to (or removed from) a base set and coarse-grained parallelism corresponds to parallel execution of the estimation algorithm against all candidate sets, 2) independent regressions on groups as a pre-cursor to HB, 3) execution of multiple models on various subsets of data and using different feature sets in the context of ensemble learning ([?]). In all these cases, the coarse (task) parallelism is not cache-friendly, and a cache-friendly, fine-grained parallelization of the underlying algorithm can outperform task parallelism. This conclusion reinforces the case for continued investment by the HPC community in developing efficient, parallelized implementations of popular statistical and machine learning algorithms.

This discussion can be further generalized to the case where coarse-grained parallelism corresponds to parallel execution of multiple programs in a multi-user environment. In this case, the operating system is responsible for allocating processor cycles against all running applications, and has the unenviable task of striking a balance between maintaining data locality by offering long time slices to each process, and keeping the schedule ‘fair’ by frequently switching context between processes ([?]).

A more thorough analysis of HB parallelization requires attention to other topics such as static-vs-dynamic thread scheduling and NUMA-aware memory allocation. For example, a primary application of HB framework is where some regression groups are so small that they need to borrow information from other, larger groups. Therefore, assuming that all groups have the same size is unrealistic. But non-uniform group sizes means non-uniform task durations, raising the possibility that a dynamic scheduling may be better than static scheduling. This, however, may have negative impact on data locality if cores are assigned to different groups each time. It is possible that the best strategy is to use a proxy for execution time as a function of group size and pre-allocate groups to threads to have even distribution and maintain a static mapping to preserve data locality. It must be noted that such potential complexities and performance hits associated with coarse-grained parallelism strengthen the case for using multi-threading with fine-grained parallelism. Further exploration of such issues is the subject of future research.

### 4.2Calculating Derivatives

In the context of MCMC sampling of probabilistic DAGs, speeding up the conditional log-likelihood function is highly impactful because the majority of time in a method such as Slice Sampler ([?]) or Metropolis ([?]) is spent on function evaluation. There exist other sampling techniques such as Adaptive Rejection ([?]), HMC ([?]) and its variants such as Riemann Manifold HMC ([?]) or No-U-Turn Sampler ([?]), and Hessian-based Metropolis-Hastings ([?]), which use higher derivatives such as the gradient vector or the Hessian matrix. Similarly, in optimization it is advantageous to use the function derivatives when available ([?]). Here we discuss parallel calculation of the gradient vector for GLM. Extension of this work to higher derivatives, namely Hessian, is the subject of future research.

Starting with Equation 3, we use indexed functions absorb the dependence on and differentiate once to get the gradient vector:

where . The expanded form of Equation 4 makes the sum over contributions from each observation explicit, while the compact form of Eq. ? hides this.

Fig. ? shows the baseline implementation of the routine for calculating the log-likelihood function and its gradient, following the SOM pattern. The routine now consists of a sequence of three maps: 1) calculate (first LA map), 2) apply and to each element of (TR map), 3) calculate (second LA map). Note that for the last step we have switched to a column-major version of (referred to as Xt in code) for better data locality and parallelization scaling in the dgemv routine. Fig. ? shows the performance of this baseline implementation. We can apply the same PLF and NUMA optimizations discussed in Section 3.4 here as well, with similar results. However, even after these two adjustments, we are quite off from the hardware limit. For example, Fig. ? shows that for , after PLF and NUMA changes we have a CCR of 28, which is significantly higher than the memory-based minimum value of 10.4. This is somewhat expected since gradient calculation adds to the amount of computation per row. But we can do better, as explained below.

--------------------------------------------------------------------------double loglike_fg(double beta[], double X[], double Xt[], double y[], int N, int K, double g[]) {
double *Xbeta = new double[N];
double *gf = new double[N];

cblas_dgemv(CblasRowMajor, CblasNoTrans, N, K, 1.0, X, K, beta, 1, 0.0, Xbeta, 1);

double f=0.0;
#pragma omp parallel for num_threads(NTHD) reduction(+:f)
#pragma simd reduction(+:f)
for (int n=0; n

#### Cache Fusion

We mentioned the primary advantage of MOS over SOM parallel pattern as one of consolidating the parallel regions and hence reducing thread management overhead. There is a related benefit in terms of minimizing data movement. In SOM, the output of each map stage is likely to be written to memory if array sizes are large, thus forcing the next map stage to reach down to the memory to retrieve the input it needs (assuming that the output of first map is input to second map). By transforming to MOS, we allow all operations to be chained and hence preventing the data from travelling unnecessarily to and from memory, or even cache since they will remain in registers. Our PLF strategy allowed us to minimize thread management overhead, but it did not achieve the same data locality as what would have been achieved in MOS. Cache fusion ([?]) allows us to maintain better cache locality. The idea is to break the map into smaller chunks and execute them sequentially, thus making inputs and outputs of each stage smaller, possibly fitting within L1 or L2. This strategy offers measurable but small benefit for function evaluation (data not shown). But with the addition of derivatives, enhancing cache locality has a more significant impact since there are more steps and more computation. Fig. ? shows the code for function and gradient calculation with all three improvements (PLF, NUMA, Cache Fusion). We refer to the entire setup as Hierarchical Loop Structure (HLS), which can be a useful parallel pattern for other problems as well.

Fig. ? shows the impact of NCHUNK on performance of the code in Fig. ?, for and . For these problem dimensions, we see that: 1) we must choose the chunks small enough to fit within L2 to get the best performance (NCHUNK=50), but making the chunks too small has adverse impact on performance (left panel). The corresponding improvement in L2 hit rate can be seen in the right panel. Further research is needed to better understand the tradeoffs involved in reducing chunk size for L2 locality.

In Fig. ? we have shown the impact of adding cache fusion to the previous optimization techniques, for . For each value of , we chose the largest chunk size that would fit within L2 (256KB) (also ensuring that is it a multiple of for code simplification). We see that the impact is significant for large , e.g. reducing CPR from 28 to 14.5, bringing it within 40% of hardware limit (10.4).

--------------------------------------------------------------------------double loglike_fg(double beta[], double X1[], double X2[], double y1[], double y2[]
, int N, int K, double g[]) {
memset(G, 0, K*sizeof(double));
int Ny=N/NTHD, Nx=Ny*K, NTHD2=NTHD/2;
double f=0.0;

{ // start of parallel omp region
double *Xp, *yp;
if (m

In function evaluation, threads must synchronize while writing to the shared buffer holding the output. This was done through the reduction clause in the omp and simd pragmas in Figs. ? and ?. There is no equivalent construct for performing reduction on arrays (with the exception of FORTRAN OpenMP). We must instead use a omp critical pragma, as seen in Fig. ?. It is interesting to note that our PLF strategy for assigning jobs to threads has enormously reduced the overhead of this critical region, compared to the overhead of a brute-force MOS implementation shown in Fig. ?. For example, we measured the penalty of the critical region for the PLF vs. MOS implementations for and , using 16 threads. For PLF, the critical region imposes an equivalent CPR overhead of 1.3, while for MOS, the same number is 475! Critical regions incur two types of penalty: first, the serialize the code and second, they require lock management by runtime. Both these components are drastically reduced in the PLF approach since the critical region is only entered NTHD times, here being 16. In contrast, in MOS each job execution enters critical region once, amounting to times. As an alternative to a critical region, we can use omp atomic pragma before the addition statement inside the loop surrounded by the critical region construct in Fig. ?. This slightly reduces the first penalty (due to shortening the serial code length) but increases the second penalty since mutex region is entered K times as many. Testing this on the MOS code, we see the penalty increasing from 475 to 1419. This suggests that lock management is the key performance factor in our synchronization operations. In contrast, a simple reduction of a scalar shows a much smaller performance hit when switching from PLF to MOS (0.2 CPR). Also, note that there is no support for vectorization of array reductions, and this is another blow to performance of the MOS implementation.

There is an equally important reason to avoid the MOS pattern of Fig. ? as it relates to the last step (inside the critical region): The updates made to the array g[] by each thread will cause its cache lines to ping pong between cores 31, creating another large performance hit. We validated this in two ways: 1) After removing the omp critical pragma (no synchronization), we measured CPR with and without the update line: g[k]+=gtmp[k];. We saw that this update line creates a 240-CPR penalty. 2) We measured L2/L3 hit rates and saw a drastic increase for both cache levels after removing the update line: 0.03% 71% (L2) and 0.02% 95% (L3). Again, the PLF approach works much better because each core does as much of the calculations privately, and updates the global g[] only once at the end. The cache hit rates for the code in Fig. ? agree with this story (76% and 98% for L2 and L3, respectively).

Finally, note that the differential update strategy will not be as effective when we need to calculate the gradient. This is because 1) calculating Xbeta is a smaller part of the entire computation, and 2) more importantly, differential update does not free us from the need to read the entire into cache during sampling for each , since calculation of cannot be simplified to require only .

--------------------------------------------------------------------------inline double loglike_child_fg(double beta[], double x[], double y, int K
, double g[]) {
double xbeta = cblas_ddot(K, x, 1, beta, 1);
double fbase, gbase, hbase;
double r=1.0+exp(-xbeta);
double fbase = -(log(r)+(1.0-y)*u), gbase = y-1.0/r;
for (int k=0; k

**********

Fig. ? summarizes the optimization techniques applied to parallel evaluation of log-likelihood and its gradient for Bayesian GLM, and the impact of each technique on the small-data and big-data problems. For big-data regime (), we reach within 25% of the memory-bandwidth-induced CPR. This is 5x faster than our baseline implementation of Fig. ? (with or without TR parallelization), and 2.5x faster than the performance of Intel MKL’s first dgemv call in the same code. NUMA adjustments explain this performance gap. Cache fusion provides the remaining 2x speedup, and impacts the chain of map sequences used in the routine by increasing L2 cache locality. Also note that small-data techniques (TR map parallelization and PLF) will be important when embedded in composite problems of larger size, such as HB GLM (Section 4.1).

### 4.3Bayond GLM

So far, our discussion of SIMD parallel MCMC focused on Bayesian GLM. These models have two important characteristics: 1) lack of conjugacy between prior and likelihood for means we have no direct sampling technique available for the conditional posterior and must therefore use MCMC techniques whose core computational work involves repeated function and derivative evaluations, 2) the conditional posterior function is computationally expensive, scaling linearly with number of observations () as well as number of attributes () in the problem. Two implications follow from the above properties: 1) Random number generation does not constitute a significant fraction of sampling time, 2) the sampling routine is too complex for the coarse-grained parallelism (concurrent sampling of conditionally-independent nodes) to be vectorizable.

In this section, we look a different class of graphical models, Markov Random Fields (or undirected graphs) with discrete nodes. We will see that, despite several important differences between MRFs and GLMs, the PLF framework continues to be helpful, with a few new twists to consider.

#### Ising Model

We consider a square-lattice Ising model with binary nodes, with each node interacting with its 4 adjacent nodes. While Ising model has its origins in statistical mechanics ([?]), it has been applied to statistical inference problems such as image reconstruction and denoising ([?]) and neural information processing ([?]). The energy function for Ising model is given by:

where is the state vector, is the interaction weight between neighbors and , and ’s control the bias terms. The sum occurs over pairs of and that are neighbors. In image de-noising, bias terms reflect the pixel intensity for the noisy image, serving to induce the output to represent the input. ’s enhance correlation among neighboring pixels to reduce noise. The configuration probability is given by the Boltzmann distribution:

where the normalizing factor is called the partition function, and is given by . In Gibbs sampling of Ising model, the probability of node is given by:

where and sum is over neighbors of node .

This graph has a 2-color decomposition (Fig. ?), and nodes within each color can be sampled concurrently. This is the coarse parallelism discussed in Section 2.3. On the other hand, since the sum in contains only 4 terms, the fine-grained parallelism opportunity is unattractive. It is clear, therefore, that multi-core multi-threading must be dedicated to the only available parallel mode. But the outstanding question is, can vectorization be applied here as well? Recall that in HB GLM, due to complexity of sampling steps for each and availability of fine-grained parallelism, we did not consider vectorization for coarse-grained parallelism. But in Ising model, the sampling steps for each node are significantly simpler, as described below:

1. Calculate , the contribution from the neighbors. This consists of 4 multiplications and 5 additions.

2. Calculate according to Equation 5. This involves 1 exponentiation, 1 division, and 1 addition.

3. Generating a sample for . This consists of generating a random deviate from uniform distribution, and comparing it to the probability calculated in step ?.

As always, the ideal scenario is to vectorize all steps. However, we face two challenges in doing so:

1. Memory Access Patterns: In step ?, fetching neighbor states of a given node requires a gather memory operation. Handling boundary conditions adds further complexity to the code. Both of these can render vectorization inefficient or impossible.

2. RNG: Making an external function call to a pre-compiled RNG library in step ? can act as a vectorization blocker.

Here again we can take advantage of the PLF strategy for incremental vectorization. Given the memory access pattern challenge of step ? and its small contribution to total computation per sample, we choose to leave it scalar. As for step ?, we propose a batch RNG approach.

Batch RNG Batch generation of uniform random deviates used in step ? offers two advantages. Firstly, it removes RNG call from the sampling loop, allowing it to be vectorized more efficiently. Secondly, it speeds up RNG by reducing the function call overhead and allowing for use of high-performance libraries that utilize vector instructions or other forms of parallelization. Examples include Intel VSL 32, Nvidia’s cuRAND 33, and the SPRNG project 34. Applying the batch RNG strategy is more involved for complex distribution. See Section 4.4.

Data Layout Optimization Vectorization efficiency for step 2 depends on data alignment for (and ). If the entire network occupies a contiguous block of memory, then data for each (colored) graph partition will not be contiguous. Such non-unit-stride memory accesses have a particularly negative impact on vectorization. To overcome this, we can extract each sub-graph and write it to new arrays, perform Gibbs sampling on these contiguous-memory arrays, and transform the results back to the original arrangement at the end (Fig. ?).

Fig. ? shows the impact of 3 optimization techniques on performance of Gibbs sampling for an Ising model used in denoising a 560-by-516 binary image (1000 samples). We followed the specification and parameters in [?] (Chapter 8). Switching to batch RNG provides a 4.9x speedup for the RNG part of the code (84.1 17.1 clock cycles per sample). Vectorization results in a 2.3x speedup in the non-RNG part of the code (75.4 32.5). Finally, switching to unit-stride memory access led to a 1.5x speedup in non-RNG part (32.5 21.7). The cumulative impact of all 3 optimization techniques (on the entire code) is a 4.3x speedup.

#### Boltzmann Machine

A close cousin of Ising model is the Boltzmann machine ([?]), where nodes are densely connected rather than having the lattice connection of Ising model. As a result, 1) there is no conditional independence among nodes and hence the graph cannot be colored for parallel sampling, 2) each node is potentially connected to a large number of nodes, creating a SIMD parallelization opportunity for calculation of ’s. A variant of Boltzmann machine is the Restricted Boltzmann machine (RBM) ([?]), a key ingredient in building deep belief nets ([?]) that are seeing increasing application in supervised and unsupervised learning of generative models for image ([?]), video ([?]) and speech ([?]) data. RBMs have a bipartite graph structure, where visible and hidden nodes are densely connected, but no connection amongst nodes of each layer is allowed. This leads to conditional independence of hidden nodes, allowing for their parallel sampling. In choosing between the fine- and coarse-grained parallel modes to apply vectorization, we must consider these factors: 1) Fine-grained parallel mode consists of a summation over contribution from nodes connected to a given node. Its arithmetic complexity is low, but vectorization is easy. 2) Coarse-grained parallel mode contains the inner, summation loop. Efficient vectorization of this step within the coarse-grained parallelism would require loop unroll (see Section 4.6). 3) Steps ? and ? from the Ising mdoel are repeated here, and we can use strategies discussed before. Note that dense connectivity can alleviate or remove the problem of suboptimal memory access patterns.

#### Differential Update and Atomic Vector Operations

We introduced the differential update strategy in the context of Bayesian GLM models (Section 3.4). This strategy can also be used with discrete-node graphs such as Ising models, Boltzmann machines and LDA models. For example, consider the image de-noising example introduced in this section. We measured transition rates for binary pixels in the model, finding that after the first few iterations less than 10% of the pixels switch value. If we switch to a differential update approach where we maintain intermediate data structures and , and update them each time a node has a state transition, we see a nearly 2x speedup. However, combining this differential update approach with vectorization can be a challenge. If two nodes are updated simultaneously and they share neighbors, they will attempt to update the same elements of and , requiring atomic operations inside vectorized code. Unfortunately, current versions of x86 processors do not support atomic operations ([?]), and it is likely that future support will be rather inefficient. Similar opportunities and challenges exist for RBMs where conditional independence allows hidden nodes to be updated concurrently, but they could attempt to write to same locations in intermediate arrays. In non-collapsed Gibbs sampling of LDA, a differential update strategy is commonly used where matrices representing counts of words by topic and topics by document are maintained and updated after each new topic assignment for a token ([?]). Vectorization of parallel sampling for topic assignments will face the same challenge of needing to perform atomic operations on these count matrices.

### 4.4Batch RNG for Complex Distributions

In Section 4.3 we saw that using a batch RNG for uniform deviates in Gibbs sampling of Ising model led to a nearly 2x speedup (Fig. ?) and paved the way for vectorization of concurrent sampling of nodes of the same color. Distributions such as uniform and normal are easy to handle in batch process due to their linear transformation properties, allowing a deviate with standard parameters to be converted to a deviate of arbitrary parameters:

where is the uniform distribution on the real interval . Similarly, for the normal distribution we have:

where is the normal distribution with mean and standard deviation . If a distribution does not enjoy such properties, we cannot pre-generate its random deviates in advance without knowing the value of parameters needed. The best we can do is to pre-generate the ‘component’ random deviates that are used in the algorithm for sampling from the desired distribution. As a concrete example, consider the Dirichlet distribution that is used in LDA models as a conjugate prior to categorical and multinomial distributions of topics in documents and words in topics ([?]):

where , , and . Dirichlet deviates can easily be constructed from Gamma deviates 35, which in turn require uniform (and possibly normal) deviates. We took the Gamma sampling algorithm described in [?] 36, and adapted it to use a buffer of uniform and normal random deviates, generated in batch. This Gamma sampler was then used in the Dirichlet sampler. For reference, we used a baseline Gamma sampler which generated its uniform and normal deviates one at a time. In both versions, Intel VSL routines were used for uniform and normal deviates. We tested the batch and one-at-a-time versions of Dirichlet sampler on the Enron email data set from UCI repository ([?]), specifically for sampling , the probability of topics (100) for each document (39861). A flat (Dirichlet) Jeffreys prior was used for all documents. One-at-a-time Dirichlet sampler had a performance of 1263 clock cycles per Gamma sample, while the batch version had a performance of 198 clock cycles, showing a speedup of 6.4x. Fig. ? summarizes the speedup offered by batch vs. one-at-a-time RNG for uniform, normal, Gamma, and Dirichlet distributions.

Our results suggest that there is value in RNG libraries offering batch generation capabilities. In the above example, we could not use the Gamma sampler offered by Intel VSL since its interface does not offer an option of passing buffers of uniform or normal deviates, forcing us to write the Gamma sampler from scratch.

### 4.5Beyond Single-Node x86

The focus of this paper has been on maximizing performance of a single, shared-memory x86 compute node for parallel MCMC sampling. Going beyond single-node x86 can be driven by three needs: 1) more memory, 2) more memory bandwidth, and 3) more computing power. In distributed-memory clusters, total memory and compute power scale linearly with node count, and so does total memory bandwidth between each node’s cores and its private memory. However, typically-slower inter-node connections mean inter-node costs of synchronization and data exchange are higher than their intra-code counterparts. As such, the two principles of minimizing parallel overhead, and minimizing data movement must be followed more rigorously in distributed computing. In fact, we can consider our NUMA adjustments, discussed in Section 3.4, as a precursor to distributed computing. Asymmetric treatment of cross-socket vs. within-socket communication, while continuing to take advantage of the convenience of shared-memory programming, can be considered halfway between writing symmetric, shared-memory parallel code and the standard message-passing-based approach to distributed computing on a cluster.

Many-core architectures such as Intel Xeon Phi and GPUs offer another option for adding memory bandwidth and compute power. They can be used as an alternative to, or in conjunction with, distributed computing. They are best suited for floating-point SIMD parallelism since they dedicate more transistors to floating-point operations and less to out-of-order execution, branch prediction, and speculative execution ([?]). Intel Xeon Phi co-processors can be considered an extension of x86 processors with higher core count (up to 60) and wider vector units (512-bit, holding 8 doubles). Fig. ? compares the compute and memory performance of an Intel Xeon Phi 7120P and an Nvidia Tesla K40 against our x86 test system. Based on these numbers, we can say, in broad terms, that compute-bound problems can gain 3.5x speedup while memory-bound problems can gain between 2x-3.5x in speedup by switching from the an x86 processor to a many-core co-processor. As with x86, approaching these hardware limits requires tuning to minimize parallel overhead and maximize data locality.

There are two other factors beyond the raw performance of co-processors that must be considered in the cost/benefit analysis of using co-processors (also referred to as ‘devices’): 1) memory capacity, 2) data transfer and parallelization overhead. As we see in Fig. ?, memory capacity is significantly lower for current generation of co-processors, compared to modern x86 servers. Therefore, as data sizes exceed device memory, the device may become significantly more complex to accommodate the need to move data in and out of device memory. In our GLM example, at , the Tesla K40 card will run out of space for larger than 31.6 million rows, while the same limit on the host system will be reached at data sizes 32 times larger.

Unless the code is ported completely to the device, so that there is no need to send data back and forth between the host and device memory, we will incur data transfer cost, due to non-zero latency and finite bandwidth. On GPU cards, kernel launch incurs an additional cost, while on Xeon Phi where we may launch more than 100 threads, the thread management overhead can be significantly higher than the numbers we discussed for the host system. For example, in Bayesian GLM we may want to use the device only for function evaluation and run the sampler on the host, or there might be other parameters besides that are sampled. As a result, each function evaluation may incur the data transfer overhead associated with 1) sending from host to device, 2) launching the kernel on the device, and 3) sending the log-likelihood result from device back to the host. To measure all these overhead components, we used an empty kernel inside a for loop, as seen in Figure 3 (left). (This test was performed on a Tesla K20 card.) This loop took nearly 20sec to run per iteration, equal to CPU clock cycles. If , this translates into a CPR overhead of 5.2. Recall that, using the differential update approach on the CPU we reached a CPR of less than 2.0 using all 16 cores (Fig. ?). Therefore, even ignoring the memory and compute limits of the device, simply the data transfer and kernel launch remove any incentive to port the code to the device, unless port a sufficient portion of the code to remove the need for such frequent kernel launches and data transfers. Figure 3 (right) shows the expected speedup as a result of device hardware limits as well as kernel launch and data transfer overhead. This analysis can be used in making an informed decision on whether or not to port the code (and how much of it) to the device.

------------------------------------------for (int i=0; i<1000; i++) {
, cudaMemcpyHostToDevice);
doNothing<<<1,1>>>();
cudaMemcpy(&f, fpDev, 1*sizeof(double)
, cudaMemcpyDeviceToHost);
}
------------------------------------------

Finally, development and maintenance costs associated with code complexity must be factored into a total cost of ownership analysis of using co-processors. CUDA 37 is a convenient programming language for Nvidia GPUs, yet it is not open-source, while OpenCL 38 has the opposite pros and cons. Even CUDA code for logistic regression log-likelihood will inevitably be more complex than the CPU code. For example, in CUDA kernels threads within a block can communicate inside a kernel, while blocks cannot. On the other hand, there is an upper limit on number of threads that can be launched within a block. Such limitations may force us to break up the CUDA kernel into two steps, one in which a partial reduction is performed within blocks, and a second kernel where the final sum is calculated. Optimal selection of grid and block geometry is another design decision that may require runtime and/or compile-time code branching.

### 4.6Compile-Time Loop Unroll

In implementing the GLM log-likelihood, we encountered a nested loop in the LA map whose vectorization in the SOM and PLF patterns was delegated to Intel MKL library as part of the dgemv call (see Figs. ? and ?). We also mentioned that, to transform this log-likelihood function to a MOS pattern, we can fuse the outer loop of the LA map with the TR loop, which leaves us with a ddot call inside the fused loop to represent the LA step (Fig. ?). But the presence of this function call prevents a full vectorization of the outer loop, creating bad performance. This can be verified by inspecting the assembly code generated by the Intel compiler, as described below.

The assembly code for the function in Fig. ? (with the fused loop vectorized) consists of three loops, each containing calls to AVX-vectorized transcendental functions __svml_exp4 and __svml_log4: 39 40 The reason for breaking the loop into 3 parts is to take advantage of efficient data movement instructions for aligned data in the main loop, while allowing for scalar data movement instructions to handle misalignment in the first loop, and array lengths that are incomplete multiples of 4 in the last loop (since each AVX vector instruction handles 4 doubles in its 256-bit registers). The evidence for partial vectorization comes from seeing 4 calls to ddot inside the main loop: Since ddot is scalar (i.e. it produces only one element per call as input to the transcendental functions), it must be called 4 times to generate the full input for the vector instructions. As we see in Fig. ?, the resulting performance (‘MOS - ddot’, CPR of 86.7) is much worse than the SOM pattern of Fig. ? (‘SOM’, CPR of 32.3).

Since ddot is a simple routine, we can replace it with a for loop (‘MOS - RegularLoop’), with significant performance improvement. We can further optimize the code by making the loop span a compile-time constant (‘MOS - KnownSpanLoop’). Best result, however, is achieved when we fully unroll the loop (‘MOS - FullyUnrolledLoop’), producing essentially the same performance as the SOM pattern.

Loop unroll does not necessarily require hand-coding by the programmer. Instead, a source-to-source compilation stage can be utilized, or instead we can utilize the template meta-programming facilities of C++ compilers. Fig. ? shows an example code that generates a fully-unrolled version of the ddot BLAS routine for calculating the inner product of two vectors. In the logistic regression example, the only parameter subspace for which we saw a significant advantage from ‘MOS - FullyUnrolledLoop’ over SOM is for short and narrow data and when multi-threading. For example, if we use same parameters as in Fig. ? but 4 threads instead of 1, the CPR numbers for these two methods become 12.0 and 19.4, respectively. In addition to being inflexible in handling external function calls, the MOS pattern faces another significant performance challenge, i.e. excessive synchronization costs in the presence of reduction-to-array operations needed for calculating the derivatives of the log-likelihood function (Section 4.2).

--------------------------------------------------------------------------template
struct inprod {
static inline double apply(const double x[], const double y[]) {
return inprod::apply(x,y)+x[i-1]*y[i-1];
}
};
template<>
struct inprod<1> {
static inline double apply(const double x[], const double y[]) {
return x[0]*y[0];
}
};
inline double ddot_unrolled(double x[], double y[], int K) {
return inprod::apply(x,y);
}
—————————————————————————————————————

Moving many tuning decisions from runtime to compile time is generally a fruitful strategy for HPC. A recent example of this is the Stan compiler for Bayesian DAGs. In contrast to previous compilers such as WINBUGS and JAGS, Stan performs a full compilation of DAG into C++ code. This is likely to be one reason for its superior performance (in addition to using a more efficient sampler in HMC and NUTS). Our manual loop unroll pushes the JIT compiling idea further by requiring the compilation to occur for each new set of values for problem dimensions (specifically here). This is a minor nuisance but can offer performance gains especially for vectorization.

## 5Concluding Remarks

### 5.1Summary

In this paper, we presented two opportunities for SIMD parallel MCMC sampling of probabilistic graphical models: 1) parallel calculation of log-likelihood and its derivatives for exchangeable models with lack of conjugacy, such as GLM, and 2) parallel sampling of conditionally-independent nodes for graphs with discrete nodes and conjugacy, such as Ising model. In addition, we offered strategies for efficient implementation of such opportunities on multi-core x86 processors using a combination of vectorizing compiler and libraries, and a series of high-level changes to the source code and runtime environment. These strategies included Partial Loop Fusion to minimize parallel overhead, NUMA-aware adjustments to minimize cross-socket communication, cache fusion to maximize cache utilization, and batch RNG for efficient random number generation and code vectorization. For Generalized Linear Models, one of the most commonly-used class of models in predictive analytics, our implementation of the log-likelihood function and its gradient vector reached within 25% of the hardware limit induced by total memory bandwidth in the big-data regime, outperforming a fully vectorized and (built-in) multi-threaded - but naive - implementation using Intel MKL by more than 5x.

There are two conceptual extremes with regards to utilizing pre-compiled optimized libraries as components of a HPC application: 1) dropping in the pre-compiled library components as they are, and getting suboptimal aggregate results, or 2) discarding all pre-compiled components and writing everything from scratch, which can significantly raise the development costs for the project. Our research suggests a middle ground, through high-level modifications layered on top of pre-compiled libraries, allowing for significant performance improvement while incurring only modest development costs.

### 5.2Future Directions

There are several important areas for further research.

1. Portable Implementation of Optimization Techniques:

In order to turn research into action, we must identify robust and portable methods for implementing our proposed optimization techniques. For example, our NUMA adjustment relied on the first-touch policy used by Linux. Also, in cache fusion, the optimal number of chunks is tied to the size of L2 cache. Such hardware and OS dependencies must be addressed before a widely-usable library can be released.

2. High-Performance GLM Hessian:

We offered efficient implementation of log-likelihood function and its first derivative for GLM. However, due to their log-concave property [ref], GLMs are suitable for optimization and sampling techniques that utilize the second derivative - Hessian matrix - of the log-likelihood (or log-posterior) function, such as Newton optimization ([?]) (equivalently known as Iterative Re-Weighted Least Squares or IRWLS for GLM models) or RM-HMC sampling ([?]). Developing an equally-efficient implementation of the Hessian for GLMs will be an important contribution to the field of (Bayesian and non-Bayesian) predictive analytics. We expect the Hierachical Loop Structure framework of Section 4.2 to remain effective, but additional opportunities for optimization, such as taking advantage of the symmetry of the Hessian matrix, must be thoroughly studied.

3. Hierarchical Bayesian GLM:

Many Bayesian applications of GLM involves an HB framework. Our research highlights the subtleties of combining coarse- and fine-grained parallel modes available for HB GLMs. These subtleties as well as additional complexities arising from highly-imbalanced regression groups must be carefully studied and encoded into a high-performance library for MCMC sampling of HB GLM models (or generic DAG compilers such as Stan).

4. Batch RNG:

We demonstrated the value of batch RNG for improving efficiency and vectorization of MCMC sampling for models such as Ising, RBM, and LDA. For complex distributions, existing libraries cannot be utilized in batch mode in the context of MCMC sampling since the parameters of distributions changes from one iteration to next. (This is in contrast to purely Monte Carlo - non-Markovian - settings where sampled distributions remain stationary throughout the simulation.) Developing batch RNG libraries, i.e. libraries that generate buffers of random deviates for a set of core distributions such as uniform and normal, and utilize these buffers for generating random deviates from complex distributions - will be an important contribution towards high-performance MCMC sampling of big-data simulations such as automated, semantic analysis of web data.

5. Vectorization and Atomic Operations:

Atomic operations arise in parallel sampling of conditionally-independent nodes (such as in non-collapsed LDA and in RBM), where nodes may simultaneously attempt to update the same intermediate data structures. Before hardware support for vectorized atomic operations become available, software options can be explored. For example, one possibility is to accumulate updates in temporary buffers such that each vector instruction produces non-overlapping updates to the buffer. Higher flexibility offered by vector intrinsics might be required to implement such solutions.

0pt References

### Footnotes

1. http://www.nvidia.com/object/what-is-gpu-computing.html
2. http://tinyurl.com/co9e8hy
3. http://ark.intel.com/products/64595/
4. http://tinyurl.com/oexotqv
5. http://tinyurl.com/7hhc7rj
6. Our introduction to graphical models borrows heavily from Section 8.1 of Bishop’s excellent textbook ([?]).
7. http://ark.intel.com/products/64595/
8. http://tinyurl.com/7ty2lsx
9. During performance measurement, all BLAS calls are made to FORTRAN routines. For presentation brevity, however, we show calls to the C interface of BLAS in this paper.
10. http://tinyurl.com/nc9yjhh
11. http://tinyurl.com/pevz69r
12. For more on how to compile and link using Intel C++ compiler, see http://tinyurl.com/lctboe3.
13. http://tinyurl.com/q4er8qp
14. http://www.cilkplus.org/
16. http://chapel.cray.com/
17. http://x10-lang.org/
18. http://mapreduce.stanford.edu/
19. http://tinyurl.com/qbtpc38
20. http://www.mcs.anl.gov/research/projects/mpi/
23. SIMD constructs have been incorporated into OpenMP 4.0 API: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf
24. This is inspired by the Clock Cycles Per Element (CPE) metric, defined in [?], Chapter 5.
25. We can read as integer since it is binary, but for simplicity we treat it as double with negligible impact on calculations
26. http://tinyurl.com/p47ytt3
27. http://tinyurl.com/7hhc7rj
28. http://tinyurl.com/qxlzd4g
29. http://man7.org/linux/man-pages/man7/numa.7.html
30. https://software.intel.com/en-us/articles/memory-allocation-and-first-touch
31. Note that this is not ‘false sharing’ since threads do not simply share the same cache line, but are indeed writing to the same memory locations.
32. http://tinyurl.com/pevz69r
33. https://developer.nvidia.com/curand
34. http://www.sprng.org/
35. First, we draw ’s from . Next we normalize: .
36. The algorithm uses the rejection method of [?] for , and a lemma described in [?] to cast an problem into an problem.
37. http://www.nvidia.com/object/cuda_home_new.html
38. https://www.khronos.org/opencl/
39. To focus on vectorization, we compiled the code without the OpenMP pragma.
40. For more on generating vectorized code, see this Web Aside http://csapp.cs.cmu.edu/public/waside/waside-simd.pdf to Bryant and O’Hallaron’s great textbook ([?]).
10006