SIMD Parallel MCMC Sampling with Applications for BigData Bayesian Analytics
Abstract
Computational intensity and sequential nature of estimation techniques for Bayesian methods in statistics and machine learning, combined with their increasing applications for big data analytics, necessitate both the identification of potential opportunities to parallelize techniques such as Monte Carlo Markov Chain (MCMC) sampling, and the development of general strategies for mapping such parallel algorithms to modern CPUs in order to elicit the performance up the computebased and/or memorybased hardware limits. Two opportunities for SingleInstruction MultipleData (SIMD) parallelization of MCMC sampling for probabilistic graphical models are presented. In exchangeable models with many observations such as Bayesian Generalized Linear Models (GLMs), childnode contributions to the conditional posterior of each node can be calculated concurrently. In undirected graphs with discrete nodes, concurrent sampling of conditionallyindependent nodes can be transformed into a SIMD form. Highperformance libraries with multithreading and vectorization capabilities can be readily applied to such SIMD opportunities to gain decent speedup, while a series of highlevel sourcecode and runtime modifications provide further performance boost by reducing parallelization overhead and increasing data locality for NonUniform Memory Access architectures. For bigdata Bayesian GLM graphs, the endresult is a routine for evaluating the conditional posterior and its gradient vector that is 5 times faster than a naive implementation using (builtin) multithreaded Intel MKL BLAS, and reaches within the striking distance of the memorybandwidthinduced hardware limit. Using multithreading for cachefriendly, finegrained parallelization can outperform coarsegrained alternatives which are often less cachefriendly, a likely scenario in modern predictive analytics workflow such as Hierarchical Bayesian GLM, variable selection, and ensemble regression and classification. The proposed optimization strategies improve the scaling of performance with number of cores and width of vector units (applicable to manycore SIMD processors such as Intel Xeon Phi and Graphic Processing Units), resulting in costeffectiveness, energy efficiency (‘green computing’), and higher speed on multicore x86 processors.
keywords:
GPU, Hierarchical Bayesian, Intel Xeon Phi, logistic regression, OpenMP, vectorization1 Introduction
Many inference problems in statistics and machine learning are best expressed in the language of probabilistic graphical models, where a probability distribution function (PDF) over a highdimensional parameter space can be motivated as the product of a collection of terms, each a function of a subset of the parameters. Inference in such models requires summarizing the joint PDF, which can be quite complex and lacking closedform integrals. Monte Carlo Markov Chain (MCMC) sampling techniques offer a practical way to summarize complex PDFs for which exact sampling algorithms are not available. For many realworld problems, MCMC sampling can be very timeconsuming due to a combination of large data sets, high dimensionality of joint PDF, lack of conjugacy between likelihood and prior functions, and poor mixing of the MCMC chain. Fast MCMC sampling is, therefore, important for wider adoption of probabilistic models in realworld applications.
For many years, software developers could rely on faster processors to see improved performance, without any need for code modification. In the past decade, however, chip manufacturers have warned of singlecore performance saturation as CPU clock rates and instructionlevel parallelism reach their physical limits (Jeffers and Reinders (2013)). Instead, we are seeing a steady rise in core counts and width of vector units, culminating in the emergence of manycore SingleInstructionMultipleData (SIMD) architectures such as Graphic Processing Units (GPUs) ^{1}^{1}1http://www.nvidia.com/object/whatisgpucomputing.html and Intel Xeon Phi ^{2}^{2}2http://tinyurl.com/co9e8hy. Today, a CPU purchased for around $3000 can offer a theoretical peak doubleprecision performance of more than 300 GFLOPS ^{3}^{3}3http://ark.intel.com/products/64595/, and manycore processors selling for nearly the same price have broken the Tera FLOP barrier ^{4}^{4}4http://tinyurl.com/oexotqv. For scientific computing applications to take full advantage of such enormous performance potential within a single compute node, the entire parallelism potential of an algorithm must be efficiently exposed to the compiler, and eventually to the hardware.
Most efforts on parallelizing MCMC have focused on identifying highlevel parallelism opportunities such as concurrent sampling of conditionallyindependent nodes (Wilkinson (2010)). Such coarsegrained parallelism is often mapped to a distributedmemory cluster or to multiple cores on a sharedmemory node. As such, vectorization capabilities of the processor are implicitly assumed to be the responsibility of libraries and compilers, resulting in a systemic underutilization of vectorization in scientific computing applications. Furthermore, with increasing data sizes and widening gap between floatingpoint performance and memory bandwidth, modern processors have seen an architectural change in memory layout from Symmetric MultiProcessors (SMPs) to NonUniform Memory Access (NUMA) designs to better scale total memory bandwidth with the rising core count. The softwarelevel, sharedmemory view of this asymmetric memory is a convenient programming feature but can lead to a critical memory bandwidth uderutilization for dataintensive applications. This paper seeks to expand our ability to make efficient use of multicore x86 processors  today’s most ubiquitous computing platform  to rapidly generate samples from the posterior distribution of model parameters. Our focus is twofold: 1) identifying opportunities for SIMD parallelism in MCMC sampling of graphical models, and 2) efficiently mapping such SIMD opportunities to multithreading and vectorization parallel modes on x86 processors. Using examples from directed and undirected graphs, we show that offtheshelf, multithreaded and vectorized highperformance libraries (along with vectorizing compilers) provide a decent speedup with small programming effort. Additionally, a series of highlevel sourcecode and runtime modifications lead to significant additional speedup, even approaching hardwareinduced performance limits. Vectorization of SIMD parallelism opportunities are often complementary with coarsegrained parallel modes and create a multiplicative performance boost. Moreover, we illustrate the counterintuitive result that, given a limited number of cores available, efficient finegrained (SIMD) multithreading can outperform coarsegrained multithreading over a range of data sizes where L3 cache utilization is the dominating factor. In the process, we propose a general strategy for optimally combining a sequence of maps to minimize multithreading overhead while maximizing vectorization coverage. This strategy naturally allows for data locality at the memory and cache level, and significantly reduces the cost of complex synchronizations such as reduction on arrays. Furthermore, a differential update strategy for MCMC sampling of graphical models is proposed which, where applicable, can lead to significant reduction in data movement as well as the amount of computation, applicable to graphs with continuous as well as discrete nodes.
The remainder of this paper is organized as follows. In Section 2 we lay the theoretical foundation for the paper, including a summary of previous research on parallel MCMC and overview of two singlechain parallelization techniques for parallel MCMC of Directed Acyclic Graphs (DAGs). In Section 3 we do a detailed performance analysis and optimization of parallel MCMC for Bayesian GLM (focusing on logistic regression). In Section 4 we discuss several extensions to the ideas developed in Section 3, including Hierarchical Bayesian (HB) GLM, calculation of derivatives, Ising model, batch RNG, distributed and manycore computing, and compiletime loop unrolling. Section 5 contains a summary of our results and pointers for future research.
2 Parallel MCMC for Graphical Models
2.1 Previous Work on Parallel MCMC
A straightforward method for parallel MCMC is to run multiple chains (Wilkinson (2010)). Since each chain must go through the burnin period individually, multichain parallelization is less suitable for complex models with poor convergence where a big fraction of time might be spent on the burnin phase. The most common singlechain MCMC parallelization strategy is concurrent sampling of conditionallyindependent nodes within the Gibbs sampling framework (Wilkinson (2010)). This is also known as graph coloring or chromatic sampling. Example applications include Hierarchical LDA models (Newman et al. (2009)) and Markov Random Fields (Gonzalez et al. (2011)). In molecular dynamics simulations, assumption of shortrange intermolecular forces provides for similar domain decompositions (Ren and Orkoulas (2007)). This parallelization strategy is reviewed in Section 2.3. In synchronous Gibbs sampling (Geman and Geman (1984)), all variables are updated simultaneously  rather than sequentially  in each Gibbs cycle, regardless of whether conditional independence holds or not. Despite its theoretical and practical shortcomings, this embarrassingly parallelizable approach remains popular in largescale MCMC problems such as Latent Dirichlet Allocation (LDA) models for topical analysis of text documents (Newman et al. (2009)). Parallel messagepassing algorithms are discussed in Gonzalez et al. (2009, 2011, 2012) with application to discrete factor graphs. Tibbits et al. (2011) present CPU and GPU parallelization of multivariate slice sampling (Neal (2003)) with application to linear Gaussian processes. Xu and Ihler (2011) propose a parallel ‘lookahead’ sampler for discretevalued denselyconnected graphical models such as Boltzmann Machines and LDA. This approach takes advantage of the observation that, in denselyconnected networks, contribution from any given node to the conditional posterior of a target node is small. A similar strategy, called ‘prefetching’ has been proposed (Brockwell (2006); Strid (2010)) in the context of oneblock MetropolisHastings algorithms. Yu and Xu (2009) propose an approximate parallel Gibbs sampler  with GPU implementation  for DNA motif finding, where parallelization is achieved through replacing sequential random steps by a deterministic, preordered search of the parameter space.
In terms of target platforms, MCMC parallelization work has primarily focused on sharedmemory multithreading (Xu and Ihler (2011); Tibbits et al. (2011)), distributedmemory parallelization using explicit or implicit messagepassing (Ren and Orkoulas (2007); Ahmed et al. (2012); Low et al. (2012); Gonzalez et al. (2009); Newman et al. (2009)), and more recently, manycore architectures such as GPUs (Yu and Xu (2009); Tibbits et al. (2011); da Silva (2011); Dumont (2011)). This leaves two areas underexplored: efficient use of vector instructions for SIMD parallelization, and NUMA awareness for efficient utilization of memory bandwidth. Addressing these two areas is important in light of trends towards higher core counts and multisocket processors with NUMA architecture, and increasing width of vector units. Today’s desktop processors are increasingly dualsocket, while quadsocket setups are available for highend servers. Since each socket has a separate memory controller, a processor’s access to memory attached to other sockets is mediated through a slower path such as Intel QPI ^{5}^{5}5http://tinyurl.com/7hhc7rj, thus creating a nonuniform access to memory. On the other hand, vector units have steadily widened over recent generations to reach 128 bits for SSE4, and 256 bits for AVX instruction set extensions. Intel Xeon Phi coprocessors have doubled the width yet again to 512 bits. The combined effect of vectorization and NUMA awareness for a dualsocket x86 can reach an order of magnitude: A fullyvectorized code can run up to 4 times faster for doubleprecision calculations, while minimizing crosssocket communication on a dualsocket processor can double the performance for a memorybound application. Our paper aims to help better utilize singleserver capacities of a modern x86 processor for parallel MCMC sampling. In addition to achieving more efficient execution on x86 processors, our performance optimization techniques pave the way for further speedup on manycore processors since the underlying principles of reducing parallel overhead and increase data locality apply to all architectures. This strategy has been dubbed ‘transforming and tuning’ (Jeffers and Reinders (2013)).
2.2 Gibbs Sampling of Graphical Models
A graphical model consists of nodes (or vertices) connected by links (or edges) ^{6}^{6}6Our introduction to graphical models borrows heavily from Section 8.1 of Bishop’s excellent textbook (Bishop (2006)).. In a probabilistic graphical model, each node represents a random variable (or a group of random variables), and links express probabilistic relationship between these variables. The graph represents a decomposition of the joint probability distribution over all variables into a product of factors, each consisting of a subset of variables. In DAGs, links have directionality indicated by arrows, and no closed path exists such that one can always travel in the direction of arrows. For a DAG with nodes, the joint distribution is given by the following factorization
(1) 
where represents the set of parents of node , and . For undirected graphs, the joint distribution is expressed as a product of potential functions over maximal cliques of the graph:
(2) 
where is the normalization factor, also known as the partition function.
Inference and prediction in graphical models often requires calculating the expected values of functions of with respect to the joint distribution . In many cases, closed forms for integrals do not exist and approximate techniques such as MCMC can be used to approximate the joint distribution with a finite sample. A popular MCMC technique is Gibbs sampling (Geman and Geman (1984)) which is a special case of the MetropolisHastings (MH) algorithm (Hastings (1970)) with 100% acceptance rate. In Gibbs sampling, we iterate through the stochastic nodes and draw samples from the probability distribution for each node, conditioned on the latest samples drawn for all remaining nodes. Below is an outline of the Gibbs sampling algorithm:

Initialize all nodes .

For iteration

Sample .

Sample .


Sample .


Sample .

Gibbs sampling is appealing because it reduces the problem of sampling from a highdimensional distribution to a series of lowdimensional sampling problems, which are often easier, especially when the resulting lowdimensional distributions have standard sampling algorithms (e.g. multivariate Gaussian, Gamma, etc.). When conditional distributions cannot be directly sampled, a hybrid approach can be used where an MCMC algorithm such as Metropolis (Metropolis and Ulam (1949)) or slice sampler (Neal (2003)) is applied within the Gibbs framework (Robert and Casella (2004)).
2.3 SIMD Parallel MCMC for DAGs
Here we present two canonical parallelization opportunities for DAGs to provide a foundation for our analysis of Bayesian GLM. Equivalent ideas for undirected graphs are discussed in 4.3 in the context of Ising model and Boltzmann machine.
According to Eq. 1, a given node appears in the joint distribution in two ways: a) selfterm, i.e. the term that expresses the probability of the node conditioned on its parents, and b) child terms, i.e. terms corresponding to the conditional probability of the node’s children. Therefore, the conditional distributions appearing in Gibbs sampling algorithm can be decomposed as follows:
(3) 
This decomposition allows us to identify the following two singlechain parallelization modes for Gibbs sampling of DAGs.
Parallel Sampling of ConditionallyIndependent Nodes: Consider two nodes and . If their joint conditional distribution is factorizable, i.e. if , then the nodes are called conditionallyindependent and can be sampled simultaneously in each iteration of Gibbs sampling. From Eq. 3, we can infer the following theorem about conditional independence:
Theorem 2.1.
Two nodes are conditionallyindependent (and can be sampled in parallel) if a) neither node is a parent of the other, AND b) nodes have no common children.
Proof.
Condition (a) means that neither node appears in the self term of the other node. Condition (b) means that neither node appears in any of the child terms for the other nodes. Therefore, neither node appears in the conditional distribution of the other node. This is sufficient to prove that nodes are conditionallyindependent and can be sampled concurrently in Gibbs sampling. ∎
The above theorem can be used to determine if any pair of nodes are conditionallyindependent or not. However, constructing the most efficient partitioning of a graph into node blocks, each of which consists of conditionallyindependent nodes is nontrivial and an NPhard problem (Wilkinson (2010)). For specific graph structures, such as HB models (Rossi et al. (2005)), conditionallyindependent node blocks can be constructed rather easily. This parallelization mode can be difficult to vectorize, especially for nonconjugate, complex posteriors (such as GLM) where drawing a sample for each node involves multiple function calls including random number generation (RNG) and a sampling routine. As we see for Ising models, however, for simpler distributions, this parallel mode can be partially vectorized using the strategy discussed in Section 3.4, with significant contribution towards overall performance of the sampling routine.
Parallel Calculation of Conditional Posterior: First, consider the conjugate case, where the product of prior and each childterm contribution retains the functional form of the prior. For example, in LDA models a Dirichlet prior is used on the distribution of topics in each document. The likelihood contribution of each token in a document is a categorical distribution, with their collective contribution being a multinomial distribution whose parameter vector is the sum of contributions from all tokens. Similarly, with a conjugate Dirichlet prior on distribution of words in each topic, the contribution from each document (and tokens within each document) takes an additive form towards the posterior. In such cases, calculating the parameters of the posterior distribution can be parallelized, followed by a reduction. Second, consider the case where conjugacy does not exist, and rather than drawing independent samples from the posterior we must create a Markov chain that converges to the posterior. For most MCMC sampling techniques such as Slice sampler (Neal (2003)) or Metropolis sampler (Metropolis and Ulam (1949)), the majority of time during each sampling cycle is spent on evaluating the conditional logposterior function (e.g. within the Gibbs sampling framework) or its derivatives (Section 4.2). Therefore, any speedup in function evaluation translates, nearly 1to1, into MCMC speedup. Similar to the conjugate case, the additive contribution of child terms can be calculated in parallel, followed by a reduction step. In both cases, when the model is exchangeable (Bernardo (1996)) as in the case of i.i.d. observations, the contributions from child nodes are symmetric, resulting in a SIMD computation.
Nonconjugate cases are typically better candidates for SIMD parallelization since 1) they tend to take a larger fraction of each sampling cycle, 2) in conjugate cases, we need a larger number of observations before the cost of calculating the conjugate posterior parameters dominates the cost of generating a random deviate from the posterior. GLM models are good candidates for SIMD Gibbs sampling due to nonconjugacy of prior and likelihood functions, and due to high computational cost of each function evaluation, especially for long and wide data and in the presence of transcendental functions (Section 3.1).
*********
Figure 1 summarizes the two singlechain parallelization modes for DAGs discussed above. In relative terms, parallel sampling of conditionallyindependent nodes can be considered coarsegrained, while parallel evaluation of loglikelihood can be considered finegrained. In the remainder of this paper, we use two examples  Bayesian GLM and Ising model  to demonstrate how each of these modes can be parallelized on multicore x86 processors using multithreading and vectorization.
3 SIMD Parallel MCMC for Bayesian GLM on Multicore x86
3.1 Setup
3.1.1 Running Example: Bayesian Logistic Regression
GLMs (Nelder and Baker (1972)) are the workhorse of statistics and machine learning, with applications such as risk analysis (Sobehart et al. (2000); Fenton and Neil (2004); Antonio and Beirlant (2007)) and public health (Hastie and Tibshirani (1987); Sharabiani et al. (2011)) among others. They can be extended to handle data sparseness and heterogeneity via HB framework (Rossi et al. (2005); Gelman and Hill (2007)), or to account for repeated measurements and longitudinal data via Generalized Linear Mixed Models (McCulloch (2006)).
For our purposes, GLM models have the following loglikelihood function:
(4) 
where is the vector of features or explanatory variables for observation and has a length , is the response or target variable, is the coefficient vector of length , and is the number of observation points. Logistic regression is an important member of GLM family, where each binary response variable is assumed to follow a Bernoulli distribution whose probability is . The loglikelihood becomes
(5) 
with being a binary response variable with possible values . In a Bayesian setting, we impose a prior distribution on such as a multivariate Gaussian with parameters and , leading to the DAG representation of Fig. 2. Unless we are dealing with very small data sizes, calculation of prior contribution is often computationally negligible compared to the likelihood calculation, and hence we focus on the latter throughout this paper.
3.1.2 Hardware
We use a dualsocket 2.6GHz Intel Xeon E52670 ^{7}^{7}7http://ark.intel.com/products/64595/ with 8 cores per socket (16 total cores), 20MB L3 cache per processor (shared among 8 cores), 256KB of L2 cache per core, and 32KB of L1 cache per core (separately for data and instructions each). Total installed RAM is 128GB, with maximum memory bandwidth of 51.2GB/sec per socket. Since each processor has its own memory controller, the dualsocket setup creates a NonUniform Memory Access (NUMA) environment (Section 8.8.2 of Intel Optimization Reference Manual ^{8}^{8}8http://tinyurl.com/7ty2lsx). This processor supports the AVX instruction set extensions, using 256bit wide floatingpoint registers YMM0YMM15. Hyperthreading is turned off for our experiments.
3.1.3 Software
All code is written in C/C++ ^{9}^{9}9During performance measurement, all BLAS calls are made to FORTRAN routines. For presentation brevity, however, we show calls to the C interface of BLAS in this paper.. (The only C++ feature used is template metaprogramming for compiletime loop unroll. See Section 4.6.) We use the Intel software stack for compiling and microbenchmarking as well as highperformance mathematical routines:

Intel Math Kernel Library (MKL) (part of Intel Composer XE 2013): We use MKL for vectorized and multithreaded BLAS calls (dgemv/ddot), vectorized transcendental functions (as part of Vector Math Library or VML ^{10}^{10}10http://tinyurl.com/nc9yjhh), and vectorized random number generation (RNG) as part of Vector Statistical Library or VSL ^{11}^{11}11http://tinyurl.com/pevz69r. Switching between singlethreaded and multithreaded MKL is done via the link option ’mkl=sequential’ or ’mkl=parallel’.^{12}^{12}12For more on how to compile and link using Intel C++ compiler, see http://tinyurl.com/lctboe3.

Intel C++ Compiler (also part of Intel Composer XE 2013): Combined with Intel VML, the compiler allows us to vectorize functions of transcendental functions. All codes tested in this paper were compiled using optimization flag O2.

Intel Performance Counter Monitor (PCM)^{13}^{13}13http://tinyurl.com/q4er8qp, an opensource C++ wrapper library for reading microbenchmarking (core and uncore) events from Intel’s Performance Monitoring Units, including cache hit rate, clock cycles lost to memory stalls, and instructions per cycle.
3.1.4 Parallel Programming
Our overall approach is to maintain software portability and developer productivity while achieving good performance. For multithreading, we use the OpenMP API and Intel’s implementation of it (as part of Intel Composer XE 2013). OpenMP offers a mix of highlevel and lowlevel compiler directives and runtime functions for creating and managing threads, and offers an efficient alternative to working directly with Pthreads. OpenMP is perhaps today’s mostwidely used API for multithreading in HPC applications. Other notable options for multithreading include Cilk Plus ^{14}^{14}14http://www.cilkplus.org/ (also handles vectorization) and Threading Building Blocks ^{15}^{15}15https://www.threadingbuildingblocks.org/ from Intel, Chapel ^{16}^{16}16http://chapel.cray.com/ from Cray, X10 ^{17}^{17}17http://x10lang.org/ from IBM, and Phoenix MapReduce ^{18}^{18}18http://mapreduce.stanford.edu/ from Stanford. For vectorization, we choose ‘guided autovectorization’ ^{19}^{19}19http://tinyurl.com/qbtpc38, i.e. we use pragmas and code modifications to guide the compiler to vectorize the code. We find this option to strike a balance between performance and portability/productivity. Other options include vector intrinsics and assembly code. Another parallelization option is distributedcomputing on a server cluster using MessagePassing Interface (MPI) ^{20}^{20}20http://www.mcs.anl.gov/research/projects/mpi/, MapReduce ^{21}^{21}21http://research.google.com/archive/mapreduce.html or Hadoop ^{22}^{22}22http://hadoop.apache.org/. Given the finegrained nature of our SIMD parallelization approach and significant network latency and bandwidth restrictions in distributed computing, however, focusing within a single server is the logical choice. Our parallelization concepts, however, can be combined with coarsegrained options and thus incorporated into distributed computing frameworks. See section 4.5.
Note that while we are using Intel’s hardware and software, the optimization lessons learned are applicable to other platforms, including opensource software. More importantly, the software itself is portable since we are using standard APIs (C/C++ standard library, OpenMP, BLAS).
3.2 Baseline Implementation
To calculate the loglikelihood in Eq. 5, we can consolidate ’s into rows of a matrix , and use a matrixvector multiplication routine such as BLAS dgemv to calculate . We refer to this first map operation (McCool et al. (2012)) as the Linear Algebra (LA) map. This is followed by an elementwise transformation of involving transcendental functions, and a final reduction step. We call this second map operation the Transcendental (TR) map. The result is the loglike routine in Fig. 3. Since each map operation is finished before the start of the new map, this strategy is a Sequence of Maps (SOM) pattern (McCool et al. (2012)). Using an offtheshelf, optimized BLAS library such as Intel MKL provides automatic vectorization and multithreading of the dgemv routine. Further parallelization can be achieved by multithreading the TR map using the omp parallel for pragma. This same loop can also be vectorized, but it requires access to a vector math library for transcendental functions, as well as compiler support for vectorization of functions of transcendental functions. The Intel Composer suite offers both such capabilities. The simd pragma in Fig. 3 directs the Intel C++ compiler to ignore any potential vectorization barriers (such as pointer aliasing between Xbeta and y) and generate vector instructions for the TR map. ^{23}^{23}23SIMD constructs have been incorporated into OpenMP 4.0 API: http://www.openmp.org/mpdocuments/OpenMP4.0.0.pdf.
Fig. 4 (left) shows the performance of our baseline implementation, measured in ‘Clock Cycles Per Row’ (CPR), defined as total time (measured in CPU clock cycles) for each function evaluation, divided by , the number of rows in ^{24}^{24}24This is inspired by the Clock Cycles Per Element (CPE) metric, defined in Bryant and O’Hallaron (2011), Chapter 5.. Time per function evaluation is an average over many iterations. All iterations share the same and  allowing for caching  but is refreshed from one iteration to the next. This mimics the realworld application in MCMC sampling, where observed data does not change throughout the sampling iterations but the coefficients are refreshed. We see that simply adding the OpenMP pragma to the TR loop leads to a 2.4x speedup (MKL+), while adding the simd pragma to the TR loop provides another 1.6x speedup (MKL++), for a total of 3.9x compared to the automatic parallelization of LA offered by Intel MKL library.
Achieving 4x speedup by adding two lines of code (and spending a few hundred dollars on the Intel C++ compiler) is a great return on investment. But we also see two alarming trends, depicted in Fig. 4 (right): 1) For small data sizes (), performance scales poorly as we increase number of threads, 2) As data sizes increase towards , absolute performance decays. It appears that we have both a smalldata and a bigdata problem. Next we do a deeper performance analysis of the code, paving the way for optimization strategies discussed in Section 3.4. These optimizations will add significant speedup to the baseline performance achieved by the code in Fig. 3.
3.3 Performance Analysis
In LA map, doubles per row are consumed ^{25}^{25}25We can read as integer since it is binary, but for simplicity we treat it as double with negligible impact on calculations and 1 double per row is produced. LA involves multiply/add calculations per row. The TR map consumes 1 double per row and produces a scalar after reduction, but involves 2 transcendental functions (exp and log) and a few arithmetic operations; as such, it is expensive but , i.e. it does not scale with . For small ’s (around 10), the TR step dominates computation but when approaches 100, the scale tips towards the LA map.
3.3.1 HardwareInduced Performance Limits for Big Data
We can use hardware specifications ^{26}^{26}26http://tinyurl.com/p47ytt3 of our server to derive memorybased and computebased upper bounds on performance of our code in the bigdata regime. This regime is formally defined as combinations of and such that 1) is much bigger than L3 cache and must therefore be fetched entirely from memory in each iteration, 2) is so large that the LA step dominates computation, making FLOPS ( multiplications and additions) a reasonable approximation of computation performed per row. The computebased upperbound on performance (i.e. minimum CPR) is derived as follows:
(6)  
(7)  
(8)  
(9) 
Similarly, we calculate the memorybased performance ceiling:
(10)  
(11)  
(12)  
(13) 
Comparing the two bounds, we see that in the bigdata limit we are memorybound nearly by a factor of 10. Achieving this memorybased performance ceiling is nontrivial and requires code modification. We now discuss two key factors that adversely affect performance scaling of our baseline (SOM) code shown in Fig. 3.
3.3.2 Parallelization Overhead
Vectorization produces a relatively consistent improvement across the range of ’s plotted in Fig. 3 (right). Multithreading, on the other hand, scales poorly for small ’s. This is expected since, in the absence of vectorization barriers such as conditional code and nonunitstride memory access, vectorization overhead is small. In Fig. 5 (top left) we have plotted multithreading overhead as a function of thread count. While a constant overhead is sufficient to create an asymptotic upper bound on performance as we increase parallelism (Amdahl’s law), an overhead that increases with thread count further accentuates this performance ceiling. Multithreading overhead is amortized over to produce an equivalent CPR overhead. For example, a 10,000clockcycle overhead translates into a CPR overhead of 100 for , dominating any parallelization gain for small ’s where the amount of computation per row is small. Larger means more computation per row and hence a smaller relative impact from multithreading overhead. We can therefore characterize multithreading overhead as a ‘smalldata’ problem. However, when embedded in larger frameworks such as HB, singleregression data becomes a subset of the larger data, and good performance scaling of SIMD parallel Gibbs for small to midsize data can enhance overall performance for a bigdata problem. This is discussed in detail in Section 4.1. We also see that multithreading overhead is larger for dynamic scheduling compared to static scheduling. In static scheduling, jobs are preassigned to threads before entering parallel region, while in dynamic scheduling threads receive new job assignments during runtime as they finish previous work. Dynamic scheduling imposes more overhead due to increased need for synchronization among threads, but can be beneficial if jobs have uneven/unpredictable durations. In SIMD parallelism, job durations are often even and hence we prefer static scheduling, which is enabled by default in C/C++ specification of OpenMP, but made explicit in the code of Fig. 3. Thread scheduling policy also has data locality implications, particularly for NUMA systems; this is discussed in the next section.
3.3.3 Memory Stalls
As gets larger, total size of and exceeds memory size at various levels of the hierarchy. As shown in Fig. 5 (top right), the L3 transition marks a drastic performance degradation. Microbenchmarking (Fig. 5, bottom panels) verifies that this transition phase is associated with a drop in L3 cache hit rate, and an increase in clock cycles lost to L3 cache misses. This second performance barrier can be viewed as a ’bigdata’ problem. Note that L2 transition has little impact on performance, as seen in speedup and lostCPUcycle measures. This suggests that there is sufficient computation per row to hide L2 latency. Furthermore, for data sizes that are small enough to fit into L2 the thread management overhead tends to be the dominating factor. Given these observations, we must focus on improving L3 cache utilization as well as memory bandwidth utilization. We note that there is indeed room for improving memorybandwidth utilization, given the gap between the actual and best bigdata performance, seen in Fig. 4 (horizontal line in right panel). This inefficiency is inherited from the MKLBLAS routine which is responsible for the LA map step in the loglikelihood function. This makes sense because LA map is the primary consumer of data and hence the performance bottleneck in bigdata regime. We validated this claim by running dgemv routine alone and measuring the resulting CPR, seeing that the gap with memorybased performance ceiling remains virtually unchanged (data not shown).
3.4 Performance Optimization
3.4.1 Reducing Parallelization Overhead
The baseline code in Fig. 3 contains two disjoint parallel regions, one inside the dgemv call of LA map (with an implicit nested loop over rows and columns of ) and one in the TR map. As a result, we incur the multithreading overhead twice, and fusing these two loops should reduce such overhead by half. A straightforward way to fuse the two loops is to move the dgemv call inside the TR loop by converting it to a ddot BLAS level 1 call. Each iteration of the fused loop serves to calculate the contribution from one row of the data towards the loglikelihood function. This routine can be considered a literal implementation of Eq. 5, and the outcome shown in Fig. 6 is an example of code/loop fusion technique used to transform a SOM into a Map of Sequences (MOS) (McCool et al. (2012)). While this version does reduce the multithreading overhead, yet it can prevent full vectorization of the loop by injecting a function call inside of it. Although it is possible to improve performance through inlining and unrolling the ddot loop (see Section 4.6), the MOS pattern is too inflexible for other cases such as calculating function derivatives 4.2 or the Ising model 4.3. We thus opt for an alternative, more general approach that reduces the multithreading overhead of the SOM code in Fig. 3 while maximizing vectorization coverage.
The solution is presented in Fig. 7, and we refer to it as Partial Loop Fusion (PLF). It works by introducing an outer OpenMP parallel region. Each thread within the parallel region receives N/NTHD rows of data to process. Inside the parallel region, we switch to the singlethreaded version of MKL, which remains vectorized. The TR map is explicitly vectorized (but requires no OpenMP parallelization since it is wrapped in a parallel region). We now have a single parallel region, incurring half the parallelization cost of code in Fig. 3, while maintaining full vectorization. For illustrative simplicity, we have assumed that N is divisible by NTHD, but generalizations can be easily handled with a few more lines of code. The PLF approach is a general strategy for combined multithreading and vectorization of a sequence of maps, and its applications and merits are further discussed in later sections (see Sections 4.2 and 4.3).
3.4.2 Reducing Memory Stall
When total data size exceeds L3 cache size, subsets of data that fit in L3 must be moved in and out from memory. Therefore, in bigdata regime, efficient memory bandwidth utilization is crucial. In multisocket servers, each physical processor provides its own memory controller and crosssocket memory transactions are mediated through a slower pointtopoint interconnect such as Intel QPI ^{27}^{27}27http://tinyurl.com/7hhc7rj or AMD’s HyperTransport ^{28}^{28}28http://tinyurl.com/qxlzd4g. This creates asymmetric memory access latency and bandwidth depending on whether memory transaction happens withinsocket or crosssocket. In contrast to Symmetric MultiProcessor Systems (SMPs) that suffer from memory bandwidth contention, NUMA topologies offer total memory bandwidth that scales with number of processors, but to utilize this total bandwidth we must localize data access within each socket as much as possible. To this end, we seek to ‘specialize’ each socket on one half of data (assuming all 16 cores are being used), by taking the following two steps:

Split and in half, and allocate each half on the memory banks associated with one processor. The implementation is described below.

Utilize a compact threadcore affinity policy, which packs threads 07 on the first socket, and threads 815 on the second processor. This is done through the following system call:
export KMP_AFFINITY="granularity=fine,compact".

Modify the multithreaded loglikelihood routine to point all threads on each socket to their respective socketlocal arrays.
A note on NUMAaware memory allocation is warranted here. Commodity servers use a virtual memory system to hide the physical RAM address space and present a single linear address space to all applications. This makes memory affinity management (step 1) highly OSspecific. Providing automatic and efficient NUMAaware memory allocation and task scheduling to applications running on multisocket processors is an active area of research for operating systems, compilers and libraries (Diaconescu and Zima (2007); Broquedis et al. (2009); Ribeiro et al. (2010); Majo and Gross (2011); Molina da Cruz et al. (2011)). The Linux header file numa.h ^{29}^{29}29http://man7.org/linux/manpages/man7/numa.7.html along with the libnuma library offers a collection of functions for NUMAaware memory allocation and thread scheduling. In this paper, we take advantage of the firsttouch policy adopted by the Linux kernel ^{30}^{30}30https://software.intel.com/enus/articles/memoryallocationandfirsttouch, where physical pages are allocated to the local memory controller of the thread that first ’touches’ the page, i.e. reads from or writes to a memory address within that page. More specifically, we spawn 2 threads, attach each to one processor, and then use a standard heap management library such as malloc or new to allocate memory for one half of the data and copy the data from the master copy into socketlocal arrays. The resulting memory allocation and loglikelihood code is shown in Fig. 8.
*********
Fig. 9 shows the results of applying the above two performance tuning techniques, PLF and NUMA adjustments, in succession (labelled as ‘PLF’ and ‘NUMAPLF’). PLF reduces multithreading overhead, leading to nearly 2x speedup for small data sizes, while NUMA adjustments also produce a nearly 2x speedup in the bigdata regime. Our bigdata performance is now within 46% of memorybased hardware limit (top left, horizontal line).
Minimizing crosssocket memory transactions reinforces the assumption that task durations in the parallel region are nearly equal. This allows for a static thread scheduling to remain effective, and we can avoid the extra overhead associated with dynamic thread scheduling (Fig. 5, top left).
3.4.3 Cache contention vs. Data Locality
When using all 16 cores, we see clear benefit from increasing data locality by using a compact affinity policy in conjunction with NUMAaware memory allocation. When using fewer than 16 cores, the picture is more complex, as seen in Fig. 9 (bottom right). If total data size is larger than persocket L3 but smaller than total L3, then spreading out the threads across sockets is advantageous since it allows for better utilization of total L3 (L3 contention regime). In contrast, when all data fits within L3 on each socket, keeping all cores on the same socket reduces interthread communication (L3 locality regime). When data is much bigger than total L3, it must be fetched from memory and the NUMA topology favors a compact affinity (NUMA locality regime). This tension between data locality and cache contention in NUMA multicore systems has been noted in the literature (Majo and Gross (2011)).
3.4.4 Differential Update
The optimization strategies discussed so far have broad applicability, as they were focused on restructuring a given computation to minimize parallelization overhead and maximize data locality. Here we present another optimization technique, ‘differential update’, that specifically applies to MCMC sampling. It seeks to remove unnecessary computation during Gibbs cycles and applies to graphs with continuous as well as discrete nodes. In the context of Bayesian GLM, it is more useful for univariate samplers such as slice sampler, and less useful for multivariate samplers such as Metropolis (Metropolis et al. (2004)) or Hamiltonian Monte Carlo (HMC) (Duane et al. (1987); Neal (1995)). It is also less useful for samplers that require evaluation of loglikelihood derivatives (Section 4.2). In this section we apply differential update to Bayesian GLM, and in Section 4.3 we discuss the concept for discrete graphs such as Boltzmann Machines and LDA.
Recall that in Gibbs sampling, we cycle through stochastic nodes and draw from the conditional distribution of each node, given the most recent samples for the remaining nodes. Using a univariate sampler such as the slice sampler (Neal (2003)), each of the elements of the coefficient vector are updated one at a time. So both inside the sampling algorithm as well as at the end of it, only one is being updated at a time, while the other components remain fixed. The key insight in differential update technique is that we can update changes to following a change to without having to perform the full computation. Instead, we recognize that
(14)  
(15) 
where is the result of removing column , i.e. , from . From the above expression, we can see that a change to leads to the following change to :
(16) 
Our strategy, therefore, is to maintain a copy of throughout the Gibbs sampling process, updating it according to the above formula after proposed or materialized changes to ’s. Fig. 10 shows the resulting function. This strategy both reduces the amount of computation needed (in LA), and more importantly, reduces the data that must be read into memory nearly by a factor of . For large , this optimization has very significant impact.
Fig. 11 shows results for differential update. The left panel shows significantly better scaling of performance with number of cores for differential update method (‘DiffUpdate’), compared to the default (‘NoDiffUpdate’) method for and . The right panel shows the impact of reduced data movement, where the differential update performance does not deteriorate when total data size exceeds L3 (in contrast to the NoDiffUpdate method). Essentially, DiffUpdate has expanded the range of ’s for which data fits within L3 by a factor of . Fig. 12 compares microbenchmarking results for DiffUpdate and NodiffUpdate for , (with vectorization and multithreading on all 16 cores). We see that the overall speedup of can be broken down into a x factor due to higher instruction execution rate (less memory stall), while the remaining x is due to fewer instructions, i.e. reduced computation.
NoDiffUpdate  DiffUpdate  

Clock Cycles Per Row  14.4  1.7 
L3 Hit Rate (%)  16.2  94.4 
Instructions Per Clock  0.5  1.2 
**********
Fig. 13 summarizes the impact of performance improvement techniques applied to the problem of calculating the loglikelihood for the logistic regression problem, using parameters , . Note that in all cases we are using all 16 cores on the machine, and thus the performance improvement reflects more efficiency rather than simply using more resources.
Optimization Technique  CPR (Speedup)  Description 

Parallelized LA map  116.9 (NA)  Multithreaded MKL 
Parallelized TR map  34.3 (3.4x)  Use omp and simd pragmas 
NUMAPLF  14.4 (8.1x)  Partial loop fusion and NUMA adjustments 
Differential Update  1.8 (64.9x)  Maintain and update Xbeta 
While our focus was on logistic regression, same techniques are directly applicable to other type of GLM such as Poisson regression. While the detailed steps in the TR map change, the LA map stays the same, and so do the underlying concepts of minimizing parallel overhead and data movement.
4 Extensions
In this section, we explore several extensions of the performance optimization concepts developed in the previous section in the context of Bayesian GLM.
4.1 Hierarchical Bayesian GLM
The multilevel framework combines heterogeneous but similar units into a single model, allowing for information sharing without complete homogeneity (Gelman and Hill (2007)). HB allows for consistent treatment of uncertainty and incorporation of prior knowledge in multilevel models, and has been applied to quantitative marketing (Rossi et al. (2005)), visual information processing (Lee and Mumford (2003)), medicine (Babapulle et al. (2004)) and changepoint analysis (Carlin et al. (1992)), among others.
HB GLM can be constructed from a GLM DAG such as Fig. 2 by replicating it times, with being the number of regression groups. Each unit has a distinct set of explanatory and target variables as well as coefficients. These DAGs are tied together by sharing the parent nodes of the group coefficients. Since ’s are not parents to each other, nor do they have common children (due to groups having distinct observations), then according to Theorem 2.1 they are conditionally independent and can be sampled concurrently within a Gibbs sampling framework. Within each group, as with ordinary GLM, we can calculate the childnode contributions to the loglikelihood of each in parallel. The factorizable loglikelihood functions for each coefficient vector is given by:
(17) 
Full discussion of HB parallelization is beyond the scope of this paper (and subject of future research). Here we focus on a key question: Given the two parallel modes available (parallel sampling of group coefficients, and parallel calculation of loglikelihood function for each group), how should available hardware resources, i.e. vectorization and multicore multithreading, be allocated to each mode?
Vectorization In Section 3.2 we discussed how to fully vectorize the loglikelihood function by adding the simd pragma before the TR loop in 3, which produced a nearly 3x speedup (exact numbers depend on parameter values). Since the coarser parallelism mode does not lend itself to a SIMD form (due to presence of function calls and complex conditional code e.g. in the slice sampling routine), it cannot benefit from vectorization. We therefore continue to dedicate vectorization to parallel loglikelihood calculation, and see a nearmultiplicative speedup as a result.
Multithreading How should the available cores be allocated to the two parallel modes? For example, should all 16 cores be dedicated to parallel sampling of coefficients, with loglikelihood calculation being singlethreaded (but vectorized), or should we sample one group at a time but use a multithreaded version of loglikelihood evaluator, or somewhere inbetween such as 2x8, 4x4, or 8x2 arrangements? From the perspective of coverage and granularity (Chandra (2001)), mapping coarsegrained parallelism to multithreading is a better option since the longer duration of parallel tasks leads to better amortization of parallelization overhead. However, data locality considerations suggest a more complicated story, as seen in Fig. 14. We have compared the performance of two extreme resource allocation strategies for HB logistic regression problem with 100 regression groups and 50 covariates in each regression group, as a function of (uniform) group size (). We have looked at two scenarios, one where the loglikelihood function for each group is evaluated only once per iteration (), and one where it is evaluated 10 times (). The latter is more representative of a realworld HB problem where each loglikelihood must be evaluated as often as a few hundred times for a univariate slice sampler. The interplay between Neval and mapping strategy is illuminating. When Neval is 1, data for each regression group is not reused until the next iteration, while when Neval is 10, data is reused and we see that mapping finegrained parallelism to multithreading is advantageous  by more than 3x  over a range of data sizes where each group fits in L3 but data size per group times NTHD exceed L3 capacity. The latter is the total amount of data processed by all threads at any given time, under the coarseparallelism approach. This is supported by the measured L3 cache hit rates (right panel), where we see a much better cache reuse over the middle range of data when using multithreading for SIMD parallel mode.
The broader lesson here is that, for data sizes and access patterns where cachefriendliness matters, locality considerations may favor using multithreading for a finegrained, cachefriendly parallelism over a coarsegrained but less cachefriendly parallelism. Similar situations to HB can arise in other settings, especially in the data analytics lifecycle. Examples are 1) variable selection using forward or backward selection (Sutter and Kalivas (1993)) where many candidate explanatory variables are added to (or removed from) a base set and coarsegrained parallelism corresponds to parallel execution of the estimation algorithm against all candidate sets, 2) independent regressions on groups as a precursor to HB, 3) execution of multiple models on various subsets of data and using different feature sets in the context of ensemble learning (Polikar (2006)). In all these cases, the coarse (task) parallelism is not cachefriendly, and a cachefriendly, finegrained parallelization of the underlying algorithm can outperform task parallelism. This conclusion reinforces the case for continued investment by the HPC community in developing efficient, parallelized implementations of popular statistical and machine learning algorithms.
This discussion can be further generalized to the case where coarsegrained parallelism corresponds to parallel execution of multiple programs in a multiuser environment. In this case, the operating system is responsible for allocating processor cycles against all running applications, and has the unenviable task of striking a balance between maintaining data locality by offering long time slices to each process, and keeping the schedule ‘fair’ by frequently switching context between processes (Bovet and Cesati (2005)).
A more thorough analysis of HB parallelization requires attention to other topics such as staticvsdynamic thread scheduling and NUMAaware memory allocation. For example, a primary application of HB framework is where some regression groups are so small that they need to borrow information from other, larger groups. Therefore, assuming that all groups have the same size is unrealistic. But nonuniform group sizes means nonuniform task durations, raising the possibility that a dynamic scheduling may be better than static scheduling. This, however, may have negative impact on data locality if cores are assigned to different groups each time. It is possible that the best strategy is to use a proxy for execution time as a function of group size and preallocate groups to threads to have even distribution and maintain a static mapping to preserve data locality. It must be noted that such potential complexities and performance hits associated with coarsegrained parallelism strengthen the case for using multithreading with finegrained parallelism. Further exploration of such issues is the subject of future research.
4.2 Calculating Derivatives
In the context of MCMC sampling of probabilistic DAGs, speeding up the conditional loglikelihood function is highly impactful because the majority of time in a method such as Slice Sampler (Neal (2003)) or Metropolis (Metropolis and Ulam (1949)) is spent on function evaluation. There exist other sampling techniques such as Adaptive Rejection (Gilks and Wild (1992)), HMC (Neal (1995)) and its variants such as Riemann Manifold HMC (Girolami and Calderhead (2011)) or NoUTurn Sampler (Hoffman and Gelman (2011)), and Hessianbased MetropolisHastings (Qi and Minka (2002)), which use higher derivatives such as the gradient vector or the Hessian matrix. Similarly, in optimization it is advantageous to use the function derivatives when available (Nocedal and Wright (2006)). Here we discuss parallel calculation of the gradient vector for GLM. Extension of this work to higher derivatives, namely Hessian, is the subject of future research.
Starting with Eq. 5, we use indexed functions absorb the dependence on and differentiate once to get the gradient vector:
(18)  
(19)  
(20) 
where . The expanded form of Eq. 19 makes the sum over contributions from each observation explicit, while the compact form of Eq. 20 hides this.
Fig. 15 shows the baseline implementation of the routine for calculating the loglikelihood function and its gradient, following the SOM pattern. The routine now consists of a sequence of three maps: 1) calculate (first LA map), 2) apply and to each element of (TR map), 3) calculate (second LA map). Note that for the last step we have switched to a columnmajor version of (referred to as Xt in code) for better data locality and parallelization scaling in the dgemv routine. Fig. 17 shows the performance of this baseline implementation. We can apply the same PLF and NUMA optimizations discussed in 3.4 here as well, with similar results. However, even after these two adjustments, we are quite off from the hardware limit. For example, Fig. 17 shows that for , after PLF and NUMA changes we have a CCR of 28, which is significantly higher than the memorybased minimum value of 10.4. This is somewhat expected since gradient calculation adds to the amount of computation per row. But we can do better, as explained below.
4.2.1 Cache Fusion
We mentioned the primary advantage of MOS over SOM parallel pattern as one of consolidating the parallel regions and hence reducing thread management overhead. There is a related benefit in terms of minimizing data movement. In SOM, the output of each map stage is likely to be written to memory if array sizes are large, thus forcing the next map stage to reach down to the memory to retrieve the input it needs (assuming that the output of first map is input to second map). By transforming to MOS, we allow all operations to be chained and hence preventing the data from travelling unnecessarily to and from memory, or even cache since they will remain in registers. Our PLF strategy allowed us to minimize thread management overhead, but it did not achieve the same data locality as what would have been achieved in MOS. Cache fusion (McCool et al. (2012)) allows us to maintain better cache locality. The idea is to break the map into smaller chunks and execute them sequentially, thus making inputs and outputs of each stage smaller, possibly fitting within L1 or L2. This strategy offers measurable but small benefit for function evaluation (data not shown). But with the addition of derivatives, enhancing cache locality has a more significant impact since there are more steps and more computation. Fig. 18 shows the code for function and gradient calculation with all three improvements (PLF, NUMA, Cache Fusion). We refer to the entire setup as Hierarchical Loop Structure (HLS), which can be a useful parallel pattern for other problems as well.
Fig. 16 shows the impact of NCHUNK on performance of the code in Fig. 18, for and . For these problem dimensions, we see that: 1) we must choose the chunks small enough to fit within L2 to get the best performance (NCHUNK=50), but making the chunks too small has adverse impact on performance (left panel). The corresponding improvement in L2 hit rate can be seen in the right panel. Further research is needed to better understand the tradeoffs involved in reducing chunk size for L2 locality.
In Fig. 17 we have shown the impact of adding cache fusion to the previous optimization techniques, for . For each value of , we chose the largest chunk size that would fit within L2 (256KB) (also ensuring that is it a multiple of N/NTHD for code simplification). We see that the impact is significant for large , e.g. reducing CPR from 28 to 14.5, bringing it within 40% of hardware limit (10.4).
4.2.2 Thread Synchronization
In function evaluation, threads must synchronize while writing to the shared buffer holding the output. This was done through the reduction clause in the omp and simd pragmas in Figs. 3 and 6. There is no equivalent construct for performing reduction on arrays (with the exception of FORTRAN OpenMP). We must instead use a omp critical pragma, as seen in Fig. 18. It is interesting to note that our PLF strategy for assigning jobs to threads has enormously reduced the overhead of this critical region, compared to the overhead of a bruteforce MOS implementation shown in Fig. 19. For example, we measured the penalty of the critical region for the PLF vs. MOS implementations for and , using 16 threads. For PLF, the critical region imposes an equivalent CPR overhead of 1.3, while for MOS, the same number is 475! Critical regions incur two types of penalty: first, the serialize the code and second, they require lock management by runtime. Both these components are drastically reduced in the PLF approach since the critical region is only entered NTHD times, here being 16. In contrast, in MOS each job execution enters critical region once, amounting to times. As an alternative to a critical region, we can use omp atomic pragma before the addition statement inside the loop surrounded by the critical region construct in Fig. 19. This slightly reduces the first penalty (due to shortening the serial code length) but increases the second penalty since mutex region is entered K times as many. Testing this on the MOS code, we see the penalty increasing from 475 to 1419. This suggests that lock management is the key performance factor in our synchronization operations. In contrast, a simple reduction of a scalar shows a much smaller performance hit when switching from PLF to MOS (0.2 CPR). Also, note that there is no support for vectorization of array reductions, and this is another blow to performance of the MOS implementation.
There is an equally important reason to avoid the MOS pattern of Fig. 19 as it relates to the last step (inside the critical region): The updates made to the array g[] by each thread will cause its cache lines to ping pong between cores ^{31}^{31}31Note that this is not ‘false sharing’ since threads do not simply share the same cache line, but are indeed writing to the same memory locations., creating another large performance hit. We validated this in two ways: 1) After removing the omp critical pragma (no synchronization), we measured CPR with and without the update line: g[k]+=gtmp[k];. We saw that this update line creates a 240CPR penalty. 2) We measured L2/L3 hit rates and saw a drastic increase for both cache levels after removing the update line: 0.03% 71% (L2) and 0.02% 95% (L3). Again, the PLF approach works much better because each core does as much of the calculations privately, and updates the global g[] only once at the end. The cache hit rates for the code in Fig. 18 agree with this story (76% and 98% for L2 and L3, respectively).
Finally, note that the differential update strategy will not be as effective when we need to calculate the gradient. This is because 1) calculating Xbeta is a smaller part of the entire computation, and 2) more importantly, differential update does not free us from the need to read the entire into cache during sampling for each , since calculation of cannot be simplified to require only .
**********
Fig. 20 summarizes the optimization techniques applied to parallel evaluation of loglikelihood and its gradient for Bayesian GLM, and the impact of each technique on the smalldata and bigdata problems. For bigdata regime (), we reach within 25% of the memorybandwidthinduced CPR. This is 5x faster than our baseline implementation of Fig. 15 (with or without TR parallelization), and 2.5x faster than the performance of Intel MKL’s first dgemv call in the same code. NUMA adjustments explain this performance gap. Cache fusion provides the remaining 2x speedup, and impacts the chain of map sequences used in the routine by increasing L2 cache locality. Also note that smalldata techniques (TR map parallelization and PLF) will be important when embedded in composite problems of larger size, such as HB GLM (Section 4.1).
Optimization Technique  Small Data Speedup  Big Data Speedup  Explanation of Impact 

TR Map Parallelization  2.4  1.01  For wide data, LA map dominates computation, reducing the impact of TR parallelization. 
Partial Loop Fusion  2.5  1.02  Total parallel overhead is independent of , thus its amortized CPR is negligible for big data. 
NUMA Adjustments  1.1  2.5  Small data fits in L3 of each socket, making memory bandwidth irrelevant. 
Cache Fusion  1.0  2.0  For small data, each core’s share fits in its L2, making cache fusion irrelevant. 
Overall Speedup  6.7  5.2  Don’t overlook smalldata techniques for bigdata problems: they become relevant in bigdata composite problems such as HB. 
4.3 Bayond GLM
So far, our discussion of SIMD parallel MCMC focused on Bayesian GLM. These models have two important characteristics: 1) lack of conjugacy between prior and likelihood for means we have no direct sampling technique available for the conditional posterior and must therefore use MCMC techniques whose core computational work involves repeated function and derivative evaluations, 2) the conditional posterior function is computationally expensive, scaling linearly with number of observations () as well as number of attributes () in the problem. Two implications follow from the above properties: 1) Random number generation does not constitute a significant fraction of sampling time, 2) the sampling routine is too complex for the coarsegrained parallelism (concurrent sampling of conditionallyindependent nodes) to be vectorizable.
In this section, we look a different class of graphical models, Markov Random Fields (or undirected graphs) with discrete nodes. We will see that, despite several important differences between MRFs and GLMs, the PLF framework continues to be helpful, with a few new twists to consider.
4.3.1 Ising Model
We consider a squarelattice Ising model with binary nodes, with each node interacting with its 4 adjacent nodes. While Ising model has its origins in statistical mechanics (Brush (1967)), it has been applied to statistical inference problems such as image reconstruction and denoising (Perez et al. (1998); Descombes et al. (1998)) and neural information processing (Hopfield (1982)). The energy function for Ising model is given by:
(21) 
where is the state vector, is the interaction weight between neighbors and , and ’s control the bias terms. The sum occurs over pairs of and that are neighbors. In image denoising, bias terms reflect the pixel intensity for the noisy image, serving to induce the output to represent the input. ’s enhance correlation among neighboring pixels to reduce noise. The configuration probability is given by the Boltzmann distribution:
(22) 
where the normalizing factor is called the partition function, and is given by . In Gibbs sampling of Ising model, the probability of node is given by:
(23) 
where and sum is over neighbors of node .
This graph has a 2color decomposition (Fig. 21), and nodes within each color can be sampled concurrently. This is the coarse parallelism discussed in Section 2.3. On the other hand, since the sum in contains only 4 terms, the finegrained parallelism opportunity is unattractive. It is clear, therefore, that multicore multithreading must be dedicated to the only available parallel mode. But the outstanding question is, can vectorization be applied here as well? Recall that in HB GLM, due to complexity of sampling steps for each and availability of finegrained parallelism, we did not consider vectorization for coarsegrained parallelism. But in Ising model, the sampling steps for each node are significantly simpler, as described below:

Calculate , the contribution from the neighbors. This consists of 4 multiplications and 5 additions.

Calculate according to Eq. 23. This involves 1 exponentiation, 1 division, and 1 addition.

Generating a sample for . This consists of generating a random deviate from uniform distribution, and comparing it to the probability calculated in step 2.
As always, the ideal scenario is to vectorize all steps. However, we face two challenges in doing so:

Memory Access Patterns: In step 1, fetching neighbor states of a given node requires a gather memory operation. Handling boundary conditions adds further complexity to the code. Both of these can render vectorization inefficient or impossible.

RNG: Making an external function call to a precompiled RNG library in step 3 can act as a vectorization blocker.
Here again we can take advantage of the PLF strategy for incremental vectorization. Given the memory access pattern challenge of step 1 and its small contribution to total computation per sample, we choose to leave it scalar. As for step 3, we propose a batch RNG approach.
Batch RNG Batch generation of uniform random deviates used in step 3 offers two advantages. Firstly, it removes RNG call from the sampling loop, allowing it to be vectorized more efficiently. Secondly, it speeds up RNG by reducing the function call overhead and allowing for use of highperformance libraries that utilize vector instructions or other forms of parallelization. Examples include Intel VSL ^{32}^{32}32http://tinyurl.com/pevz69r, Nvidia’s cuRAND ^{33}^{33}33https://developer.nvidia.com/curand, and the SPRNG project ^{34}^{34}34http://www.sprng.org/. Applying the batch RNG strategy is more involved for complex distribution. See Section 4.4.
Data Layout Optimization Vectorization efficiency for step 2 depends on data alignment for (and ). If the entire network occupies a contiguous block of memory, then data for each (colored) graph partition will not be contiguous. Such nonunitstride memory accesses have a particularly negative impact on vectorization. To overcome this, we can extract each subgraph and write it to new arrays, perform Gibbs sampling on these contiguousmemory arrays, and transform the results back to the original arrangement at the end (Fig. 21).
Fig. 22 shows the impact of 3 optimization techniques on performance of Gibbs sampling for an Ising model used in denoising a 560by516 binary image (1000 samples). We followed the specification and parameters in Bishop (2006) (Chapter 8). Switching to batch RNG provides a 4.9x speedup for the RNG part of the code (84.1 17.1 clock cycles per sample). Vectorization results in a 2.3x speedup in the nonRNG part of the code (75.4 32.5). Finally, switching to unitstride memory access led to a 1.5x speedup in nonRNG part (32.5 21.7). The cumulative impact of all 3 optimization techniques (on the entire code) is a 4.3x speedup.
4.3.2 Boltzmann Machine
A close cousin of Ising model is the Boltzmann machine (Ackley et al. (1985)), where nodes are densely connected rather than having the lattice connection of Ising model. As a result, 1) there is no conditional independence among nodes and hence the graph cannot be colored for parallel sampling, 2) each node is potentially connected to a large number of nodes, creating a SIMD parallelization opportunity for calculation of ’s. A variant of Boltzmann machine is the Restricted Boltzmann machine (RBM) (Smolensky (1986)), a key ingredient in building deep belief nets (Hinton et al. (2006)) that are seeing increasing application in supervised and unsupervised learning of generative models for image (Krizhevsky et al. (2012)), video (Le et al. (2011)) and speech (Mohamed et al. (2011)) data. RBMs have a bipartite graph structure, where visible and hidden nodes are densely connected, but no connection amongst nodes of each layer is allowed. This leads to conditional independence of hidden nodes, allowing for their parallel sampling. In choosing between the fine and coarsegrained parallel modes to apply vectorization, we must consider these factors: 1) Finegrained parallel mode consists of a summation over contribution from nodes connected to a given node. Its arithmetic complexity is low, but vectorization is easy. 2) Coarsegrained parallel mode contains the inner, summation loop. Efficient vectorization of this step within the coarsegrained parallelism would require loop unroll (see Section 4.6). 3) Steps 2 and 3 from the Ising mdoel are repeated here, and we can use strategies discussed before. Note that dense connectivity can alleviate or remove the problem of suboptimal memory access patterns.
4.3.3 Differential Update and Atomic Vector Operations
We introduced the differential update strategy in the context of Bayesian GLM models (Section 3.4). This strategy can also be used with discretenode graphs such as Ising models, Boltzmann machines and LDA models. For example, consider the image denoising example introduced in this section. We measured transition rates for binary pixels in the model, finding that after the first few iterations less than 10% of the pixels switch value. If we switch to a differential update approach where we maintain intermediate data structures and , and update them each time a node has a state transition, we see a nearly 2x speedup. However, combining this differential update approach with vectorization can be a challenge. If two nodes are updated simultaneously and they share neighbors, they will attempt to update the same elements of and , requiring atomic operations inside vectorized code. Unfortunately, current versions of x86 processors do not support atomic operations (Kumar et al. (2008)), and it is likely that future support will be rather inefficient. Similar opportunities and challenges exist for RBMs where conditional independence allows hidden nodes to be updated concurrently, but they could attempt to write to same locations in intermediate arrays. In noncollapsed Gibbs sampling of LDA, a differential update strategy is commonly used where matrices representing counts of words by topic and topics by document are maintained and updated after each new topic assignment for a token (Xu and Ihler (2011)). Vectorization of parallel sampling for topic assignments will face the same challenge of needing to perform atomic operations on these count matrices.
4.4 Batch RNG for Complex Distributions
In Section 4.3 we saw that using a batch RNG for uniform deviates in Gibbs sampling of Ising model led to a nearly 2x speedup (Fig. 22) and paved the way for vectorization of concurrent sampling of nodes of the same color. Distributions such as uniform and normal are easy to handle in batch process due to their linear transformation properties, allowing a deviate with standard parameters to be converted to a deviate of arbitrary parameters:
(24) 
where is the uniform distribution on the real interval . Similarly, for the normal distribution we have:
(25) 
where is the normal distribution with mean and standard deviation . If a distribution does not enjoy such properties, we cannot pregenerate its random deviates in advance without knowing the value of parameters needed. The best we can do is to pregenerate the ‘component’ random deviates that are used in the algorithm for sampling from the desired distribution. As a concrete example, consider the Dirichlet distribution that is used in LDA models as a conjugate prior to categorical and multinomial distributions of topics in documents and words in topics (Blei (2012)):
(26) 
where , , and . Dirichlet deviates can easily be constructed from Gamma deviates ^{35}^{35}35First, we draw ’s from . Next we normalize: ., which in turn require uniform (and possibly normal) deviates. We took the Gamma sampling algorithm described in Press (2007) ^{36}^{36}36The algorithm uses the rejection method of Marsaglia and Tsang (2000) for , and a lemma described in Ripley (2009) to cast an problem into an problem., and adapted it to use a buffer of uniform and normal random deviates, generated in batch. This Gamma sampler was then used in the Dirichlet sampler. For reference, we used a baseline Gamma sampler which generated its uniform and normal deviates one at a time. In both versions, Intel VSL routines were used for uniform and normal deviates. We tested the batch and oneatatime versions of Dirichlet sampler on the Enron email data set from UCI repository (Bache and Lichman (2013)), specifically for sampling , the probability of topics (100) for each document (39861). A flat (Dirichlet) Jeffreys prior was used for all documents. Oneatatime Dirichlet sampler had a performance of 1263 clock cycles per Gamma sample, while the batch version had a performance of 198 clock cycles, showing a speedup of 6.4x. Fig. 23 summarizes the speedup offered by batch vs. oneatatime RNG for uniform, normal, Gamma, and Dirichlet distributions.
Distribution  OneAtATime (CCPS)  Batch (CCPS)  Batch (Speedup) 

Uniform  76  18  4.2 
Normal  949  39  24.3 
Gamma  1170  162  7.2 
Dirichlet  1263  198  6.4 
Our results suggest that there is value in RNG libraries offering batch generation capabilities. In the above example, we could not use the Gamma sampler offered by Intel VSL since its interface does not offer an option of passing buffers of uniform or normal deviates, forcing us to write the Gamma sampler from scratch.
4.5 Beyond SingleNode x86
The focus of this paper has been on maximizing performance of a single, sharedmemory x86 compute node for parallel MCMC sampling. Going beyond singlenode x86 can be driven by three needs: 1) more memory, 2) more memory bandwidth, and 3) more computing power. In distributedmemory clusters, total memory and compute power scale linearly with node count, and so does total memory bandwidth between each node’s cores and its private memory. However, typicallyslower internode connections mean internode costs of synchronization and data exchange are higher than their intracode counterparts. As such, the two principles of minimizing parallel overhead, and minimizing data movement must be followed more rigorously in distributed computing. In fact, we can consider our NUMA adjustments, discussed in Section 3.4, as a precursor to distributed computing. Asymmetric treatment of crosssocket vs. withinsocket communication, while continuing to take advantage of the convenience of sharedmemory programming, can be considered halfway between writing symmetric, sharedmemory parallel code and the standard messagepassingbased approach to distributed computing on a cluster.
Manycore architectures such as Intel Xeon Phi and GPUs offer another option for adding memory bandwidth and compute power. They can be used as an alternative to, or in conjunction with, distributed computing. They are best suited for floatingpoint SIMD parallelism since they dedicate more transistors to floatingpoint operations and less to outoforder execution, branch prediction, and speculative execution (Kirk and Wenmei (2012); Jeffers and Reinders (2013)). Intel Xeon Phi coprocessors can be considered an extension of x86 processors with higher core count (up to 60) and wider vector units (512bit, holding 8 doubles). Fig. 24 compares the compute and memory performance of an Intel Xeon Phi 7120P and an Nvidia Tesla K40 against our x86 test system. Based on these numbers, we can say, in broad terms, that computebound problems can gain 3.5x speedup while memorybound problems can gain between 2x3.5x in speedup by switching from the an x86 processor to a manycore coprocessor. As with x86, approaching these hardware limits requires tuning to minimize parallel overhead and maximize data locality.
Intel Xeon E52670 (dualsocket)  Intel Xeon Phi 7120P  Nvidia Telsa K40  

Memory Bandwidth (GB/sec)  102.4  352 (3.44x)  288 (2.81x) 
Peak FloatingPoint Performance (GFLOPS/sec)  332.8  1208 (3.63x)  1430 (4.30x) 
Memory (GB)  (up to) 384  16 (1:24)  12 (1:32) 
There are two other factors beyond the raw performance of coprocessors that must be considered in the cost/benefit analysis of using coprocessors (also referred to as ‘devices’): 1) memory capacity, 2) data transfer and parallelization overhead. As we see in Fig. 24, memory capacity is significantly lower for current generation of coprocessors, compared to modern x86 servers. Therefore, as data sizes exceed device memory, the device may become significantly more complex to accommodate the need to move data in and out of device memory. In our GLM example, at , the Tesla K40 card will run out of space for larger than 31.6 million rows, while the same limit on the host system will be reached at data sizes 32 times larger.
Unless the code is ported completely to the device, so that there is no need to send data back and forth between the host and device memory, we will incur data transfer cost, due to nonzero latency and finite bandwidth. On GPU cards, kernel launch incurs an additional cost, while on Xeon Phi where we may launch more than 100 threads, the thread management overhead can be significantly higher than the numbers we discussed for the host system. For example, in Bayesian GLM we may want to use the device only for function evaluation and run the sampler on the host, or there might be other parameters besides that are sampled. As a result, each function evaluation may incur the data transfer overhead associated with 1) sending from host to device, 2) launching the kernel on the device, and 3) sending the loglikelihood result from device back to the host. To measure all these overhead components, we used an empty kernel inside a for loop, as seen in Fig. 25 (left). (This test was performed on a Tesla K20 card.) This loop took nearly 20sec to run per iteration, equal to CPU clock cycles. If , this translates into a CPR overhead of 5.2. Recall that, using the differential update approach on the CPU we reached a CPR of less than 2.0 using all 16 cores (Fig. 13). Therefore, even ignoring the memory and compute limits of the device, simply the data transfer and kernel launch remove any incentive to port the code to the device, unless port a sufficient portion of the code to remove the need for such frequent kernel launches and data transfers. Fig. 25 (right) shows the expected speedup as a result of device hardware limits as well as kernel launch and data transfer overhead. This analysis can be used in making an informed decision on whether or not to port the code (and how much of it) to the device.
Finally, development and maintenance costs associated with code complexity must be factored into a total cost of ownership analysis of using coprocessors. CUDA ^{37}^{37}37http://www.nvidia.com/object/cuda_home_new.html is a convenient programming language for Nvidia GPUs, yet it is not opensource, while OpenCL ^{38}^{38}38https://www.khronos.org/opencl/ has the opposite pros and cons. Even CUDA code for logistic regression loglikelihood will inevitably be more complex than the CPU code. For example, in CUDA kernels threads within a block can communicate inside a kernel, while blocks cannot. On the other hand, there is an upper limit on number of threads that can be launched within a block. Such limitations may force us to break up the CUDA kernel into two steps, one in which a partial reduction is performed within blocks, and a second kernel where the final sum is calculated. Optimal selection of grid and block geometry is another design decision that may require runtime and/or compiletime code branching.
4.6 CompileTime Loop Unroll
In implementing the GLM loglikelihood, we encountered a nested loop in the LA map whose vectorization in the SOM and PLF patterns was delegated to Intel MKL library as part of the dgemv call (see Figs. 3 and 7). We also mentioned that, to transform this loglikelihood function to a MOS pattern, we can fuse the outer loop of the LA map with the TR loop, which leaves us with a ddot call inside the fused loop to represent the LA step (Fig. 6). But the presence of this function call prevents a full vectorization of the outer loop, creating bad performance. This can be verified by inspecting the assembly code generated by the Intel compiler, as described below.
The assembly code for the function in Fig. 6 (with the fused loop vectorized) consists of three loops, each containing calls to AVXvectorized transcendental functions __svml_exp4 and __svml_log4: ^{39}^{39}39To focus on vectorization, we compiled the code without the OpenMP pragma. ^{40}^{40}40For more on generating vectorized code, see this Web Aside http://csapp.cs.cmu.edu/public/waside/wasidesimd.pdf to Bryant and O’Hallaron’s great textbook (Bryant and O’Hallaron (2011)). The reason for breaking the loop into 3 parts is to take advantage of efficient data movement instructions for aligned data in the main loop, while allowing for scalar data movement instructions to handle misalignment in the first loop, and array lengths that are incomplete multiples of 4 in the last loop (since each AVX vector instruction handles 4 doubles in its 256bit registers). The evidence for partial vectorization comes from seeing 4 calls to ddot inside the main loop: Since ddot is scalar (i.e. it produces only one element per call as input to the transcendental functions), it must be called 4 times to generate the full input for the vector instructions. As we see in Fig. 26, the resulting performance (‘MOS  ddot’, CPR of 86.7) is much worse than the SOM pattern of Fig. 3 (‘SOM’, CPR of 32.3).
Since ddot is a simple routine, we can replace it with a for loop (‘MOS  RegularLoop’), with significant performance improvement. We can further optimize the code by making the loop span a compiletime constant (‘MOS  KnownSpanLoop’). Best result, however, is achieved when we fully unroll the loop (‘MOS  FullyUnrolledLoop’), producing essentially the same performance as the SOM pattern.
Loop unroll does not necessarily require handcoding by the programmer. Instead, a sourcetosource compilation stage can be utilized, or instead we can utilize the template metaprogramming facilities of C++ compilers. Fig. 27 shows an example code that generates a fullyunrolled version of the ddot BLAS routine for calculating the inner product of two vectors. In the logistic regression example, the only parameter subspace for which we saw a significant advantage from ‘MOS  FullyUnrolledLoop’ over SOM is for short and narrow data and when multithreading. For example, if we use same parameters as in Fig. 26 but 4 threads instead of 1, the CPR numbers for these two methods become 12.0 and 19.4, respectively. In addition to being inflexible in handling external function calls, the MOS pattern faces another significant performance challenge, i.e. excessive synchronization costs in the presence of reductiontoarray operations needed for calculating the derivatives of the loglikelihood function (Section 4.2).
Moving many tuning decisions from runtime to compile time is generally a fruitful strategy for HPC. A recent example of this is the Stan compiler for Bayesian DAGs. In contrast to previous compilers such as WINBUGS and JAGS, Stan performs a full compilation of DAG into C++ code. This is likely to be one reason for its superior performance (in addition to using a more efficient sampler in HMC and NUTS). Our manual loop unroll pushes the JIT compiling idea further by requiring the compilation to occur for each new set of values for problem dimensions (specifically here). This is a minor nuisance but can offer performance gains especially for vectorization.
5 Concluding Remarks
5.1 Summary
In this paper, we presented two opportunities for SIMD parallel MCMC sampling of probabilistic graphical models: 1) parallel calculation of loglikelihood and its derivatives for exchangeable models with lack of conjugacy, such as GLM, and 2) parallel sampling of conditionallyindependent nodes for graphs with discrete nodes and conjugacy, such as Ising model. In addition, we offered strategies for efficient implementation of such opportunities on multicore x86 processors using a combination of vectorizing compiler and libraries, and a series of highlevel changes to the source code and runtime environment. These strategies included Partial Loop Fusion to minimize parallel overhead, NUMAaware adjustments to minimize crosssocket communication, cache fusion to maximize cache utilization, and batch RNG for efficient random number generation and code vectorization. For Generalized Linear Models, one of the most commonlyused class of models in predictive analytics, our implementation of the loglikelihood function and its gradient vector reached within 25% of the hardware limit induced by total memory bandwidth in the bigdata regime, outperforming a fully vectorized and (builtin) multithreaded  but naive  implementation using Intel MKL by more than 5x.
There are two conceptual extremes with regards to utilizing precompiled optimized libraries as components of a HPC application: 1) dropping in the precompiled library components as they are, and getting suboptimal aggregate results, or 2) discarding all precompiled components and writing everything from scratch, which can significantly raise the development costs for the project. Our research suggests a middle ground, through highlevel modifications layered on top of precompiled libraries, allowing for significant performance improvement while incurring only modest development costs.
5.2 Future Directions
There are several important areas for further research.

Portable Implementation of Optimization Techniques: In order to turn research into action, we must identify robust and portable methods for implementing our proposed optimization techniques. For example, our NUMA adjustment relied on the firsttouch policy used by Linux. Also, in cache fusion, the optimal number of chunks is tied to the size of L2 cache. Such hardware and OS dependencies must be addressed before a widelyusable library can be released.

HighPerformance GLM Hessian: We offered efficient implementation of loglikelihood function and its first derivative for GLM. However, due to their logconcave property [ref], GLMs are suitable for optimization and sampling techniques that utilize the second derivative  Hessian matrix  of the loglikelihood (or logposterior) function, such as Newton optimization (Nocedal and Wright (2006)) (equivalently known as Iterative ReWeighted Least Squares or IRWLS for GLM models) or RMHMC sampling (Girolami and Calderhead (2011)). Developing an equallyefficient implementation of the Hessian for GLMs will be an important contribution to the field of (Bayesian and nonBayesian) predictive analytics. We expect the Hierachical Loop Structure framework of Section 4.2 to remain effective, but additional opportunities for optimization, such as taking advantage of the symmetry of the Hessian matrix, must be thoroughly studied.

Hierarchical Bayesian GLM: Many Bayesian applications of GLM involves an HB framework. Our research highlights the subtleties of combining coarse and finegrained parallel modes available for HB GLMs. These subtleties as well as additional complexities arising from highlyimbalanced regression groups must be carefully studied and encoded into a highperformance library for MCMC sampling of HB GLM models (or generic DAG compilers such as Stan).

Batch RNG: We demonstrated the value of batch RNG for improving efficiency and vectorization of MCMC sampling for models such as Ising, RBM, and LDA. For complex distributions, existing libraries cannot be utilized in batch mode in the context of MCMC sampling since the parameters of distributions changes from one iteration to next. (This is in contrast to purely Monte Carlo  nonMarkovian  settings where sampled distributions remain stationary throughout the simulation.) Developing batch RNG libraries, i.e. libraries that generate buffers of random deviates for a set of core distributions such as uniform and normal, and utilize these buffers for generating random deviates from complex distributions  will be an important contribution towards highperformance MCMC sampling of bigdata simulations such as automated, semantic analysis of web data.

Vectorization and Atomic Operations: Atomic operations arise in parallel sampling of conditionallyindependent nodes (such as in noncollapsed LDA and in RBM), where nodes may simultaneously attempt to update the same intermediate data structures. Before hardware support for vectorized atomic operations become available, software options can be explored. For example, one possibility is to accumulate updates in temporary buffers such that each vector instruction produces nonoverlapping updates to the buffer. Higher flexibility offered by vector intrinsics might be required to implement such solutions.
References
References
 Ackley et al. (1985) Ackley, D. H., Hinton, G. E., Sejnowski, T. J., 1985. A learning algorithm for boltzmann machines*. Cognitive science 9 (1), 147–169.
 Ahmed et al. (2012) Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., Smola, A. J., 2012. Scalable inference in latent variable models. In: Proceedings of the fifth ACM international conference on Web search and data mining. ACM, pp. 123–132.
 Antonio and Beirlant (2007) Antonio, K., Beirlant, J., 2007. Actuarial statistics with generalized linear mixed models. Insurance: Mathematics and Economics 40 (1), 58–76.
 Babapulle et al. (2004) Babapulle, M. N., Joseph, L., Bélisle, P., Brophy, J. M., Eisenberg, M. J., 2004. A hierarchical bayesian metaanalysis of randomised clinical trials of drugeluting stents. The Lancet 364 (9434), 583–591.

Bache and Lichman (2013)
Bache, K., Lichman, M., 2013. UCI machine learning repository.
URL http://archive.ics.uci.edu/ml  Bernardo (1996) Bernardo, J. M., 1996. The concept of exchangeability and its applications. Far East Journal of Mathematical Sciences 4, 111–122.
 Bishop (2006) Bishop, C. M., 2006. Pattern recognition and machine learning.
 Blei (2012) Blei, D. M., 2012. Probabilistic topic models. Communications of the ACM 55 (4), 77–84.
 Bovet and Cesati (2005) Bovet, D. P., Cesati, M., 2005. Understanding the Linux kernel. O’Reilly Media, Inc.
 Brockwell (2006) Brockwell, A., 2006. Parallel markov chain monte carlo simulation by prefetching. Journal of Computational and Graphical Statistics 15 (1), 246–261.
 Broquedis et al. (2009) Broquedis, F., Furmento, N., Goglin, B., Namyst, R., Wacrenier, P.A., 2009. Dynamic task and data placement over numa architectures: an openmp runtime perspective. In: Evolving OpenMP in an Age of Extreme Parallelism. Springer, pp. 79–92.
 Brush (1967) Brush, S. G., 1967. History of the lenzising model. Reviews of Modern Physics 39 (4), 883.
 Bryant and O’Hallaron (2011) Bryant, R., O’Hallaron, D. R., 2011. Computer systems: a programmer’s perspective, Second Edition. Prentice Hall.
 Carlin et al. (1992) Carlin, B. P., Gelfand, A. E., Smith, A. F., 1992. Hierarchical bayesian analysis of changepoint problems. Applied statistics, 389–405.
 Chandra (2001) Chandra, R., 2001. Parallel programming in OpenMP. Morgan Kaufmann.
 da Silva (2011) da Silva, A. R. F., 2011. Cudabayesreg: parallel implementation of a bayesian multilevel model for fmri data analysis. Journal of Statistical Software 44 (4), 1–24.
 Descombes et al. (1998) Descombes, X., Kruggel, F., von Cramon, D. Y., 1998. Spatiotemporal fmri analysis using markov random fields. Medical Imaging, IEEE Transactions on 17 (6), 1028–1039.
 Diaconescu and Zima (2007) Diaconescu, R. E., Zima, H. P., 2007. An approach to data distributions in chapel. International Journal of High Performance Computing Applications 21 (3), 313–335.
 Duane et al. (1987) Duane, S., Kennedy, A. D., Pendleton, B. J., Roweth, D., 1987. Hybrid monte carlo. Physics letters B 195 (2), 216–222.
 Dumont (2011) Dumont, M. D., 2011. Markov chain monte carlo on the gpu. Ph.D. thesis, Rochester Institute of Technology.
 Fenton and Neil (2004) Fenton, N., Neil, M., 2004. Combining evidence in risk analysis using bayesian networks. Agena White Paper W 704.
 Gelman and Hill (2007) Gelman, A., Hill, J., 2007. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press.
 Geman and Geman (1984) Geman, S., Geman, D., 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence (6), 721–741.
 Gilks and Wild (1992) Gilks, W. R., Wild, P., 1992. Adaptive rejection sampling for gibbs sampling. Applied Statistics, 337–348.
 Girolami and Calderhead (2011) Girolami, M., Calderhead, B., 2011. Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73 (2), 123–214.
 Gonzalez et al. (2011) Gonzalez, J., Low, Y., Gretton, A., Guestrin, C., 2011. Parallel gibbs sampling: From colored fields to thin junction trees. In: International Conference on Artificial Intelligence and Statistics. pp. 324–332.
 Gonzalez et al. (2012) Gonzalez, J. E., Low, Y., Gu, H., Bickson, D., Guestrin, C., 2012. Powergraph: Distributed graphparallel computation on natural graphs. In: Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI). pp. 17–30.
 Gonzalez et al. (2009) Gonzalez, J. E., Low, Y., Guestrin, C., O’Hallaron, D., 2009. Distributed parallel inference on large factor graphs. In: Proceedings of the TwentyFifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, pp. 203–212.
 Hastie and Tibshirani (1987) Hastie, T., Tibshirani, R., 1987. Nonparametric logistic and proportional odds regression. Applied statistics 36 (3), 260–276.
 Hastings (1970) Hastings, W. K., 1970. Monte carlo sampling methods using markov chains and their applications. Biometrika 57 (1), 97–109.
 Hinton et al. (2006) Hinton, G. E., Osindero, S., Teh, Y.W., 2006. A fast learning algorithm for deep belief nets. Neural computation 18 (7), 1527–1554.
 Hoffman and Gelman (2011) Hoffman, M. D., Gelman, A., 2011. The nouturn sampler: Adaptively setting path lengths in hamiltonian monte carlo. arXiv preprint arXiv:1111.4246.
 Hopfield (1982) Hopfield, J. J., 1982. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences 79 (8), 2554–2558.
 Jeffers and Reinders (2013) Jeffers, J., Reinders, J., 2013. Intel Xeon Phi Coprocessor High Performance Programming. Newnes.
 Kirk and Wenmei (2012) Kirk, D. B., Wenmei, W. H., 2012. Programming massively parallel processors: a handson approach. Newnes.
 Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., Hinton, G. E., 2012. Imagenet classification with deep convolutional neural networks. In: NIPS. Vol. 1. p. 4.
 Kumar et al. (2008) Kumar, S., Kim, D., Smelyanskiy, M., Chen, Y.K., Chhugani, J., Hughes, C. J., Kim, C., Lee, V. W., Nguyen, A. D., 2008. Atomic vector operations on chip multiprocessors. In: ACM SIGARCH Computer Architecture News. Vol. 36. IEEE Computer Society, pp. 441–452.
 Le et al. (2011) Le, Q. V., Zou, W. Y., Yeung, S. Y., Ng, A. Y., 2011. Learning hierarchical invariant spatiotemporal features for action recognition with independent subspace analysis. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, pp. 3361–3368.
 Lee and Mumford (2003) Lee, T. S., Mumford, D., 2003. Hierarchical bayesian inference in the visual cortex. JOSA A 20 (7), 1434–1448.
 Low et al. (2012) Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J. M., 2012. Distributed graphlab: a framework for machine learning and data mining in the cloud. Proceedings of the VLDB Endowment 5 (8), 716–727.
 Majo and Gross (2011) Majo, Z., Gross, T. R., 2011. Memory management in numa multicore systems: trapped between cache contention and interconnect overhead. In: ACM SIGPLAN Notices. Vol. 46. ACM, pp. 11–20.
 Marsaglia and Tsang (2000) Marsaglia, G., Tsang, W. W., 2000. A simple method for generating gamma variables. ACM Transactions on Mathematical Software (TOMS) 26 (3), 363–372.
 McCool et al. (2012) McCool, M., Reinders, J., Robison, A., 2012. Structured parallel programming: patterns for efficient computation. Elsevier.
 McCulloch (2006) McCulloch, C. E., 2006. Generalized linear mixed models. Wiley Online Library.
 Metropolis et al. (2004) Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., Teller, E., 2004. Equation of state calculations by fast computing machines. The journal of chemical physics 21 (6), 1087–1092.
 Metropolis and Ulam (1949) Metropolis, N., Ulam, S., 1949. The monte carlo method. Journal of the American statistical association 44 (247), 335–341.
 Mohamed et al. (2011) Mohamed, A.r., Sainath, T. N., Dahl, G., Ramabhadran, B., Hinton, G. E., Picheny, M. A., 2011. Deep belief networks using discriminative features for phone recognition. In: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on. IEEE, pp. 5060–5063.
 Molina da Cruz et al. (2011) Molina da Cruz, E. H., Alves, Z., Carissimi, A., Navaux, P. O. A., Ribeiro, C. P., Mehaut, J., 2011. Using memory access traces to map threads and data on hierarchical multicore platforms. In: Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. IEEE, pp. 551–558.
 Neal (1995) Neal, R. M., 1995. Bayesian learning for neural networks. Ph.D. thesis, University of Toronto.
 Neal (2003) Neal, R. M., 2003. Slice sampling. Annals of statistics, 705–741.
 Nelder and Baker (1972) Nelder, J. A., Baker, R., 1972. Generalized linear models. Wiley Online Library.
 Newman et al. (2009) Newman, D., Asuncion, A., Smyth, P., Welling, M., 2009. Distributed algorithms for topic models. The Journal of Machine Learning Research 10, 1801–1828.
 Nocedal and Wright (2006) Nocedal, J., Wright, S., 2006. Numerical optimization: Springer series in operations research and financial engineering. SpringerVerlag.
 Perez et al. (1998) Perez, P., et al., 1998. Markov random fields and images. CWI quarterly 11 (4), 413–437.
 Polikar (2006) Polikar, R., 2006. Ensemble based systems in decision making. Circuits and Systems Magazine, IEEE 6 (3), 21–45.
 Press (2007) Press, W. H., 2007. Numerical recipes 3rd edition: The art of scientific computing. Cambridge university press.
 Qi and Minka (2002) Qi, Y., Minka, T. P., 2002. Hessianbased markov chain montecarlo algorithms.
 Ren and Orkoulas (2007) Ren, R., Orkoulas, G., 2007. Parallel markov chain monte carlo simulations. The Journal of chemical physics 126 (21), 211102.
 Ribeiro et al. (2010) Ribeiro, C. P., Méhaut, J., Carissimi, A., 2010. Memory affinity management for numerical scientific applications over multicore multiprocessors with hierarchical memory. In: Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on. IEEE, pp. 1–4.
 Ripley (2009) Ripley, B. D., 2009. Stochastic simulation. Vol. 316. John Wiley & Sons.
 Robert and Casella (2004) Robert, C. P., Casella, G., 2004. Monte Carlo statistical methods. Vol. 319. Citeseer.
 Rossi et al. (2005) Rossi, P. E., Allenby, G. M., McCulloch, R. E., 2005. Bayesian statistics and marketing. J. Wiley & Sons.
 Sharabiani et al. (2011) Sharabiani, M. T. A., Vermeulen, R., Scoccianti, C., Hosnijeh, F. S., Minelli, L., Sacerdote, C., Palli, D., Krogh, V., Tumino, R., Chiodini, P., et al., 2011. Immunologic profile of excessive body weight. Biomarkers 16 (3), 243–251.
 Smolensky (1986) Smolensky, P., 1986. Information processing in dynamical systems: Foundations of harmony theory.
 Sobehart et al. (2000) Sobehart, J. R., Stein, R., Mikityanskaya, V., Li, L., 2000. Moodyâs public firm risk model: A hybrid approach to modeling short term default risk. Moodyâs Investors Service, Global Credit Research, Rating Methodology, March.
 Strid (2010) Strid, I., 2010. Efficient parallelisation of metropolis–hastings algorithms using a prefetching approach. Computational Statistics & Data Analysis 54 (11), 2814–2835.
 Sutter and Kalivas (1993) Sutter, J. M., Kalivas, J. H., 1993. Comparison of forward selection, backward elimination, and generalized simulated annealing for variable selection. Microchemical journal 47 (1), 60–66.
 Tibbits et al. (2011) Tibbits, M. M., Haran, M., Liechty, J. C., 2011. Parallel multivariate slice sampling. Statistics and Computing 21 (3), 415–430.
 Wilkinson (2010) Wilkinson, D. J., 2010. Parallel bayesian computation. In: Kontoghiorghes, E. J. (Ed.), Handbook of Parallel Computing and Statistics. CRC Press.
 Xu and Ihler (2011) Xu, T., Ihler, A. T., 2011. Multicore gibbs sampling in dense, unstructured graphs. In: International Conference on Artificial Intelligence and Statistics. pp. 798–806.
 Yu and Xu (2009) Yu, L., Xu, Y., 2009. A parallel gibbs sampling algorithm for motif finding on gpu. In: Parallel and Distributed Processing with Applications, 2009 IEEE International Symposium on. IEEE, pp. 555–558.