An efficient numerical method for solving the Boltzmann equation in multidimensions
Abstract
In this paper we deal with the extension of the Fast Kinetic Scheme (FKS) [J. Comput. Phys., Vol. 255, 2013, pp 680698] originally constructed for solving the BGK equation, to the more challenging case of the Boltzmann equation. The scheme combines a robust and fast method for treating the transport part based on an innovative Lagrangian technique supplemented with fast spectral schemes to treat the collisional operator by means of an operator splitting approach. This approach along with several implementation features related to the parallelization of the algorithm permits to construct an efficient simulation tool which is numerically tested against exact and reference solutions on classical problems arising in rarefied gas dynamic. We present results up to the DD case for unsteady flows for the Variable Hard Sphere model which may serve as benchmark for future comparisons between different numerical methods for solving the multidimensional Boltzmann equation. For this reason, we also provide for each problem studied details on the computational cost and memory consumption as well as comparisons with the BGK model or the limit model of compressible Euler equations.
keywords:
Boltzmann equation, kinetic equations, semiLagrangian schemes, spectral schemes, 3D/3D, GPU, CUDA, OpenMP, MPI.Msc:
(82B40, 76P05, 65M70, 65M08, 65Y05, 65Y20Contents
 1 Introduction
 2 The Boltzmann equation
 3 The Fast Kinetic Scheme and the Fast Spectral Scheme
 4 Numerical implementation

5 Numerics
 5.1 Part 1. Numerical results for the space homogeneous case

5.2 Part 2. Numerical results for the one dimensional case in space.
 5.2.1 Test 2.1. Numerical convergence of the Boltzmann equation. The two dimensional in velocity Maxwellian molecules case.
 5.2.2 Test 2.2. Comparisons between the BGK model and the Boltzmann model. The two dimensional in velocity Maxwellian molecules case.
 5.2.3 Test 2.3. Numerical convergence of the Boltzmann equation. The three dimensional in velocity hard sphere molecules case.
 5.2.4 Performances
 5.3 Part 3. Numerical results for the space two dimensional case.
 5.4 Part 4. Numerical results for the space three dimensional case.
 6 Conclusion and perspectives
1 Introduction
Kinetic equations consider a representation of a gas as particles undergoing instantaneous collisions interspersed with ballistic motion cercignani; ACTA. Nowadays, these models appear in a variety of sciences and applications such as astrophysics, aerospace and nuclear engineering, semiconductors, plasmas related to fusion processes as well as biology, medicine or social sciences. The common structure of such equations consists in a combination of a linear transport term with one or more interaction terms which furnishes the time evolution of the distribution of particles in the phase space. For its nature, the unknown distribution lives in a seven dimensional space, three dimensions for the physical space and three dimensions for the velocity space, plus the time. This makes the problem a real challenge from the numerical point of view, since the computational cost becomes immediately prohibitive for realistic multidimensional problems ACTA. Aside from the curse of dimensionality problem, there are many other difficulties which are specific to kinetic equations. We recall two among the most important ones. The computational cost related to the evaluation of the collision operator involving multidimensional integrals which should be solved in each point of the physical space FiMoPa:2006; PR2. The second challenge is represented by the presence of multiple scales which requires the development of adapted numerical schemes to avoid the resolution of the stiff dynamics Dimarco_stiff1; Dimarco_stiff2; Jin_review; Jin; Jin2; BLM; degondrev typically arising when dealing with problems with multiple regimes.
Historically, there exists two different approaches which are generally used to tackle kinetic equations from a numerical point of view: deterministic numerical methods such as finite volume, semiLagrangian and spectral schemes ACTA, and, probabilistic numerical methods such as Direct Simulation Monte Carlo (DSMC) schemes bird; Cf. Both methodologies have strengths and weaknesses. While the first could normally reach high order of accuracy, the second are often faster, especially for solving steady problems but, typically, exhibit lower convergence rate and difficulties in describing non stationary and slow motion flows.
In this work, we deal with deterministic techniques. In particular, we focus on semiLagrangian approaches CrSon; CrSon1; Gu; Shoucri; FilbetRusso; Filbet2 for the transport part coupled with spectral methods canuto:88 for the interaction part. Our main goal is to tackle the challenges related to the high dimensionality of the equations and with the difficulties related to the approximation of the collision integral. More in details, we deal with the extension of our recent works FKS; FKS_HO; FKS_DD; FKS_GPU which were based on an innovative semiLagrangian technique for discretizing the transport part of a kinetic model (FKS method). This technique has been applied solely to the solution of a simple kinetic equation with a relaxation type collision operator, i.e. the BGK (BhatnagarGrossKrook) operator Gross. Here we extend it to the case of the full Boltzmann operator bird; cercignani and we test his performances up to the six dimensional case. The FKS is based on the classical discrete velocity models (DVM) approach bobylev; Pal; Pal1; Mieussens. Successively, in order to overcome the problem of the excessive computational cost, we propose to use a Lagrangian technique which exactly solves the transport step on the entire domain, without reprojecting the solution on the grid at each time step. The FKS approach was shown to be an efficient way to solve linear transport equations, and, it has permitted the simulation of full six dimensions problems on a single processor machine FKS. Unfortunately the solutions computed with this method are limited to a first order in space and time precision. Extension of the method to high order reconstruction is under consideration RFKS. Concerning the discretization of the collision operator we rely on Fourier techniques. For the resolution of the Boltzmann integral, these techniques have been first introduced independently by L. Pareschi and B. Perthame in pareschi:1996 and by A. Bobylev and S. Rjasanow in bobylev_rjasanow97. Since then, this approach has been investigated by a many authors FiMoPa:2006; PR2; gamba:2010; gamba; gambaL; wu2013deterministic; bobylev_rjasanow97; Rjasanow; Villani; filbet:2011Conv; wu2015influence; Aleks; PaRu:stab:00; wu2015fast. In this work we will make use in particular of the fast method described in MoPa:2006; FiMoPa:2006 which has a complexity of the order of where is the number of point in which the velocity space is discretized in one direction and the dimension of the velocity space. The method preserves mass, and approximates with spectral accuracy momentum and energy.
Combining opportunely the FKS method with the fast spectral approach we have developed a method for solving the Boltzmann equation up to the six dimensional case for unsteady flows. In order to obtain such result we have constructed a parallel version of our algorithm taking advantages of GraphicalProcessorUnit (GPU) under CUDA language. The results presented in this work show that we are nowadays ready and able to use kinetic equations to simulate realistic multidimensional flows. Up to our knowledge this is one of the first examples in literature of solution of the full multidimensional Boltzmann equation by means of deterministic schemes. Our main limitation to run extensive simulations remains at the present moment the lack of memory capacity due to the use of shared memory machines. However, in the present paper, we also report some preliminary results about the extension of our method to deal with distributedmemory computers but we postpone for a future research the detailed analysis of the Message Passive Interface (MPI) version of the algorithm as well as the analysis of its performances.
The article is organized as follows. In Section 2 we recall the Boltzmann equation. In Section 3 we present the Fast Kinetic Scheme and the fast spectral scheme. In Section 4 we detail the aspects related to the implementation of the resulting algorithm necessary to realize an efficient parallel numerical tool. Several tests starting from the 0D 2D up to the 3D3D case are studied in details in Section 5. These tests assess the validity of our approach as well as detail all the computational aspects. Conclusions and future works are finally exposed in Section 6.
2 The Boltzmann equation
In this section we briefly recall the Boltzmann equation and its main properties, we also recall some related models, i.e. the BGK model and the compressible Euler equations, which will be used for numerical comparisons. We refer the reader to ACTA; cercignani for an exhaustive description.
In the kinetic theory of rarefied gases, the nonnegative function characterizes the state of the system and it defines the density of particles having velocity in position at time , where is the physical dimension and the dimension of the velocity space. The time evolution of the particle system is obtained through the equation
(1) 
The operator , on the right hand side in equation (1), describes the effects of particle interactions and its form depends on the details of the microscopic dynamic. Independently on the type of microscopic interactions considered, typically the operator is characterized by some conservation properties of the physical system. They are written as
(2) 
where are commonly called the collision invariants. We denote by
the first three moments of the distribution function , namely, , where is the density, the momentum and the energy. Integrating (1) against yields a system of macroscopic conservation laws
(3) 
The above moment system is not closed since the second term involves higher order moments of the distribution function . However, using the additional property of the operator that the functions belonging to its kernel satisfy
(4) 
where the Maxwellian distributions can be expressed in terms of the set of moments by
(5) 
with the temperature, one can get a closed system by replacing with in (3). This corresponds to the set of compressible Euler equations which can be written as
(6) 
with . The simplest operator satisfying (2) and (4) is the relaxation operator Gross
(7) 
where defines the socalled collision frequency. Its values will be specified in the numerical test Section in order for the model to be as close as possible to the Boltzmann model described next. The classical Boltzmann operator reads
(8) 
where is a vector of the unitary sphere . The postcollisional velocities are given by the relations
(9) 
where is the relative velocity. The kernel characterizes the details of the binary interactions, it has the form
(10) 
where the scattering crosssection , in the case of inverse th power forces between particles, can be written as
(11) 
with . The special situation gives the socalled Maxwell pseudomolecules model with
(12) 
For the Maxwell case the collision kernel is independent of the relative velocity. For numerical purposes, a widely used model is the variable hard sphere (VHS) model introduced by Bird bird. The model corresponds to , where is a positive constant, and hence
(13) 
In the numerical test Section we will consider the Maxwell molecules case when dealing with a velocity space of dimension and with the more realistic case of VHS molecules when dealing with the three dimensional case in velocity space: . For comparison purposes we will also consider the simplified BGK model (7) and the compressible Euler system case (6).
3 The Fast Kinetic Scheme and the Fast Spectral Scheme
In this section we detail the numerical scheme. In the first part we discuss the Fast Kinetic Scheme (FKS) and in the second part the Fast Spectral method for the Boltzmann equation. The two solvers are connected by using splitting in time approaches as the ones described in Des.
3.1 The Fast Kinetic Scheme (FKS)
The Fast Kinetic Scheme FKS; FKS_HO belongs to the family of socalled semiLagrangian schemes CrSon; CrSon1; Filbet which are typically applied to a Discrete Velocity Model (DVM) bobylev; Mieussens approximation of the original kinetic equation.
In order to introduce the scheme, let us truncate the velocity space by fixing some given bounds and set a cubic grid in velocity space of points with the grid step which is taken equal in each direction. The continuous distribution function is then replaced by a vector whose components are assumed to be approximations of the distribution function at locations :
(14) 
The discrete velocity kinetic model consists then of a set of evolution equations in the velocity space for , , of the form
(15) 
where is a suitable approximation of the collision operator at location discussed in Section (3.3). Observe that, due to the truncation of the velocity space and to the finite number of points with which is discretized, the moments of the discrete distribution function are such that
(16) 
with the discrete collision invariants, i.e. they are no longer those given by the continuous distribution . This problem concerns all numerical methods based on the discrete velocity models and different strategies can be adopted to restore the correct macroscopic physical quantities Mieussens; Pal; Pal1; gamba. At the same time, the spectral approach does not preserve the energy and the momentum of the system even if it approximates these quantities with spectral accuracy PR2. In order to solve this problem of loss of conservation we adopt an projection technique for the discretized distribution and the discretized collision operator which permits to project the discretized and in the space of the distributions for which the moments are exactly the continuous macroscopic quantities we aim to preserve. We detail this approach in the following paragraph. For lightening the notations we omit the tilde in the rest of the paper, we only use it in the next paragraph in order to introduce the projection procedure. We will then suppose from now on that all the operations preserves the macroscopic moments if not differently stated.
Let us now introduce a Cartesian uniform grid in the three dimensional physical space made of points with a scalar which represents the grid step (the same in each direction) in the physical space. Further we define a time discretization starting at , where is the time step defined by an opportune CFL condition discussed next. The time index varies between and so that the final time is .
Each equation of system (15) is solved by a time splitting procedure. We recall here a first order splitting approach: first a transport step exactly solves the lefthand side, whereas a collision stage solves the righthand side using the solution from the transport step as initial data:
Transport stage  (17)  
Collisions stage  (18) 
Transport step
Let be the pointwise values at time of the distribution , . The idea behind the fast kinetic scheme is to solve the transport stage (17) continuously in space, see Fig. 1 for a sketch in the one dimensional setting. To this aim we define at the initial time the function as a piecewise continuous function for all , where and . Hence starting from data at time index , the exact solution of (17) is simply
(19) 
In other words, the entire function is advected with velocity during unit of time and the superscript indicates that only the transport step has been solved so far. The extension of this procedure to the generic time step gives
(20) 
where now, the key observation is that the discontinuities of the piecewise function do not lie on the interfaces of two different cells. Instead, the positions of the discontinuities depend entirely on the previous advection step and thus they may be located anywhere in the physical space. This means that if only the linear transport equation has to be solved, this approach gives the exact solution to the equation if the initial data is truly a piecewise constant function initially centered on the spacial mesh.
Collision step
The effect of the collisional step is to change the amplitude of . The idea is to solve the collision operator locally on the grid points, and, successively, extend these computed values to the full domain . Thus we need to solve the following ordinary differential equation
(21) 
where , for all velocities of the lattice and grid points . The initial data for solving this equation is furnished by the result of the transport step obtained by (20) at points of the mesh at time , i.e. , for all , and . Then, the solution of (21), locally on the grid points, reads if, for simplicity, a forward Euler scheme is used as
(22) 
where . Many different time integrators can be employed to solve this equation. In particular special care is needed in the case in which the equation becomes stiff, refer to Dimarco_stiff1; Dimarco_stiff2 for alternative strategies. Since the time integration of the collision term is not the issue considered in this paper, we considered the simplest possible scheme, but the FKS technique remains the same when other time integrators are employed. Equation (22) furnishes a new value for the distribution at time only in the cell centers of the spacial cells for each velocity . However, one needs also the value of the distribution in all points of the domain in order to perform the transport step at the next time step. Therefore, we define a new piecewise constant function for each velocity of the lattice as
(23) 
Said differently we make the fundamental assumption that the shape of in space coincides with the one of . Thanks to the above choice one can rewrite the collision step in term of spatially reconstructed functions as
(24) 
This ends one time step of the FKS scheme.
Concerning the transport part of the scheme, the time step is constrained by a CFL like condition of type
(25) 
The time step constraint for the collision step depends on the choice of the operator . Since in the numerical test section
we used both a BGK and a Boltzmann operator, the time step constraint for the interaction part has been chosen as the minimum time step which gives stability in the
solution of the ODE (22) independently on the type of collision operator employed.
As observed in FKS the transport scheme is stable for every choice of the time step, being the solution
for a given fixed reconstruction performed exactly. Nonetheless
the full scheme being based on a time splitting technique, the error is of the
order of in the case of first order splitting or of order for a splitting of order .
This suggests to take the usual CFL condition for the transport part in order to maintain the time error small enough.
To conclude this Section let us observe that time accuracy can be increased by high order time splitting methods, while spacial accuracy can be increased close to the fluid limit to a nominally secondorder accurate scheme by the use of piecewise linear reconstructions of state variables, see the details in FKS_HO. Spacial accuracy for all regime can also be increased by using high order polynomial reconstruction for the distribution . This work is in progress and results are discussed in RFKS.
3.2 Conservation of macroscopic quantities
In order to preserve mass, momentum and energy in the scheme, we employ the strategy proposed in gamba. We consider one space cell, the same renormalization of should be considered for all spatial cells. This step is performed at the beginning, i.e. . for the distribution and after each collision step (22) which causes loss of momentum and energy due to the spectral discretization. Let be the distribution function vector at at the center of the cell and let be the unknown corrected distribution vector which fulfills the conservation of moments. Let
be a matrix of coefficients depending on the discretization parameters and be the vector of conserved quantities, namely density, momentum and energy. Conservation can be imposed solving a constrained optimization formulation:
(26)  
The solution of this minimization problem can be analytically obtained by employing the Lagrange multiplier method. Let be the Lagrange multiplier vector. The corresponding scalar objective function to be minimized is given by
(27) 
Then, by nullifying the derivative of with respect to we get
(28) 
while the Lagrange multipliers are obtained by solving
(29) 
In particular, the above expression says that the value of is uniquely determined by . Back substituting into (28) finally provides
(30) 
Note that matrix is precomputed and stored as being constant for each simulation. If a discretization of the Maxwellian distribution is needed, as for example when using the BGK model, the same procedure should be applied to the function in order to assure conservation of the first three moments at each instant of time at which the Maxwellian function is invoked. In this case by defining with the approximated Maxwellian and by the pointwise value of the equilibrium distribution, we get mimicking (30)
(31) 
This ends the description of the procedure which permits to conserve macroscopic quantities.
3.3 Fast Spectral Scheme (FSS) to discretize the Boltzmann collision operator
The fast spectral discretization of the Boltzmann operator employed in this work is described in this section. To this aim, we focus again on a given cell at a given instant of time . The same computation is repeated for all cells and times . Moreover, since the collision operator acts only on the velocity variable, to lighten the notation in this paragraph, only the dependency on the velocity variable is considered for the distribution function , i.e. .
The first step to construct for our spectral discretization is to truncate the integration domain of the Boltzmann integral (8) as done for the distribution . As a consequence, we suppose the distribution function to have compact support on the ball of radius centered in the origin. Then, since one can prove that , in order to write a spectral approximation which avoid aliasing, it is sufficient that the distribution function is restricted on the cube with . Successively, one should assume on and extend to a periodic function on the set . Let observe that the lower bound for can be improved. For instance, the choice guarantees the absence of intersection between periods where is different from zero. However, since in practice the support of increases with time, we can just minimize the errors due to aliasing canuto:88 with spectral accuracy.
To further simplify the notation, let us take and hence with in the following. We denote by the Boltzmann operator with cutoff. Hereafter, using one index to denote the dimensional sums, we have that the approximate function can be represented as the truncated Fourier series by
(32) 
(33) 
We then obtain a spectral quadrature of our collision operator by projecting (8) on the space of trigonometric polynomials of degree less or equal to , i.e.
(34) 
Finally, by substituting expression (32) in (34) one gets after some computations
(35) 
where are given by
(36) 
with
(37) 
Let us notice that the naive evaluation of (35) requires operations, where . This causes the spectral method to be computationally very expensive, especially in dimension three. In order to reduce the number of operations needed to evaluate the collision integral, the main idea is to use another representation of (8), the socalled Carleman representation Carl:EB:32 which is obtained by using the following identity
This gives in our context for the Boltzmann integral
(38) 
with
(39) 
This transformation yields the following new spectral quadrature formula
(40) 
where are now given by
(41) 
Now, in order to reduce the number of operation needed to evaluate (40), we look for a convolution structure. The aim is to approximate each by a sum
where represents the number of finite possible directions of collisions. This finally gives a sum of discrete convolutions and, consequently, the algorithm can be computed in operations by means of standard FFT technique canuto:88.
In order to get this convolution form, we make the decoupling assumption
(42) 
This assumption is satisfied if is constant. This is the case of Maxwellian molecules in dimension two, and hard spheres in dimension three, the two cases treated in this paper. Indeed, using kernel (10) in (39), one has
so that is constant if , and , .
We start by dealing with dimension and , i.e. Maxwellian molecules. Here we write and in spherical coordinates and to get
Then, denoting for , we have the explicit formula
where . This explicit formula is further plugged in the expression of and using its parity property, this yields
Finally, a regular discretization of equally spaced points, which is spectrally accurate because of the periodicity of the function, gives
(43) 
with
(44) 
where .
Now let us deal with dimension with satisfying the decoupling assumption (42). First we switch to the spherical coordinates for :
Then, integrating first on the intersection of the unit sphere with the plane gives
where
This leads to the following decoupling formula
where denotes the half sphere and
Now, in the particular case where , i.e. the hard sphere model, we can explicitly compute the functions . These are
Taking a spherical parametrization of and uniform grids of respective size and for and (again spectrally accurate because of the periodicity of the function) leads to the following quadrature formula for
where
and for all and
4 Numerical implementation
We discuss in this part some aspects relative to the numerical implementation of the method.
4.1 Algorithm for the transport part
The method described in the previous sections can be resumed into two main actions: transport and interaction. In our implementation we have adopted a particle like interpretation of the FKS scheme which helps in reducing the computational effort and the memory requirement. In this interpretation, each point of the quadrature of the phase space is represented by a particle with a given mass, velocity and position. The transport phase causes the particle to move in the physical space, while the interaction phase causes the mass of each particle to change. In this setting, the distribution function can be expressed as
(45) 
where represents the particle position, the particle velocity, the particle mass. Moreover, the particle velocity corresponds to the quadrature point chosen to discretize the velocity space. The effect of the transport is a simple shift of particles to their new positions according to
(46) 
The collision step acts only locally and changes the velocity distribution. The particle interpretation of this part of the scheme consists in changing the mass of each particle through the formula (22) which reads
(47) 
where corresponds to the approximation of the collision integral in the center of the cell evaluated at location by means of the fast spectral method described before. Let observe that thanks to the uniform grid in velocity and physical space, the number of particles is the same in each cell and remains constant in time. This permits to consider the motion of only a fixed subset of particles which belongs to only one chosen cell. The relative motion of all other particles in their cells being the same FKS. This gives great computational advantage since the number of particle to track in time is greatly reduced. The only information which should be tracked remains the mass of each individual particle which is different for every velocity and position . This reinterpretation of the scheme permits to strongly reduce the computational cost related to the transport part as shown in the numerical test section.
4.2 Profiling and parallelization strategy
The Fast Kinetic scheme for the BGK equation was shown to be very efficient and extremely parallelizable (with a shown acceleration very close to the ideal scaling) in FKS_GPU on mild parallel architecture, namely using OpenMP on maximum threads and on two Graphical Processing Units (GPU) using CUDA framework. These light parallel infrastructures are unfortunately not sufficient when Boltzmann operator is to be simulated. To this aim, let us briefly discuss the profiling of the collision algorithm. As we have already seen in the description of the method in the previous section, the resolution of Boltzmann operator requires several passages from the physical space to the Fourier space, which implies several calls to (inverse) Fourier transforms within nested loops which are due to the convolutive form of the integral. This interaction step covers about of the total cost of a 2D2D simulation on a serial machine, see table 1. The situation is alike in the case of 3D3D simulations, and, thus, results are not reported. From table 1 we deduce that the collision step is the part of the code which demands to be dealt with great attention if any gain in performance is expected.
Main routines  Cost CPU %  Subroutines  Subcost CPU % 

Transport  2 %  —  2 % 
Update moments  0 %  —  0 % 
Collision  98 %  Compute  5 % 
Loop: compute coeff.  5 %  
Loop: compute  87 %  
Update  3 %  
Total  100 %  100% 
More precisely, if we give a closer look to the computational cost related to the computation of the collision term in one fixed spacial cell, the major cost (beyond ) is spent by the Fourier transform routines. In our implementation, the collision is performed without any communication with the local neighborhood. In other words, once the values of the distribution functions are stored then is computed locally. Because is not too large, typically the number of points chosen to discretize the velocity space goes from to , then the memory requirement is not extremely demanding. Finally, as seen in the previous Section the computation of is based on a loop over the number of angles with in dimension two, and, a double loop over the angles and in dimension three with and with a typical number of chosen angles of eight. After a first Fourier transform of the distribution , for each iterate of these loops, the coefficients of the quadrature formula are computed and then three calls to the inverse fast Fourier transform are done. The results of the FFT are gathered into temporary complex variables later arranged to form the solution of . The costs related to the four main routines needed for the evaluation of the operator namely: Fourier transform, coefficients computation, inverse Fourier transform and sum of the obtained values to form , are detailed in table 1.
We discuss now some aspects related to the parallelization strategy. The framework described is extremely well suited for parallelization. The strategy adopted is to divide the space domain in several subdomains because the interaction step, the most expensive one, does not demand any communication between spatial cells. Each cell can be therefore treated independently from the others. Then, the idea employed has been to use the OpenMP on shared memory systems and make each computational core responsible for a subset of a space mesh. This approach requires very little modification of a sequential code and gives strong scaling close to ideal as shown in FKS_GPU for the BGK case. A different possibility is to compute the collision kernel, and all the related Fast Fourier Transforms on a GPU. Even if, this involves substantial amount of slow communications between CPU and GPU, the computational complexity of this part is so large that the communication time does not play a fundamental role. This approach requires more programming effort, compared to the OpenMP parallelization strategy as several routines have to be completely rewritten. In particular, the code has to be adapted to the Single Instruction, Multiple Threads (SIMT) parallel programming mode imposed by the GPU architecture. That is to say, parts of the code that can be run in parallel on large number of threads have to be identified and rewritten as so called CUDA kernels  functions that are executed multiple times in parallel by different CUDA threads. Moreover, the GPUCPU memory data transfer has to be taken into account. The payoff is however higher and, at least for computations which does not require too large memory storage, this is probably the best choice.
Unfortunately, when the number of points needed to perform a simulation grows in the case the Boltzmann operator is used, the above implementation strategies on shared memory systems are not sufficient any more. This is true particularly in the full three dimensional case. Luckily, the method described in this work can be easily adapted for its application in distributed memory systems. The idea adopted in this case is to distribute spatial degrees of freedom over computational nodes, keeping on every node a complete set of velocity space points. The computational domain is then decomposed into slices along the direction (see Fig.2). The slices are successively distributed over different MPI processes. Every node will therefore compute the collision term for a part of computational domain. This can be done on each node using the OpenMP or GPU parallelization strategy, depending on the architecture at disposal. As the update of the primitive variables requires an exchange with neighboring spatial cells, some of the particle masses, which contain all the information related to the distribution function in our implementation, need to be broadcast to other computational nodes. This data exchange, which is typically a bottleneck in the MPI computations, and, usually demands large efforts in order to be minimized the internodal communications, in the case of the Boltzmann equation, takes only a small part of the total runtime, even if, in our first MPI implementation, the domain decomposition is far from being optimal. We postpone to a future work the discussion related to a more efficient MPI parallelization strategy and all details related to this type of implementation.
5 Numerics
In this section, we validate the proposed SFKS (SpectralFKS) method which couples the fast semiLagrangian and the fast spectral scheme. The testing methodology is divided into four parts.

Part 1. Sanity checks. We only consider the space homogeneous Boltzmann equation in two and three dimensions in velocity space. Two problems from PR2 for which the analytical solution is known are simulated. The results show that our implementation of the fast spectral discretization of the Boltzmann operator is correct.

Part 2. Using the SFKS method in one dimension of physical space and two or three dimensions in velocity space, we show on Riemann like problems the differences between the BGK and the Boltzmann models. The reported results justify the use of the more complex and costly collisional Boltzmann operator especially in situations far from the thermodynamical equilibrium. We also report the details of the computational costs as well as the numerical convergence of the method for an increasing number of points in the physical or velocity space.

Part 3. Using the SFKS method in the two dimensional case in space and velocity, we show that using the Boltzmann or the BGK operator leads to notably different results. We consider a regular vortex like problem and a reentry like problem with evolving angle of attack. Performances and profiling study of the SFKS approach supplemented with BGK or Boltzmann collisional operator are provided for refined grids in space and time. Details about the computational costs of the different part of the scheme, scalability with respect to the number of cells are also furnished in order to characterize the behavior of the method as precisely as possible.

Part 4. We test the SFKS method in the full three dimensional case in space and velocity for an unsteady test problem. The interaction of a flying object with the surrounding ambient gas is simulated. We compare the results obtained with the Boltzmann collisional operator with those given by a simpler BGK model. Moreover, we measure the cost of such a 3D3D simulation in terms of CPU, cost per degree of freedom and scalability with respect to the number of cells. Some code profiling is also provided to measure the cost of the main components of the scheme.
Finally, the ability of the whole numerical code to run on several parallel environments is tested: CPU (OpenMP, MPI) and GPU (CUDA) types of parallelism frameworks are considered.
5.1 Part 1. Numerical results for the space homogeneous case
In this part, we validate our spectral discretization implementation of the Boltzmann collisional operator. To this aim, we reemploy the test cases considered in PR2 and FilbetRusso for the homogeneous two dimensional and three dimensional Boltzmann equation. For both situations, exact solutions are available and briefly recalled in the following.
5.1.1 Test 1.1. Convergence to equilibrium for the Maxwell molecules in dimension two.
We consider two dimensional in velocity Maxwellian molecules and the following space homogeneous initial condition
(48) 
The analytical solution (the socalled BKW one) is given for all times by Bobylev:75; KrookWu:1977:
(49) 
with . The final time is set to and the time step is equal to . The test is performed for points and for eight discrete angles . In order to keep the aliasing error smaller than the spectral error, the velocity domain is set to for , for and to for points. This choice helps fighting back the aliasing error behavior since it increases with the number of points. Therefore, to have comparable aliasing errors for the three meshes, we enlarge the size of the velocity space when the number of points increases. We report the discrete and norms of the error for the distribution function in Table 2. Moreover, in Figure 3, we report the distribution function at different times (, and ) when . The solution is plotted versus the exact solution. The results clearly show the convergence of the method towards the exact solution. Last in Figure 3 right panel, we present the time evolution of the error for the three configurations. Consistently the errors decrease by about one order of magnitude. The last recorded results, at corresponds to the Figures reported in table 2.
of points  error  error 

5.1.2 Test 1.2. Space homogeneous comparison between the BGK and the Boltzmann model. Maxwellian molecules.
In Figure 4, we compare the convergence to equilibrium for the BGK and Boltzmann models on the test case described in the previous paragraph using points on a domain . The norm of the difference between the two distribution functions and as a function of time is shown. As expected the differences are increasing at early stages of the relaxation towards the equilibrium to reach a maximum difference around time . Then, the two models slowly converge towards the same equilibrium solution.
The red marked iterations in Figure 4 are further depicted in Figure 5, where we show the details of the difference in time between the two distribution functions. The same azimuthal scale is employed to ease the comparison.
5.1.3 Test 1.3. Convergence to equilibrium in dimension three. Hard sphere molecules.
The following initial condition is considered FilbetRusso
(50) 
where and is . The final time is set to , the time step is , while the velocity domain is discretized with points.
In Figure 6, we report the initial data while in Figure 7, we report the relaxation to the equilibrium for the hard sphere model on a fixed plane
passing through the points of coordinates and of normal ^{1}^{1}1This plane passes through the center of the two spheres
defined by (50). In Figure 8, we report as before the comparison between the Boltzmann and the BGK model
measuring the difference between the two distribution functions in time. Finally, in Figure 9 are presented the details of such differences between the distribution functions
for the red marked iterations of figure 8 at . As for the Maxwellian molecules in two dimensions, at the beginning of the simulation we observe the larger deviations between the two models while at the end they both converge to the same limit solution as expected.
The results of these homogeneous tests (in two and in three dimensions in velocity) are meant to verify and validate our implementation of the spectral discretization of the Boltzmann collisional operator. Moreover, the differences observed between the two models justify the necessity of the Boltzmann operator for space non homogeneous situations. We can now investigate the coupling of the Boltzmann operator with the Fast semiLagrangian method for several space non homogeneous cases.
5.2 Part 2. Numerical results for the one dimensional case in space.
In this part, we focus on solving the one dimensional in space Boltzmann and BGK equations considering a two or three dimensional dimensional velocity space setting. The purposes are twofold. First, we numerically demonstrate that the FKS and the spectral accurate Boltzmann kernel solver may be appropriately coupled and that they provide a valid kinetic solver for Boltzmann equations. Second, we show that BGK and Boltzmann models provide different results, justifying the use of a more sophisticated model. For these tests, a classical Riemann problem with Sod like initial data is considered
with . Initial local thermodynamics equilibrium is considered for all tests: . The velocity space is set to for all cases. Dirichlet boundary conditions are set on the left/right boundaries of . The two kinetic models are solved by employing a time rescaling factor in order to put in evidence the role of the collisions in the solutions. The rescaled equations reads
(51) 
where is the rescaled parameter (the frequency of relaxation) which plays the role of the non dimensional Knudsen number. Smaller is the relaxation frequency, faster is the relaxation of the distribution function towards the equilibrium state. However, the exact rate of convergence is dependent on the type of collision considered either BGK or Boltzmann (Maxwellian molecules or hard spheres).
5.2.1 Test 2.1. Numerical convergence of the Boltzmann equation. The two dimensional in velocity Maxwellian molecules case.
The two dimensional velocity space is first considered leading to a space/velocity mesh of the form with and a varying number of space cells . In Figure 10, we present the space convergence results for the density, the velocity and the temperature when the Boltzmann operator is solved. Successively refined (doubled) spatial meshes are employed, from to up to final time . From these data we can observe that the simulation results seem to converge towards a given numerical solution in both cases (left panels) and (right panels). We can also observe that the increase in mesh resolution is profitable especially for smaller . This is the same behavior observed in FKS. In fact, the scheme precision decreases as the equilibrium state is approached, being virtually exact in non collisional or almost non collisional regimes. The loss of precision observed in fluid dynamic regimes can be recovered with a similar technique as the one proposed in FKS_HO. Here, however we do not consider this possibility. The CFL condition employed is the following
(52) 
where the first term on the right hand side of the above equation comes from the will of keeping the error small enough in the splitting scheme, while the second term is due to the stability restriction in the solution of the space homogeneous problem when Maxwellian molecules are employed. In fact, in this case the loss part of the collision integral can be estimated, giving .
5.2.2 Test 2.2. Comparisons between the BGK model and the Boltzmann model. The two dimensional in velocity Maxwellian molecules case.
Here, the BGK and Boltzmann models are simulated for the same Sodlike problem. In order to have fairest as possible comparisons between the two models, we choose for the BGK model. This choice permits to have the same loss part for the two models since for Maxwellian molecules the loss part is close to as stated in the previous paragraph. We fix spatial cells and cells in velocity space. This permits to consider almost converged results. In Figure 11, we present the results when (left) and