# Fitting Galaxies on GPUs

## Abstract

Structural parameters are normally extracted from observed galaxies by fitting analytic light profiles to the observations. Obtaining accurate fits to high-resolution images is a computationally expensive task, requiring many model evaluations and convolutions with the imaging point spread function. While these algorithms contain high degrees of parallelism, current implementations do not exploit this property. With ever-growing volumes of observational data, an inability to make use of advances in computing power can act as a constraint on scientific outcomes. This is the motivation behind our work, which aims to implement the model-fitting procedure on a graphics processing unit (GPU). We begin by analysing the algorithms involved in model evaluation with respect to their suitability for modern many-core computing architectures like GPUs, finding them to be well-placed to take advantage of the high memory bandwidth offered by this hardware. Following our analysis, we briefly describe a preliminary implementation of the model fitting procedure using freely-available GPU libraries. Early results suggest a speed-up of around over a CPU implementation. We discuss the opportunities such a speed-up could provide, including the ability to use more computationally expensive but better-performing fitting routines to increase the quality and robustness of fits.

## 1Introduction

Recent trends in commodity computing hardware have seen a dramatic shift first from single-core processors to multi-core and then to accelerated platforms like graphics processing units (GPUs). GPUs were originally designed to speed up 3D graphics calculations for video games, but their immense memory bandwidth and arithmetic capabilities have seen them re-purposed for the needs of scientific computing. While unquestionably powerful, their radically different, massively-parallel architectures have shaken up the software community. Astronomy is one of many fields trying to adapt to these changes in computing hardware.

While the area is still in its infancy, GPUs have already been shown to provide significant speed-ups across a range of astronomy problems. These include direct N-body simulation (e.g., [?]), adaptive mesh refinement hydrodynamics (e.g., [?]), galaxy spectral energy density calculations [?], gravitational microlensing [?], correlation for radio telescopes (e.g., [?]) and coherent pulsar dedispersion [?]. The approach taken in each of these cases has, however, been *ad hoc* in nature – the transition to the GPU has been guided largely by hardware-specific documentation, code samples and simple trial and error. While such an approach has proven very successful for these early adopters, it is not clear that it will remain effective when it comes to more complex algorithms. Furthermore, in some cases the cost of re-implementing a code may be too large to gamble on a return (i.e., a speed-up) of unknown magnitude.

In this paper we discuss the potential for accelerating the process of galaxy fitting using GPUs. Rather than tackling the challenge blind, we instead use a generalised method based on algorithm analysis as outlined in [?]. The galaxy fitting process is described in Section 2, which is followed by a full analysis of the problem in Section 3. A preliminary implementation and results are described in Section Section 4 before our summary discussion in Section Section 5.

## 2Galaxy Fitting

A common problem in astronomy is to fit analytic surface brightness profiles to observations of globular clusters or galaxies in order to extract structural parameters such as the effective radius, ellipticity or integrated flux magnitude. The fitting procedure is typically non-linear and of high dimensionality, demanding the use of powerful optimisation routines. Many codes have been developed to perform this task, including, e.g., ishape [?], galfit [?] and galphat [?], which use the downhill simplex, Levenberg Marquardt and Markov-Chain Monte-Carlo methods respectively. While there is a variety of optimisation techniques in common use, most follow a similar pattern:

Evaluate a model on a grid using the current set of parameters (guessed initially).

Convolve the model with the point spread function (PSF) of the observation.

Compare the model and observation.

Adjust the model parameters.

Check finishing criteria and return optimised parameters if complete, otherwise repeat from Step 1.

Step 4 is where the specifics of a particular fitting routine come into play, while steps 1-3 generally remain unchanged between methods. Given the pixel-counts of modern astronomical observations, and the fact that most fitting routines require a very large number of iterations, the optimisation process can be highly computationally intensive. The quantity and quality of science results are thus tied to the available computing power and a code’s ability to take advantage of it.

Computationally-limited problems like galaxy fitting are ideal candidates for acceleration. The fact that steps 1-3 are common to a large number of fitting routines allows us to study the problem with significant generality. Additionally, the image-based nature of the operations immediately suggests suitability for GPUs.

## 3Algorithm Analysis

In order to determine whether galaxy fitting is a suitable application for GPU acceleration, we use an approach based around algorithm analysis as described in [?]. We begin by identifying known algorithms within the steps in the problem outline presented in Section 2:

Evaluation of a model on a grid is an example of a

**transform**algorithm.Convolution with the PSF is best done in Fourier space, requiring the

**fast Fourier transform (FFT)**and regular**transform**algorithms.Comparison of a model with an observation typically involves computation of the “sum of squared differences”, which involves the

**transform**and**reduce**algorithms.

Note that the algorithms required during optimisation of the model parameters in step 4 will depend on the chosen fitting routine.

It is thus seen that the fitting procedure makes use of only the transform, FFT and reduce algorithms, all of which are known to be very efficient on many-core architectures like GPUs [?].

The next step in the analysis is to look at the global characteristics of the computation. Both the transform and reduce algorithms have a work complexity of , indicating that a constant number of operations is performed for every image pixel. The FFT algorithm has a work complexity of , indicating that operations are performed for each of the image pixels. The convolution step is thus expected to consume the majority of the processing time for large images.

The fitting procedure for an image of pixels therefore requires reading values, repeatedly performing arithmetic operations times, and writing out optimised parameter values (where is the number of iterations required to obtain a good fit). In the best case scenario then, the problem has a ratio of memory to compute operations, or *arithmetic intensity*, of .

The ability to achieve this arithmetic intensity depends on the memory access patterns of the component algorithms. Because the FFT algorithm requires an all-to-all communication pattern (i.e., each output value depends on every input value), it is necessary to have the entire image globally accessible during each iteration of the fitting routine. This rules out storing the image data in very small caches (e.g., the *shared memory* on NVIDIA GPUs) between iterations. However, there is generally more than enough main memory on a GPU to hold an entire image. This means that the data may be left on the device for all iterations, with no need to go back to the host’s memory or disk until the final results have been obtained. The limiting factor will instead be the internal memory bandwidth of the device. This is a good result, as current GPUs have significantly more memory bandwidth than CPUs, and one can expect a corresponding speed-up.

## 4Implementation Results

Given the positive results of the algorithm analysis in Section Section 3, a prototype implementation of the galaxy fitting problem was deemed worthwhile. NVIDIA’s CUDA^{1}^{2}^{3}

Given the subtleties of mature codes like galfit [?], performing an accurate comparison with our prototype GPU code is not yet possible. Preliminary results, however, suggest a speed-up in the main computations of around when using a single NVIDIA Tesla C1060 GPU versus an Intel Nehalem CPU. Profiling results also indicate that the GPU hardware is being used efficiently by all of the algorithms in the code. These results support the conclusions of our analysis in the previous section.

## 5Discussion

Many-core architectures like GPUs are now an important part of the computing landscape. While many software challenges remain, a generalised approach to analysing astronomy problems has proven very useful in tackling new GPU codes.

Galaxy fitting looks to be a promising application of GPU technology. Significant speed-ups present the opportunity to perform faster fits, which may be crucial for the next generation of galaxy surveys. Alternatively, the additional processing speed could be fed back into the fitting routine to provide fits of much better quality in the same length of time, helping to overcome common problems such as local minima and unphysical results.

While useful as a profiling tool, our prototype GPU code requires significant further development before it can be considered a viable alternative to other galaxy fitting codes in use by the astronomy community. Future work will address such development.

Given the generality of our analysis, it is likely that other fitting problems in astronomy would also benefit from GPU acceleration. If one allows flexibility in the dimensionality of the problem, procedures such as spectral line or cube fitting become possible. Such problems will also be the subject of future work.