Multi-level analysis of compiler induced variability and performance tradeoffs This work was performed under the auspices of the U.S. Department of Energy by LLNL under contract DE-AC52-07NA27344. (LLNL-CONF-759867)

# Multi-level analysis of compiler induced variability and performance tradeoffs ††thanks: This work was performed under the auspices of the U.S. Department of Energy by LLNL under contract DE-AC52-07NA27344. (LLNL-CONF-759867)

Michael Bentley,  Ian Briggs,  Ganesh Gopalakrishnan
Dong H. Ahn,  Ignacio Laguna,  Gregory L. Lee,  Holger E. Jones
University of Utah
mbentley@cs.utah.edu, ianbriggsutah@gmail.com, ganesh@cs.utah.edu
Lawrence Livermore National Laboratory
{ahn1,lagunaperalt1,lee218,jones19}@llnl.gov
###### Abstract

Floating-point arithmetic is the computational foundation of numerical scientific software. Compiler optimizations that affect floating-point arithmetic can have a significant impact on the integrity, reproducibility, and performance of HPC scientific applications. Unfortunately the interplay between the compiler-induced variability and runtime performance from compiler optimizations is not well understood by programmers, a problem that is aggravated by the lack of analysis tools in this domain.

In this paper, we present a novel set of techniques, as part of a multi-level analysis, that allow programmers to automatically search the space of compiler-induced variability and performance using their own inputs and metrics; these techniques allow programmers to pinpoint the root-cause of non-reproducible behavior to function granularity across multiple compilers and platforms using a bisection algorithm. We have demonstrated our methods on real-world code bases. We provide a performance and reproducibility analysis on the MFEM library as well as a study of compiler characterization by attempting to isolate all 1,086 found instances of result variability. The Laghos proxy app is analyzed and a significant divergent floating-point variability is identified in their code base. Our bisect algorithm pinpointed the problematic function with as little as 14 program executions. Furthermore, an evaluation with 4,376 controlled injections of floating-point perturbations on the LULESH proxy application, found that our framework is 100% accurate in detecting the file and function location of the injected problem with an average of 15 program executions.

debugging, compiler, reproducibility, performance tuning

## I Introduction

Given the frequent introductions of new machines and compilers, scientists must often port their trusted and working simulations to new environments. Often they must also explore optimization flags that yield higher performance on existing platforms. When these alterations introduce significant result deviations, they must either back off from these optimizations, or attempt to identify the programming construct or module that introduced the observed variability. Unfortunately, with millions of lines of code, thousands of modules, functions, and external libraries and binaries all being common in practice, the problem of identifying the source of variability is quite complex and does not have effective solutions. Today’s solutions include printf-debugging or manually altering the compilation flags and rerunning, but these approaches do not scale to large code bases.

To illustrate the problem, consider the porting of the Community Earth System Model (a large-scale climate simulation) [1], which experienced unacceptable levels of result variability. After weeks of painstaking investigations, the problem turned out to be the introduction of a fused-multiply-add instruction. As problems such as this become increasingly common in scientific software, tools that can automatically isolate the root cause would be extremely useful to scientific programmers.

Formalizing the Problem. Scientific HPC applications can be large and complex, often simulating physical phenomena for which expected outcomes are not known. As a result, there is a particular compilation configuration that is trusted because it has passed the test of time (i.e., it is believed to be correct from the first version), and is considered the baseline compilation configuration.

When applications are ported to a different compiler or to a new version of the same compiler, all acceptably good compilation configurations must deliver answers close to the baseline, in an empirical sense, either based on designer experience, or in a more rigorous mathematical sense, such as meeting an error norm. When results deviate from acceptable levels, support tools must help locate the issue within a short distance of the root cause.

Our Contributions. In this paper, we explore efficient algorithmic solutions and offer a tool-suite that attains requisite levels of search efficiency as well as root-causing efficacy on real-world applications. Our workflow includes measures to proactively explore the space of optimizations to discover lurking non-portability dangers. They also include the ability to seek tradeoffs in the rich space of optimizations and arrive at optimization flag combinations that yield acceptably different answers while significantly boosting performance.

In this paper, we present a multi-level analysis workflow comprising individual analyses that help root-cause compiler-induced result variability down to individual source files and functions. This workflow is realized on top of FLiT, an open-source floating-point litmus testing framework. The work in this paper significantly extends FLiT with several new capabilities.

In Section II, we introduce our bisection algorithm that is used to identify either files or individual functions with that file that cause result variability. We make the assumption of singleton blame sites, which means that a single file/function can, by itself, induce variability. In other words, it is not necessary to have two or more files or functions to be jointly affected in order for variability to be manifested. First, this assumption holds frequently in practice. Second, the bisection algorithm has a built-in dynamic verification assertion that is proven to verify against false negatives because of this assumption. In Section II-B, we describe how our search procedure is able to exploit this assumption and simplify the search for variability.

In Section III, we demonstrate and validate our workflow and bisection techniques on three real-world codes. In navigating performance and reproducibility in the MFEM library as a case-study, we found that 14 of 19 examples exhibited the highest speedups with compilations that are bitwise reproducible. Two of those 14 showed bitwise reproducibility across all tested compilations. These results show that we may not necessarily need to sacrifice reproducibility for performance if we search using the application, inputs, and metrics we care about.

In another set of experiments, we demonstrate our bisection algorithm on all variability inducing compilations from MFEM to empirically characterize the proclivity of a compiler to introduce variability. For this code base, we provide the “best average compilation” for each compiler over the set of 19 MFEM examples, along with a rough idea of how often each compiler induces variability. One key demonstration of the utility of our bisect algorithm is its ability to analyze a reproducibility bug reported by developers of the Laghos proxy application. Overall, we were able to find the source of extreme variability with 14 application runs.

To quantify the efficacy of our bisect algorithm even more sharply, we implemented a custom LLVM pass, and using it, injected floating-point perturbations in the LULESH proxy application. We can achieve a precision and recall of 100% at identifying the source of variability, or reporting that the injection was benign and caused no variability. Each injection took only 15 application executions on average during the bisection search to isolate the function exhibiting variability.

### I-a Motivation

At one stage of the development of Laghos, an open-source simulator of compressible gas dynamics [2], the project scientists were seeking higher optimizations provided by the IBM compiler, xlc. After they moved from optimization level -O2 to -O3, the norm of the energy over the mesh of one example run went from 129,664.9 to 144,174.9 in a single iteration — a 11.2% relative difference caused simply by the optimizations! In addition, the density of the simulated gas became negative — a physical impossibility. Even more striking was the runtime difference: from 51.5 seconds to 21.3 seconds for the first iteration, which is a speedup of 2.42.

Obviously, tools such as FLiT are needed to help the programmer safely navigate this performance/result integrity space. Even with such a tool, an effective overall strategy must be supported. We present a novel multi-level analysis workflow and tooling consisting of three overall phases. The objective of first phase is to identify which compiler optimizations cause reproducibility problems. The second phase helps analyze the performance resulting from the optimizations, thus helping the programmer arrive at the most performant of acceptable solutions. The third phase helps characterize which functions within the code exhibit variability under compiler optimizations, sorted by the most influential.

## Ii Workflow for Multi-Level Analysis

Key to the design of FLiT is a choice of approaches and algorithms that are essential in order to make an impact in today’s HPC contexts. We now present some of these choices and describe the workflow in Figure 1.

For the purpose of discussion, a compilation is defined as a triple (Compiler, Optimization Level, Switches) applied to a subset of files in an application. This triple contains the full configuration of how an application is compiled as far as optimizations and compiler options are concerned. Our work helps hunt down result variability inducing compilations.

#### Handle vendor-specific as well as general-purpose compilers

Vendor-provided compilers are key to achieving high performance, especially within newly delivered HPC machines. Given this, FLiT cannot rely upon technologies that do not transcend compilers. Some such technologies are binary instrumentation tools such as PIN (these are relevant only when targeting Intel machines) and instrumentation passes based on LLVM (that are not supported by all vendor compilers).

#### Applicability within HPC build systems

Productivity oriented approaches in HPC critically depend on infrastructures such as Kokkos [3] and RAJA [4] that synthesize efficient code, affect loop optimizations in a natural way, and are important for smoothly incorporating new developments in parallelism. Given that codes written within these frameworks employ their own annotations, an approach that heavily relies on static analysis will be burdened with supporting all these different annotations. FLiT avoids this by dealing with compiled object files. It supports not only the linkage of object files emitted under different compilations.

#### Use designer-provided tests and acceptance criteria

A generic tool such as FLiT cannot have pre-built notions of which results are acceptable. It therefore engineers its solutions around C++ features that requires a minimal amount of customization. For each test, the user creates a class and defines four methods:

• getInputsPerRun: Simply returns an integer – The number of floating-point values taken by the test as input (between 0 and the maximum value of size_t)

• getDefaultInput: Returns a vector of the input to use for the test. If there are more values here than specified in getInputsPerRun, then the input will be split up and the test will be executed multiple times, thus allowing data-driven testing [5].

• run_impl: The actual test that takes a vector of floating-point values as input and returns a test result. The test result can either be a single floating-point value, or a std::string. The return type of std::string is provided so that more complex structures can be returned, such as arbitrary meshes.

• compare: Takes in the test values from the baseline and testing compilations, and returns a single floating-point value. If the two values are considered equal, then this function should return 0. Otherwise, this function should return a positive value. This function behaves as a metric between the two values, and is the means by which FLiT determines if there is variability in a compilation compared to the baseline.

There are two variants of this compare function, one for long double values and another for std::string values. The user need only implement the associated variant for the return type of their test.

FLiT requires deterministic executions, as shown in Figure 1. This means that on a given platform and input, we must be able to rerun an application and obtain the exact same results as measured by the user-provided compare function. There are many deterministic HPC applications, even many MPI and/or OpenMP applications that provide run-to-run reproducibility, and are therefore supported by FLiT. As depicted in Figure 1, if an application is not deterministic, then external methods can be used to make it deterministic. For example, one can identify and fix raced with a race detector such as Archer [6], or directly determinize an execution using a capture-playback framework such as ReMPI [7].

Currently, support for GPUs does not exist in FLiT. With GPUs, the manner in which warps are scheduled can cause floating-point reassocations, thus changing execution results. Given the rapid evolutions in the GPU-space, this is future work111 There is very little external control one can exert on GPU warp schedulers. .

### Ii-a Bisect Problem

The bisect problem handled by FLiT is multifaceted: it must help locate variability-inducing compilations while also checking for acceptable execution results. Unfortunately, modern compilers are quite complex, and their internal operation involves many decisions such as link-time library substitutions, the ability (or lack of) to leverage new hardware resources, and many more such options that affect either performance or the execution results. This richness forces us to adopt an approach that is as generic as possible, and consists of compiling different files at different optimizations and drawing a final linked image from this mixture. The granularity of mixing versions in our case is either at a file level, or (by using weak symbols and overriding) at a function level. When we encounter a numerical result difference during our bisection search, we allow existing tools to help with root-causing. Thus FLiT’s task is to isolate the problem down to a file or a function.

An important practical reality is that a large application is comprised of hundreds of functions spread over multiple files. It is possible that the compiler optimization may have affected any subset of these functions to cause the observed variability. The objective of FLiT’s bisect algorithm is to identify and isolate all functions that have contributed to result variability.

In a general sense, one faces the daunting prospect of identifying those functions that are “coupled” in the sense that they must be optimized together in a certain way in order to cause result variability. This would lead to a search algorithm that considers all possible subsets of files or functions — an exponential problem that, if implemented as such, would result in a very slow tool. The singleton blame site assumption alluded to earlier reduces search space, as discussed in more depth in Section II-D.

### Ii-B Bisect Algorithm

The bisect algorithm (Algorithm 1) follows a simple divide and conquer approach. It takes two inputs: (1) , which is a set of files/functions in the compilations to be searched over; and (2) A test function Test that maps to a real value that is either or greater than . A non-zero output indicates the existence of result variability, and also helps us sort the problematic items (files and functions) in order of the degree of variability they induce by themselves. It also allows us to formulate the bisect biggest algorithm (discussed in Section II-E). A zero output indicates that there is no result variability.

Notice that procedure BisectOne (helper to procedure Bisectall) does not simply return the next found element. It instead returns a pair of two sets. The first set returned is a set over which further searching need not be done. The second is a singleton set — the “found element” in essence. As line 2 of BisectOne indicates, this means that Test () is greater than , i.e., the presence of this singleton set, namely , in a compilation causes result variability. That means we have successfully located one variability-inducing file/function. We now return the pair indicating: (1) that we found , and (2) we need not include in future searches (line 7 of BisectAll). These elements are then removed from the search space in future bisect searches (as seen on line 7 of procedure BisectAll in Algorithm 1). This is not necessary for the algorithm to work correctly, or even for the complexity, but it is simply an optimization that allows us to prune the search space if we happen to find elements which cause the given test to pass. This is a key difference with respect to how Delta debugging [8] works — a point discussed under the heading Assumption 2 of Section II-D.

As a specific example of this strategy, notice what we do on line 9 of BisectOne which is when . Then we suppress future testing on .

The Test function that is passed to the bisect algorithms is a user defined metric that has the following attributes:

• Maps a set of items to a non-negative value, .

• there are no variability causing elements

• there is at least one variability causing element

In Figure 2, we can see an example of running Algorithm 1. The ✔ symbol indicates an instance when and the ✘ symbol indicates . Each invocation of BisectOne is indicated by horizontal lines between the steps. The small X’s in Figure 2 refer to the extra set of elements returned by procedure BisectOne indicating a set of elements to discard for future search.

Although it is true that for this example, it would be cheaper to do a linear search over the elements, a linear search would always be , where is the total number of elements. This bisect algorithm has worst-case complexity and best-case complexity where is the number of variability causing elements to find. These bounds are discussed in more detail in Section II-D

### Ii-C Implementation of Bisect

The bisect search algorithm utilizes a well-known divide and conquer technique, but applying it to find the functions causing variability is nontrivial. Note, the terms function and symbol are used interchangeably, although symbol usually refers to a compiled version of the function. Since the problem is to find all functions causing variability, we could group together all functions of the compiled application and apply the bisect algorithm. But, for anything larger than small applications, the search space becomes too large to search effectively. Instead, akin to how Delta Debugging [8] was extended to work on hierarchical structures [9], we perform this bisect algorithm on a dual-level hierarchy, first by searching for the files where variability is introduced by the compiler, and then searching the functions within each found file. This allows us to reduce the search space considerably, by splitting up the full bisect search into much smaller separate searches.

The file bisect Test function is implemented by mixing and matching the object files generated from the two different compilations, some from the variability-inducing compilation, and the rest from the baseline compilation. The Test function passed into the bisect algorithm is generated from the baseline compilation, the variable compilation, and the full list of source files. When a set of source files are passed into the Test function, those files are compiled with the variable compilation with all others compiled with the baseline compilation, and then the two sets of object files are linked together. This is expressed in the left half of Figure 3.

It is possible for the baseline and variable compilations use different compilers, in which case this approach depends heavily on binary compatibility between the two compilers [10, 11]. Since binary compatibility is not guaranteed with the C++ standard library, we enforce all compilers to use the GCC implementation.

Using this approach, the bisect algorithm finds all compiled object files that contribute to the variability when compiled with the variable compilation. Each compiled object file comes from a single source file, and therefore can indicate the source files that cause variability.

Having finished finding all variability-contributing object files, we move on to finding the variability-inducing symbols within the found object files (i.e. methods and functions). This second pass over symbols, called symbol bisect, is performed individually on all symbols within each found variability-producing object file.

Exploiting Linker Behavior and Objcopy: The method for selecting functions from two different versions of the same object file is done by making use of strong and weak symbols, and is shown in the right half of Figure 3. At link time, if there is more than one strong symbol, the linker reports a duplicate symbol error. If there is more than one weak symbol, then the linker is allowed to choose which one to keep and discards the rest. In the case there is one strong symbol and one or more weak symbols, the linker keeps the strong symbol and discards all weak symbols. It is the last case we utilize to select functions. Using objcopy, we can duplicate an object file, and change a subset of the strong symbols into weak symbols. The other object file is then treated similarly, but marking the compliment set of symbols as weak. At this point, both object files can be successfully linked together into the executable.

However, when a compiler generates an object file, it works under the assumption that the object file, also known as a singe translation unit, is indivisible [12], and therefore perform many optimizations based on that assumption In order to replace a function with a different compiled copy, the inlining optimization must be disabled so that copies of the function to be replaced does not remain embedded inside of other functions. This problem of wanting to be able to replace a function with a different implementation has been solved in the domain of shared libraries, with the use of LD_PRELOAD and is called interposition. In order to successfully replace all instances of one function, it is then required to recompile the object file with -fPIC, thus disabling inlining of functions that are callable from other translation units (i.e. the globally exported symbols). We are limited, therefore, to search within the space of globally exported symbols, since those are the only ones we can guarantee can be replaced fully with the desired version.

There are other potential ways to select individual functions from one compilation and the rest from another compilation. For example, some compilers allow turning on and off compiler optimizations using #pragma statements. This approach would work only for compilers with such a capability, and would not be able to handle the situation of mixing compilations that have two different compilers, such as GCC and the Intel compiler. Another approach is to split the functions into separate source files. However this approach is non-trivial to implement and has the potential do disable many of the optimizations that cause variability. The final approach we considered was compiler intermediate representation, such as LLVM IR. But this approach will work only with the compilers with which we can perform such a pass, at the very least excluding the use of closed source compilers such as the Intel compiler, the IBM compiler, and the PGI compiler.

The Test function for symbol bisect is the generated from the two specified compilations, the full set of source files, the one source file to search, and the full list of globally-exported symbol names. It then marks certain symbols as weak from the two versions of the variability-inducing object file (compiled with -fPIC) and links together these two object files with the rest of the object files compiled with the baseline compilation.

### Ii-D Bisect Analysis

Stated in a general manner (i.e., without our singleton blame site assumption), our objective is to find all minimal sets of functions that cause variability.

###### Definition 1.

is a minimal set if with , then .

In the above definition, is the set of all elements of size . That is to say, a minimal set is the smallest set that causes Test to fail with a non-zero value (uniqueness is not guaranteed).

The goal of bisect is not to find a single minimal set, but to find all elements that cause variability. Elements that do not cause variability do not effect the Test value at all by definition; therefore, only the variability inducing elements (i.e. the elements to find) cause any perturbations from the Test function. Although, without any further assumptions about the Test function, the search space is . In the worst case, it would be required to perform an exhaustive search over this entire search space. Therefore, we form additional assumptions to reduce the search space and make the problem tractable and scalable.

###### Assumption 1.

Each combination of errors is unique (e.g. errors do not exactly cancel). This means if and only if , where is the set of all variability-inducing elements from .

That is to say that the value for is unique to a nonlinear combination of the effects of each minimal set within. It is possible for this assumption to not always be true. But without this assumption, we could not do any better than brute-force search or some approximation technique. Because the Test function returns a floating-point value measuring the induced variability, it is reasonable to presume duplicate Test values from different error combinations are unlikely. Although this assumption allows us to do something better than brute-force search, the search space is still much to large for any reasonable application. This brings us to our next simplifying assumption.

###### Assumption 2.

All minimal sets are of size one.

This is the formal equivalent of the singleton blame site assumption discussed thus far. This assumption claims there is no situation where two functions need to be compiled in a certain way together in order to generate a measurable variability. This assumption is strong and not always true. But, for the problem of variability from optimizations on functions, it is frequently true in practice. And this allows us to reduce the search space considerably.

Using this assumption, we could do a simple linear search over the elements to determine which elements constitute minimal sets, which would have complexity of . Instead, we perform a bisection search, resulting in a complexity of where is the number of elements that cause variability (as seen in Algorithm 1). This is a much more scalable approach as gets very large.

But what if Assumption 2 is not true? We could generate false negative results. But the assertion found on line 8 of procedure BisectAll in Algorithm 1 directly verifies the veracity of Assumption 2.

###### Theorem 1.

If , and , then

is not necessarily the set of all variable elements, since there could be coupled elements that only show variability together. But, if , then is the set of all variable elements.

###### Proof.

Each element of causes variability, therefore . From Assumption 1, implies that . Therefore

Despite a simple proof, the result is profound. If Assumption 1 holds, and the assertion on line 8 of procedure BisectAll in Algorithm 1 holds, then there are no false negatives, meaning we have found all variability inducing elements. And this dynamic verification requires only two more Test executions (only one more with memoization since has previously been performed). However, if the assertion fails, meaning , then Assumption 1 or Assumption 2 are false, in which case there may be false negative results. When this occurs, the user is notified by our tool that there may be functions not found by the bisect algorithm that contribute to variability.

### Ii-E The Bisect Biggest Algorithm

Along with the bisect algorithm that finds all variability-inducing files and functions, we developed an algorithm that can search for the biggest contributors where the user can choose the value for . This variant is based on Uniform Cost Search and can exit early. Upon finding the largest contributing file, it immediately recurses to find the largest contributing symbols. When a file or symbol is found to have a smaller Test value than the found symbol’s Test value, it exits early. It is not able to dynamically verify assumptions, but can significantly improve performance if only the top few most contributing functions are desired, and there happen to be many more than that to find.

## Iii Experimental Results

We performed three evaluations of FLiT: MFEM, Laghos, and Lulesh. We apply FLiT to MFEM to view the speed and variability space, then we apply FLiT bisect on all variant compilations. The second is a real world case study applying FLiT bisect to a codebase with an unknown issue with variability. Finally, we use a floating point modifying pass to evaluate precision and recall of the bisect algorithm.

### Iii-a Performance vs Reproducibility Case Study

MFEM is a finite element library poised for use in high performance applications. We used FLiT to compile it under three mainstream compilers to view the tradeoff between reproducibility and speed, as seen in Figures 4 and 5. In Figure 6 we examine the fastest non-variant compilations given by each compiler with the fastest variant overall.

The MFEM library comes with end to end examples of how to use the framework, which is what we used as test cases in FLiT. These examples include the use of MPI, which FLiT now supports. Each example produces a full two dimensional mesh, which we use for our custom comparison function by differencing the meshes and taking the norm of the result.

Using FLiT we compiled MFEM using the releases of the g++, clang++, and icpc compilers listed in Figure 7. For these compilers we paired a base optimization level, -O0 through -O3, with the same flag combinations used in [13]. This leads to compilations, and with test cases results in a total of experimental results. Looking at a single experiment and ordering the compilations from slowest to fastest, we get graphs similar to Figure 4, which represents example from MFEM. In this Figure the points marked with a blue circle compare equal to the result baseline of g++ -O0, and those with a red X exhibit variability. For MFEM example 5, the fastest bitwise equal compilation showed the best speedup of 1.128. This example was not an oulier; similar results are found in of the examples, as seen in Figure 6. This contrasts with Figure 5, which has the variant compilations grouped near the top and showing a significant speedup over the fastest bitwise equal compilation.

While these plots give detail to individual experiments, Figure 6 shows a bigger picture. Each grouping shows the fastest non variant compilation and the fastest variant compilation in regards to a single experiment. Once again, out of experiments show non-variant compilations to also be the fastest. Only of the groupings show variant compilations being noticeably faster than non-variant compilations.

### Iii-B Bisect

FLiT found compilations which lead to variant results, each of which were explored by FLiT bisect. These searches were over a non-trivial codebase. An overview of the success rate of bisect is available in Figure 8.

The MFEM library contains almost functions which are exported symbols, as seen in Figure 9. While this is daunting for a linear search, the bisect approach used an average of executions including the verification assertion. FLiT was able to isolate the variability to the file level of the time, and of those was able to isolate the variability to the symbol level of the time.

### Iii-C Characterization of Compilers

From this two-part experiment we can asses the compilers predilection for speed, variability, and compatibility.

The maximum available speedup for a single example ranges from a speedup to a speedup over a speed baseline of g++ -O2, but this comes with the caveat that each example will have a different compilation triple. Since MFEM is a library, it is better to see which triples lead to the best average speedup across all examples to cover all users. This can be seen in Figure 7, in which g++ comes in first with a speedup of . Note, all three of these fastest average compilations have variability induced on at least one example.

In that same Figure is the percentage of compilations which caused variability. The most invariant compiler is clang++ with of compilations deviating from the baseline. The most variant compiler, producing almost half variable compilations, is the Intel compiler, icpc. Intel’s compiler went from a far second in speed to last in variability.

Examining the bisect results more closely there were some issues that lead to the failure rate of file bisect. When icpc and g++ object files were linked together the resulting executable would sometimes fail with a segmentation fault. While Intel claims compatibility with the GNU compiler [10] this does not seem to always hold.

### Iii-D Penetration into Laghos

The issue found by the developers of Laghos manifested when it was compiled with IBM’s xlc++ compiler at -O3. Given the code, bisect was able to find a non floating-point related issue that was already fixed in another branch of the code. After fixing that problem we were able to isolate the issue down to the function level.

The tool developers trusted the results from both g++ -O2 and xlc++ -O2 when using their own branch of the code. We used a public branch of the code in an attempt to reproduce the results they had. In our runs, all results were the special floating point value . Using bisect we narrowed this down to the two closest visible symbols to the issue. The source code in question was #define xsw(a,b) a^=b^=a^=b, which evokes undefined behavior in C++. Bisect identified these two function in program executions. The developers were able to confirm the bug, which was already fixed in their own version.

Fixing this issue lead to results that agreed with the developer-stated results for both the trusted compilations and the variant xlc++ -O3 compilation. We ran many variants of bisect to evaluate the speed and effectiveness of bisect and bisect biggest, as can be seen in Figure 10. By limiting either the digit sensitivity of our compare function, or the value of bisect biggest ( refers to using the traditional bisect algorithm), the number of runs vary from to , all of which were able to identify the large variability-inducing function. In the function pointed to was an exact comparison to in an if statement. The value being compared against had small variability, but the difference in branching resulted in significant application variability. Changing this to an epsilon based comparison gave results close to the trusted results, even under xlc++ -O3.

### Iii-E Injection Study

We performed controlled injections of floating-point variability at all floating-point code locations to quantify the accuracy of our tool.

Our injection framework is based on the LLVM compiler [14] and introduces an additional floating-point operation in a given floating-point instruction of the LLVM intermediate representation (IR). More formally, given a target floating-point instruction of the form , where and are floating-point operands, and OP is a basic floating-point operation (+,-,*,/), we introduce an additional operation , where OP’ is also a basic floating-point operation and is chosen from a uniform distribution between 0 and 1. For example, assuming that the target instruction is

 z=x∗y,

after the injection, the resulting operation is:

 z=(x+1e-100)∗y.

In this example, OP’ is the addition operation and is 1e-100.

Our variability injection framework requires two passes. The first pass identifies potential valid injection locations; an injection location is defined by a file, function and floating-point instruction tuple in the program. The second pass, performs an injection in a user-specified valid location, using a specific and operation OP’. The injections are performed at an early stage during the LLVM optimization step. Our goal is to introduce variability before optimizations take place.

For our evaluation we used the benchmark Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). LULESH contains source lines of code, in which there are floating point operations. For each of these operations we did four injection runs, one for each possible OP’.

Under our evaluation criteria as seen in Figure 11 we deem a symbol reported by FLiT bisect to be exact of the source function where the injection occurred, this occurred times. We also count indirect finds, which is when the source function is not a visible symbol but bisect was able to find the visible symbol which used the injected function. This can occur for a number of reasons, with the majority coming from functions which were inlined or otherwise not exported as a strong symbol. We also count wrong finds and missed finds, which are false positives and false negatives. Both of these categories yielded no results in our runs. The final category is when the injection was not measurable. A non-measurable result is when the injection did not change the output of LULESH, which account for of the runs. This can occur when the injection was in code that was not run, for instance a branch that was never taken or dead code that was removed in an optimization step.

## Iv Related Work

The general areas of floating-point error analysis and result reproducibility have been receiving a lot of attention [15, 16, 17, 18, 19]. Space limitations prevent a more in-depth survey. There have been also some efforts in understanding performance and reproducibility in the setting of GPUs [20]. The study of deterministic cross-platform floating point arithmetics was begun on a strong note a decade ago in [21] by Seiler, but appears to not have been continued since. Our initial work on FLiT was in a sense inspired by Seiler’s efforts.

The version of FLiT discussed in this paper is built off an existing open-source tool framework called FLiT [13]. Compared to the existing open-source version, we make the following contributions that have significantly expanded the initial concept to a practically useful tool.

• We now have two real-world case studies, namely Larghos and MFEM, whereas [13] only conducted studies using simple litmus tests;

• The ability to seek higher performance with acceptable result variability was not present in [13];

• The whole idea of bisect search is new. Clearly, all the associated contributions, including the assumptions that help make bisect efficient, searching for the biggest contributors to result variability, and file versus function bisection are new.

• Fault injection studies using LLVM instrumentation.

In [1], the authors discuss the impact of nonreproducibility in climate codes. The tooling they provide (KGEN) is home-grown, not meant for external use [22]. Their work does not involve any bisect capability. Their special focus is on large-scale Fortran support (and currently FLiT does not handle Fortran; it is a straightforward addition, and is future work for us).

A tool called COSFID [23] was used to take climate codes and analyze them more systematically. Their work realizes file-level bisection search, albeit through a single bash script. This is not at the same level as the current FLiT’s engineering is in terms of its configurability to multiple compilers and platforms, ability to work in many build environments, and establishing a discipline whereby the user can specify the test inputs, the comparison function, etc. Their work does not perform symbol-level bisection to isolate problems down to individual functions, as we do. The assumption that makes our bisection search efficient — namely singleton blame site — is not exploited in their work.

The issue of designing bitwise reproducible applications is discussed in [24]. Their work focuses on the design of efficient reduction operators, improving on prior work on deterministic addition. It does not support capabilities such as compilations involving different optimizations, and bisection search.

## V Concluding Remarks

Given that floating-point arithmetic is the computational foundation of numerical scientific software, compiler induced result variability is a huge impediment to progress in HPC based experimentation. Given the variety of compilers deployed in a typical organization, the plethora of optimization options each compiler supports, and the increasing variety of hardware as well as libraries, a modern HPC researcher trying to cope with compiler-induced variability without any tool support is fighting a labor-intensive and error-prone uphill battle.

This work for the very first time offers a comprehensive and practical tool called FLiT that has made a difference in a state-of-the-art project at Lawrence Livermore labs, explaining why the Larghos application exhibits an unacceptable degree of result variability. In another realistic project, MFEM, FLiT has demonstrated the degree of reproducibility and performance available in state-of-the-art applications. Without tools such as FLiT, a programmer may end up adopting draconian measures such as prohibiting the project-wide use of optimizations higher than, say, -O2.

We describe how we engineer file- and symbol-level bisection search to locate files or functions that cause result variability. Our algorithms have yielded results with respect to actual projects, as well as in the context of 4,376 controlled injections of floating-point perturbations on the LULESH proxy application where it obtained 100% accuracy in detecting the file and function location of the injected problem with an average of only 15 program executions.

Our future work will address the limitations identified, the key ones being: (1) handling OpenMP and MPI applications, with support for result determinization provided in an easy-to-use manner; (2) support GPUs; and (3) properly understand, characterize, and provide workarounds for the (inevitable) compiler bugs that occasionally crash the executables generated by our bisection search algorithm. We continue to maintain the open-source status of FLiT, and invite contributions as well as usage of FLiT in others’ projects, providing us feedback.

## References

• [1] A. Baker, D. Hammerling, M. Levy, H. Xu, J. Dennis, B. Eaton, J. Edwards, C. Hannay, S. Mickelson, R. Neale, D. Nychka, J. Shollenberger, J. Tribbia, M. Vertenstein, and D. Williamson, “A new ensemble-based consistency test for the community earth system model,” Geoscientific Model Development, no. 8, p. 2829â2840, 2015, doi:10.5194/gmd-8-2829-2015.
• [2] V. A. Dobrev, T. V. Kolev, and R. N. Rieben, “High-order curvilinear finite element methods for Lagrangian hydrodynamics,” SIAM Jounal on Scientific Computing, vol. 34, no. 5, pp. B606–B641, 2012.
• [3] H. C. Edwards, C. R. Trott, and D. Sunderland, “Kokkos: Enabling manycore performance portability through polymorphic memory access patterns,” Journal of Parallel and Distributed Computing, vol. 74, no. 12, pp. 3202–3216, 2014.
• [4] R. D. Hornung and J. A. Keasler, “The RAJA portability layer: overview and status,” Lawrence Livermore National Lab.(LLNL), Livermore, CA (United States), Tech. Rep., 2014.
• [5] P. Baker, Z. R. Dai, J. Grabowski, Ø. Haugen, I. Schieferdecker, and C. Williams, “Data-driven testing,” in Model-Driven Testing.   Springer, 2008, pp. 87–95.
• [6] S. Atzeni, G. Gopalakrishnan, Z. Rakamaric, D. H. Ahn, I. Laguna, M. Schulz, G. L. Lee, J. Protze, and M. S. Müller, “ARCHER: effectively spotting data races in large openmp applications,” in IPDPS 2016, 2016, pp. 53–62.
• [7] K. Sato, D. H. Ahn, I. Laguna, G. L. Lee, and M. Schulz, “Clock delta compression for scalable order-replay of non-deterministic parallel applications,” in Supercomputing (SC), 2015, pp. 62:1–62:12.
• [8] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Transactions on Software Engineering, vol. 28, no. 2, pp. 183–200, 2002.
• [9] G. Misherghi and Z. Su, “HDD: hierarchical delta debugging,” in Proceedings of the 28th international conference on Software engineering.   ACM, 2006, pp. 142–151.
• [10] (2018) GCC Compatibility and Interoperability. [Online]. Available: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-gcc-compatibility-and-interoperability
• [11] Using the GNU Compiler Collection (GCC): Compatibility. [Online]. Available: https://gcc.gnu.org/onlinedocs/gcc/Compatibility.html
• [12] I. Jtc, “Sc22/wg14. iso/iec 9899: 2011,” Information technologyâProgramming languagesâC., 2011. [Online]. Available: http://www.iso.org
• [13] G. Sawaya, M. Bentley, I. Briggs, G. Gopalakrishnan, and D. H. Ahn, “FLiT: Cross-platform floating-point result-consistency tester and workload,” in Workload Characterization (IISWC), 2017 IEEE International Symposium on.   IEEE, 2017, pp. 229–238.
• [14] C. Lattner and V. Adve, “LLVM: A compilation framework for lifelong program analysis & transformation,” in Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization.   IEEE Computer Society, 2004, p. 75.
• [15] “SC15 BoF on Reproducibility of High Performance Codes and Simulations â Tools, Techniques, Debugging,” organized by Miriam Leeser, Dong H. Ahn and Michela Taufer. [Online]. Available: https://gcl.cis.udel.edu/sc15bof.php
• [16] P. Balaji and D. Kimpe, “On the reproducibility of mpi reduction operations,” in High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), 2013 IEEE 10th International Conference on.   IEEE, 2013, pp. 407–414.
• [17] M. Steyer, “Intel® mpi library conditional reproducibility.”
• [18] M. J. Corden and D. Kreitzer, “Consistency of floating-point results using the intel compiler or why doesnât my application always give the same answer,” Technical report, Intel Corporation, Software Solutions Group, Tech. Rep., 2009, https://software.intel.com/sites/default/files/article/164389/fp-consistency-102511.pdf.
• [19] M. Leeser and M. Taufer, “Panel on reproducibility at sc’16,” 2016, http://sc16.supercomputing.org/presentation/?id=pan109&sess=sess177.
• [20] N. Whitehead and A. Fit-Florea, “Precision & performance: Floating point and ieee 754 compliance for nvidia gpus,” 2012, presented at GTC 2012.
• [21] C. Seiler, 2008, http://christian-seiler.de/projekte/fpmath/.
• [22] Y. Kim, J. Dennis, C. Kerr, R. R. P. Kumar, A. Simha, A. Baker, and S. Mickelson, “KGEN: A python tool for automated fortran kernel generation and verification,” Procedia Computer Science, vol. 80, pp. 1450–1460, 2016.
• [23] R. Li, L. Liu, G. Yang, C. Zhang, and B. Wang, “Bitwise identical compiling setup: prospective for reproducibility and reliability of Earth system modeling,” Geoscientific Model Development, vol. 9, no. 2, pp. 731–748, 2016.
• [24] A. Arteaga, O. Fuhrer, and T. Hoefler, “Designing bit-reproducible portable high-performance applications,” in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, Phoenix, AZ, USA, May 19-23, 2014, 2014, pp. 1235–1244. [Online]. Available: https://doi.org/10.1109/IPDPS.2014.127
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters