Pruning, Pushdown Exception-Flow Analysis

Pruning, Pushdown Exception-Flow Analysis

Shuying Liang University of Utah
liangsy@cs.utah.edu
   Weibin Sun University of Utah
wbsun@cs.utah.edu
   Matthew Might University of Utah
might@cs.utah.edu
   Andy Keep University of Utah
andy.keep@gmail.com
   David Van Horn University of Maryland
dvanhorn@cs.umd.edu
Abstract

Statically reasoning in the presence of exceptions and about the effects of exceptions is challenging: exception-flows are mutually determined by traditional control-flow and points-to analyses. We tackle the challenge of analyzing exception-flows from two angles. First, from the angle of pruning control-flows (both normal and exceptional), we derive a pushdown framework for an object-oriented language with full-featured exceptions. Unlike traditional analyses, it allows precise matching of throwers to catchers. Second, from the angle of pruning points-to information, we generalize abstract garbage collection to object-oriented programs and enhance it with liveness analysis. We then seamlessly weave the techniques into enhanced reachability computation, yielding highly precise exception-flow analysis, without becoming intractable, even for large applications. We evaluate our pruned, pushdown exception-flow analysis, comparing it with an established analysis on large scale standard Java benchmarks. The results show that our analysis significantly improves analysis precision over traditional analysis within a reasonable analysis time.

I Introduction

Exceptions are not exceptional enough. They pervade the control-flow structure of modern object-oriented programs. An exception indicates an error occurred during program execution. Exceptions are resolved by locating code specified by the programmer for handling the exception (an exception handler) and executing this code.

This language feature is designed to ensure software robustness and reliability. Ironically, Android malware is exploiting it to leak private sensitive information to the Internet through exception handlers [22]. Analyzing the behavior of programs in the presence of exceptions is important to detect such vulnerabilities. However, exception-flow analysis is challenging, because it depends upon control-flow analysis and points-to analysis, which are themselves mutually dependent, as illustrated in Figure 1.

In Figure 1, edge A denotes the mutual dependence between exception-flow analysis and traditional control-flow analysis (CFA). CFA traditionally analyzes which methods can be invoked at each call-site. Exception-flow analysis refers to the control-flow that is introduced when throwing exceptions [6]. Intuitively, throwing an exception behaves like a global goto statement, in that it introduces additional, complex, inter-procedural control flow into the program. This makes it difficult to reason about feasible run-time paths using traditional CFA. Similarly, infeasible call and return flows can cause spurious paths between throw statements and catch blocks. The following simple example demonstrates this:

try    maybeThrow();  // Call 1 catch (Exception e)   // Handler 1   System.err.println(”Got an exception”);

maybeThrow();  // Call 2

\@endparenv

Under a monovariant abstraction like 0-CFA [29], where the distinction between different invocations of the same procedure are lost, it will seem as though exceptions thrown from Call 2 can be caught by Handler 1.

Fig. 1: Relationship among exception-flow analysis, control-flow analysis and points-to analysis.

Edge B in Figure 1 denotes the relationship between exception-flow analysis and points-to analysis. Points-to analysis computes which abstract objects (with respect to allocation sites, calling contexts, etc.) a program variable or register can point to. Points-to analysis affects exception-flow analysis, because the type of the exception at a throw site determines which catch block will be executed. That is to say, exception-flow analysis requires precise points-to analysis. Similarly, exceptional flows affect points-to analysis, since the path taken by the exceptional flow can enable or disable object assignments and bindings.

The mutually recursive relationship of CFA and points-to analysis, denoted by edge C, is obvious: abstract objects (points-to analysis) determine which methods can be resolved in dynamic dispatch (CFA), while control-flow paths affect object assignments and bindings for points-to analysis. In fact, exception-flow analysis is an example of this relationship, which exacerbates the edge C relationship further!

I-a Existing approaches

Existing compilers or analysis frameworks provide a conservative model for exception handling. One approach assigns all exceptions thrown in a program to a single global variable. This variable is then read at an exception catch site. This approach is imprecise since it has no knowledge of which exception propagates to a catch site [13, 20].

The second approach analyzes exceptional control flow only intra-procedurally, computing only local catch clauses for a try block, with no dynamic propagation of exceptions inter-procedurally.

The third approach is co-analysis using both control-flow analysis and points-to analysis (a.k.a. on-the-fly control-flow construction) to handle exceptions, which yields reasonable precision, compared to the aforementioned two approaches, as documented in a past precision study [6]. Unfortunately, even for the best co-analysis, where boosting context-sensitivity improves the analysis of exceptions, it does not improve as much as it does for points-to analysis. It is too easy for exceptions to cross context boundaries and merge. For the previous simple example, we could increase to 1-call-site sensitivity. However, context-sensitivity costs more and is easily confused when calls are wrapped, as in:

 try     callsMaybeThrow();  // Call 1  catch (Exception e)   // Handler 1    System.err.println(”Got an exception”);

 callsMaybeThrow();  // Call 2

 // …

 void callsMaybeThrow()    maybeThrow();

\@endparenv

Similarly, values can easily merge with finitized object-sensitivity in points-to analysis. For example, if object-sensitivity uses levels of object allocation sites (or a mix with receiver objects) to distinguish contexts, objects are merged when the level exceeds . Even worse, the limited -sensitivity does not distinguish live heap objects from dead (garbage) heap objects, the existence of which harms both the precision and performance of the analysis. More detailed related work is described in Section IX.

I-B Our approach

Due to the intrinsic relationships illustrated in Figure 1, we propose a hybrid joint analysis of pushdown exception-flow analysis with abstract garbage collection enhanced with liveness analysis. Specifically, a pushdown system derived from the concrete semantics of a core calculus for an object-oriented language extended with exceptions is used to tackle exceptional control-flow matching between catches and throws, in addition to call and return matches. Abstract garbage collection is adapted to an object-oriented program setting, and it is enhanced with liveness analysis to tackle the points-to aspect of exception-flow analysis. We evaluate an implementation for Dalvik bytecode of the joint analysis technique on a standard set of Java benchmarks. The results show that the pruned, pushdown exception-flow analysis yields higher precision than traditional exception-flow analysis by up to 11 times within a reasonable amount of analysis time.

I-C Organization

The rest of the paper is organized as follows: Section II presents the core calculus of an object-oriented language extended with exceptions. Section III formulates the concrete semantics for the language with the intent of refactoring and abstracting it into a static analyzer. Section IV derives the abstract semantics from the concrete semantics by reformulating the structure of continuations into a list of frames and forms an implicit pushdown system. Section V-A introduces the adaptation of abstract garbage collection in object-oriented languages. Section V-B enhances the adapted abstract garbage collection with liveness analysis for better precision. The reachability algorithm is described in Section VI. Section VII describes the details of our implementation. The evaluation and benchmarks are reported in Section VIII. Section IX reports related work, and Section X concludes.

Ii A Featherweight Java with Exceptions

For presentation purpose, we start with a variant of Featherweight Java [14] in “A-Normal” form [11] with exceptions. A-Normal Featherweight Java (ANFJ) is identical to ordinary Featherweight Java, except that arguments to a function call must be atomically evaluable, as they are in A-Normal Form -calculus. For example, the body return f.foo(b.bar()); becomes the sequence of statements

B b1 = b.bar();

F f1 = f.foo(b1);

return f1;

\@endparenv

This does not change the expressive power of the language or the nature of the analysis to come, but it does simplify the semantics while preserving the essence of the language.

The following grammar describes A-Normal Featherweight Java extended with exceptions; like regular Java, ANFJ has statement forms:

is a set of class names
is a set of method invocation sites
is a set of labels
is a set of variables

The set contains both variable and field names. Every statement has a label. The function yields the (semantically) subsequent statement for a statement’s label.

Iii Machine semantics for Featherweight Java

is a set of time-stamps.
Fig. 2: Concrete state-space for A-Normal Featherweight Java.
Fig. 3: Helper functions for the concrete semantics.

In preparation for synthesizing an abstract interpreter, we first construct a small-step abstract machine-based semantics for Featherweight Java. Figure 2 contains the concrete state-space for the small-step Featherweight Java machine. Each machine state has five components: a statement, a frame pointer, a store, a continuation and a timestamp. The encoding of objects abstracts over a low-level implementation: an object is a class plus a base pointer, and field addresses are “offsets” from this base pointer. Given an object , the address of field would be . In the semantics, object allocation creates a single new base object pointer .

The concrete semantics use the helper functions described in Figure 3. The constructor-lookup function yields the field names and the constructor associated with a class name. A constructor takes a newly allocated address to use for fields and a vector of arguments; it returns the change to the store plus the record component of the object that results from running the constructor. The method-lookup function takes a method invocation point and an object to determine which method is actually being called at that point. The concrete semantics are encoded as a small-step transition relation, . Each statement and expression type has a transition rule below.

Variable reference

Variable reference computes the address relative to the current frame pointer and retrieves the result from the store:

Return to call

Returning from a function checks if the top-most frame pointer is a function continuation (as apposed to an exception-handler continuation). If it is, then the machine binds the result and restores the context of the continuation; if not, then the machine skips to the next continuation. If :

Return over handler

If the topmost continuation is a handler, then the machine pops the handler off the stack. So, if :

Field reference

Field reference is similar to variable reference, except that it must find the base object pointer with which to compute the appropriate offset:

Method invocation

Method invocation is a multi-step process: it looks up the object, determines the class of the object and then identifies the appropriate method. When transitioning to the body of the resolved method, a new function continuation is instantiated, which records the caller’s execution context. Finally, the store is updated with the bindings of formal parameters to evaluated values of passed arguments.

Object allocation

Object allocation creates a new base object pointer; it also invokes the constructor helper to initialize the object( The operation represents right-biased functional union in that wherever vector is in scope, its components are implicitly in scope: ):

Casting

A cast references a variable, replacing the class of the object:

Try

A try statement creates a new handler continuation and then proceeds to the body of the try statement.

Throw to matching handler

When the machine encounters a throw statement, it must check if the topmost continuation is both a handler and a matching handler; if so, then it returns to the context within the continuation: If and and is a :

Throw past non-matching handler

When throwing, if the topmost handler is not a match, the machine looks deeper in the stack for a matching handler. If and but is not a :

Throw past return point

If throwing an exception and the topmost handler is a function return point, then it jumps over this continuation. If :

Popping handlers

When control passes out of a try block, the topmost handler must be popped from the stack. To handle this, the “successor” of the last statement in a try block is actually a special pophandler statement, and the “successor” of that statement is the statement directly following the try block.

Iv A pushdown semantics of exceptions

is a set of frame pointers
is a set of object pointers
is a set of time-stamps.
Fig. 4: Abstract state-space for pushdown analysis of A-Normal Featherweight Java.

With the concrete semantics for A-Normal Featherweight Java with exceptions in place, we are ready to derive the abstract semantics for static analysis. “Abstracting abstract machines” (AAM) has proposed a systematic approach to derive such kind of abstraction, which is equivalent to most of the conservative static analyses [32]. The idea is to make the analysis finite and terminate by finitize every component in the state, so that there is no source of infinity. However, when we apply this technique, the precision is not satisfiable in the client security analysis [21], because the over-approximation of the continuation component causes spurious control-flow and return-flows.

Therefore, in this work, we choose to abstract less than what AAM approach does: we leave the stack (represented as continuation) unbounded in height. In fact, the central idea behind this abstraction is the generalization of two kinds of frames on stack: the function frames and the exception-handler frames. In this way, we form the abstract pushdown semantics. Then, the pushdown abstract semantics will further be computed as control-state reachability in pushdown systems, which is evolved from the work of [26, 9, 10]. However, unlike them, we improve the algorithm to handle new behaviors introduced by exceptions. The algorithm is detailed in Section VI. The rest of the section focuses how we formulate the pushdown semantics.

Abstract semantics are defined on an abstract state-space. To formulate the pushdown abstract state-space, we first reformulate continuations as a list of frames in the concrete semantics:

We have two kinds of frames: function frames as well as handler frames. As with continuations, they may grow without bound (The enhanced reachability algorithm handles this in Section VI).

Figure 4 contains the abstract state-space for the pushdown version of the small-step Featherweight Java machine. At this point, we can extract the high-level structure of the pushdown system from the state-space. A configuration in a pushdown system is a control state (from a finite set) paired with a stack (with a finite number of frames that are defined in Figure 4). This can be observed as follows:

Now let us show the detailed abstract transition relations. Thanks to the way we do the abstraction so far (That is, structural abstraction of concrete states except for the stack component), the abstract transition relations resemble a lot as their concrete counterparts. The biggest difference in abstract semantics is that it does weak updates using the operator . For example, for variable reference (weak updates are underlined.):

The other difference is, whenever evaluating expressions, the results are abstract entities that represents one or more concrete entities. For example, field reference:

The underlined operation shows that there could be more than one abstract objects are evaluated. The two differences apply to all the other rules. To save space, we demonstrate the abstract rules that involve exceptions. Fig 5 shows how we handle the exception-flow and its mix with normal control-flow. The idea is the “multi-pop” behavior introduced when a function call returns or an exception throws (as the concrete semantics). The effect of this approach substantially simplifies the control-reachability algorithm during summarization, as we shall show in Section VI.

[Try]: [Throw to matching handler]: [Throw past non-matching handler]: [Throw past return point]: [Return over handler]: [Popping handlers]:

Fig. 5: Abstract transition relations (exception)

V Enhanced abstract garbage collection

The previous section formulates a pushdown system to handle complicated control-flows (both normal and exceptional). This section describes how we prune the analysis for exceptions from the angle of points-to analysis with enhanced garbage collection generalized for object-oriented programs.

V-a Abstract garbage collection in an object-oriented setting

The idea of abstract garbage collection was first proposed in the work of Might and Shivers [24] for higher-order programs. As an analog to the concrete garbage collection, abstract garbage collection reallocates unreachable abstract resources. Order-of-magnitude improvements in precision have been reported, even as it drops run-times by cutting away false positives. It is natural to think that this technique can benefit exception-flow analysis for object-oriented languages. In fact, in an object-oriented setting, abstract garbage collection can free the analysis from the context-sensitivity and object-sensitivity limitation, since the “garbage” discarded is ignorant of any form of sensitivity! For example, in the following simple code snippet,

  A a1 = idA(new A());

  A a2 = idA(new A()):

  B b1 = idB(a1.makeB());

  B b2 = idB(a2.makeB());

\@endparenv

idA and idB are identity functions. Traditionally, with one level of object-sensitivity and one level of context sensitivity, we are able to distinguish the arguments passed in all of the four lines. However, it is easy to exceed the -sensitivity (call site, allocation sites, receiver objects, etc.) in modern software constructs. Abstract garbage collection can play a role in the way that it discards conservative values and enables fresh bindings for reused variables (formal parameters). This does not need knowledge about any sensitivity! Thus, it can avoid “merging” of abstract object values (and so indirectly eliminate potentially spurious function calls). For exceptions specifically, abstract garbage collection can help avoid conflating exception objects at various throw sites.

To gain the promised analysis precision and performance, we must conduct a careful and subtle redesign of the abstract garbage collection machinery for object-oriented languages. Specifically, we need to make it work with the abstract semantics defined in Section IV. In addition, the reachability algorithm should also be able to work with abstract garbage collection. Fortunately, the challenge of how to adapt abstract garbage collection into pushdown systems has been resolved in the work of Earl et al. [10]. Here we focus on the enhanced machinery for object-oriented languages.

First, we describe how we adapt abstract garbage collection to analyze object-oriented languages. Abstract garbage collection discards unreachable elements from the store, it modifies the transition relation to conduct a “stop-and-copy” garbage collection before each transition. To do so, we define a garbage collection function on configurations:

where the pipe operation yields the function , but with inputs not in the set mapped to bottom—the empty set. The reachability function first computes the root set and then the transitive closure of an address-to-address adjacency relation:

where the function finds the root addresses:

The function finds roots on the stack. However, only has the component to construct addresses, so we define a helper function to extract only out from the stack and skip over all the handle frames. Now is defined as

and the relation: connects adjacent addresses: such that . The formulated abstract garbage collection semantics constructs the subroutine eagc that is called in Alg. 4, which is the interface to enable abstract garbage collection in the reachability algorithm.

V-B Abstract garbage collection enhanced with liveness analysis

Abstract garbage collection can avoid conflating abstract objects for reused variables or formal parameters, but it can not discover “garbage” or “dead” abstract objects in the local scope. The following example illustrates this:

bool foo(A a)   B b = B.read(a);  C p = C.doSomething(b);  return bar(C.not(p));

\@endparenv

Obviously, in the function body foo, b is actually “dead” after the second line. However, näive abstract garbage collection has no knowledge of this. In fact, this is a problem for näive concrete garbage collection [1]. In the realm of static analysis, the garbage value pointed to by b can pollute the exploration of the entire state space.

In addition, in the register-based byte code that our implementation analyzes, there are obvious cases where the same register is reassigned multiple times at different sites within a method. The direct adaptation of abstract garbage collection to an object-oriented setting in Section V-A cannot collect these registers between uses. In other words, for object-oriented programs, we also want to collect “dead” registers, even though they are reachable under description in Section V-A. This can be easily achieve by using liveness analysis. Of course, we could also solve it by transforming the byte code into Static Single Assignment (SSA) form. However, as mentioned above, liveness analysis has additional benefits, so we chose to enhance the abstract garbage collection with live variable analysis (LVA).

LVA computes the set of variables that are alive at each statement within a method. The garbage collector can then more precisely collect each frame.

Since LVA is well-defined in the literature [2], we skip the formalization here, but the is now modified to collect only live variables of the current statement :

The liveness property is embedded in the overall eagc subroutine in Alg. 4.

Vi Extending pushdown reachability analysis for exceptions

Given the formalisms in the previous sections, it is not immediately clear how to convert these rules into a static analyzer, or more importantly, how to handle the unbounded stack without it always visiting new machine configurations. Thus, we need a way to compute a finite summary of the reachable machine configurations.

In abstract interpretation frameworks, the Dyck State Graph synthesis algorithm [9], which is a purely functional version of the Saturation algorithm [26], provides a method for computing reachable pushdown control states. We build our algorithms on the work of Earl et al. [10]. As it turns out, it is not hard to extend the summarization idea to deal with an unbounded stack with exceptions. In the following sections, we present the complete algorithm in a top-down fashion, which aims to easily turn into actual working code. The algorithm code uses previous definitions specified in Section IV.

Vi-a Analysis setup

Input: : a list of program statements (with an initial entry point ).
Output: Dyck State Graph a triple of a set of control states a set of edges, and a initial state.
empty store initial empty stack frame pointer empty list of contexts ()Eval() return
Algorithm 1 Analyze

The analysis for a program starts from the Analyze function, as shown in Alg. 1. It accepts a program expression (an entry point to a program), and gives out a Dyck State Graph (DSG). Formally speaking, a DSG of a pushdown system is the subset of a pushdown system reachable over legal paths. (A path is legal if it never tries to pop when a frame other than is on top of the stack.) Note that the component is designed for accommodating traditional analysis, depending on actual implementation. For example, the last call sites or object-allocation labels, or the mix of them. The analysis produces DSG from the subroutine Eval, which is the fix-point synthesis algorithm.

In Alg. 1, is a composed data structure used in the summarization algorithm. It is derived from the idea of an closure graph (ECG) in the work of Earl et al. [9], but supports efficient caching of closures along with transitive push frames on the stack. Specifically, = (). The six components can be considered maps:

  • predecessors : , maps a target node to the source node(s) of an edge(s)

  • successors : , maps a source node to the target node(s) of an edge(s)

  • top frames , records the shallow pushed stack frame(s) for a state node.

  • possible stack frames , compute all possible pushed stack frame of a state. It is used for abstract garbage collection.

  • predecessors for push action : , records source state node(s) for a pushed frame and the net-changed state. For example in the legal path: , the entry is in .

  • non- predecessors ( ), maps a state node to non- predecessors.

These data structures (and ) have the same definition in the following algorithms.

Vi-A1 Fix-point algorithm of the pushdown exception framework

Alg. 2 describes the fix-point computation for the reachability algorithm. It iteratively constructs the reachable portion of the pushdown transition relation (Ln. 5-12) by inserting -summary edges whenever it finds empty-stack (Ln. 13-20) (e.g., push a, push b, pop b, pop a) paths between control states.

Input:
Output:
1 for do
2       for do
3             for