Improving Strategies via SMT Solving  This work was partially funded by the ANR project “ASOPT”.

Improving Strategies via SMT Solving 1

Abstract

We consider the problem of computing numerical invariants of programs by abstract interpretation. Our method eschews two traditional sources of imprecision: (i) the use of widening operators for enforcing convergence within a finite number of iterations (ii) the use of merge operations (often, convex hulls) at the merge points of the control flow graph. It instead computes the least inductive invariant expressible in the domain at a restricted set of program points, and analyzes the rest of the code en bloc. We emphasize that we compute this inductive invariant precisely. For that we extend the strategy improvement algorithm of Gawlitza and Seidl (2007a). If we applied their method directly, we would have to solve an exponentially sized system of abstract semantic equations, resulting in memory exhaustion. Instead, we keep the system implicit and discover strategy improvements using SAT modulo real linear arithmetic (SMT). For evaluating strategies we use linear programming. Our algorithm has low polynomial space complexity and performs for contrived examples in the worst case exponentially many strategy improvement steps; this is unsurprising, since we show that the associated abstract reachability problem is -complete.

1 Introduction

1.1 Motivation

Static program analysis attempts to derive properties about the run-time behavior of a program without running the program. Among interesting properties are the numerical ones: for instance, that a given variable always has a value in the range when reaching a given program point. An analysis solely based on such interval relations at all program points is known as interval analysis Cousot and Cousot (1976). More refined numerical analyses include, for instance, finding for each program point an enclosing polyhedron for the vector of program variables Cousot and Halbwachs (1978). In addition to obtaining facts about the values of numerical program variables, numerical analyses are used as building blocks for e.g. pointer and shape analyses.

However, by Rice’s theorem, only trivial properties can be checked automatically H. G. Rice (1953). In order to check non-trivial properties we are usually forced to use abstractions. A systematic way for inferring properties automatically w.r.t. a given abstraction is given through the abstract interpretation framework of Cousot and Cousot (1977). This framework safely over-approximates the run-time behavior of a program.

When using the abstract interpretation framework, we usually have two sources of imprecision. The first source of imprecision is the abstraction itself: for instance, if the property to be proved needs a non-convex invariant to be established, and our abstraction can only represent convex sets, then we cannot prove the property. Take for instance the C-code y = 0; if (x <= -1 || x >= 1) { if (x == 0) y = 1; }. No matter what the values of the variables x and y are before the execution of the above C-code, after the execution the value of y is . The invariant in the “then” branch is not convex, and its convex hull includes . Any static analysis method that computes a convex invariant in this branch will thus also include . In contrast, our method avoids enforcing convexity, except at the heads of loops.

The second source of imprecision are the safe but imprecise methods that are used for solving the abstract semantic equations that describe the abstract semantics: such methods safely over-approximate exact solutions, but do not return exact solutions in all cases. The reason is that we are concerned with abstract domains that contain infinite ascending chains, in particular if we are interested in numerical properties: the complete lattice of all -dimensional closed real intervals, used for interval analysis, is an example. The traditional methods are based on Kleene fixpoint iteration which (purely applied) is not guaranteed to terminate in interesting cases. In order to enforce termination (for the price of imprecision) traditional methods make use of the widening/narrowing approach of Cousot and Cousot (1977). Grossly, widening extrapolates the first iterations of a sequence to a possible limit, but can easily overshoot the desired result. In order to avoid this, various tricks are used, including “widening up to” (Halbwachs, 1993, Sec. 3.2), “delayed” or with “thresholds” (Blanchet et al., 2003). However, these tricks, although they may help in many practical cases, are easily thwarted. Gopan and Reps (2006) proposed “lookahead widening”, which discovers new feasible paths and adapts widening accordingly; again this method is no panacea. Furthermore, analyses involving widening are non-monotonic: stronger preconditions can lead to weaker invariants being automatically inferred; a rather non-intuitive behaviour. Since our method does not use widening at all, it avoids these problems.

1.2 Our Contribution

We fight both sources of imprecision noted above:

  • In order to improve the precision of the abstraction, we abstract sequences of if-then-else statements without loops en bloc. In the above example, we are then able to conclude that holds. In other words: we abstract sets of states only at the heads of loops, or, more generally, at a cut-set of the control-flow graph (a cut-set is a set of program points such that removing them would cut all loops).

  • Our main technical contribution consists of a practical method for precisely computing abstract semantics of affine programs w.r.t. the template linear constraint domains of Sankaranarayanan et al. (2005), with sequences of if-then-else statements which do not contain loops abstracted en bloc. Our method is based on a strict generalization of the strategy improvement algorithm of Gawlitza and Seidl (2007b, a, 2010). The latter algorithm could be directly applied to the problem we solve in this article, but the size of its input would be exponential in the size of the program, because we then need to explicitly enumerate all program paths between cut-nodes which do not cross other cut-nodes. In this article, we give an algorithm with low polynomial memory consumption that uses exponential time in the worst case. The basic idea consists in avoiding an explicit enumeration of all paths through sequences of if-then-else-statements which do not contain loops. Instead we use a SAT modulo real linear arithmetic solver for improving the current strategy locally. For evaluating each strategy encountered during the strategy iteration, we use linear programming.

  • As a byproduct of our considerations we show that the corresponding abstract reachability problem is -complete. In fact, we show that it is -hard even if the loop invariant being computed consists in a single inequality where is a program variable and is the parameter of the invariant. Hence, exponential worst-case running-time seems to be unavoidable.

1.3 Related Work

Recently, several alternative approaches for computing numerical invariants (for instance w.r.t. to template linear constraints) were developed:

Strategy Iteration

Strategy iteration (also called policy iteration) was introduced by Howard for solving stochastic control problems Howard (1960); Puterman (1994) and is also applied to two-players zero-sum games Hoffman and Karp (1966); Puri (1995); Vöge and Jurdziński (2000) or min-max-plus systems Cochet-Terrasson et al. (1999). Costan et al. (2005); Gaubert et al. (2007); Adjé et al. (2010) developed a strategy iteration approach for solving the abstract semantic equations that occur in static program analysis by abstract interpretation. Their approach can be seen as an alternative to the traditional widening/narrowing approach. The goal of their algorithm is to compute least fixpoints of monotone self-maps , where for all and is a family of self-maps. The assumption is that one can efficiently compute the least fixpoint of for every . The ’s are the (min-)strategies. Starting with an arbitrary min-stratgy , the min-strategy is successively improved. The sequence of attained min-strategies results in a decreasing sequence that stabilizes, whenever is a fixpoint of — not necessarily the least one. However, there are indeed important cases, where minimality of the obtained fixpoint can be guaranteed Adjé et al. (2008). Moreover, an important advantage of their algorithm is that it can be stopped at any time with a safe over-approximation. This is in particular interesting if there are infinitely many min-strategies Adjé et al. (2010). Costan et al. (2005) showed how to use their framework for performing interval analysis without widening. Gaubert et al. (2007) extended this work to the following relational abstract domains: The zone domain Miné (2001a), the octagon domain Miné (2001b) and in particular the template linear constraint domains Sankaranarayanan et al. (2005). Gawlitza and Seidl (2007a) presented a practical (max-)strategy improvement algorithm for computing least solutions of systems of rational equations. Their algorithm enables them to perform a template linear constraint analysis precisely — even if the mappings are not non-expansive. This means: Their algorithm always computes least solutions of abstract semantic equations — not just some solutions.

Acceleration Techniques

Gonnord and Halbwachs (2006); Gonnord (2007) investigated an improvement of linear relation analysis that consists in computing, when possible, the exact (abstract) effect of a loop. The technique is fully compatible with the use of widening, and whenever it applies, it improves both the precision and the performance of the analysis. Leroux and Sutre (2007); Gawlitza et al. (2009) studied cases where interval analysis can be done in polynomial time w.r.t. a uniform cost measure, where memory accesses and arithmetic operations are counted for .

Quantifier Elimination

Recent improvements in SAT/SMT solving techniques have made it possible to perform quantifier elimination on larger formulas (Monniaux, 2008). Monniaux (2009) developed an analysis method based on quantifier elimination in the theory of rational linear arithmetic. This method targets the same domains as the present article; it however produces a richer result. It can not only compute the least invariant inside the abstract domain of a loop, but also express it as a function of the precondition of the loop; the method outputs the source code of the optimal abstract transformer mapping the precondition to the invariant. Its drawback is its high cost, which makes it practical only on small code fragments; thus, its intended application is modular analysis: analyze very precisely small portions of code (functions, modules, nodes of a reactive data-flow program, …), and use the results for analyzing larger portions, perhaps with another method, including the method proposed in this article.

Mathematical Programming

Colón et al. (2003); Sankaranarayanan et al. (2004); Cousot (2005) presented approaches for generating linear invariants that uses non-linear constraint solving. Leconte et al. (2009) propose a mathematical programming formulation whose constraints define the space of all post-solutions of the abstract semantic equations. The objective function aims at minimizing the result. For programs that use affine assignments and affine guards, only, this yields a mixed integer linear programming formulation for interval analysis. The resulting mathematical programming problems can then be solved to guaranteed global optimality by means of general purpose branch-and-bound type algorithms.

2 Basics

2.1 Notations

denotes the set of Boolean values. The set of real numbers is denoted by . The complete linearly ordered set is denoted by . We call two vectors comparable iff or holds. For with , we set and . We denote the -th row (resp. the -th column) of a matrix by (resp. ). Accordingly, denotes the component in the -th row and the -th column. We also use this notation for vectors and mappings .

Assume that a fixed set of variables and a domain is given. We consider equations of the form , where is a variable and is an expression over . A system of (fixpoint) equations is a finite set of equations, where are pairwise distinct variables. We denote the set of variables occurring in by . We drop the subscript whenever it is clear from the context.

For a variable assignment , an expression is mapped to a value by setting and , where , is a -ary operator, for instance , and are expressions. Let be a system of equations. We define the unary operator on by setting for all . A solution is a variable assignment such that holds. The set of solutions is denoted by .

Let be a complete lattice. We denote the least upper bound and the greatest lower bound of a set by and , respectively. The least element (resp. the greatest element ) is denoted by (resp. ). We define the binary operators and by and for all , respectively. For , we will also consider as the application of a -ary operator. This will cause no problems, since the binary operators and are associative and commutative. An expression (resp. an equation ) is called monotone iff all operators occurring in are monotone.

The set of all variable assignments is a complete lattice. For , we write (resp. ) iff (resp. ) holds for all . For , denotes the variable assignment . A variable assignment with is called finite. A pre-solution (resp. post-solution) is a variable assignment such that (resp. ) holds. The set of all pre-solutions (resp. the set of all post-solutions) is denoted by (resp. ). The least fixpoint (resp. the greatest fixpoint) of an operator is denoted by (resp. ), provided that it exists. Thus, the least solution (resp. the greatest solution) of a system of equations is denoted by (resp. ), provided that it exists. For a pre-solution (resp. for a post-solution ), (resp. ) denotes the least solution that is greater than or equal to (resp. the greatest solution that is less than or equal to ). From Knaster-Tarski’s fixpoint theorem we get: Every system of monotone equations over a complete lattice has a least solution and a greatest solution . Furthermore, and .

2.2 Linear Programming

We consider linear programming problems (LP problems for short) of the form where , , and are the inputs. The convex closed polyhedron is called the feasible space. The LP problem is called infeasible iff the feasible space is empty. An element of the feasible space, is called feasible solution. A feasible solution that maximizes is called optimal solution.

LP problems can be solved in polynomial time through interior point methods Megiddo (1987); Schrijver (1986). Note, however, that the running-time then crucially depends on the sizes of occurring numbers. At the danger of an exponential running-time in contrived cases, we can also instead rely on the simplex algorithm: its running-time is uniform, i.e., independent of the sizes of occurring numbers (given that arithmetic operations, comparison, storage and retrieval for numbers are counted for ).

2.3 SAT modulo real linear arithmetic

The set of SAT modulo real linear arithmetic formulas is defined through the grammar ,   . Here, is a constant, is a real valued variable, are real-valued linear expressions, is a Boolean variable and are formulas. An interpretation for a formula is a mapping that assigns a real value to every real-valued variable and a Boolean value to every Boolean variable. We write for “ is a model of ”, i.e., , , , , and:

A formula is called satisfiable iff it has a model. The problem of deciding, whether or not a given SAT modulo real linear arithmetic formula is satisfiable, is NP-complete. There nevertheless exist efficient solver implementations for this decision problem Dutertre and de Moura (2006b).

In order to simplify notations we also allow matrices, vectors, the operations , and the Boolean constants and to occur.

2.4 Collecting and Abstract Semantics

The programs that we consider in this article use real-valued variables . Accordingly, we denote by the vector of all program variables. For simplicity, we only consider elementary statements of the form , and , where (resp. ), (resp. ), and denotes the vector of all program variables. Statements of the form are called (affine) assignments. Statements of the form are called (affine) guards. Additionally, we allow statements of the form and , where are statements. The operator binds tighter than the operator , and we consider and to be right-associative, i.e., stands for , and stands for . The set of statements is denoted by . A statement of the form , where does not contain the operator for all , is called merge-simple. A merge-simple statement that does not use the operator at all is called sequential. A statement is called elementary iff it neither contains the operator nor the operator .

The collecting semantics of a statement is defined by

for . Note that the operators and are associative, i.e., and hold for all statements .

An (affine) program is a triple , where is a finite set of program points, is a finite set of control-flow edges, and is the start program point. As usual, the collecting semantics of a program is the least solution of the following constraint system:

Here, the variables , take values in . The components of the collecting semantics are denoted by for all .

Let be a complete lattice (for instance the complete lattice of all -dimensional closed real intervals). Let the partial order of be denoted by . Assume that and form a Galois connection, i.e., for all and all , iff . The abstract semantics of a statement is defined by The abstract semantics of an affine program is the least solution of the following constraint system:

Here, the variables , take values in . The components of the abstract semantics are denoted by for all . The abstract semantics safely over-approximates the collecting semantics , i.e., for all .

2.5 Using Cut-Sets to improve Precision

Usually, only sequential statements (these statements correspond to basic blocks) are allowed in control flow graphs. However, given a cut-set , one can systematically transform any control flow graph into an equivalent control flow graph of our form (up to the fact that has fewer program points than ) with increased precision of the abstract semantics. However, for the sake of simplicity, we do not discuss these aspects in detail. Instead, we consider an example:

(a) (b)
Figure 1:
Example 1 (Using Cut-Sets to improve Precision).

As a running example throughout the present article we use the following C-code:

int x_1, x_2; x_1 = 0; while (x_1 <= 1000) { x_2 = -x_1;
  if (x_2 < 0) x_1 = -2 * x_1; else x_1 = -x_1 + 1; }

This C-code is abstracted through the affine program which is shown in Figure 1.(a). However, it is unnecessary to apply abstraction at every program point; it suffices to apply abstraction at a cut-set of . Since all loops contain program point , a cut-set of is . Equivalent to applying abstraction only at program point is to rewrite the control-flow graph w.r.t. the cut-set into a control-flow graph equivalent w.r.t. the collecting semantic. The result of this transformation is drawn in Figure 1.(b). This means: the affine program for the above C-code is , where and

Let denote the collecting semantics of and denote the collecting semantics of . and are equivalent in the following sense: holds for all program points . W.r.t. the abstract semantics, is, is we will see, strictly more precise than . In general we at least have for all program points . This is independent of the abstract domain.2

2.6 Template Linear Constraints

In the present article we restrict our considerations to template linear constraint domains (Sankaranarayanan et al., 2005). Assume that we are given a fixed template constraint matrix . The template linear constraint domain is . As shown by Sankaranarayanan et al. (2005), the concretization and the abstraction , which are defined by

form a Galois connection. The template linear constraint domains contain intervals, zones, and octagons, with appropriate choices of the template constraint matrix Sankaranarayanan et al. (2005).

In a first stage we restrict our considerations to sequential and merge-simple statements. Even for these statements we avoid unnecessary imprecision, if we abstract such statements en bloc instead of abstracting each elementary statement separately:

Example 2.

In this example we use the interval domain as abstract domain, i.e., our complete lattice consists of all -dimensional closed real intervals. Our affine program will use variables, i.e., . The complete lattice of all -dimensional closed real intervals can be specified through the template constraint matrix , where denotes the identity matrix. Consider the statements , , and and the abstract value (a -dimensional closed real interval). The interval can w.r.t.  be identified with the abstract value . More generally, w.r.t.  every -dimensional closed real interval can be identified with the abstract value . If we abstract each elementary statement separately, then we in fact use instead of to abstract the collecting semantics of the statement . The following calculation shows that this can be important: The imprecision is caused by the additional abstraction. We lose the information that the values of the program variables and are equal after executing the first statement. ∎

Another possibility for avoiding unnecessary imprecision in the above example would consist in adding additional rows to the template constraint matrix. Although this works for the above example, it does not work in general, since still only convex sets can be described, but sometimes non-convex sets are required (cf. with the example in the introduction).

Provided that is a merge-simple statement, can be computed in polynomial time through linear programming:

Lemma 3 (Merge-Simple Statements).

Let be a merge-simple statement and . Then can be computed in polynomial time through linear programming. ∎

However, the situation for arbitrary statements is significantly more difficult, since, by reducing SAT to the corresponding decision problem, we can show the following:

Lemma 4.

The problem of deciding, whether or not, for a given template constraint matrix , and a given statement , holds, is NP-complete.

Before proving the above lemma, we introduce -strategies for statements as follows:

Definition 1 (-Strategies for Statements).

A -strategy for a statement is a function that maps every position of a -statement, (a statement of the form ) within to or . The application of a -strategy to a statement is inductively defined by , , and , where is an elementary statement, and are arbitrary statements. For all occurrences , denotes the position of , i.e., identifies the occurrence. ∎

Proof.

Firstly, we show containment in . Assume . There exists some such that the -th component of is greater than . We choose non-deterministically. There exists a -strategy for such that the -th component of equals the -th component of . We choose such a -strategy non-deterministically. By Lemma 3, we can check in polynomial time, whether the -th component of is greater than . If this is fulfilled, we accept.

In order to show -hardness, we reduce the NP-hard problem SAT to our problem. Let be a propositional formula with variables. W.l.o.g. we assume that is in normal form, i.e., there are no negated sub-formulas that contain or . We define the statement that uses the variables of as program variables inductively by , , , and , where is a variable of , and are formulas. Here, the statement is an abbreviation for the statement . The formula is satisfiable iff holds. Moreover, even if we just use the interval domain, holds iff holds. Thus, is satisfiable iff holds. ∎∎

Obviously, and for all statements . We can transform any statement into an equivalent merge-simple statement using these rules. We denote the merge-simple statement that is obtained from an arbitrary statement by applying the above rules in some canonical way by . Intuitively, is an explicit enumeration of all paths through the statement .

Lemma 5.

For every statement , is merge-simple, and . The size of is at most exponential in the size of . ∎

However, in the worst case, the size of is exponential in the size of . For the statement , for instance, we get After replacing all statements with it is in principle possible to use the methods of Gawlitza and Seidl (2007a) in order to compute the abstract semantics precisely. Because of the exponential blowup, however, this method would be impractical in most cases. 3

Our new method that we are going to present avoids this exponential blowup: instead of enumerating all program paths, we shall visit them only as needed. Guided by a SAT modulo real linear arithmetic solver, our method selects a path through only when it is locally profitable in some sense. In the worst case, an exponential number of paths may be visited (Section 7); but one can hope that this does not happen in many practical cases, in the same way that SAT and SMT solving perform well on many practical cases even though they in principle may visit an exponential number of cases.

2.7 Abstract Semantic Equations

The first step of our method consists of rewriting our program analysis problem into a system of abstract semantic equations that is interpreted over the reals. For that, let be an affine program and its abstract semantics. We define the system of abstract semantic inequalities to be the smallest set of inequalities that fulfills the following constraints:

  • contains the inequality for every .

  • contains the inequality for every control-flow edge and every .

We define the system of abstract semantic equations by . Here, for a system of inequalities, is the system of equations. The system of abstract semantic equations captures the abstract semantics of :

Lemma 6.

for all program points , . ∎

Example 7 (Abstract Semantic Equations).

We again consider the program of Example 1. Assume that the template constraint matrix is given by and . Let denote the abstract semantics of . Then . consists of the following abstract semantic equations:

As stated by Lemma 6, we have , and . ∎

3 A Lower Bound on the Complexity

In this section we show that the problem of computing abstract semantics of affine programs w.r.t. the interval domain is -hard. -hard problems are conjectured to be harder than both -complete and co--complete problems. For further information regarding the polynomial-time hierarchy see e.g. Stockmeyer (1976).

Theorem 8.

The problem of deciding, whether, for a given program , a given template constraint matrix , and a given program point , holds, is -hard.

Proof.

We reduce the -complete problem of deciding the truth of a propositional formula (Wrathall, 1976) to our problem. Let be a formula without free variables, where is a propositional formula. We consider the affine program , with program variables , where , and with

The statement is defined as in the proof of Lemma 4.

In intuitive terms: this program initializes to . Then, it enters a loop: it computes into the binary decomposition of , then it attempts to nondeterministically choose so that is true. If this is possible, it increments by one and loops. Otherwise, it just loops. Thus, there is a terminating computations iff holds.

Then holds iff . For the abstraction, we consider the interval domain. By considering the Kleene-Iteration, it is easy to see that holds iff holds. Thus holds iff holds. ∎∎

4 Determining Improved Strategies

In this section we develop a method for computing local improvements of strategies through solving SAT modulo real linear arithmetic formulas.

In order to decide, whether or not, for a given statement , a given , a given , and a given , holds, we construct the following SAT modulo real linear arithmetic formula (we use existential quantifiers to improve readability):

Here, is a formula that relates every with all elements from the set . It is defined inductively over the structure of as follows:

Here, for every position of a subexpression of , is a Boolean variable. Let denote the set of all positions of -subexpressions of . The set of free variables of the formula is . A valuation for the variables from the set describes a path through . We have:

Lemma 9.

holds iff is satisfiable. ∎

Our next goal is to compute a -strategy for such that holds, provided that holds. Let be a statement, , , and . Assume that holds. By Lemma 9, there exists a model of . We define the -strategy for by for all . By again applying Lemma 9, we get . Summarizing we have:

Lemma 10.

By solving the SAT modulo real linear arithmetic formula that can be obtained from in linear time, we can decide, whether or not holds. From a model of this formula, we can obtain a -strategy for such that holds in linear time. ∎

Figure 2: Formula for Example 11
Example 11.

We again continue Example 1 and 7. We want to know, whether holds. For that we compute a model of the formula which is written down in Figure 2. is a model of the formula