Improving Strategies via SMT Solving
^{1}
Abstract
We consider the problem of computing numerical invariants of programs by abstract interpretation. Our method eschews two traditional sources of imprecision: (i) the use of widening operators for enforcing convergence within a finite number of iterations (ii) the use of merge operations (often, convex hulls) at the merge points of the control flow graph. It instead computes the least inductive invariant expressible in the domain at a restricted set of program points, and analyzes the rest of the code en bloc. We emphasize that we compute this inductive invariant precisely. For that we extend the strategy improvement algorithm of Gawlitza and Seidl (2007a). If we applied their method directly, we would have to solve an exponentially sized system of abstract semantic equations, resulting in memory exhaustion. Instead, we keep the system implicit and discover strategy improvements using SAT modulo real linear arithmetic (SMT). For evaluating strategies we use linear programming. Our algorithm has low polynomial space complexity and performs for contrived examples in the worst case exponentially many strategy improvement steps; this is unsurprising, since we show that the associated abstract reachability problem is complete.
1 Introduction
1.1 Motivation
Static program analysis attempts to derive properties about the runtime behavior of a program without running the program. Among interesting properties are the numerical ones: for instance, that a given variable always has a value in the range when reaching a given program point. An analysis solely based on such interval relations at all program points is known as interval analysis Cousot and Cousot (1976). More refined numerical analyses include, for instance, finding for each program point an enclosing polyhedron for the vector of program variables Cousot and Halbwachs (1978). In addition to obtaining facts about the values of numerical program variables, numerical analyses are used as building blocks for e.g. pointer and shape analyses.
However, by Rice’s theorem, only trivial properties can be checked automatically H. G. Rice (1953). In order to check nontrivial properties we are usually forced to use abstractions. A systematic way for inferring properties automatically w.r.t. a given abstraction is given through the abstract interpretation framework of Cousot and Cousot (1977). This framework safely overapproximates the runtime behavior of a program.
When using the abstract interpretation framework, we usually have two sources of imprecision. The first source of imprecision is the abstraction itself: for instance, if the property to be proved needs a nonconvex invariant to be established, and our abstraction can only represent convex sets, then we cannot prove the property. Take for instance the Ccode y = 0; if (x <= 1  x >= 1) { if (x == 0) y = 1; }. No matter what the values of the variables x and y are before the execution of the above Ccode, after the execution the value of y is . The invariant in the “then” branch is not convex, and its convex hull includes . Any static analysis method that computes a convex invariant in this branch will thus also include . In contrast, our method avoids enforcing convexity, except at the heads of loops.
The second source of imprecision are the safe but imprecise methods that are used for solving the abstract semantic equations that describe the abstract semantics: such methods safely overapproximate exact solutions, but do not return exact solutions in all cases. The reason is that we are concerned with abstract domains that contain infinite ascending chains, in particular if we are interested in numerical properties: the complete lattice of all dimensional closed real intervals, used for interval analysis, is an example. The traditional methods are based on Kleene fixpoint iteration which (purely applied) is not guaranteed to terminate in interesting cases. In order to enforce termination (for the price of imprecision) traditional methods make use of the widening/narrowing approach of Cousot and Cousot (1977). Grossly, widening extrapolates the first iterations of a sequence to a possible limit, but can easily overshoot the desired result. In order to avoid this, various tricks are used, including “widening up to” (Halbwachs, 1993, Sec. 3.2), “delayed” or with “thresholds” (Blanchet et al., 2003). However, these tricks, although they may help in many practical cases, are easily thwarted. Gopan and Reps (2006) proposed “lookahead widening”, which discovers new feasible paths and adapts widening accordingly; again this method is no panacea. Furthermore, analyses involving widening are nonmonotonic: stronger preconditions can lead to weaker invariants being automatically inferred; a rather nonintuitive behaviour. Since our method does not use widening at all, it avoids these problems.
1.2 Our Contribution
We fight both sources of imprecision noted above:

In order to improve the precision of the abstraction, we abstract sequences of ifthenelse statements without loops en bloc. In the above example, we are then able to conclude that holds. In other words: we abstract sets of states only at the heads of loops, or, more generally, at a cutset of the controlflow graph (a cutset is a set of program points such that removing them would cut all loops).

Our main technical contribution consists of a practical method for precisely computing abstract semantics of affine programs w.r.t. the template linear constraint domains of Sankaranarayanan et al. (2005), with sequences of ifthenelse statements which do not contain loops abstracted en bloc. Our method is based on a strict generalization of the strategy improvement algorithm of Gawlitza and Seidl (2007b, a, 2010). The latter algorithm could be directly applied to the problem we solve in this article, but the size of its input would be exponential in the size of the program, because we then need to explicitly enumerate all program paths between cutnodes which do not cross other cutnodes. In this article, we give an algorithm with low polynomial memory consumption that uses exponential time in the worst case. The basic idea consists in avoiding an explicit enumeration of all paths through sequences of ifthenelsestatements which do not contain loops. Instead we use a SAT modulo real linear arithmetic solver for improving the current strategy locally. For evaluating each strategy encountered during the strategy iteration, we use linear programming.

As a byproduct of our considerations we show that the corresponding abstract reachability problem is complete. In fact, we show that it is hard even if the loop invariant being computed consists in a single inequality where is a program variable and is the parameter of the invariant. Hence, exponential worstcase runningtime seems to be unavoidable.
1.3 Related Work
Recently, several alternative approaches for computing numerical invariants (for instance w.r.t. to template linear constraints) were developed:
Strategy Iteration
Strategy iteration (also called policy iteration) was introduced by Howard for solving stochastic control problems Howard (1960); Puterman (1994) and is also applied to twoplayers zerosum games Hoffman and Karp (1966); Puri (1995); Vöge and Jurdziński (2000) or minmaxplus systems CochetTerrasson et al. (1999). Costan et al. (2005); Gaubert et al. (2007); Adjé et al. (2010) developed a strategy iteration approach for solving the abstract semantic equations that occur in static program analysis by abstract interpretation. Their approach can be seen as an alternative to the traditional widening/narrowing approach. The goal of their algorithm is to compute least fixpoints of monotone selfmaps , where for all and is a family of selfmaps. The assumption is that one can efficiently compute the least fixpoint of for every . The ’s are the (min)strategies. Starting with an arbitrary minstratgy , the minstrategy is successively improved. The sequence of attained minstrategies results in a decreasing sequence that stabilizes, whenever is a fixpoint of — not necessarily the least one. However, there are indeed important cases, where minimality of the obtained fixpoint can be guaranteed Adjé et al. (2008). Moreover, an important advantage of their algorithm is that it can be stopped at any time with a safe overapproximation. This is in particular interesting if there are infinitely many minstrategies Adjé et al. (2010). Costan et al. (2005) showed how to use their framework for performing interval analysis without widening. Gaubert et al. (2007) extended this work to the following relational abstract domains: The zone domain Miné (2001a), the octagon domain Miné (2001b) and in particular the template linear constraint domains Sankaranarayanan et al. (2005). Gawlitza and Seidl (2007a) presented a practical (max)strategy improvement algorithm for computing least solutions of systems of rational equations. Their algorithm enables them to perform a template linear constraint analysis precisely — even if the mappings are not nonexpansive. This means: Their algorithm always computes least solutions of abstract semantic equations — not just some solutions.
Acceleration Techniques
Gonnord and Halbwachs (2006); Gonnord (2007) investigated an improvement of linear relation analysis that consists in computing, when possible, the exact (abstract) effect of a loop. The technique is fully compatible with the use of widening, and whenever it applies, it improves both the precision and the performance of the analysis. Leroux and Sutre (2007); Gawlitza et al. (2009) studied cases where interval analysis can be done in polynomial time w.r.t. a uniform cost measure, where memory accesses and arithmetic operations are counted for .
Quantifier Elimination
Recent improvements in SAT/SMT solving techniques have made it possible to perform quantifier elimination on larger formulas (Monniaux, 2008). Monniaux (2009) developed an analysis method based on quantifier elimination in the theory of rational linear arithmetic. This method targets the same domains as the present article; it however produces a richer result. It can not only compute the least invariant inside the abstract domain of a loop, but also express it as a function of the precondition of the loop; the method outputs the source code of the optimal abstract transformer mapping the precondition to the invariant. Its drawback is its high cost, which makes it practical only on small code fragments; thus, its intended application is modular analysis: analyze very precisely small portions of code (functions, modules, nodes of a reactive dataflow program, …), and use the results for analyzing larger portions, perhaps with another method, including the method proposed in this article.
Mathematical Programming
Colón et al. (2003); Sankaranarayanan et al. (2004); Cousot (2005) presented approaches for generating linear invariants that uses nonlinear constraint solving. Leconte et al. (2009) propose a mathematical programming formulation whose constraints define the space of all postsolutions of the abstract semantic equations. The objective function aims at minimizing the result. For programs that use affine assignments and affine guards, only, this yields a mixed integer linear programming formulation for interval analysis. The resulting mathematical programming problems can then be solved to guaranteed global optimality by means of general purpose branchandbound type algorithms.
2 Basics
2.1 Notations
denotes the set of Boolean values. The set of real numbers is denoted by . The complete linearly ordered set is denoted by . We call two vectors comparable iff or holds. For with , we set and . We denote the th row (resp. the th column) of a matrix by (resp. ). Accordingly, denotes the component in the th row and the th column. We also use this notation for vectors and mappings .
Assume that a fixed set of variables and a domain is given. We consider equations of the form , where is a variable and is an expression over . A system of (fixpoint) equations is a finite set of equations, where are pairwise distinct variables. We denote the set of variables occurring in by . We drop the subscript whenever it is clear from the context.
For a variable assignment , an expression is mapped to a value by setting and , where , is a ary operator, for instance , and are expressions. Let be a system of equations. We define the unary operator on by setting for all . A solution is a variable assignment such that holds. The set of solutions is denoted by .
Let be a complete lattice. We denote the least upper bound and the greatest lower bound of a set by and , respectively. The least element (resp. the greatest element ) is denoted by (resp. ). We define the binary operators and by and for all , respectively. For , we will also consider as the application of a ary operator. This will cause no problems, since the binary operators and are associative and commutative. An expression (resp. an equation ) is called monotone iff all operators occurring in are monotone.
The set of all variable assignments is a complete lattice. For , we write (resp. ) iff (resp. ) holds for all . For , denotes the variable assignment . A variable assignment with is called finite. A presolution (resp. postsolution) is a variable assignment such that (resp. ) holds. The set of all presolutions (resp. the set of all postsolutions) is denoted by (resp. ). The least fixpoint (resp. the greatest fixpoint) of an operator is denoted by (resp. ), provided that it exists. Thus, the least solution (resp. the greatest solution) of a system of equations is denoted by (resp. ), provided that it exists. For a presolution (resp. for a postsolution ), (resp. ) denotes the least solution that is greater than or equal to (resp. the greatest solution that is less than or equal to ). From KnasterTarski’s fixpoint theorem we get: Every system of monotone equations over a complete lattice has a least solution and a greatest solution . Furthermore, and .
2.2 Linear Programming
We consider linear programming problems (LP problems for short) of the form where , , and are the inputs. The convex closed polyhedron is called the feasible space. The LP problem is called infeasible iff the feasible space is empty. An element of the feasible space, is called feasible solution. A feasible solution that maximizes is called optimal solution.
LP problems can be solved in polynomial time through interior point methods Megiddo (1987); Schrijver (1986). Note, however, that the runningtime then crucially depends on the sizes of occurring numbers. At the danger of an exponential runningtime in contrived cases, we can also instead rely on the simplex algorithm: its runningtime is uniform, i.e., independent of the sizes of occurring numbers (given that arithmetic operations, comparison, storage and retrieval for numbers are counted for ).
2.3 SAT modulo real linear arithmetic
The set of SAT modulo real linear arithmetic formulas is defined through the grammar , . Here, is a constant, is a real valued variable, are realvalued linear expressions, is a Boolean variable and are formulas. An interpretation for a formula is a mapping that assigns a real value to every realvalued variable and a Boolean value to every Boolean variable. We write for “ is a model of ”, i.e., , , , , and:
A formula is called satisfiable iff it has a model. The problem of deciding, whether or not a given SAT modulo real linear arithmetic formula is satisfiable, is NPcomplete. There nevertheless exist efficient solver implementations for this decision problem Dutertre and de Moura (2006b).
In order to simplify notations we also allow matrices, vectors, the operations , and the Boolean constants and to occur.
2.4 Collecting and Abstract Semantics
The programs that we consider in this article use realvalued variables . Accordingly, we denote by the vector of all program variables. For simplicity, we only consider elementary statements of the form , and , where (resp. ), (resp. ), and denotes the vector of all program variables. Statements of the form are called (affine) assignments. Statements of the form are called (affine) guards. Additionally, we allow statements of the form and , where are statements. The operator binds tighter than the operator , and we consider and to be rightassociative, i.e., stands for , and stands for . The set of statements is denoted by . A statement of the form , where does not contain the operator for all , is called mergesimple. A mergesimple statement that does not use the operator at all is called sequential. A statement is called elementary iff it neither contains the operator nor the operator .
The collecting semantics of a statement is defined by
for . Note that the operators and are associative, i.e., and hold for all statements .
An (affine) program is a triple , where is a finite set of program points, is a finite set of controlflow edges, and is the start program point. As usual, the collecting semantics of a program is the least solution of the following constraint system:
Here, the variables , take values in . The components of the collecting semantics are denoted by for all .
Let be a complete lattice (for instance the complete lattice of all dimensional closed real intervals). Let the partial order of be denoted by . Assume that and form a Galois connection, i.e., for all and all , iff . The abstract semantics of a statement is defined by The abstract semantics of an affine program is the least solution of the following constraint system:
Here, the variables , take values in . The components of the abstract semantics are denoted by for all . The abstract semantics safely overapproximates the collecting semantics , i.e., for all .
2.5 Using CutSets to improve Precision
Usually, only sequential statements (these statements correspond to basic blocks) are allowed in control flow graphs. However, given a cutset , one can systematically transform any control flow graph into an equivalent control flow graph of our form (up to the fact that has fewer program points than ) with increased precision of the abstract semantics. However, for the sake of simplicity, we do not discuss these aspects in detail. Instead, we consider an example:


(a)  (b) 
Example 1 (Using CutSets to improve Precision).
As a running example throughout the present article we use the following Ccode:
This Ccode is abstracted through the affine program which is shown in Figure 1.(a). However, it is unnecessary to apply abstraction at every program point; it suffices to apply abstraction at a cutset of . Since all loops contain program point , a cutset of is . Equivalent to applying abstraction only at program point is to rewrite the controlflow graph w.r.t. the cutset into a controlflow graph equivalent w.r.t. the collecting semantic. The result of this transformation is drawn in Figure 1.(b). This means: the affine program for the above Ccode is , where and
Let denote the collecting semantics of
and denote the collecting semantics of .
and are equivalent in the following sense:
holds for all program points .
W.r.t. the abstract semantics, is, is we will see, strictly more precise than .
In general we at least have for all program points .
This is independent of the abstract domain.
2.6 Template Linear Constraints
In the present article we restrict our considerations to template linear constraint domains (Sankaranarayanan et al., 2005). Assume that we are given a fixed template constraint matrix . The template linear constraint domain is . As shown by Sankaranarayanan et al. (2005), the concretization and the abstraction , which are defined by
form a Galois connection. The template linear constraint domains contain intervals, zones, and octagons, with appropriate choices of the template constraint matrix Sankaranarayanan et al. (2005).
In a first stage we restrict our considerations to sequential and mergesimple statements. Even for these statements we avoid unnecessary imprecision, if we abstract such statements en bloc instead of abstracting each elementary statement separately:
Example 2.
In this example we use the interval domain as abstract domain, i.e., our complete lattice consists of all dimensional closed real intervals. Our affine program will use variables, i.e., . The complete lattice of all dimensional closed real intervals can be specified through the template constraint matrix , where denotes the identity matrix. Consider the statements , , and and the abstract value (a dimensional closed real interval). The interval can w.r.t. be identified with the abstract value . More generally, w.r.t. every dimensional closed real interval can be identified with the abstract value . If we abstract each elementary statement separately, then we in fact use instead of to abstract the collecting semantics of the statement . The following calculation shows that this can be important: The imprecision is caused by the additional abstraction. We lose the information that the values of the program variables and are equal after executing the first statement. ∎
Another possibility for avoiding unnecessary imprecision in the above example would consist in adding additional rows to the template constraint matrix. Although this works for the above example, it does not work in general, since still only convex sets can be described, but sometimes nonconvex sets are required (cf. with the example in the introduction).
Provided that is a mergesimple statement, can be computed in polynomial time through linear programming:
Lemma 3 (MergeSimple Statements).
Let be a mergesimple statement and . Then can be computed in polynomial time through linear programming. ∎
However, the situation for arbitrary statements is significantly more difficult, since, by reducing SAT to the corresponding decision problem, we can show the following:
Lemma 4.
The problem of deciding, whether or not, for a given template constraint matrix , and a given statement , holds, is NPcomplete.
Before proving the above lemma, we introduce strategies for statements as follows:
Definition 1 (Strategies for Statements).
A strategy for a statement is a function that maps every position of a statement, (a statement of the form ) within to or . The application of a strategy to a statement is inductively defined by , , and , where is an elementary statement, and are arbitrary statements. For all occurrences , denotes the position of , i.e., identifies the occurrence. ∎
Proof.
Firstly, we show containment in . Assume . There exists some such that the th component of is greater than . We choose nondeterministically. There exists a strategy for such that the th component of equals the th component of . We choose such a strategy nondeterministically. By Lemma 3, we can check in polynomial time, whether the th component of is greater than . If this is fulfilled, we accept.
In order to show hardness, we reduce the NPhard problem SAT to our problem. Let be a propositional formula with variables. W.l.o.g. we assume that is in normal form, i.e., there are no negated subformulas that contain or . We define the statement that uses the variables of as program variables inductively by , , , and , where is a variable of , and are formulas. Here, the statement is an abbreviation for the statement . The formula is satisfiable iff holds. Moreover, even if we just use the interval domain, holds iff holds. Thus, is satisfiable iff holds. ∎∎
Obviously, and for all statements . We can transform any statement into an equivalent mergesimple statement using these rules. We denote the mergesimple statement that is obtained from an arbitrary statement by applying the above rules in some canonical way by . Intuitively, is an explicit enumeration of all paths through the statement .
Lemma 5.
For every statement , is mergesimple, and . The size of is at most exponential in the size of . ∎
However, in the worst case,
the size of is
exponential in the size of .
For the statement
,
for instance,
we get
After replacing all statements with
it is in principle possible to use the methods of Gawlitza and Seidl (2007a)
in order to compute the abstract semantics precisely.
Because of the exponential blowup, however,
this method would be impractical in most cases.
Our new method that we are going to present avoids this exponential blowup: instead of enumerating all program paths, we shall visit them only as needed. Guided by a SAT modulo real linear arithmetic solver, our method selects a path through only when it is locally profitable in some sense. In the worst case, an exponential number of paths may be visited (Section 7); but one can hope that this does not happen in many practical cases, in the same way that SAT and SMT solving perform well on many practical cases even though they in principle may visit an exponential number of cases.
2.7 Abstract Semantic Equations
The first step of our method consists of rewriting our program analysis problem into a system of abstract semantic equations that is interpreted over the reals. For that, let be an affine program and its abstract semantics. We define the system of abstract semantic inequalities to be the smallest set of inequalities that fulfills the following constraints:

contains the inequality for every .

contains the inequality for every controlflow edge and every .
We define the system of abstract semantic equations by . Here, for a system of inequalities, is the system of equations. The system of abstract semantic equations captures the abstract semantics of :
Lemma 6.
for all program points , . ∎
3 A Lower Bound on the Complexity
In this section we show that the problem of computing abstract semantics of affine programs w.r.t. the interval domain is hard. hard problems are conjectured to be harder than both complete and cocomplete problems. For further information regarding the polynomialtime hierarchy see e.g. Stockmeyer (1976).
Theorem 8.
The problem of deciding, whether, for a given program , a given template constraint matrix , and a given program point , holds, is hard.
Proof.
We reduce the complete problem of deciding the truth of a propositional formula (Wrathall, 1976) to our problem. Let be a formula without free variables, where is a propositional formula. We consider the affine program , with program variables , where , and with
The statement is defined as in the proof of Lemma 4.
In intuitive terms: this program initializes to . Then, it enters a loop: it computes into the binary decomposition of , then it attempts to nondeterministically choose so that is true. If this is possible, it increments by one and loops. Otherwise, it just loops. Thus, there is a terminating computations iff holds.
Then holds iff . For the abstraction, we consider the interval domain. By considering the KleeneIteration, it is easy to see that holds iff holds. Thus holds iff holds. ∎∎
4 Determining Improved Strategies
In this section we develop a method for computing local improvements of strategies through solving SAT modulo real linear arithmetic formulas.
In order to decide, whether or not, for a given statement , a given , a given , and a given , holds, we construct the following SAT modulo real linear arithmetic formula (we use existential quantifiers to improve readability):
Here, is a formula that relates every with all elements from the set . It is defined inductively over the structure of as follows:
Here, for every position of a subexpression of , is a Boolean variable. Let denote the set of all positions of subexpressions of . The set of free variables of the formula is . A valuation for the variables from the set describes a path through . We have:
Lemma 9.
holds iff is satisfiable. ∎
Our next goal is to compute a strategy for such that holds, provided that holds. Let be a statement, , , and . Assume that holds. By Lemma 9, there exists a model of . We define the strategy for by for all . By again applying Lemma 9, we get . Summarizing we have:
Lemma 10.
By solving the SAT modulo real linear arithmetic formula that can be obtained from in linear time, we can decide, whether or not holds. From a model of this formula, we can obtain a strategy for such that holds in linear time. ∎