Lexical State Analyzer

# Lexical State Analyzer

###### Abstract

Lexical states provide a powerful mechanism to scan regular expressions in a context sensitive manner. At the same time, lexical states also make it hard to reason about the correctness of the grammar. We first categorize the related correctness issues into two classes: errors and warnings. and then present a context sensitive and a context insensitive analysis to identify errors and warnings in context-free-grammars (CFGs). We also present a comparative study of these analyses. A standalone tool (LSA) has also been implemented by us that can identify errors and warnings in JavaCC grammars. The LSA tool outputs a graph that depicts the grammar and the error transitions. It can also generates counter example strings that can be used to establish the errors. We have used LSA to analyze a host of open-source JavaCC grammar files to good effect.

## 1 Introduction

Lexical states provide a convenient mechanism to conditionally activate lexical tokens. For the same input substring, use of lexical states can allow different lexical tokens to be recognized based on prior accepted tokens. For example, when parsing a C program, the parser may put the scanner in a special state (say COMMENT) when it encounters “/*”; when the scanner is in this state the input substring “int” is not recognized as a keyword token but is treated as part of the comment string (just as any input other than “*/” would). In other words the token “int” is not active in the lexical state COMMENT. While lexical states do not make the generated scanners more powerful, they make the specification of the lexical rules simpler.

This simplicity comes with its own cost – lexical states make it hard to reason about the grammar. To explain the hardness, we show a snippet of JavaCC grammar in Figure 1 to parse a subset of BibTex files. Note that JavaCC expects the rules for lexical analysis (regular expressions) and parsing (context free grammar) to be present in a single file. An input BibTex file is expected to consist of zero or more citation blocks. Say we have a BibTex file with the following content to parse: {tightcode} @inproceedings{Tarjan71, author = ”Robert Endre Tarjan”, title = ”Depth-first search and linear graph algorithms” } A quick glance at the grammar production rules will let the programmer believe that the grammar will parse the above input. Let us see how the input is parsed in the presence of lexical states111 In JavaCC, <I1, I2, …In> TOKEN : <X : RegEx> : Os indicates that the scanner can return a token X when it matches the regular expression RegEx, only if its current state is I1, or I2, or .. In and after scanning the token the state changes to Os. Specifying the in-state (such as I1, I2, …) and out state (such as Os) are optional; the default in-state is the special state DEFAULT and the default out-state is the particular in-state in which the token is scanned.. The scanner starts in the DEFAULT state. Each citation block starts with an “@”; upon reading the “@” symbol the scanner switches its state to ENTRY. In this state, the scanner identifies the INPROC token and it switches the state to FIELDS. In this state, the scanner identifies a series of tokens such as LB, IDENTIFIER (to be parsed as Key), COMMA, AUTHOR and EQ. The parser now expects to match the production Data. The scanner first identifies a quote (QT) and switches state to QT_DATA. In this state the scanner matches ETC_IN_QT_DATA multiple times and then it identifies QT_IN_QT_DATA. At this point the parser is expecting the token COMMA or RB, but the scanner reads these tokens only in the lexical state FIELDS, which does not match the current lexical state QT_DATA. Thus, the parser will mark the input string as syntactically incorrect.

Thus, contrary to the naive conclusion drawn by the grammar designer, the presence of lexical states may render the production rules incorrect. In other words, Block has a dead production rule that will never be matched – we cannot match RB after Entry has been matched. Similarly, parts of the grammar rules given in InputFile (AT_OUTSIDE Block() EOF)and Entry (Key() COMMA Field() COMMA Field()) will never be matched. We call it a definite error in the grammar to have a (sub)production that will never be matched. This grammar contains another definite error: as per the given production, Data can also expand to <LB> BrString(), to parse something like {Robert Endre Tarjan}. However, the parser needs either ETC_IN_BR_DATA or RB_IN_BR_DATA, which can only be identified in the lexical state BR_DATA. But since the state of the scanner is FIELDS it can never parse data present inside braces.

Similar to the terminals that may have different in- and out-states as declared in the grammar, a non-terminal can also be seen to have in- and out-states. The in-states of a non-terminal is the union of all the in-states of the terminals present in the FIRST set [AhoSethiUllman86] of the non-terminal. Similar to the FIRST set of a non-terminal we can also define the LAST set of a non-terminal – the last terminal contained in every sentence derived from is a member of the LAST(). The out-state of a non-terminal is the union of the out-states of the terminals present in LAST().

Say there exists a grammar rule , where and are a sequence of terminals and non-terminals (of length one or more). Say, while can be derived from some of the out-states of , there also exist out-states of from which cannot be derived. In such a case, depending on the specific input, after matching we may reach a state that is not a valid in-state of . We term these as possible errors in the grammar. The grammar snippet shown in Figure 1 has a few possible errors as well. For example, we may be able to match Block, if the input is something like @inproceedings{Tarjan71}. But if the input contains some fields that have to be matched to one or more instance of <COMMA>Field() in Entry then we cannot match Block. The aim of this paper is to present techniques to identify errors, both definite and possible ones, in context free grammars.

We show that the most naive form of grammar analysis that detects unreachable non-terminals, by marking all the transitively unreachable non-terminals from the start non-terminal, is oblivious of the in- and out-states and is not sufficient to identify definite and possible errors.

Our contributions:
We present two analyses to identify definite errors (marked as errors) and possible errors (marked as warnings). Our first analysis (context insensitive lexical state analysis) computes summary in- and out-states for each non-terminal and it does not take into consideration the position (context) in which the non-terminal appears in any production rule. We use these summary of in- and out-states to conservatively identify the errors and warnings. Our second analysis (context sensitive lexical state analysis) computes the out-states for each non-terminal specific to the context (position and in-state) in which may be parsed. Based on the precise out-states we compute all the (definite) errors that may occur in a production, for each possible lexical in-state for that production. We present a comparative study of these two analyses (see Section 2).
We have implemented these analyses as a standalone tool (LSA) that can identify errors and warnings in JavaCC grammars. The LSA tool outputs a graph that depicts the grammar and the error transitions. It can generate example strings (counter examples) that can be used to establish the errors (see Section 3).
We have evaluated our LSA tool on a host of open-source JavaCC grammar files to good effect. We find that our techniques help catch errors and warnings that are otherwise not caught by the naive unreachable production detection algorithm (see Section 4).

### 1.1 Related Work

Researchers have designed grammar analyzers with many different purposes. Identifying ambiguity of context free grammars has received a fair amount of attention [Gorn63, amber, BrabrandGigerichMoller10, BastenStorm10, VasudevanTratt12]. Similarly, there have been prior works on verifying [BarthwalNorrish09] and validating parsers [JourdanPottierLeroy12]; these focus on ensuring that the semantics of the parser matches that of the grammar. None of these papers deal with lexical states and erroneous situation arising in such a context. In contrast, our paper aims at identifying errors in grammars that use lexical states.

The use of context to improve the precision of program analysis is a well known technique. The trade-offs between context-sensitive (improved precision) and context-insensitive (faster) are well studied [Muchnick97, NielsonNielsonHankin99]. In this paper, we use the notion of context-sensitive and context-insensitive analysis to present two analyses that help identify errors and warnings in context free grammars (CFGs) that use tokens with lexical states.

## 2 Lexical State Verifier

In this section, we first discuss the grammar subset over which we illustrate our analysis. Then we present three algorithms to analyze these grammars: the naive useless productions detection algorithm, our context insensitive lexical state analysis, and our context sensitive lexical state analysis. We follow it up with a discussion on the analyses and our counter example derivation process.

### 2.1 Grammar subset

We first discuss a representative scheme for token and grammar specification that we will use to explain our techniques. We will assume that the input grammar follows this specification. Our specification can be used to generate grammars in JavaCC format trivially. Details about the JavaCC syntax can be found in the manual [javacc].

A typical definition of lexical tokens is of the form: {tightcode} ¡I1, I2 …In¿ TOKEN : { ¡Token1:RegEx1¿ : Os ¡Token2:RegEx2¿ } It defines two tokens Token1 and Token2 corresponding to two regular expressions RegEx1 and RegEx2. Given a string matching RegEx1 (or RegEx2), the scanner returns the token Token1 (or Token2) if its current state . If the scanner returns the token Token2, the scanner will remain in state . If the scanner returns the token Token1, the scanner will switch to state Os. Thus every lexical token have a non-empty set of in-states and a corresponding set of out-states.

We will assume that the input grammar can be derived from the following representative grammar:

 N0→N1|N2\em// AlternateN0→N1N2\em// SequenceN0→T\em// TerminalNe→ϵ\em// Epsilon

We use to denote terminals and to denote non terminals in the grammar. We expect that is the only non-terminal whose production string is . We will also assume that every non-terminal must have a unique production associated with it. Note that our grammar is general enough to derive any LL (and hence JavaCC) grammar. And our actual implementation can deal with the complete JavaCC grammar.

A context-free grammar can be specified using the four tuple (), where is a set of non-terminals, is a set of terminals, is a set of productions in the above described form and is the initial non-terminal symbol.

### 2.2 Useless Production Detection

We next present a naive algorithm to identify and eliminate useless productions (UPs) in the grammar. We call a production as useless, if it cannot be reached from the start non-terminal. Figure 2 presents a sketch of the algorithm. Starting with the start non-terminal , we “visit” all the non-terminals and mark the non-terminals used in the corresponding productions. We make a post-pass to collect and return all the unmarked non-terminals (in variable ). As it can be seen this algorithm does not take into consideration the lexical states of the terminals in use. And thus the effectiveness of this algorithm is limited.

### 2.3 Context Insensitive Lexical State Analysis

We now present our context insensitive lexical state analysis. The analysis populates two different maps inStates and outStates (Figure 3) for its internal use. For each non terminal, the inStates and outStates maps store the in-states and out-states, respectively. For all the non terminals, these two maps are initialized to contain empty sets. We use to denote the power set of . We assume that the out-state map for terminals () and in-state map for terminals () are trivially precomputed (code not shown) from the rules given for lexical tokens.

Figure 4 presents a sketch of our context insensitive analysis. The main function Main-CInsensitive takes the grammar () as input and first calls Find-Useless-Productions to identify all the useful productions. It follows a worklist based approach to compute the out- and in-states for all the non terminals. We say that a non terminal uses a non terminal , if appears on the right side of the production corresponding to .

CI-BuildOutStates: The out-state of a non-terminal depends on the exact production corresponding to the non-terminal. If the production is of the form , then out-states of includes the out-states of and . If the production is of the form , then out-states of includes the out-states of and optionally that of , if derives the empty string .

CI-BuildInStates: Similar to the construction of outStates, we update the inStates map for each production depending on its form. One main difference between the two is that when the production is of the form : the in-states of includes the in-states of and optionally that of , if derives the empty string .

CI-Analyze: After the in- and out-states of all the non terminals are computed, we first check if the start non terminal () can be parsed in the default lexical state (DEFAULT). We then invoke the CI-Analyze method to check if the lexical states () in which a non terminal can be accessed matches that of its in-states (inStates[]). If there are no common elements between and inStates[], then it is flagged as an error. If includes lexical states that are not part of inStates[], then it is a possible error and hence marked as a warning. A context insensitive error/warning consists of just the non-terminal in which the error/warning is identified.

Example: Figure 5 shows a sample grammar with two lexical states (DEFAULT and LX1). The in-, out-states computed using the context insensitive analysis along with identified errors and warnings are shown in columns 2-4 of Figure 6. For example, it says that non terminal E will always lead to an error state.

Complexity: We will use to denote the number of lexical states, to denote the grammar size; in the worst case , but in practise it rarely happens. The complexity of CI-BuildOutStates and CI-BuildInStates functions is . Each of the while loops in Main-CInsensitive is at most invoked times – in each iteration, size of the outStates map of at least one non terminal increases by one.

### 2.4 Context Sensitive Analysis

We now describe our context sensitive analysis. Here the set of lexical states LS, contains an additional error state . If a terminal or non-terminal cannot be parsed in a specific lexical state (including the error state ), then we consider the resulting lexical state to be . Compared to the context insensitive analysis, the outStates map contains more detailed information. It stores the out-states for each non terminal for each possible lexical state – outStates: NT LS (LS). For all the non terminals, for each lexical token, this map is initialized to contain empty sets. For the outStates map, we use a specialized union operator () to do an element wise union of all the elements of the operands.

Figure 7 presents a sketch of our context sensitive analysis. The main function Main-CSensitive takes the grammar () as input and first calls Find-Useless-Productions to identify all the useful productions. It follows a worklist based approach to compute the out-states for all the non terminals. The CS-BuildOutStates function is similar to that described in the context insensitive analysis (Figure 4). One main variation being the current version maintains separate set of out-states for each lexical state. Once the out-states are computed it calls the CS-Analyze to analyze the grammar, starting with the start non-terminal () and default lexical state as the in-states set ({DEFAULT}).

CS-Analyze: We first check if the current non-terminal () has already been analyzed for the in-states . If it has been already analyzed for all the member states in , then we return the non error out-states of over all the in-states. We use a two dimensional boolean array (isAnalyzed), all elements initialized to false, to check whether a production has already been analyzed or not. For a given lexical state, if the out-states of consists of only the error state , then it is marked as an error. A context sensitive error consists of the non-terminal and the lexical state in which the error is identified. Note that, we avoid issuing any warnings for any non-terminal and lexical state (when ), because the source of the warning would anyway be reported as an error; thereby, we avoid too many messages. If has not been analyzed for a subset of input states we recursively analyze the non-terminals used by .

Example For the example program shown in Figure 5, the out-states of each non terminal for each lexical state computed using the context sensitive analysis, along with the identified errors (note, the error is specific to a non terminal and a lexical token) are shown in columns 5-7 of Figure 6. For example, it says that non terminal D leads to an error state when it is matched in lexical state DEF or LX1. As it can be seen the context sensitive analysis reports all the errors including those that are otherwise not reported by the context insensitive analysis.

Complexity: The complexity of the operator is . The complexity of CS-BuildOutStates function is . The while loop in Main-CSensitive is at most invoked times – in each iteration, size of the outStates map for at least one non-terminal for at least one in-state increases by one. The CS-Analyze function can be called at most times and in each invocation the work done is bound by . This leads to an overall complexity of Main-CSensitive as . In practise, size of is a small number and that makes it almost linear.

### 2.5 Generating Examples

We now discuss, how we can generate example strings that can be used to establish errors in a grammar. We represent the grammar as a graph, and reduce the problem of generating “error” examples, as that of computing an annotated path from the start node to the error node.

Given a context free grammar that uses tokens with lexical states, we represent as a forest (called lexical-transition-graph), where each connected component corresponds to a different production (labeled by that non terminal). To avoid the problem of too many edges we keep the forest sparse and omit the edges between the use of a non-terminal and the graph corresponding to its production, in our figures shown in this manuscript; such edges depict parent-child (use of a non-terminal - its corresponding production) relationship. Each connected component can be seen as a graph , where is the set of nodes consisting of all the non-terminals, terminals and a set of special operators present in the production. For the subset of grammar presented in Section 2.1, , representing the sequencing and choice operators222The complete JavaCC grammar syntax allows strings of the form , and ; thus consists of additional operators “”, “” and “[]”.. Such graph admits a natural parent-child relationship – each terminal and non-terminal on the right side of a production for a non-terminal are marked as its children. Similarly, each special operator works a parent for each non-terminal and other special operators contained with in. Each node has an attached set of in-states and out-states. The set of in-states of an operator node are connected to the corresponding in-states of all its children. Similarly the set of out-states of an operator node are connected to the corresponding out-states of its children. The set of in- and out-states of a token are connected as per the state transitions defined in the grammar. They basically represent the lexical state transitions that are taking place in the grammar.

Given a particular context sensitive error , we find a path from to the root (start non terminal); this path in reverse ensures that we reach in state . Figure 8 presents the algorithm. We recursively visit the parents of the current node till we reach the graph for the start node (root). Next we retrace the path (from the root to ) and at each intermediate node in the path, whose production is of the form , we output a part of the example string.

Example: For the grammar shown in Figure 5, Figure 9 shows the generated lexical transition graph for two production rules E and H. The red marked box shows that there are no “out” edges from D thus indicating an error in E. The graph for node H suffers from no such problems. Our counter example generation routine would generate the string bcbcbcbcc as an example that cannot be parsed. Note that, we have deliberately skipped the box corresponding to the“” in the graph for E to avoid clutter of rectangles.

### 2.6 Comparing context sensitive and insensitive analysis

In this section, we compare the precision of the context sensitive and insensitive analysis. Say NT and NT LS are the sets of errors identified by context insensitive and context sensitive analysis, respectively. Say, the set of non-terminals present in are given by .

###### Theorem 2.1

The context sensitive analysis is more precise than the context insensitive analysis. Or in other words, .

We present a sketch of the proof in Appendix A. This theorem ensures that context sensitive analysis identifies all the errors shown in the context insensitive analysis and may be more.

### 2.7 Practical limitations of using only the naive algorithm

It can be noted that our context sensitive and insensitive algorithms are essentially identifying “useless” non terminals in different productions. Thus it can be argued that we should be able to use the discussed useless production removal procedure (Figure 2) to identify “useless” non-terminals, if the grammar with lexical states can be converted to an equivalent grammar with no lexical states. A grammar with lexical states converted to a grammar with non-lexical states by duplicating terminals and non-terminals such that each one has an unique in-state and unique out-state; However, such a translation (from grammar with lexical states to one without) can lead to exponential blow up. One such example is given below:

 S→AAA…A // w number of themA→A1|A2|A3…|AnA1→a1,A2→a2,⋯,An→an

Say, we have number of lexical states (), and each terminal is declared as: L1, L2, …Ln TOKEN : <: >. Thus, each token has in-states and an unique out state . A translation as suggested above would lead to productions, rendering the overall analysis impractical.

## 3 Implementation

We have implemented our LSA tool using JavaCC and Java. LSA uses the JavaCC grammar from Sun Microsystems [javaccRep]. We extend the code generated by JTB [jtb] to generate an annotated tree, where each node contains information required for the analyses. Further, we recreate the parse tree to for efficient traversal; we call this tree the operator tree. The intermediate nodes of this tree are the operators , , and ; the terminals and non terminals can only appear in the leaf nodes. The node is used to represent grammar productions, and its left child is a non-terminal and right side is a production. The operators along with terminals and non-terminals are used to denote different productions. We later use this tree to generate the graph discussed in Section 2.5, where we drop the operators and make non-terminals as intermediate nodes. Unlike our discussed grammar subset (Section 2.1), all these operators can admit any number of operands. Thus our implementation is not limited by the grammar restrictions described in this paper. LSA can take as input any valid LL(k) grammar in JavaCC format. We now discuss some implementation details of LSA.

### 3.1 Graph Generation

Given an input grammar, LSA invokes our analyses to find warnings and errors. Next, as described in Section 2.5, it creates a lexical transition graph for the input grammar (in DOT [dot] format), along with the lexical states. This graph represents the lexical state transitions that are taking place in the grammar. We then highlight the edges which can lead to error states. Figure 10 shows a part of the graph generated for the motivating example shown in Figure 1. It shows that there are no edges from the out-states of Field (BR_DATA and QT_DATA) to in-states of the “*” sub-production (FIELDS). Thus we cannot use this production to parse more than one Field.

### 3.2 Limitations

We briefly discuss some of the limitations of our implementation. We analyze the lexical state transitions only in the BNF productions (not Javacode productions) and only with respect to the TOKEN regular expression specifications. JavaCC constructs such as SKIP, MORE and SPECIAL TOKENS are not handled as they do not appear directly in BNF productions. Similarly, we do not handle inlined Java code; JavaCC allows the inlined code to change the scanner state using a specialized function SwitchTo. This function takes an integer argument representing the state to change to. Thus precise lexical state transition analysis would depend on identifying the value that flows into these arguments. Analyzing lexical state transitions involving SwitchTo functions is left as a future work.

## 4 Evaluation

We present the evaluation of our tool on a set of ten opensource JavaCC grammar files downloaded from different websites. The complete compilation of these grammar files can be downloaded from our website: http://www.cse.iitm.ac.in/~krishna/lsa/benchmarks/.

Figure 11 presents the summary of our evaluation. The size of these grammar files varied from approximately 200 lines of code to 3000 lines of code. The number of lexical states varied between one to nine. Following the suggestions of the insightful paper of George et al [GeorgesBuytaertEeckhout07], we report the analysis time as an average over 30 runs (on a personal laptop with Intel i3 processor). The reported time includes the time it took to read the grammar files and doing the specific analysis. It can be easily seen from the figure that the running time overhead for our proposed analysis is minimal; all the analyses finish running in less than a second. The context insensitive and sensitive analyses for grammars like PHP, FM and Parser take more time compared to the UP Analysis; this is because of the comparatively increased use of the lexical states in them.

It can be noted that the number of context insensitive errors is less than or equal to the number of context sensitive errors, which agrees with our claim in Section 2.6. We have also generated the graphs for these benchmarks that depict the errors and these can be accessed from the above mentioned URL. We are in the process of writing to the authors of these grammars to understand the challenges in automatic fixing of such grammars.

## 5 Conclusion

We discuss three techniques to identify errors and warnings in context free grammars that use tokens with lexical states: a naive technique to eliminate useless productions, a context insensitive lexical state analysis and a context sensitive lexical state analysis. We have implemented these techniques as standalone tool (LSA) for grammar files written in JavaCC format. Besides the specific information about the errors and warning, LSA outputs a graph that helps reasons about the errors in a convenient manner. We have used LSA to analyze a few open-source JavaCC grammars to good effect. We are working towards releasing this tool for public use.

Analyzing JavaCC grammars with Javacode productions and inlined Java code is an interesting challenge. Further, (semi)automatic fixing of the identified errors in grammars is another formidable challenge. These challenges are left for future work.

## Appendix A Comparison of context sensitive and insensitive analysis

Given a grammar (), we define these sets in Figure 12. We will be using these sets and maps, in addition to the ones defined in Figure 3 to state and prove the following theorem:

###### Theorem A.1

The context sensitive analysis is more precise than the context insensitive analysis. Or in other words, .

###### Proof

Notation: Considering the subset of grammar considered in this paper (Section 2.1), the only production in which a context insensitive error can be encountered is of the form . Say is = is one such production. We will be using as a short form for . We will define the following two sets.

 S1=O′(N1,R)∩I(N2) S2=O′(N1,L)∩I(N2)

Sets and contain the states in which can be done parsed after , in production , while doing context sensitive and insensitive analysis, respectively.

We have,

 S1=ϕ↔p∈Nes (1) S2=ϕ↔p∈Ei (2) S1⊆S2 (3)

From (3), we have

 S2 = ϕ →S1 = ϕ → p∈Ei→p∈Nes // F rom (1), and (2) ↔ Nes⊇Ei
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters