Context Sensitive Earley
January 29, 2017 | Author: Praveen Raj | Category: N/A
Short Description
Download Context Sensitive Earley...
Description
Earley Parsing for Context-Sensitive Grammars Daniel M. Roberts May 4, 2009
1
Introduction
The field of parsing has received considerable attention over the last 40 years, largely because of its applicability to nearly any problem involving the conversion of data. It has proven to be a beautiful example how theoretical research, in this case automaton theory, and clever algorithms can lead to vast improvements over naive or brute-force implementations. String parsing has applications in data retrieval, compilers, and many other fields. For compilers especially, format description languages such as YACC have developed to define how code in a text file should be interpreted as a tree structure which the compiler can handle directly. While this is good for the specialized domain, there are cases when one might wish to use a lighter-weight all-in-one approach that also includes a small amount of in-line programmability. On the flip-side of the pursuit for expressiveness are efficiency constraints. If the language is quite restricted, the parsing algorithm can be more tightly tailored to fit the problem, and thus can be more efficient. In practice, a simple language that is the subset of a more complicated language often outperforms the more expressive language when the two are given the same simple-language input. For example, regular expressions engines of Java, Python, Ruby, Perl, and certainly others, which include expressive features such as look-aheads, back-references, and boundary conditions, have worst-case exponential performance even on strictly regular regular expressions [1], whereas an implementation specialized for strict regular expressions can achieve worst-case quadratic time—or linear time if we allow preprocessing. The goal of the current research is to explore how to go about creating a parsing language that in addition to being generally efficient, is in fact maximally efficient on restricted sublanguages with known solutions. This work is mainly concerned with three formats for specifying the structure of textual documents: (1) regular expressions, which describe the regular languages, (2) context-free grammars (CFGs) with regular right-hand sides, which describe the context-free languages, and (3) what we shall call the context-sensitive grammars (CSGs)1 , which will be the focus 1
CSGs as defined here are not to be confused with the related but different definition of a context-free
1
of this paper. Regular expressions and CFGs have long been staples of parsing and are used for everything from web apps and text search to compilers. Matching a string of length m to a regular expression of length n is known to be worst case O(nm) in time if we do not allow preprocessing, and this bound can be achieved with the Thompson algorithm, described below. Parsing a string of length m in accordance with a context-free grammar of “size” n is known to be worst case O(m2 n) in time, and an algorithm that achieves this time bound is the Earley algorithm which is based on the same principle as Thompson. Earley can also be extended to parse strings according to a CSG, but because of the wide range of expressiveness allowed by this format, there is no good worst-case time bound. Nevertheless, one key property of the extension is that when the CSG describes a context-free language, it is no less efficient than traditional Earley parsing. Since both regular expressions and CFGs are subsets of CSGs, this augmented grammar specification format offers more flexibility than either, without sacrificing efficiency in the cases where special features are not used. I shall begin with an overview of the theory for parsing regular expressions and contextfree grammars.
2
Regular Expressions
Regular expressions have the form rexp ::= Eps | Char(char) | Alt(rexp,rexp) | Cat(rexp,rexp) | Star(rexp) Basic regular expressions syntax and matching conditions can be described as follows. 1. Eps. A string matches “()” if it is the empty string. 2. Char(‘c’). A string matches “c” if it consists of the single character “c”. 3. Alt(re1, re2). A string matches “re1 |re2 ” if it matches either re1 or re2 . 4. Cat(re1, re2). A string matches “re1 · re2 ” if it can be split into two strings s1 and s2 such that s1 matches re1 and s2 matches re2 . 5. Star(re). A string matches “re*” if it can be split into k ≥ 0 strings all of which match re. In addition, it is conventional to give the operators |, ·, and * binding preferences that are analogous to those of +, ·, and x2 in mathematical expressions, thus reducing the need for parentheses. It is also conventional to leave out the explicit symbol ·, just as it is in mathematical expressions. Thus, for example, the regular expression ab|cd* stands for the grammar used in relation to the Chomsky Hierarchy.
2
more explicit (a · b)|(c · (d*)), and matches either the string “ab” or any string of the form “cddd...”, i.e. a single c followed by zero or more d’s. The naive way to match a regular expression is to match its parts recursively, more or less as described above. Although this algorithm has been shown to be exponential in the degenerate case, it is by far the most widely used algorithm in practice because it is conceptually the simplest to implement and it is trivially extensible to include more powerful features such as back-references, which are inherently non-regular in the strict sense. It would be desirable, however, for the mere existence of such language features not to interfere with the computational complexity of parsing against expressions that use only the operations given here. This is the motivation for the present research. The polynomial time algorithm eluded to above is attributed to Ken Thompson, 1968. It works by reading in the characters of the string one at a time, keeping track of all possible parses at once. To assist in keeping track of the partial parses, the algorithm first builds a directed graph representation of the regular expression so that each node represents a parse state, and each transition is labeled with the character that needs to be parsed in order for the edge to be followed. Some edges can be unlabeled, in which case no character is needed to pass from one state to the other. This type of graph is often called a Finite State Automaton (FSA). Below is an example of a recursive function in pseudocode that returns the Initial node and the Final node. A parse that begins at the Start node and ends at the Final node is a successful parse. FSA(re): I = fresh_state() F = fresh_state() switch(re): Eps: return (F, F) Char(c): build [I --->c F] return (I, F) Alt(re1, re2): (I1, F1) = FSA(re1) (I2, F2) = FSA(re2) build [I ---> I1] build [I ---> I2] build [F1 ---> F] build [F2 ---> F] return (I, F) Cat(re1, re2): (I1, F1) = FSA(re1) 3
(I2, F2) = FSA(re2) build [F1 ---> I2] return (I1,F1) Star(re1): (I1, F1) = FSA(re1) build [I ---> I1] build [F1 ---> I] build [I ---> F] return (I, F) The Thompson algorithm to parse a string str against an FSA keeps track of Sk , the set of states that are reachable after parsing the first k characters, starting with k = 0 up to the length of the string. Specifically, for k > 0, Sk+1 depends only on Sk and str[k]. Here are the rules for moving ahead with Thompson: Initialization: I ∈ S0 Consumption:
Null Propagation:
u ∈ Sk , u →str[k] u0 u0 ∈ Sk+1 u ∈ Sk , u → u 0 u 0 ∈ Sk
When F ∈ Sk , the first k characters of the string match the regular expression. There are a number of additions to the syntax of regular expressions that do not affect the overall complexity of the grammars they describe. In particular we shall use 1. re+ is equivalent to re re*, and 2. re? is equivalent to re|(). It is also convenient both for ease of expression and for efficiency of parsing to have character classes. A character class matches any symbol in a given set of symbols: for example, we may wish to have \a match any alphabetic character, have \w match any alphanumeric character, and have a period (.) match any character, etc. Since we can often represent a set of characters as a range in the ascii character set, it is more efficient for a parser to see whether a character falls within the range then to see whether it is equal to one of some list of characters, because the latter approach requires a linear-time search.
4
3
Context Free Grammars
The context-free grammars describe a class of languages that is in some sense “infinitely” more complex than regular expressions. Among the language features that CFGs can describe but that regular expressions cannot are 1. Matching parentheses, 2. Recursive algebraic structures, and 3. Trees of arbitrary depth. The idea of recursion is explicitly built into the definition of a CFG. This makes them perfect for describing many formal languages, such programming language syntax and data formats. As we will see shortly, the regular expressions syntax can be described as a CFG, and for that matter so can CFG syntax.
3.1
CFG Syntax
For the purposes of this paper, a CFG is a context-free grammar with regular right-hand sides. A CFG is a list of rules of the form nont = rexp; where nont is the nonterminal symbol being defined and rexp is a regular expression over terminals and nonterminals. The first nonterminal in the list of definitions is taken as the start nonterminal, which is to say that a string matches the CFG if it matches the first nonterminal. To distinguish nonterminals from terminals, nonterminals are enclosed in curly braces {nont}. Just as character classes can be simulated by the other operations, namely alternation, so too can regular expressions be simulated by the more basic CFG syntax, which allows only concatenation. Allowing regular right-hand sides not only simplifies the notation for the programmer, but also makes parsing more efficient. In addition to this baseline language, there are two extra syntactic features that do not express conditions on whether a string is accepted, but instead affect how verbose the resulting parse tree is. By default, all nonterminals processed in the course of the matching a string are represented as a node in the tree, and all terminals are not represented at all. To hide a nonterminal and let its subtree be subsumed by its parent, the syntax .nont = rexp; is used. To express the characters that appear in a regular expression, use the syntax $(re). This is especially useful when you want the parse tree to keep track of the actual character used to match a word class: $(.), $(\a), etc.; or to get the result of an alternation: $(a|b|c), etc. For grouping that does not save, brackets are used. Below is an example CFG that defines regular expressions syntax: 5
rexp = {rexp1}; alt = {rexp2} (\| {rexp2})+; cat = {rexp3} {rexp3}+; uny = {rexp4} $(\*|\+|\?); eps = \(\); c = $(\w|\.|\\.); .paren = \({rexp1}\); .rexp1 = {rexp2} | {alt}; .rexp2 = {rexp3} | {cat}; .rexp3 = {rexp4} | {uny}; .rexp4 = {eps} | {c} | {paren}; Here the four nonterminals rexp1, rexp2, rexp3, and rexp4 represent four different levels of binding. This prevents a string such as ab|cd* from being interpreted as the regular expression (a(b|c)d)*, etc, but rather as (ab)|(c(d*)).
3.2
CFG Transducers
The transducers needed to represent a CFG will not be finite state automata in general, because the expressive power of FSAs is equal to that of regular expressions. We will compile one FSA for each regular right-hand side, and use a new type of graph edge to join them together: the call edge. If some node u in nonterminal A has transition u →{B} u0 , (where A may or may not be the same as B), we build a call edge from u to the start state of B’s FSA. To use this kind of transducer, we have to maintain a stack of return addresses: when we follow u →call IB , , we push u on to the stack. When we reach FB , we pop the first item on the stack, u, and transition to some u0 where u →{B} u0 . [3] Once a transducer has been assembled from a CFG, it can be marshaled and stored as a preprocessed form of the CFG, to be used directly for parsing.
3.3
Earley Parsing
As with FSAs, these transducers are in general non-deterministic; that is, from any given state there may be null-edges or multiple edges of the same type. Either of these means that during the parse, there will be moments where it is not clear which state we ought to move to next. As with regular expressions, this nondeterminism can in theory be dealt with by recursively trying every possible path until a match is found, but this kind of “backtracking” leads to poor performance that is worst-case exponential. The classic polynomial-time solution to this problem was proposed by Jay Earley [2] and has a similar flavor to the Thompson algorithm for regular expressions. The version described below has been modified to work with the CFG transducers described above, and is largely based on a version described in by Trevor Jim and Yitzhak Mandelbaum [3].
6
For every position 0 ≤ j ≤ n in the string to be parsed, the algorithm constructs an “Earley set.” Just as in Thompson, an Earley set is a set of possible “parse states,” but in the case of CFGs, a transducer vertex does not fully describe a parse state, because we also need to be able to reconstruct the call stack. One way to do this would be to describe a parse state as a transducer vertex and the call stack, but this has the downside of being very inefficient, because there is no bound on the length of the call stack, and thus these sets of possible states would be able to grow arbitrarily large. A far more compact way to represent the call stack is with a “return address” i not to a vertex, but to an Earley set. What this means is that any vertex with a representative in the ith Earley set is a valid return address. Thus, if two parse states in the same Earley set call the same nonterminal, the sub-parse is only done once, rather than once per call. Below are the formal parsing semantics. Rules carried over from Thompson: Initialization: (I, 0) ∈ S0 Consumption: (u, i) ∈ Sj , u →str[j] v (v, i) ∈ Sj+1 Null Propagation: (u, i) ∈ Sj , u → v (v, i) ∈ Sj New Rules: Call Propagation: (u, i) ∈ Sj , u →call v (v, j) ∈ Sj Return Propagation: (u, i) ∈ Sj , u 7→ A, (u0 , i0 ) ∈ Si , u0 →A v (v, i0 ) ∈ Sj The general algorithm is to seed a set Sj with Initialization if j = 0 and Consumption otherwise; repeat the three Propagation rules until Sj does not change; and to recursively apply this to Sj+1 if j < m. There are many ways to optimize propagation. One case where this can significantly increase efficiency is when a nonterminal is “nullable,” that is, it matches the null string. Nullable nonterminals require applying the rules in several rounds until nothing changes, because if a nonterminal doesn’t consume any characters, 7
Return Propagation searches through the items in the same Earley set (i = j above), which is yet unfinished2 . This issue, however does not affect the overall time-complexity of the algorithm, which has worst case time O(n2 m) in either case, and this paper does not deal with such optimizations.
4
Context Sensitive Grammars
The purpose of the present paper is to explore a grammar that has context-sensitive features, but that looks formally quite similar to our formulation of CFGs with regular righthand sides. In formal language theory, context-sensitivity is often formulated by loosening the restriction that rules have only a single nonterminal on the left hand side. The present formulation is easier to reason with and incorporates some familiar features of imperative programming, such as the ability to pass arguments to subroutines, to store values in variables, and to reason with and operate on both values and variables.
4.1
CSG Syntax
The augmentation of CFG syntax to accommodate forms of context-sensitivity can be done entirely by adding new types of expressions to regular expressions, now renamed rhs’s due to their lack of regularity in the grammar theory sense. var = string nont = string rhs ::= Eps | Char(char) | Alt(rhs,rhs) | Cat(rhs,rhs) | Star(rhs) | Nont(nont, exp) | Assert(exp) | Capture(rhs,var) | Set(var,exp) exp ::= ... Here is a summary of the additions: 1. A call to a nonterminal may contain an argument, which the nonterminal may use to guide its parse. 2. An assert statement, for example [len(x) > 3], which matches the empty string if and only if the expression evaluates to true. 3. A capture statement, which after matching an rhs, stores the matched string in a variable, which may be referenced later. For example (.. @ x) matches two characters and stores them in the variable x. 4. We may set and reset variables at any time with a command like (x=len(y)). 2
See
8
The expression language exp may be as expressive as one wishes so long as it does not modify the external environment, though a very simple language that includes integer calculation and comparison, strings, characters, atoi, string-length, variables, and equality testing is enough to allow this class of grammars to describe many practically applicable cases of context-sensitivity.
4.2
CSG Transducers
As above, once we have reformulated how to build the right-hand sides, joining them to create the transducer is done as described for CFGs, with the small caveat that call edges are parameterized with the argument to be passed. To build a transducer fragment, we use the method described for regular expressions, also parameterizing nonterminal edges with their argument. An assert edge is labeled with the assertion, and a set edge is labeled with the assignment. Capture(x,rhs) requires its own mechanism, which is to surround rhs’s fragment with an incoming “push” arrow and an outgoing “pop x” arrow. The transducer semantics are described below.
4.3
Augmented Earley Parsing
This algorithm is a minimal modification of the Earley algorithm to include extra state information, namely variable contexts and the capture stack. Thus, an Earley item for CSG parsing is (u, i, E, α), where E is the context which has type [(var, exp)], and α which has type [int]. Rules carried over from Thompson: Initialization: (I, 0, [], []) ∈ S0 Consumption: (u, i, E, α) ∈ Sj , u →str[j] v (v, i, E, α) ∈ Sj+1 Null Propagation: (u, i, E, α) ∈ Sj , u → v (v, i, E, α) ∈ Sj Rules carried over from vanilla Earley: Call Propagation: (u, i, E, α) ∈ Sj , u →call(e) v (v, j, [(“arg”, e)], []) ∈ Sj
9
Return Propagation: (u, i, E, α) ∈ Sj , u 7→ A, (u0 , i0 , E 0 , α0 ) ∈ Si , u0 →A(E[“arg”]) v (v, i0 , E 0 , α0 ) ∈ Sj New Rules: Assert Propagation: (u, i, E, α) ∈ Sj , u →assert(e) v, eval(e, E) = Bool(true) (v, i, E, α) ∈ Sj Set Propagation: (u, i, E, α) ∈ Sj , u →set(x,e) v (v, i, ((x, e) :: E), α) ∈ Sj Push Propagation: (u, i, E, α) ∈ Sj , u →push v (v, i, E, (j :: α)) ∈ Sj Pop Propagation: (u, i, E, (k :: α)) ∈ Sj , u →pop(x) v (v, i, ((x, str[k : j]) :: E), α) ∈ Sj Two details deserve attention. First, Call Propagation has been modified to pass on a context that contains the special variable “arg” set to the value of the parameter. Second, Return Propagation has been modified so that a nonterminal arc must match both in nonterminal and in parameter. The capture mechanism works as follows: to start a capture, the input position is pushed onto the capture stack; to finish a capture, pop the start position off the stack and extract the substring of the input that starts there and ends at the current position, and store this string in some variable.
5
The Expression Language
In the current implementation, a simple, untyped expression language is used. Here is a BNF outline: type exp ::= Var(var) | Unit | Bool(bool) | Int(int) | Char(char) | Str(string) | Not(exp) | Equals(exp, exp) | Less(exp, exp) | Minus(exp) | Sum(exp, exp) | Prod(exp, exp) | | GetChar(exp, exp) | Len(exp) | Atoi(exp) | Fail type var = string In addition to the symbols ! for Not, =, =, and - for binary Sum(e1,Minus(e2)) are used for syntactic convenience. The syntax str[i] is used for GetChar(str,i), len(str) for Len(str), and int(str) for Atoi(str). 10
6
Parse Tree Building
To return a parse tree from the algorithm outlined above, it suffices to store for each parse state item a pointer to the item or items that participated in its creation. The scheme used in the current implementation is as follows. Every item stores one of the following parse annotations: 1. When an item is added by Call Propagation, it stores a PCall tag, with no pointers. 2. When an item is added by Return Propagation, it stores PReturn(u0 , u, A, E[“args”], show), where u, u0 , A, and E correspond to the variables in Return Propagation as stated above, and show corresponds to whether the parse of the nonterminal should be given its own subtree, or whether it should be subsumed by the caller parse; this is shown syntactically by the omission or inclusion of a period (‘.’) before the name of the nonterminal in the grammar file. 3. When an item is added by anything else, it stores a simple back pointer to the item PTransEps(item). Additionally, in order to have some sort of control over what shows up in our parse tree and what is omitted, we can augment the CSG right hand sides to include a Show(var) constructor. Syntactically this is written ($var). This generates an arc in the transducer that acts just like a null transition, except that if u →show(var) v, then v stores a special parse annotation: PShow(exp), where exp is the value of var, E[var]. This lets the tree generator know to include exp as one of the Leafs of the tree. Note that a Leaf(exp) records the expression as an expression value, Unit, Bool(b), Int(i), Char(c), or Str(s), preserving the type.
7
Examples
To test the efficacy of this system in practice, the following examples have been tested on the current implementation.
7.1
IP Adresses
This example matches an IP address, whose format is N.N.N.N, where 0 ≤ N ≤ 255. IP = {N255}\.{N255}\.{N255}\.{N255}; .N255 = (\d+@x)(x=int(x))[x>=0][x {text}?({xml}+ {text})*{xml}* \; setting = $({word}) _ \= _ ($({word})|\"$(((.@cstr)[cstr!="\""])*)\"); text = $(((. @ cstr)[cstr!="\"])*); .word = (\a | \_)(\w | \_)*; This little hack represents the basic XML format, that is, header and parameters in triangle brackets (), followed by text and other xml tags, followed by a matching end tag ().
7.4
Operator Binding Strength
In an earlier CFG example, we described regular expressions syntax as a CFG, using multiple nonterminals to achieve order of operations rules, or binding strength. Parameterized nonterminals give us another, potentially neater way to express these rules. Here is a version of the previous example that uses a few of our simple programming features. program .rexp = | |
= _ {rexp} _; [arg=()] {rexp 1} [arg re” syntax, which could be shorthand for “[x=v1 ]re1 |[x!=v1 ](· · · [x=vn ]ren ([x!=vn ]re))”, but would be more efficient if implemented separately. 3. Guards: “[len(re) < 6]”, “[re = exp]”, shorthand for “(re @ x)[len(x) {inner} \ : { inner = {text}?({xml}+ {text})*{xml}*: { text = $(((. @ cstr)[cstr!="\"])*); }; setting = $({word}) _ \= _ ($({word})|\"$(((.@cstr)[cstr!="\""])*)\"); .word = (\a | \_)(\w | \_)*; }; } Unlike some other potential modifications, this would only have the effect of making code easier to read, write, and update. It would not have a significant impact on performance.
9.5
Explicit Tree Construction
In the present system, tree construction is automatic and based on the way the nonterminals are parsed. The only control we have over tree structure from within the language is the somewhat awkward dot-notation to suppress expression of a nonterminal. There may be cases when we want to set aside parts of an rhs as a subtree without explicitly making a nonterminal for it. To add such control over tree construction, we can add the following constructor: 16
rhs ::= ... | Label(rhs, string) This could be written {label: rhs}; nonterminal syntax could be replaced with ; {nont} could be a shorthand for {nont: }; and dot-notation could be eliminated. Thus there would be an explicit way to signal subtree creation, as well as a way to show or not show nonterminal subtrees on a per-case basis. A second addition would change the language quite dramatically but may be a good way to incorporate the equivalent of YACC’s “actions.” Essentially, we could replace the constructor above with rhs ::= ... | Construct(exp) So long as the expression language is rich enough, we can explicitly build an arbitrary structure and not rely on the parser’s tree representation at all. If we are also allowed to include global type statements for the expression language, we can get something like the following code for building regular expressions: type rexp = Eps | Char(string) | Alt([rexp]) | Cat([rexp]) | Star(rexp); _ {rexp} _ : rexp = rexp_b = [arg
View more...
Comments