Theory of Parsing Translation And Compiling Alfred v Aho Jeffrey D Ullman Vol 1
Short Description
Descripción: good theory book, computers...
Description
This first volume of a comprehensive, self-contained, and up-to-date treatment of compiler theory emphasizes parsing and its theoretical framework. Chapters are devoted to mathematical preliminaries, an overview of compiling, elements of language theory, theory of translation, general parsing methods, one-pass no-backtrack parsing, and limited backtrack parsing algorithms. Also included are valuable appendices on the syntax for an extensible base language, for SNOBOL4 statements, for PL860 and for a PAL syntax directed translation scheme.
Among the features: • Emphasizes principles that have broad applicability rather than details specific to a given language or machine. • Provides complete coverage of all important parsing algorithms, ineluding the LL, LR, Precedence and Earley's Methods. • Presents many examples to illustrate concepts and applications. Exercises at the end of each section (continued on back jlap)
THE THEORY OF PARSING, TRANSLATION, AND COMPILING
Prentice-Hall Series in Automatic Computation
George Forsythe, editor
AHO AND ULLMAN, The Theory of Parsing, Translation, and Compiling, V o l u m e I: Parsing; V o l u m e II: Compiling (ANDREE),3 Computer Programming: Techniques, Analysis, and Mathematics ANSELONE, Collectively Compact Operator Approximation Theory
and Applications to Integral Equations ARBIB, Theories of Abstract Automata BATES AND DOUGLAS, Programming Language/One, 2nd ed. BLUMENTHAL, Management Information Systems BOBROW AND SCHWARTZ, Computers and the Policy-Making Community BOWLES, editor, Computers in Humanistic Research BRENT, Algorithms for Minimization without Derivatives CESCHINO AND KUNTZMAN, Numerical Solution of Initial Value Problems CRESS, et al., FORTRAN IV with WATFOR and WATFIV DANIEL, The Approximate Minimization of Functionals DESMONDE, A Conversational Graphic Data Processing System DESMONDE, Computers and Their Uses, 2nd ed. DESMONDE, Real-Time Data Processing Systems DRUMMOND, Evaluation and Measurement Techniques for Digital Computer Systems EVANS, et al., Simulation Using Digital Computers
FIKE, Computer Evaluation of Mathematical Functions FIKE, PL/1 for Scientific Programers rORSYTHE AND MOLER, Computer Solution of Linear Algebraic Systems GAUTHIER AND PONTO, Designing Systems Programs GEAR, Numerical lnital Value Problems in Ordinary Differential Equations GOLDEN, FORTRAN I V Programming and Computing GOLDEN AND LEtCHUS, IBM~360 Programming and Computing GORDON, System Simulation GREENSPAN, Lectures on the Numerical Solution of Linear, Singular, and Nonlinear Differential Equations GRUENBERGER, editor, Computers and Communications GRUENBERGER, editor, Critical Factors in Data Management GRUENBERGER, editor, Expanding Use of Computers in the 70's GRUENBERGER, editor, Fourth Generation Computers HARTMANIS AND STEARNS, Algebraic Structure Theory of Sequential Machines HULL, Introduction to Computing JACOBY, et al., Iterative Methods for Nonlinear Optimization Problems JOHNSON, System Structure in Data, Programs and Computers KANTER, The Computer and the Executive KIVIAT, et al., The SIMSCRIPT H Programming Language LORIN, Parallelism in Hardware and Software: Real and Apparent Concurrency LOUDEN AND LEDIN, Programming thelBM 1130, 2nd ed. MARTIN, Design of Real-Time Computer Systems MARTIN, Future Developments in Telecommunications MARTIN, Man-Computer Dialogue
MARTIN, Programming Real-Time Computing Systems MARTIN, Systems Analysis for Data Transmission MARTIN, Telecommunications and the Computer MARTIN, Teleprocessing Network Organization MARTIN AND NORMAN, The Computerized Society MATHISON AND WALKER, Computers and Telecommunications: Issues in Public Policy MCKEEMAN, et al., A Compiler Generator MEYERS, Time-Sharing Computation in the Social Sciences MINSKY, Computation: Finite and Infinite Machines MOORE, Interval Analysis PLANE AND MCMILLAN, Discrete Optimization: Integer Programming and
Network Analysis for Management Decisions PRITSKER AND KIVIAT, Simulation with GASP II:
a FORTRAN-Based Simulation Language PYLYSHYN, editor, Perspectives on the Computer Revolution RICH, Internal Sorting Methods: Illustrated with PL/1 Program RffSTIN, editor, Computer Networks RUSTIN, editor, Debugging Techniques in Large Systems RUSTIN, editor, Formal Semantics of Programming Languages SACKMAN AND CITRENBAUM, editors, On-line Planning:
To wards Creative Problem-Solving The S M A R T Retrieval System: Experiments in Automatic Document Processing SAMMET, Programming Languages: History and Fundamentals SCHULTZ, Digital Processing: A System Orientation SCHULTZ, Finite Element Analysis SCHWARZ, et al., Numerical Analysis of Symmetric Matrices SHERMAN, Techniques in Computer Programming SIMON AND SIKLOSSY, Representation and Meaning: Experiments with Information Processing Systems SNYDER, Chebyshev Methods in Numerical Approximation STERLING AND POLLACK, Introduction to Statistical Data Processing STOUTMEYER, PL/1 Programming for Engineering and Science STROUD, Approximate Calculation of Multiple Integrals STROUD AND SECREST, Gbussian Quadrature Formulas TAVISS, editor, The Computer Impact TRAUn, Iterative Methods for the Solution of Equations UHR, Pattern Recognition, Learning, and Thought VAN TASSEL, Computer Security Management VARGA, Matrix Iterative Analysis VAZSONYI, Problem Solving by Digital Computers with PL/1 Programming WAITE, Implementing Software for Non-Numeric Application WILKINSON, Rounding Errors in Algebraic Processes ZIEGLER, Time-Sharing Data Processing Systems SALTON, editor,
THE THEORY OF PARSING, TRANSLATION, A N D COMPILING VOLUME I: PARSING
A L F R E D V. A H O Bell Telephone Laboratories, Inc. Murray Hill, N.J.
JEFFREY
D. U L L M A N
Department of Electrical Engineering Princeton University
PRENTICE-HALL, INC. ENGLEW00D
CLIFFS,
N.J.
© 1972 by Bell Telephone Laboratories, Incorporated, and J. D. Ullman
All rights reserved. No part of this book may be reproduced in any form or by any means without permission in writing from the publisher,
1098
ISBN: 0-13-914556-7 Library of Congress Catalog Card No. 72-1073
Printed in the United States of America
PRENTICE-HALL PRENTICE-HALL PRENTICE-HALL PRENTICE-HALL PRENTICE-HALL
INTERNATIONAL, INC., London OF AUSTRALIA, PTY. LTD., Sydney OF CANADA, LTD., Toronto OF INDIA PRIVATE LIMITED, New Delhi OF JAPAN, INC., Tokyo
For Adrienne and Holly
PREFACE
This book is intended for a one or two semester course in compiling theory at the senior or graduate level. It is a theoretically oriented treatment of a practical subject. Our motivation for .making it so is threefold. (1) In an area as rapidly changing as Computer Science, sound pedagogy demands that courses emphasize ideas, rather than implementation details. It is our hope that the algorithms and concepts presented in this book will survive the next generation of computers and programming languages, and that at least some of them will be applicable to fields other than compiler writing. (2) Compiler writing has progressed to the point where many portions of a compiler can be isolated and subjected to design optimization. It is important that appropriate mathematical tools be available to the person attempting this optimization. (3) Some of the most useful and most efficient compiler algorithms, e.g. LR(k) parsing, require a good deal of mathematical background for full understanding. We expect, therefore, that a good theoretical background will become essential for the compiler designer. While we have not omitted difficult theorems that are relevant to compiling, we have tried to make the book as readable as possible. Numerous examples are given, each based on a small grammar, rather than on the large grammars encountered in practice. It is hoped that these examples are sufficient to illustrate the basic ideas, even in cases where the theoretical developments are difficult to follow in isolation.
Use of the Book The notes from which this book derives were used in courses at Princeton University and Stevens Institute of Technology at both the senior and graduate levels. Both one and two semester courses have been taught from this book. In a one semester course, the course in compilers was preceded by a
×
PREFACE
course covering finite automata and context-free languages. It was therefore unnecessary to cover Chapters 0, 2 and 8. Most of the remaining chapters were covered in detail. In a two semester sequence, most of Volume I was covered in the first semester and most of Volume II, except for Chapter 8, in the second. In the two semester course more attention was devoted to proofs and proof techniques than in the one semester course. Some sections of the book are clearly more important than others, and we would like to give the reader some brief comments regarding our estimates of the relative importance of various parts of Volume I. As a general comment, it is probably wise to skip most of the proofs. We include proofs of all main results because we believe them to be necessary for maximum understanding of the subject. However, we suspect that many courses in compiling do not get this deeply into many topics, and reasonable understanding can be obtained with only a smattering of proofs. Chapters0 (mathematical background) and 1 (overview of compiling) are almost all essential material, except possibly for Section 1.3, which covers applications of parsing other than to compilers. We believe that every concept and theorem introduced in Chapter 2 (language theory) finds use somewhere in the remaining nine chapters. However, some of the material can be skipped in a course on compilers. A good candidate for omission is the rather difficult material on regular expression equations in Section 2.2.1. One is then forced to omit some of the material on right linear grammars in Section 2.2.2. (although the equivalence between these and finite automata can be obtained in other ways) and the material on Rosenkrantz's method of achieving Greibach normal form in Section 2.4.5. The concepts of Chapter 3 (translation) are quite essential to the rest of the book. However, Section 3.2.3, on the hierarchy of syntax,directed translations, is rather difficult and can be omitted. We believe that Section 4.1 on backtracking methods of parsing is less vital than the tabular methods of Section 4.2. Most of Chapter 5 (single-pass parsing) is essential. We suggest that LL grammars (Section 5.1), LR grammars (Section 5.2), precedence grammars (Sections 5.3.2 and 5.3.4) and operator precedence grammars (Section 5.4.3) receive maximum priority. Other sections could be omitted if necessary. Chapter 6 (backtracking algorithms) is less essential than most of Chapter 5 or Section 4.2. If given a choice, we would cover Section 6.1 rather than 6.2.
Organization of the Book The entire work The Theory of Parsing, Translation, and Compiling appears in two volumes, Parsing (Chs. 0-6) and Compiling (Chs. 7-11). (The topics covered in the second volume are parser optimization, theory of deterministic
PREFACE
xi
parsing, translation, bookkeeping, and code optimization.) The two volumes form an integrated work, with pages consecutively numbered, and with a bibliography and index for both volumes appearing in Volume II. Problems and bibliographical notes appear at the end of each section (numbered i.j). Except for open problems and research problems, we have used stars to indicate grades of difficulty. Singly starred problems require one significant insight for their solution. Doubly starred exercises require more than one such insight. It is recommended that a course based on this book be accompanied by a programming laboratory in which several compiler parts are designed and implemented. At the end of certain sections of this book appear programming exercises, which can be used as projects in such a programming laboratory.
Acknowledgements Many people have carefully read various parts of this manuscript and have helped us significantly in its preparation. Especially, we would like to thank David Benson, John Bruno, Stephen Chen, Matthew Geller, James Gimpel, Michael Harrison, Ned Horvath, Jean Ichbiah, Brian Kernighan, Douglas Mcllroy, Robert Martin, Robert Morris, Howard Siegel, Leah Siegel, Harold Stone, and Thomas Szymanski, as well as referees Thomas Cheatham, Michael Fischer, and William McKeeman. we have also received important comments from many of the students who used these notes, among them Alan Demers, Nahed El Djabri, Matthew Hecht, Peter Henderson, Peter Maika, Thomas Peterson, Ravi Sethi, Kenneth Sills, and Steven Squires. Our thanks are also due for the excellent typing of the manuscript done by Hannah Kresse and Dorothy Luciani. In addition, we acknowledge the support services provided by Bell Telephone Laboratories during the preparation of the manuscript. The use of UNIX, an operating system for the PDP-11 computer designed by Dennis Ritchie and Kenneth Thompson, expedited the preparation of certain parts of this manuscript. ALFRED V. AHO JEFFREY D. ULLMAN
CONTENTS
PREFACE
O
MATHEMATICAL PRELIMINARIES
0.1
Concepts from Set Theory 0.1.1 Sets 1 0.1.2 Operations on Sets 3 0.1.3 Relations 5 0.1.4 Closures of Relations 7 0.1.5 Ordering Relations 9 0.1.6 Mappings 10 Exercises 11
0.2 Sets of Strings
15
0.2.1 Strings 15 0.2.2 Languages 16 0.2.3 Operations on Languages Exercises 18
0.3 Concepts from Logic
17
19
0.3.1 Proofs 19 0.3.2 Proof by Induction 20 0.3.3 Logical Connectives 21 Exercises 22 Bibliographic Notes 25
0.4 Procedures and Algorithms
25
0.4.1 Procedures 25 0.4.2 Algorithms 27 , , ,
XIII
xiv
CONTENTS
0.4.3 0.4.4 0.4.5 0.4.6
Recursive Functions 28 Specification of Procedures 29 Problems 29 Post's Correspondence Problem 32 Exercises 33 Bibliographic Notes 36
0.5 Concepts from Graph Theory 0.5.1 0.5.2 0.5.3 0.5.4 0.5.5 0.5.6 0.5.7 0.5.8
|
37
Directed Graphs 37 Directed Acyclic Graphs 39 Trees 40 Ordered Graphs 41 Inductive Proofs Involving Dags 43 Linear Orders from Partial Orders 43 Representations for Trees 45 Paths Through a Graph 47 Exercises 50 Bibliographic Notes 52
AN I N T R O D U C T I O N TO C O M P I L I N G
53
1.1 Programming Languages
53
1.1.1 Specification of Programming Languages 53 1.1.2r Syntax and Semantics 55 Bibliographic Notes 57
1.2 An Overview of Compiling 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.2.6 1.2.7 1.2.8
58
The Portions of a Compiler 58 Lexical Analysis 59 Bookkeeping 62 Parsing 63 Code Generation 65 Code Optimization 70 Error Analysis and Recovery 72 Summary 73 Exercises 75 Bibliographic Notes 76
1.3 Other Applications of Parsing and Translating Algorithms 1.3.1 Natural Languages 78 1.3.2 Structural Description of Patterns Bibliographic Notes 82
2
ELEMENTS OF L A N G U A G E T H E O R Y
2.1
Representations for Languages 2.1.1 Motivation 84
77
79
83 83
CONTENTS
2.1.2 2.1.3 2.1.4
Grammars 84 Restricted Grammars 91 Recognizers 93 Exercises 96 Bibliographic Notes 102
2.2 Regular Sets, Their Generators, and Their Recognizers 2.2.1 2.2.2 2.2.3 2.2.4 2.2.5
2.3
103
Regular Sets and Regular Expressions 103 Regular Sets and Right-Linear Grammars 110 Finite Automata 112 Finite Automata and Regular Sets 118 Summary 120 Exercises 121 Bibiliographic Notes 124
Properties of Regular Sets 2.3.1 2.3.2 2.3.3 2.3.4
124
Minimization of Finite Automata 124 The Pumping Lemma for Regular Sets 128 Closure Properties o f Regular Sets 129 Decidable Questions About Regular Sets 130 Exercises 132 Bibliographic Notes 138
138
2.4 Context-free Languages 2.4.1 Derivation Trees 139 2.4.2 Transformations on Context-Free Grammars 143 2.4.3 Chomsky Normal Form 151 2.4.4 Greibach Normal Form 153 2.4.5 An Alternative Method of Achieving Greibach Normal Form Exercises 163 Bibliographic Notes 166 2.5
xv
Pushdown Automata
159
167
2.5.1 The Basic Definition 167 2.5.2 Variants of Pushdown Automata 172 2.5.3 Equivalence of PDA Languages and CFL's 176 2.5.4 Deterministic Pushdown Automata 184 Exercises 190 Bibliographic Notes 192
2.6 Properties of Context-Free Languages 2.6.1 Ogden's Lemma 192 2.6.2 Closure Properties of CFL's 196 2.6.3 Decidability Results 199 2.6.4 Properties o f Deterministic CFL's 201 2.6.5 Ambiguity 202 Exercises 207 Bibliographic Notes 21I
192
xvi
CONTENTS
3
THEORY OF T R A N S L A T I O N
3.1
212
213
Formalisms for Translations 3.1.1 Translation and Semantics 213 3.1.2 Syntax-Directed Translation Schemata 3.1.3 Finite Transducers 223 3.1.4 Pushdown Transducers 227 Exercises 233 Bibliographic Notes 237
215
238
3.2 Properties of Syntax-Directed Translations 3.2.1 Characterizing Languages 238 3.2.2 Properties of Simple SDT's 243 3.2.3 A Hierarchy of SDT's 243 Exercises 250 Bibliographic Notes 251
251
3.3 Lexical Analysis 3.3.1 An Extended Language for Regular Expressions 252 3.3.2 Indirect Lexical Analysis 254 3.3.3 Direct Lexical Analysis 258 3.3.4 Software Simulation of Finite Transducers 260 Exercises 261 Bibliographic Notes 263
263
3.4 Parsing 3.4.1 Definition of Parsing 263 3.4.2 Top-Down Parsing 264 3.4.3 Bottom-Up Parsing 268 3.4.4 Comparison of Top-Down and Bottom-Up Parsing 3.4.5 Grammatical Covering 275 Exercises 277 Bibliographic Notes 280
4
271
281
GENERAL PARSING M E T H O D S
4.1
282
Backtrack Parsing 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5
Simulation of a P D T 282 Informal Top-Down Parsing 285 The Top-Down Parsing Algorithm 289 Time and Space Complexity of the Top-Down Parser Bottom-UpParsing 301 Exercises 307 Bibliographic Notes 313
297
CONTENTS 4.2
Tabular Parsing Methods
xvii 314
4.2.1 The Cocke- Younger-Kasami Algorithm 314 4.2.2 The ParsingMethodofEarley 320 Exercises 330 Bibliographic Notes 332
5
O N E - P A S S NO B A C K T R A C K PARSING
333
5.1
334
LL(k) Grammars
5.1.1 5.1.2 5.1.3 5.1.4 5.1.5 5.1.6
5.2
Deterministic Bottom-Up Parsing
5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.2.6
5.3
Definition of LL(k) Grammar 334 Predictive Parsing Algorithms 338 Implications of the LL(k) Definition 342 Parsing LL(1) Grammars 345 Parsing LL(k) Grammars 348 Testing for the LL(k) Condition 356 Exercises 361 Bibliographic Notes 368 368
Deterministic Shift-Reduce Parsing 368 LR(k) Grammars 371 Implications of the LR(k) Definition 380 Testing for the LR(k) Condition 391 Deterministic Right Parsers for LR(k) Grammars 392 Implementation of LL(k) and LR(k) Parsers 396 Exercises 396 Bibliographic Notes 399
Precedence Grammars
399
5.3.1 Formal Shift-Reduce Parsing Algorithms 400 5.3.2 Simple Precedence Grammars 403 5.3.3 Extended Precedence Grammars 410 5.3.4 Weak Precedence Grammars 415 Exercises 424 Bibliographic Notes 426 5.4
Other Classes of Shift-Reduce Parsable Grammars 5.4.1 5.4.2 5.4.3 5.4.4 5.4.5
Bounded-Right-Context Grammars 427 Mixed Strategy Precedence Grammars 435 Operator Precedence Grammars 439 Floyd-Evans Production Language 443 Chapter Summary 448 Exercises 450 Bibliographic Notes 455
426
°°°
XVIll
6
CONTENTS
LIMITED B A C K T R A C K PARSING A L G O R I T H M S
456
6.1 Limited Backtrack Top-Down Parsing
456
6.1.1 6.1.2 6.1.3 6.1.4 6.1.5
TDPL 457 TDPL and Deterministic Context-Free Languages 466 A Generalization of TDPL 469 Time Complexity of GTDPL Languages 473 Implementation of GTDPL Programs 476 Exercises 482 Bibliographic Notes 485
485
6.2 Limited Backtrack Bottom-Up Parsing 6.2.1 6.2.2 6.2.3 6.2.4
Noncanonical Parsing 485 Two-StackParsers 487 ColmerauerPrecedence Relations 490 Test forColmerauerPrecedence 493 Exercises 499 Bibliographic Notes 500
501
APPENDIX
A.1 A.2 A.3 A.4
Syntax for an Extensible Base Language 501 Syntax of SNOBOL4 Statements 505 Syntax forPL360 507 A Syntax-Directed Translation Scheme for PAL
512
BIBLIOGRAPHY
519
INDEX TO L E M M A S , T H E O R E M S , AND ALGORITHMS
531
INDEX TO V O L U M E i
533
MATHEMATICAL PRELIMINARIES
0
To speak clearly and accurately we need a precise and well-defined language. This chapter describes the language that we shall use to discuss parsing, translation, and the other topics to be covered in this book. This language is primarily elementary set theory with some rudimentary concepts from graph theory and logic included. For readers having background in these areas, Chapter 0 can be easily skimmed and treated as a reference for notation and definitions.
0.1.
CONCEPTS
FROM
SET THEORY
This section will briefly review some of the most basic concepts from set theory: relations, functions, orderings, and the usual operations on sets. 0.1.1.
•S e t s
In what follows, we assume that there are certain objects, referred to as atoms. The term atom wilt be a rudimentary concept, which is just another way of saying that the term atom will be left undefined, and what we choose to call an atom depends on our domain of discourse. Many times it is convenient to consider integers or letters of an alphabet to be atoms. We also postulate an abstract notion of membership. If a is a member of A, we write a ~ A. The negation of this statement is written a ~ A. We assume that if a is an atom, then it has no member; i.e., x ~ a for all x in the domain of discourse. We shall also use certain primitive objects, called sets, which are not atoms. If A is a set, then its members or elements are those objects a (not
2
MATHEMATICAL PRELIMINARIES
CHAP. 0
necessarily atoms) such that a E A. Each member of a set is either an atom or another set. We assume each member of a set appears exactly once in that set. If A has a finite number of members, then A is a finite set, and we often write A = {al, a2,. • •, an}, if a l , . . . , an are all the members of A and a~ ~ a j, for i ~ j. Note that order is unimportant. We could also write A = [ a n , . . . , al}, for example. We reserve the symbol ~ for the empty set, the set which has no members. Note that an atom also has no members, but ~ is not an atom, and no atom is ~ . The statement @A = n means that set A has n members. Example 0.1
Let the nonnegative integers be atoms. Then A = {1, {2, 3}, 4} is a set. A's members are l, {2, 3}, and 4. The member {2, 3} of A is also a set. Its members are 2 and 3. However, the atoms 2 and 3 are not members of A itself. We could equivalently have written A = {4, 1, {3, 2}}. Note that
#A=3.
D
A useful way o f defining sets is by means of a predicate, a statement involving one or more unknowns which has one of two values, true or false. The set defined by a predicate consists of exactly those elements for which the predicate is true. However, we must be careful what predicate we choose to define a set, or we may attempt to define a set that could not possibly exist. Example 0.2
The phenomenon alluded to above is known as Russell's paradox. Let P(X) be the predicate " X is not a member of itself"; i.e., X ~ X. Then we might think that we could define the set Y of all X such that P(X) was true; i.e., Y consists of exactly those sets that are not members of themselves. Since most common sets seem not to be members of themselves, it is tempting to suppose that set Y exists. But if Y exists, we should be able to answer the question, "Is Y a member of itself?" But this leads to an impossible situation. If Y ~ Y, then P(Y) is false, and Y is not a member of itself, by definfi:ion of Y. Hence, it is not possible that Y ~ Y. Conversely, suppose that Y ~ Y. Then, by definition of Y again, Y ~ Y. We see that Y ~ Y implies Y ~ Y and that Y ~ Y implies Y ~ Y. Since either Y ~ Y or Y ~ Y is true, both are true, a situation which we shall assume is impossible. One "way out" is to accept that set Y does not exist. [Z] The normal way to avoid Russell's paradox is to define sets only by those predicates P(X) of the form " X is in A and Pi(X)," where A is a known set and P1 is an arbitrary predicate. If the set A is understood, we shall just write PI(X) for " X is in A and Pi(X)."
SEC. 0.1
CONCEPTS FROM SET THEORY
3
If P(X) is a predicate, then we denote the set of objects X for which P(X) is true by IX] P(X)]. Example 0.3
Let P(X) be the predicate "X is a nonnegative even integer." That is, P(X) is "X is in the set of integers and Pi(X)," where PI(X) is the predicate " X is even." Then A----[X]P(X)} is the set which is often written {0, 2, 4, . . . , 2 n , . . . ) . Colloquially, we can assume that the set of nonnegative integers is understood, and write A -- IX I X is even). D We have glossed over a great deal of development called axiomatic set theory. The interested reader is referred to Halmos [1960] or Suppes [1960] (see the Bibliography at the end of Volume 1) for a more complete treatment of this subject. DEFINITION
We say that set A is included in set B, written A ___ B, if every element of A is also an element of B. Sometimes we say that B includes A, written B ~ A, if A ~ B. In either case, A is said to be a subset of B, and B a superset of A. If B contains an element not contained in A and A ___ B, then we say that A is properly included in B, written A ~ B (or B properly includes A, written B ~ A). We can also say that A is a proper subset of B or that B is a proper superset of A. Two sets A and B are equal if and only if A ~ B and B _ A. A picture called a Venn diagram is often used to graphically describe set membership and inclusion. Figure 0. i shows a Venn diagram for the relation
A~_B.
Fig. 0.1
Venn diagram of set inclusion:
A~B.
0.1.2.
Operations on Sets
There are several basic operations on sets which can be used to construct new sets.
4
MATHEMATICAL PRELIMINARIES
CHAP. 0
DEFINITION Let A and B be sets. The union of A and B, written A U B, is the set containing all elements in A together with all elements in B. Formally, A uB=~xIxEA orx~B}.t The intersection of A and B, written A ~ B, is the set of all elements that are in both A and B. Formally, A n B = {x Ix ~ A and x 6 B}. The difference of A a n d B, written A -- B, is the set of all elements in A that are not in B. If A = U m t h e set of all elements under consideration or the universal set, as it is sometimes called--then U - B is often written ./~ and called the complement of B. Note that we have referred to the universal set as the set of all objects "under consideration." We must be careful to be sure that U exists. For example, if we choose U to be "the set of all sets," then we would have Russell's paradox again. Also, note that /t is not well defined unless we assume that complementation with respect to some known universe is implied. In general A B = A n /t. Venn diagrams for these set operations are shown in Fig. 0.2.
(a)
AUB
(b)
(c)
AnB
A-B
Fig. 0.2 Venn diagrams of set operations. If A (~ B -- ;~, then A and B are said to be disjoint. DEFINITION If I is some (indexing) set such that A t is a known set for each i i n / , then we write U A~ for {X] there exists i ~ I such that X ~ At}. Since I may not iEI
be finite, this definition is an extension of the union of two sets. If I is defined by predicate P(i), we sometimes write U At for U At. For example, " U At" P(i)
iEI
i>2
means A3 u A, U As U . . . tNote that we may not have a set guaranteed to include A W B, so this use of predicate definition appears questionable. In axiomatic set theory, the existence of A w B is taken to be an axiom.
SEC. 0.1
CONCEPTS FROM SET THEORY
5
DEFINITION Let A be a set. The power set of A, written ~P(A) or sometimes 2 A, is the set of all subsets of A. That is, 6'(A) = [B! B ~ A}.t Example 0.4
Let A = {1, 2}. Then ~P(A) = ( ~ , {1}, {2}, {1, 2}}. As another example, 6,(~)
=
{~].
13
In general, if A is a finite set of m members, CP(A) has 2 m members. The empty set is a m e m b e r of 6'(A) for every A. We have observed that the members of a set are considered to be unordered. It is often convenient to have ordered pairs of objects available for discourse. We thus make the following definition. DEFINITION Let a and b be objects. Then (a, b) denotes the ordered pair consisting of a and b in that order. We say that (a, b) = (c, d) if and only if a = c and b = d. In contrast, [a, b} = {b, a}. Ordered [a, [a, b}}. It only if a = regard to be
pairs can be considered sets if we define (a, b) to be the set is left to the Exercises to show that [a, [a, b]} = [c, [c, d]} if and c and b = d. Thus this definition is consistent with what we the fundamental property of ordered pairs.
DEFINITION The Cartesian product of sets A and B, denoted A x B, is {(a, b) la ~ A and b ~ B}. Example 0.5
Let A = [ 1, 2} and B = {2, 3, 4}. Then A x B = ((1, 2), (I, 3), (1, 4), (2, 2), (2, 3), (2, 4)}. 0.1.3.
D
Relations
M a n y c o m m o n mathematical concepts, such as membership, set inclusion, and arithmetic "less than" ( < ) , are referred to as relations. We shall give a formal definition of the concept and see how c o m m o n examples of relations fit the formal definition. ~The existence of the power set of any set is an axiom of set theory. The other set defining axioms, in addition to the power set axiom and the union axiom previously mentioned, are: (1) If A is a set and P a predicate, then [XIP(X) and X ~ A} is a set. (2) If X is an atom or set, then [X} is a set. (3) If A is a set, then {XI for some Y, we have X ~ Y and Y ~ A} is a set.
6
CHAP. 0
MATHEMATICAL PRELIMINARIES
DEFINITION Let A and B be sets. A relation from A to B is any subset of A x B. If A = B, we say that the relation is on A. If R is a relation from A to B, we write a R b whenever (a, b) is in R. We call A the domain of R, and B the range of R. Example 0.6
Let A be the set of integers. The relation < is {(a, b) la is less than b}. We thus write a < b exactly when we would expect to do so. [[] DEFINITION The relation {(b, a)[ (a, b ) ~ R} is called the inverse of R and is often denoted R- 1. A relation is a very general concept. Often a relation may possess certain properties to which special names have been given. DEFINITION Let A be a set and R a relation on A. We say that R is (1) Reflexive if a R a for all a in A, (2) Symmetric if "a R b" implies "b R a" for a, b in A, and (3) Transitive if "a R b and b R c" implies "a R c" for a, b, c in A. The elements a, b, and c need not be distinct. Relations obeying these three properties occur frequently and have additional properties as a consequence. The term equivalence relation is used to describe a relation which is reflexive, symmetric, and transitive. An important property of equivalence relations is that an equivalence relation R on a set A partitions A into disjoint subsets called equivalence classes. For each element a in A we define [a], the equivalence class of a, to be the set {b l a R b}. Example 0.7
Consider the relation of congruence modulo N on the nonnegative integers. We say that a _= b mod N (read "a is congruent to b modulo N") if there is an integer k such that a - b = kN. As a specific case let us take N = 3. Then the set {0, 3, 6 , . . . , 3 n , . . . } forms an equivalence class, since 3n ~ 3m mod 3 for all integer values of m and n. We shall use [0] to denote this class. We could have used [3] or [6] or [3n], since any element of an equivalence class can be used as a representative of that class. The two other equivalence classes under the relation congruence modulo 3 are [1] = {1, 4, 7 , . . . , 3n -t- 1 , . . . } [2] = {2, 5, 8 , . . . ,
3n -q- 2 , . . . }
SEC. 0.1
CONCEPTS FROM SET THEORY
7
The union of the three sets [0], [ 1] and [2] is the set of all nonnegative integers. Thus we have partitioned the set of all nonnegative integers into the three disjoint equivalence classes [0], [1], and [2] by means of the equivalence relation congruence modulo 3 (Fig. 0.3). D
Set of all nonnegative integers [1]
Fig. 0.3
Equivalence classes for congruence modulo 3.
The index of an equivalence relation on a set A is the number of equivalence classes into which A is partitioned. The following theorem about equivalence relations is left as an exercise. THEOREM 0.1
Let R be an equivalence relation on A. Then for all a and b in A, either [a] = [b] or [a] and [b] are disjoint.
Proof. Exercise. D 0.1.4.
Closures of Relations
Given a relation R, we often need to find another relation R', which includes R and has certain additional properties, e.g., transitivity. Moreover, we would generally like R' to be as "small" as possible, that is, to be a subset of every other relation including R which has the desired properties. Of course, the "smallest" relation may not be unique if the additional properties are strange. However, for the common properties mentioned in the previous section, we can often find a unique superset of a given relation wffh these as additional properties. Some specific cases follow. DEFINITION
The k-fold product of a relation R (on A), denoted R k, is defined as follows" (1) a R I b if and only if a R b; (2) a R t b if and only if there exists c in A such that a R c and c R ~- 1 b f o r i > 1. This is an example of a recursive definition, a method of definition we shall use many times. To examine the recursive aspect of this definition,
8
MATHEMATICALPRELIMINARIES
CHAP. 0
suppose that a R* b. T h e n by (2) there is a c~ such that a R c~ a n d c 1 R 3 b. Applying (2) again there is a c 2 such that c 1 R c 2 a n d c 2 R 2 b. One m o r e application o f (2) says that there is a c 3 such that c z R c 3 a n d c 3 R ~ b. N o w we can apply (1) and say that c3 R b. Thus, if a R 4 b, then there exists a sequence of elements c 1, c 2, c 3 in A such that a R cl, cl R c2, c2 R c3, a n d c a R b. The transitive closure of a relation R on a set A will be d e n o t e d R ÷. We define a R ÷ b if a n d only if a R ~b for some i ~ 1. We shall see that R ÷ is the smallest transitive relation that includes R. W e could have alternatively defined R ÷ by saying that a R ÷ b if there exists a sequence cl, c2 . . . . , c, of zero or more elements in A such that a R c~, c~ R c 2, . . . , c,_ 1 R c,, c, R b. I f n -- 0, a R b is meant. The reflexive and transitive closure o f a relation R o n a set A is d e n o t e d R* a n d is defined as follows" (1) a R* a for all a in A; (2) a R* b if a R ÷ b; (3) N o t h i n g is in R* unless its being there follows from (1) or (2). I f we define R ° by saying a R ° b if and only if a = b, then a R* b if a n d only if a R l b for some i ~ 0. The only difference between R + a n d R* is that a R* a is true for all a in A b u t a R ÷ a m a y or m a y not be true. R* is the smallest reflexive a n d transitive relation that includes R. In Section 0.5.8 we shall examine m e t h o d s o f c o m p u t i n g the reflexive a n d tralasitive closure of a relation efficiently. We would like to prove here that R ÷ a n d R* are the smallest supersets of R with the desired properties. TI-IEOR~ 0.2 If R ÷ a n d R* are, respectively, the transitive a n d reflexive-transitive closure o f R as defined above, then (a) R ÷ is transitive; if R' is a n y transitive relation such that R _~ R', then R + ~ R'. (b) R* is reflexive and transitive; if R' is any reflexive a n d transitive relation such that R ~ R', then R* ~ R'.
Proof We prove only (a); (b) is left as an exercise. First, to show that R ÷ is transitive, we m u s t show that if a R+b and b R ÷ c, then a R ÷ c. Since a R ÷ b, there exists a sequence of elements d l , . . . , d, such that dl R d 2 , . . . , d,_~ R d,, where dl = a a n d d~ = b. Since b R + c, we can find e l , . . . , e , such that el R e2 . . . . , em_l R era, where el = b = d, and em = c. Applying the definition of R + m ÷ n times, we conclude that a R + c. N o w we shall show that R + is the smallest.transitive relation that includes R. Let R' be any transitive relation such that R ~ R'. We must show that R + ~_ R'. Thus let (a, b) be in R+; i.e., a R ÷ b. Then there is a sequence
SEC. 0.1
CONCEPTS FROM SET THEORY
9
c l , . . . , c, such that a = c 1, b = c,, and c tRct+ 1 for 1 ~ i < n. Since R ~ R', we have c~ R' c~+1 for 1 _< i < n. Since R' is transitive, repeated application of the definition of transitivity yields c~ R' c,; i.e., a R' b. Since (a, b) is an arbitrary member of R +, we have shown that every member of R + is also a member of R'. Thus, R + ~ R', as desired. D 0.1.5.
Ordering Relations
An important class of relations on sets arc the ordering relations. In general, an ordering on a set A is any transitive relation on A. In the study of algorithms, a special type of ordering, called a partial order, is particularly important. DEFINITION
A partial order on a set A is a relation R on A such that
(1) R is transitive, and (2) For all a in A, a R a is false. (That is, R is irreflexive.) F r o m properties (i) and (2) of a partial order it follows that if a R b, then b R a is false. This is called the asymmetric property. Example 0.8
An example of a partial order is proper inclusion of sets. For example, let S = {ei . . . . , e,} be a set of n elements and let A = 6'(S). There are 2" elements in A. Then define a R b if and only if a ~ b for all a, b in A. R is a partial order. If S = {0, 1, 2}, then Fig. 0.4 graphically depicts this partial order. Set $1 properly includes set $2 if and only if there is a path downward from S~ to $2. D In the literature the term partial order is sometimes used to denote what we call a reflexive partial order. 0, 1,2~
2}
Fig. 0.4 A partial order.
10
MATHEMATICALPRELIMINARIES
CHAP. 0
DEFINITION
A reflexive partial order on a set A is a relation R such that (1) R is transitive, (2) R is reflexive, and (3) If a R b and b R a, then a = b. This property is called antisymmetry. An example of a reflexive partial order would be (not necessarily proper) inclusion of sets. In Section 0.5 we shall show that every partial order can be graphically displayed in terms of a structure called a directed acyclic graph. An important special case of a partial order is linear order (sometimes called a total order). DEFINITION
A linear order R on a set A is a partial order such that if a and b are in A, then either a R b, b R a, or a = b. If A is a finite set, then one convenient representation for the linear order R is to display the elements of A as a sequence a 1, a 2 , . . . , a, such that at R aj if and only if i < j, where A = [ a l , . . . , a,}. We can also define a reflexive linear order analogously. That is, R is a reflexive linear order on A if R is a reflexive partial order such that for all a and b in A, either a R b or b R a. For example, the relation < (less than) on the nonnegative integers is a linear order. The relation < is a reflexive linear order. 0.1.6.
Mappings
One important kind of relation that we shall be using is known as a mapping. DEFINITION
A mapping (also function or transformation) M from a set A to a set B is a relation from A to B such that if (a, b) and (a, c) are in M, then b = c. If (a, b) is in M, we shall often write M ( a ) = b. We say that M(a) is defined if there exists b in B such that (a, b) is in M. If m(a) is defined for all a in A, we shall say that M is total. If we wish to emphasize that M may not be defined for all a in A, we shall say that M is a partial mapping (function) from A to B. In either case, we write M : A ~ B. We call A and B the domain and range of M, respectively. If M " A ~ B is a mapping having the property that for each b in B there is at most one a in A such that M(a)~- b, then M is an injection (oneto-one mapping) from A into B. If M is a total mapping such that for each b in B there is exactly one a in A such that M(a) = b, then M is a bijection (one-to-one correspondence) between A and B. If M " A ~ B is an injection, then we can find the inverse mapping
EXERCISES
11
M-1 : B ---~ A such that M-l(b) = a if and only if M(a) = b. If there exists b in B for which there is no a in A such that M(a) = b, then M-~ will be a partial function. The notion of a bijection is used to define the cardinality of a set, which, informally speaking, denotes the number of elements the set contains. DEFINITION Two sets A and B are of equal cardinality if there is a bijection M from
AtoB. Example 0.9
{0, 1, 2} and {a, b, c} are of equal cardinality. To prove this, use, for example, the bijection M = {(0, a), (1, b), (2, c)}. The set of integers is equal in cardinality to the set of even integers, even though the latter is a proper subset of the former. A bijection we can use to prove this would be {(i, 2i)1i is an integer}. E] We can now define precisely what we mean by a finite and infinite set.I DEFINITION A set S isfinite if it is equal in cardinality to the set {1, 2 , . . . , n} for some integer n. A set is infinite if it is equal in cardinality to a proper subset of itself. A set is countable if it is equal in cardinality to the set of positive integers. It follows from Example 0.9 that every countable set is infinite. An infinite set that is not countable is called uncountable. Examples of countable sets are (1) The set of all positive and negative integers, (2) The set of even integers, and (3) {(a, b) la and b are integers}. Examples of uncountable sets are (1) The set of real numbers, (2) The set of all mappings from the integers to the integers, and (3) The set of all subsets of the positive integers.
EXERCISES
0.1.1.
Write out the sets defined by the following predicates. Assume that A = {0, 1,2, 3, 4, 5, 6}. (a) {XI X is in A and X is even}.
tWe have used these terms previously, of course, assuming that their intuitive meaning was clear. The formal definitions should be of some interest, however.
12
MATHEMATICALPRELIMINARIES
CHAP. 0
(b) {XI X is in A and X is a perfect square}. (c) {XI X is in A and X >_ X 2 + 1}. 0.1.2.
Let A = {0, 1, 2} and B = {0, 3, 4}. Write out (a) a U B.
(b) A N B . (c) A - - B . (d) 6'(a). (e) A × B. 0.1.3.
Show that if A is a set with n elements, then (P(A) has 2" elements.
0.1.4.
Let A and B be sets and let U be some universal set with respect to which complements are taken. Show that
(a) A u B = ~ n a .
(b) A n B = A
UB.
These two identities are referred to as De Morgan's laws. 0.1.5.
Show that there does not exist a set U such that for all sets A, A ~ U.
Hint: Consider Russell's paradox. 0.1.6.
Give an example of a relation on a set which is (a) Reflexive but not symmetric or transitive. (b) Symmetric but not reflexive or transitive. (c) Transitive but not reflexive or symmetric. In each case specify the set on which the relation is defined.
0.1.7.
Give an example of a relation on a set which is (a) Reflexive and symmetric but not transitive. (b) Reflexive and transitive but not symmetric. (c) Symmetric and transitive but not reflexive. Warning: D o not be misled into believing that a relation which is symmetric and transitive must be reflexive (since a R b and b R a implies
aRa). 0.1.8.
Show that the following relations are equivalence relations"
(a) {(a, a) l a ~ A }. (tO Congruence on the set of triangles. 0.1.9.
Let R be an equivalence relation on a set A. Let a and b be in A. Show that (a) [a] = [b] if and only if a R b. (b) [a] n [b] = Z~ if and only if a R b is false.?
0.1.10.
Let A be a finite set. What equivalence relations on A induce the largest and smallest number of equivalence classes ?
0.1.11.
Let A = {0, 1, 2) and R = [(0, 1), (1, 2)}. Find R* and R +.
0.1.12.
Prove Theorem 0.2(b).
tBy "a R b is false," we mean that (a, b) ~ R.
EXERCISES
0.1.13.
13
Let R be a relation on A. Show that there is a unique relation Re such that (I) R ~ R,, (2) Re is an equivalence relation on A, and (3) If R' is any equivalence relation on A such that R ~ R', then Re ~ R'. Re is called the least equivalence relation containing R. DEFINITION
A well order on a set A is a reflexive partial order R on A such that for each n o n e m p t y subset B ___ A there exists b in B such that b R a for all a in B (i.e., each n o n e m p t y subset contains a smallest element).
0.1.14.
Show that ~ integers.
(less than or equal to) is a well order on the positive
DEFINITION Let A be a set. Define (1) A 1 -- A, and (2) A" = A '-~ × A, for i > 1. Let A + denote U A;. i>_ 1
0.1.15.
Let R be a well order on A. Define /~ on A + by" (al . . . . . am)? R (bl . . . . . b,) if and only if either (1) F o r some i < m, a i = b i for 1 _ j < i, a¢ ~ b~ and as Rb~, or (2) m ~ n , and a~ -- b~ for all i, 1 ~ i ~ m . Show that /~ is a well order on A +. We call /~ a lexicographie order on A +. (The ordering of words in a dictionary is an example of a lexicographic order.)
0.1.16.
State whether each of the following are partial orders, reflexive partial orders, linear orders, or reflexive linear orders" (a) ~ on 6'(A). (b) ~ on 6~(A). (c) The relation R1 on the set H of h u m a n beings defined by a R1 b if and only if a is the father of b. (d) The relation Rz on H given by a R b if and only if a is an ancestor of b. (e) The relation R s on H defined by a R3 b if and only if a is older than b.
0.1.17.
Let R1 and Rz be relations. The composition of Ri and Rz, denoted Ri o Rz is {(a, b)[for some c, a R1 c and c Rz b}. Show that if R1 and R2 are mappings, then R1 o R2 is a mapping. U n d e r what conditions will R i o R2 be a total mapping ? An injection ? A bijection ?
•J'Strictly speaking, (al . . . . . definition of A m.
am) means (((... (al, a2), a3) . . . . ), am), according to the
14
MATHEMATICALPRELIMINARIES
CHAP. 0
0.1.18.
Let A be a finite set and let B ~ A. Show that if M" A ---~ B is a bijection, then A = B.
0.1.19.
Let A and B have m and n elements, respectively. Show that there are n m total functions from A t o B. How many (not necessarily total) functions from A to B are there ?
'0.1.20.
Let A be an arbitrary (not necessarily finite) set. Show that the sets &(A) and {MI M is a total function from A to {0, 1}} are of equal cardinality.
0.1.21.
Show that the set of all integers is equal in cardinality to (a) The set of primes. (b) The set of pairs of integers. Hint: Define a linear order on the set of pairs of integers by ( i l , j l ) R (i2,J2) if and only if il + j l < i2 + j 2 or il + j l = i2 + j 2 and il < i2.
0.1.22.
Set A is "larger than" B if A and B are of different cardinality but B is equal in cardinality to a subset of A. Show that the set of real numbers between 0 and 1, exclusive, is larger than the set of integers. Hint: Represent real numbers by unique decimal expansions. In contradiction, suppose that the two sets in question were of equal cardinality. Then we could find a sequence of real numbers rl, r 2 , . . , which included all real numbers r, 0 < r < 1. Can you find a real number r between 0 and 1 which differs in the ith decimal place from rt for all i?
"0.1.23.
Let R be a linear order on a finite set A. Show that there exists a unique element a ~ A such that a R b for all b ~ A --{a}. Such an element a is called the least element. If A is infinite, does there always exist a least element ?
"0.1.24.
Show that [a, [a, b}] = {c, [c, d}} if and only if a = c and b = d.
0.1.25.
Let R be a partial order on a set A. Show that if a R b, then b R a is false.
"0.1.26.
Use the power set and union axioms to help show that if A and B are sets, then A × B is a set.
*'0.1.27.
Show that every set is either finite or infinite, but not both.
"0.1.28.
Show that every countable set is infinite.
"0.1.29.
Show that the following sets have the same cardinality: (1) (2) (3) (4)
*'0.1.30. 0.1.31.
The The The The
set set set set
of of of of
real numbers between 0 and 1, all real numbers, all mappings from the integers to the integers, and all subsets of the positive integers,
Show that 6~(A) is always larger than A for any set A. Show that if R is a partial order on a set A, then the relation R' given by R' = R w {(a, a)[a ~ A} is a reflexive partial order on A.
SEC. 0.2
SETS OF STRINGS
0.1.32.
0.2.
15
Show that if R is a reflexive partial order on a set A, then the relation R' = R -- [(a, a)la ~ A} is a partial order on A.
SETS OF STRINGS
In this book we shall be dealing primarily with sets whose elements are strings of symbols. In this section we shall define a number of terms dealing with strings. 0.2.1.
Strings
First of all we need the concept of an alphabet. To us an alphabet will be any set of symbols. We assume that the term symbol has a sufficiently clear intuitive meaning that it needs no further explication. An alphabet need not be finite or even countable, but for all practical applications alphabets will be finite. Two examples of alphabets are the set of 26 upper- and 26 lowercase Roman letters (the Roman alphabet) and the set {0, 1}, which is often called the binary alphabet. The terms letter and character will be used synonymously with symbol to denote an element of an alphabet. If we put a sequence of symbols side by side, we have a string of symbols. For example, 01011 is a string over the binary alphabet [0, 1}. The terms sentence and word are often used as synonyms for string. There is one string which arises frequently and which has been given a special denotation. This is the empty string and it will be denoted by the symbol e. The empty string is that string which has no symbols. CONVENTION
We shall ordinarily use capital Greek letters for alphabets. The letters a, b, c, and d will represent symbols and the letters t, u, v, w, x, y, and z generally represent strings. We shall represent a string of i a's by a ~. For example, a t = a,t a z = aa, a 3 = aaa, and so forth. Then, a ° is e, the empty string. DEFINITION
We formally define strings over an alphabet E in the following manner: (1) e is a string over ~. (2) If x is a string over Z and a is in E, then xa is a string over Z. (3) y is a string over E if and only if its being so follows from (1) and (2). There are several operations on strings for which we shall have use later on. If x and y are strings, then the string xy is called the concatenation of x tWe thus identify the symbol a and the string consisting of a alone.
16
MATHEMATICAL PRELIMINARIES
CHAP.
0
and y. For example, if x = ab and y = cd, then xy = abed. For all strings X, xe
-~- e x
=
x.
The reversal of a string x, denoted x R, is the string x written in reverse order; i.e., if x : a~ • .. a,, where each a~ is a symbol, then x R : a , . . . a , . Also, e R : e. Let x, y, and z be arbitrary strings over some alphabet E. We call x a prefix of the string xy and y a suffix of xy. y is a substring of xyz. Both a prefix and suffix of a string x are substrings of x. For example, ba is both a prefix and a substring of the string bac. Notice that the empty string is a substring, a prefix, and a s u ~ x of every string. If x :/: y and x is a prefix (suffix) of some string y, then x is called a proper prefix (suffix) of y. The length of a string is the number of symbols in the string. That is, if x ---- a~ . . . a,, where each at is a symbol, then the length of x is n. We shall denote the length of a string x b y ] x 1. For example, l a a b l = 3 a n d ] e l = 0. All strings which we shall encounter will be of finite length. 0.2.2.
Languages
DEFINITION
A language over an alphabet Z is a set of strings over E. This definition surely encompasses almost everyone's notion of a language. F O R T R A N , A L G O L , PL/I, and even English are included in this definition. Example 0.10
Let us consider some simple examples of languages over an alphabet Z. The empty set ~ is a language. The set [e} which contains only the empty string is a language. Notice that ~ and {~e}are two distinct languages. D DEFINITION We let E* denote the set containing all strings over E including e. For example, if A is the binary alphabet {0, 1], then E* = [e, 0, 1, 00, 01, 10, 11,000, 0 0 1 , . . . ] . Every language over E is a subset of E*. The set of all strings over E but excluding e will be denoted by E+. Example 0.11
Let us consider the language L 1 containing all strings of zero or more
a's. We can denote L 1 by [atl i ~ 0}. It should be clear that L1 = {a}*. [Z]
SEC. 0.2
SETS OF STRINGS
17
CONVENTION
When no confusion results we shall often denote a set consisting of a single element by the element itself. Thus according to this convention a* = {a}*. DEFINITION
A language L such that no string in L is a proper prefix (suffix) of any other string in L is said to have the prefix (suffix) property. For example, a* does not have the prefix property but (a~b[i ~ O} does. 0.2.3.
Operations on Languages
We shall often be concerned with various operations applied to languages. In this section, we shall consider some basic and fundamental operations on languages. Since a language is just a set, the operations of union, intersection, difference, and complementation apply to languages. The operation concatenation can be applied to languages as well as strings. DEFINITION
Let L1 be a language over alphabet El and L z a language over E2. Then L1L2, called the concatenation or product of L a and L z, is the language {xy[x ~ L~ and y ~ L2]. There will be occasions when we wish to concatenate any arbitrary number of strings from a language. This notion is captured in the closure of a language. DEFINITION
The closure of L, denoted L*, is defined as follows" (1) L ° = (e}. (2) L" = LL"-1 for n .~ 1. (3) L * = U L". n>_0
The positive closure of L, denoted L +, is [,_JL". Note that L + = LL* = L*L n~l
and that L* = L + U (e}. We shall also be interested in mappings on languages. A simple type of mapping which occurs frequently when dealing with languages is homomorphism. We can define a homomorphism in the following way. DEFINITION
Let E~ and E2 be alphabets. A homomorphism is a mapping h" E~ ~ E*. We extend the domain of the homomorphism h to ~* by letting h ( e ) = e and h(xa) = h(x)h(a) for all x in ~*, a in E,.
18
MATHEMATICALPRELIMINARIES
CHAP. 0
Applying a h o m o m o r p h i s m to a language L, we get another language
h(L), which is the set of strings (h(w) lw ~ L]. Example 0.12
Suppose that we wish to change every instance of 0 in a string to a and every 1 to bb. We can define a h o m o m o r p h i s m h such that h ( 0 ) = a and h(1) = bb. Then i f L is the language (0"l"[n ~ 1}, h ( L ) = {a"b2"ln ~ 1}. [Z Although h o m o m o r p h i s m s on languages are not always one-to-one mappings, it is often useful to talk about their inverses (as relations). DEFINITION
If h" E~ --~ Ez* is a h o m o m o r p h i s m , then the relation h-1. E2* ---~ 6~(E~*), defined below, is called an inverse homomorphism. If y is in E2*, then h-~(y) is the set of strings over E1 which get mapped by h to y. That is, h-~(y) = (xlh(x) = y}. I f L is a language over E2, then h-I(L)is the language over Ex consisting of those strings which get mapped by h into a string in L. Formally, h - ' ( L ) = U h-'(y) = (xlh(x) ~ L]. y EL
Example 0.13
Let h be a h o m o m o r p h i s m such that h(0) = a and h(1) = a. It follows that h-~(a) = {0, 1} and h-~(a *) = [0, 1}*. As a second example, suppose that h is a h o m o m o r p h i s m such that h(0) = a and h(i) = e. Then h- l(e) = 1" and h- l(a) = 1'01". Here 1"01" denotes the language [ 1i01J I i, j ~ 0], which is consistent with our definitions and the convention which identifies a and [a]. [~]
EXERCISES
0.2.1.
Give all the (a) prefixes, (b) suffixes, and (c) substrings of the string abe.
0.2.2.
Prove or disprove: L + = L* -- {el.
0.2.3.
Let h be the homomorphism defined by h(0) = a, h(1) = bb, and h(2) = e. What is h(L), where L = {012]* ?t
0.2.4.
Let h be as in Exercise 0.2.3. What is h-l({ab]*)?
*0.2.5.
0,2,6,
Prove or disprox,e the following" (a) h-l(h(L)).'= L. (b) h(h-~(L)) = L. Can L* or L + ever be ~ ? Under what circumstances are L* and L + finite ?
tNote that {012}* is not {0, 1, 2}*.
SEC. 0.3
CONCEPTSFROMLOGIC
19
*0.2.7. Give well orders on the following languages: (a) (a, b}*. (b) a*b*c*. (c) (w[ w ~ (a, b}* and the number of a's in w equals the number of b's). 0.2.8.
0.3.
Which of the following languages have the prefix (suffix) property? (a)~. (b) {e). (c) (a"b"ln ~ 1~. (d) L*, if L has the prefix property. (e) (wl w ~ {a, b}* and the number of a's in w equals the number of b's~}.
CONCEPTS F R O M
LOGIC
In this book we shall present a number of algorithms which are useful in language-processing applications. For some functions several algorithms are known, and it is desirable to present the algorithms in a common framework in which they can be evaluated and compared. Above all, it is most desirable to know that an algorithm performs the function that it is supposed to perform. For this reason, we shall provide proofs that the various algorithms that we shall present function as advertised. In this section, we shall briefly comment on what is a proof and mention some useful techniques of proof. 0.3.1.
Proofs
A formal mathematical system can be characterized by the following basic components" (1) (2) (3) (4)
Basic symbols, Formation rules, Axioms, and Rules of inference.
The set of basic symbols would include the symbols for constants, operators, and so forth. Statements can then be constructed from these basic symbols according to a set of formation rules. Certain primitive statements can be defined, and the validity of these statements can be accepted without justification. These statements are known as the axioms of the system. Then certain rules can be specified whereby valid statements can be used to infer new valid statements. Such rules are called rules of inference. The objective may be to prove that a certain statement is true in a certain mathematical system. A proof of that statement is a sequence of statements such that (1) Each statement is either an axiom or can be created from one or more of the previous statements by means of a rule of inference.
20
MATHEMATICAL PRELIMINARIES
CHAP. 0
(2) The last statement in the sequence is the statement to be proved. A statement for which we can find a proof is called a theorem of that formal system. Obviously, every axiom of a formal system is a theorem. In spirit at least, the proof of any mathematical theorem can be formulated in these terms. However, going to a level of detail in which each statement is either an axiom or follows from previous statements by rudimentary rules of inference makes the proofs of all but the most elementary theorems too long. The tasl~ of finding proofs of theorems in this fashion is in itself laborious, even for computers. Consequently, mathematicians invariably employ various shortcuts to reduce the length of a proof. Statements which are previously proved theorems can be inserted into proofs. Also, statements can be omitted when it is (hopefully) clear what is being done. This technique is practiced virtually everywhere, and this book is no exception. It is known to be impossible to provide a universal method for proving theorems. However, in the next sections we shall mention a few of the more commonly used techniques. 0.3.2.
Proof by Induction
Suppose that we wish to prove that a statement S(n) about an integer n is true for all integers in a set N. If N is finite, then one method of proof is to show that S(n) is true for each value of n in N. This method of proof is sometimes called proof by perfect induction or proof by exhaustion. If N is an infinite subset of the integers, then we may use simple mathematical induction. Let no be the smallest value in N. To show that S(n) is true for all n in N, we may equivalently show that (1) S(no) is true. (This is called the basis of the induction.) (2) Assuming that S(m) is true for all m < n in N, show that S(n) is also true. (This is the inductive step.) Example 0.14
Suppose then that S(n) is the statement 1-q-3+5+...+2n--
1 =n
2
That is, the sum of odd integers is a perfect square. Suppose we wish to show that S(n) is true for all positive integers. Thus N = {~1,2, 3 , . . . } . Basis. F o r n = l w e h a v e l = 12. Inductive Step. Assuming S ( 1 ) , . . . , S(n) are true [in particular, that S(n)
SEe. 0.3
CONCEPTSFROMLOGIC
21
is true], we have
1-t-3-Jr-5+..-+[2n--
l]+[2(n+
1)-- 1 ] : n 2 + 2 n +
1
: (n + 1)2 so that S(n + 1) must then also be true. We thus conclude that S(n) is true for all positive integers. The reader is referred to Section 0.5.5 for some methods of induction on sets other than integers. 0.3.3.
Logical Connectives
Often a statement (theorem) may read "P if and only if Q" or "P is a necessary and sufficient condition for Q," where P and Q are themselves statements. The terms/f, only if, necessary, and sufficient have precise meanings in logic. A logical connective is a symbol that can be used to create a statement out of simpler statements. For example, and, or, not, implies are logical connectives, not being a unary connective and the others binary connectives. If P and Q are statements, then P and Q, P or Q, not P, and P implies Q are also statements. The symbol A is used to denote and, V to denote or, ,,~ to denote not, and ~ to denote implies. There are well-defined rules governing the truth or falsehood of a statement containing logical connectives. For example, the statement P and Q is true only when both P is true and Q is also true. We can summarize the properties of a logical connective by a table, called a truth table, which displays the value of a composite statement in terms of the values of its components. Figure 0.5 shows the truth table for the logical connectives and, or, not and implies.
P
Q
PAQ
PVQ
F F T T
F T F T
F F F T
F T T T
,.~P T T F F
P---~Q T T F T
Fig. 0.5 Truth tables for and, or, not, and implies. F r o m the table (Fig. 0.5) we see that P ~ Q is false only when P is true and Q is false. It may seem a little odd that if P is false, then P implies Q
22
MATHEMATICAL PRELIMINARIES
CHAP. 0
is always true, regardless of the value of Q. But in logic this is customary; from falsehood as a hypothesis, anything follows. We can now return to consideration of a statement of the form P if and only if Q. This statement consists of two parts: P / f Q and P only if Q. It is more common to state P / f Q as if Q then P, which is only another way of saying Q implies P. In fact the following five statements are equivalent: (1) (2) (3) (4) (5)
P implies Q. l f P then a. P only if Q. Q is a necessary condition for P. P is a sufficient condition for Q.
To show that the statement P if and only if Q is true we must show both that Q implies P and that P implies Q. Thus, P if and only if Q is true exactly when P and Q are either both true or both false. There are several alternative methods of showing that the statement P implies Q is always true. One method is to show that the statement not Q implies not Pt is always true. The reader should verify that not Q implies not P has exactly the same truth table as P implies Q. The statement not Q implies not P is called the contrapositive of P implies Q. One important technique of proof is proof by contradiction, sometimes called the indirect proof or reductio ad absurdum. Here, to show that P implies Q is true, we show that not Q and P implies falsehood is true. That is to say, we assume that Q is not true, and if assuming that P is true we are able to obtain a statement known to be false, then P implies Q must be true. The converse of the statement if P then Q is if Q then P. The statement P if and only if Q is often written if P then Q and conversely. Note that a statement and its converse do not have the same truth table.
EXERCISES
DEFINITION Propositional calculus is a good example of a mathematical system. Formally, propositional calculus can be defined as a system ~ consisting of (1) (2) (3) (4)
A set of primitive symbols, Rules for generating well-formed statements, A set of axioms, and Rules of inference.
tWe assume "not" takes precedence over "implies." Thus the proper phrasing of the sentence is (not P) implies (not Q). In general, "not" takes precedence over "and," which takes precedence over "or," which takes precedence over "implies."
EXERCISES
23
(1) The primitive symbols of g are ( , ) , ---~, ~ , and an infinite set of statement letters al, a2, a3 . . . . . The symbol ~ can be thought of as implies and ~ as not. (2) A well-formed statement is formed by one or more applications of the following rules: (a) A statement letter is a statement. (b) If A and B are statements, then so are ( ~ A) and (A ~ B). (3) Let A, B, and C be statements. The axioms of S are AI:
(A ~
A2:
((A ~
( n - - , A))
A3:
((,-, B ---~ ~ A) ~
(n ~
c)) ~
((A ~
((~ B ~
B) ~ A) ~
(a ~
C)))
a))
(4) The rule of inference is modus ponens, i.e., from' the statements (A -----~B) and A we can infer the statement B. We shall leave out parentheses wherever possible. The statement a ---~ a is a theorem of S and has as p r o o f the sequence of statements (i) A =a, (ii) (iii) (iv) (v) "0.3.1. 0.3.2.
**0.3.3.
0.3.4.
(a ~ ((a ---~ a) ~ a)) ~ ((a ---~ (a ---~ a)) --~ (a --~ a)) from B=(a~a), and C = a . a ~ ((a ~ a) ~ a) from A1. (a ~ (a ~ a)) ---~ (a ~ a) by modus ponens from (i) and (ii). a ~ (a ~ a) from A1. a ---~ a by modus ponens from (iii) and (iv).
A2
with
Prove that ( ~ a --~ a) ~ a is a theorem of S. A tautology is a statement that is true for all possible truth values of the statement variables. Show that every theorem of S is a tautology. Hint: Prove the theorem by induction on the n u m b e r of steps necessary to obtain the theorem. Prove the converse of Exercise 0.3.2, i.e., that every tautology is a theorem. Thus a simple m e t h o d to determine whether a statement of propositional calculus is a theorem is to determine whether that statement is a tautology. Give the truth table for the statement if P then if Q then R. DEFINITION Boolean algebra can be interpreted as a system for manipulating truth-valued variables using logical connectives informally interpreted as and, or, and not. Formally, a Boolean algebra is a set B t o g e t h e r with operations • (and), + (or), and - (not). The axioms of Boolean algebra are the following: F o r all a, b, and c in B, (1) a + ( b + c ) = ( a + b ) + c a - ( b . c) = ( a - b ) . c. (2) a + b = b . + a
(associativity) (commutativity)
a.b =b.a. (3) a . (b + c) = ( a - b ) + ( a . c) a + ( b . c ) = ( a + b ) . ( a + c).
(distributivity)
24
MATHEMATICAL PRELIMINARIES
CHAP. 0
In addition, there are two distinguished members of B, 0 and 1 (in the most c o m m o n Boolean algebra, these are the only members of B, representing falsehood and truth, respectively), with the following laws: (4) a + 0 = a a.l=a. (5) a + a = l a.d =0. The rule of inference is substitution of equals for equals. *0.3.5.
0.3.6. **0.3.7. 0.3.8.
Show that the following statements are theorems in any Boolean algebra: (a) 0 = i. (b) a + ( b . a ) = a + b . (c) a = a . What are the informal interpretations of these theorems ? Let A be a set. Show that (P(A) is a Boolean algebra if + , . , u , N, and complementation with respect to the universe A.
and - are
Let B be a Boolean algebra where ~ B = n. Show that n = 2 m for some integer m. Prove by induction that n(n + 1) l + 2 + . . . + n = ~
0.3.9.
Prove by induction that (1 + 2 + . . .
"0.3.10.
+n) z =13 +2 3 +...
+ ns
What is wrong with the following? THEOREM All marbles have the same color. Proof. Let A be any set of n marbles, n > 1. We shall "prove" by induction on n that all marbles in A have the same color. Basis. If n = 1, all marbles in A are clearly of the same color. Inductive Step. Assume that if A is any set of n marbles, then all marbles in A are the same color. Let A' be a set of n + 1 marbles, n > 1. Remove one marble from A'. We are then left with a set A " of n marbles, which, by the inductive hypothesis, has marbles all the same color. Remove from A" a second marble and then add to A " the marble originally removed. We again have a set of n marbles, which by the inductive hypothesis has marbles the same color. Thus the two marbles removed must have been the same color so that the set A' must contain marbles all the same color. Thus, in any set of n marbles, all marbles are the same color. D
"0.3.11.
Let R be a well order on a set A and S(a) a statement about a in A. Assume that if S(b) is true for all b ~ a such that b R a, then S(a) is true.
SEC. 0.4
PROCEDURES AND ALGORITHMS
25
Show that then S(a) is true for all a in A. Note that this is a generalization of the principle of simple induction. 0.3.12.
Show that there are only four unary logical connectives. Give their truth tables.
0.3.13.
Show that there are 16 binary logical connectives.
0.3.14.
Two logical statements are equivalent if they have the same truth table. Show that (a) ~ (P A Q) is equivalent to ~ p v ~ Q. (b) ~ (P V Q) is equivalent to ~ P A ~ Q.
0.3.15.
Show that P --~ Q is equivalent to ~ Q ~
0.3.16.
Show that P ~
"0.3.17.
~ P.
Q is equivalent to P A ~ Q ~
false.
A set of logical connectives is complete if for any logical statement we can find an equivalent statement containing only those logical connectives. Show that [A, ~ ] and [V, ~} are complete sets of logical connectives.
BIBLIOGRAPHIC
NOTES
Church [1956] and Mendelson [1968] give good treatments of mathematical logic. Halmos [1963] gives a nice introduction to Boolean algebras.
0.4
PROCEDURES
AND
ALGORITHMS
The concept of algorithm is central to computing. The definition of algorithm can be approached from a variety of directions. In this section we shall discuss the term algorithm informally and hint at how a more formal definition can be obtained. 0.4.1.
Procedures
We shall begin with a slightly more general concept, that of a procedure. Broadly speaking, a procedure consists of a finite set of instructions each of which can be mechanically executed in a fixed a m o u n t of time with a fixed a m o u n t of effort. A procedure can have any number of inputs and outputs. To be precise, we should define the terms instruction, input, and output. However, we shall not go into the details of such a definition here since any "reasonable" definition is adequate for our needs. A good example of a procedure is a machine language computer program. A program consists of a finite number of machine instructions, and each instruction usually requires a fixed amount of computation. However, procedures in the form of computer programs may often be very difficult to understand, so a more descriptive notation wit! be used in this book. The
26
MATHEMATICALPRELIMINARIES
CHAI'. 0
following example is representative of the notation we shall use to describe procedures and algorithms. Example 0.15
Consider Euclid's algorithm to determine the greatest common divisor of two positive integers p and q.
Procedure 1. Euclidean algorithm. lnput, p and q, positive integers. Output. g, the greatest common divisor o f p and q. Method. Step 1: Let r be the remainder of p/q. Step 2: If r = 0, set g = q and halt. Otherwise set p = q, then q = r, and go to step 1. D Let us see if procedure 1 qualifies as a procedure under our definition. Procedure 1 certainly consists of a finite set of instructions (each step is considered as one instruction) and has input and output. However, can each instruction be mechanically executed with a fixed amount of effort ? Strictly speaking, the answer to this question is no, because i f p and q are sufficiently large, the computation of the remainder of p/q may require an amount of effort that is proportional in some way to the size of p and q. However, we could replace step 1 by a sequence of steps whose net effect is to compute the remainder of p/q, although the amount of effort in each step would be fixed and independent of the size ofp and q. (Thus the number of times each step is executed is an increasing function of the size ofp and q.) These steps, for example, could implement the customary paper-and-pencil method of doing integer division. Thus we shall permit a step of a procedure to be a procedure in itself. So under this liberalized notion of procedure, procedure 1 qualifies as a procedure. In general, it is convenient to assume that integers are basic entities, and we shall do so. Any integer can be stored in one memory cell, and any integer arithmetic operation can be perforr~ed in one step. This is a fair assumption only if the integers are less than 2k, where k is the number of bits in a computer word, as often happens in practice. However, the reader should bear in mind. the additional effort necessary to handle integers of arbitrary size when the elementary steps handle only integers of bounded size. We must now face perhaps the most important consideration--proving that the procedure does what it is supposed to do. For each pair of integers p and q, does procedure 1 in fact compute g to be the greatest common divisor of p and q ? The answer is yes, but we shall leave the proof of this particular assertion to the Exercises. We might note in passing, however, that one useful technique of proof
SEC. 0.4
PROCEDURES AND ALGORITHMS
27
for showing that procedures work as intended is induction on the number of steps taken. 0.4.2.
Algorithms
We shall now place an all-important restriction on a procedure to obtain what is known as an algorithm. DEFINITION A procedure halts on a given input if there is a finite number t such theft after executing t (not necessarily different) elementary instructions of the procedure, either there is no instruction to be executed next or a "halt" instruction was last executed. A procedure which halts on all inputs is called an algorithm. Example 0.16
Consider the procedure of Example 0.15. We observe that steps 1 and 2 must be executed alternately. After step 1, step 2 must be executed. After step 2, step 1 may be executed, or there may be no next step; i.e., the procedure halts. We can prove that for every input p and q, the procedure halts after at most 2q steps,t and that thus the procedure is an algorithm. The proof turns on observing that the value r computed in step 1 is less than the value of q, and that, hence, successive values of q when step 1 is executed form a monotonically decreasing sequence. Thus, by the qth time step 2 is executed, r, which cannot be negative and is less than the current value of q, must attain the value zero. When r = 0, the procedure halts. [3 There are several reasons why a procedure may fail to halt on some inputs. It is possible that a procedure can get into an infinite loop under certain conditions. For example, if a procedure contained the instruction Step I: If x = 0, then go to Step 1, else halt, then for x = 0 the procedure would never halt. Variations on this situation are countless. Our interest will be almost exclusively in algorithms. We shall be interested not only in proving that algorithms are correct, but also in evaluating algorithms. The two main criteria for evaluating how well algorithms perform will be (1) The number of elementary mechanical operations executed as a function of the size of the input (time complexity), and (2) How large an auxiliary memory is required to hold intermediate tin fact, 4 logs q is an upper bound on the number of steps executed for q > 1. We leave this as an exercise.
28
MATHEMATICAL PRELIMINARIES
CHAP. 0
results that arise during the execution, again as a function of the size of the input (space complexity). Example 0.17
in Example 0.16 we saw that the number of steps of procedure 1 (Example 0.15) that would be executed with input (p, q) is bounded above by 2q. The amount of memory used is one cell for each of p, q, and r, assuming that any integer can be stored in a single cell. If we assume that the amount of memory needed to store an integer depends on the length of the binary representation of that integer, the amount of memory needed is proportional to log 2 n, where n is the maximum of inputs p and q. [Z] 0.4.3.
Recursive Functions
A procedure defines a mapping from the set of all allowable inputs to a set of outputs. The mapping defined by a procedure is called a partial recursivefunc-tion or recursivefunction. If the procedure is an algorithm, then the mapping is called a total recursive function. A procedure can also be used to define a language. We could have a procedure to which we can present an arbitrary string x. After some computation, the procedure would output "yes" when string x is in the language. If x is not in the language, then the procedure may halt and say "no" or the procedure may never halt. This procedure would then define a language L as the set of input strings for which the procedure has output "yes." The behavior of the procedure on a string not in the language is not acceptable from a practical point of view. If the procedure has not halted after some length of time on an input x, we would not know whether x was in the language but the procedure had not finished computing or whether x was not in the language and the procedure would never terminate. If we had used an algorithm to define a language, then the algorithm would halt on all inputs. Consequently, patience is justified with an algorithm in that we know that if we wait long enough, the algorithm will eventually halt and say either "yes" or "no." A set which can be defined by a procedure is said to be recursively enumerable. A set which can be defined by an algorithm is called recursive. If we use more precise definitions, then we can rigorously show that there are sets which are not recursivety enumerable. We can also show that there are recursively enumerable sets which are not recursive. We can state this in another way. There are mappings which cannot be specified by any procedure. There are also mappings which can be specified by a procedure but which cannot be specified by an algorithm. We shall see that these concepts have great underlying significance for
SEC. 0.4
PROCEDURES AND ALGORITHMS
29
a theory of programming. In Section 0.4.5 we shall give an example of a procedure for which it can be shown that there is no equivalent algorithm. 0.4.4.
Specification of Procedures
In the previous section we informally defined what we meant by procedure and algorithm. It is possible to give a rigorous definition of these terms in a variety of formalisms. In fact there are a large number of formal notations for describing procedures. These notations include (1) (2) (3) (4) (5) (6) (7)
Turing machines [Turing, 1936-1937]. Chomsky type 0 grammars [Chomsky, 1959a and 1963]. Markov algorithms [Markov, 1951]. Lambda calculus [Church, 1941]. Post systems [Post, 1943]. Tag systems [Post, 1965]. Most programming languages [Sammet, 1969].
This list can be extended readily. The important point to be made here is that it is possible to simulate a procedure written in one of these notations by means of a procedure written in any other of these notations. In this sense all these notations are equivalent. Many years ago the logicians Church and Turing hypothesized that any computational process which could be reasonably called a procedure could be simulated by a Turing machine. This hypothesis is known as the ChurchTuring thesis and has been generally accepted. Thus the most general class of sets that we would wish to deal with in a practical way would be included in the class of recursively enumerable sets. Most programming languages, at least in principle, have the capability of specifying any procedure. In Chapter 11 (Volume 2) we shall see what consequences this capability produces when we attempt to optimize programs. We shall not discuss the details of these formalisms for procedures here, although some of them appear in the exercises. Minsky [1967] gives a very readable introduction to this topic. In our book we shall use the rather informal notation for describing procedures and algorithms that we have already seen. 0.4.5.
Problems
We shall use the word problem in a rather specific way in this book. DEFINITION
A problem (or question) is a statement (predicate) which is either true or false, depending on the value of some number of unknowns of designated
30
MATHEMATICAL PRELIMINARIES
CHAP. 0
type in the statement. A problem is usually presented as a question, and we say the answer to the problem is "yes" if the statement is true and "no" if the statement is false. Example 0.18
An example of a problem is "x is less than y, for integers x and y." More colloquially, we can express the statement in question form and delete mention of the type of x and y: "Is x less than y ?" DEFINITION An instance of a problem is a set of allowable values for its unknowns. For example, the instances of the problem of Example 0.18 are ordered pairs of integers. A mapping from the set of instances of a problem to ~yes, no} is called a solution to the problem. If this mapping can be specified by an algorithm, then the problem is said to be (recursively) decidable or solvable. If no algorithm exists to specify this mapping, then the problem is said to be (recursively) undecidable or unsolvable. One of the remarkable achievements of twentieth-century mathematics was the discovery of problems that are undecidable. We shall see later that undecidable problems seriously hamper the development of a broadly applicable theory of computation. Example 0.19
Let us discuss the particular problem "Is procedure P an algorithm ?" Its analysis will go a long way toward exhibiting why some problems are undecidable. First, we must assume that all procedures are specified in some formal system such as those mentioned earlier in this section. It appears that every formal specification language for procedures admits only a countable number of procedures. While we cannot prove this in general, we give one example, the formalism for representing absolute machine language programs, and leave the other mentioned specifications for the Exercises. Any absolute machine language program is a finite sequence of O's and l's (which we imagine are grouped 32, 36, 48, or some number to a machine word). Suppose that we have a string of O's and l's representing a machine language program. We can assign an integer to this program by giving its position in some well ordering of all strings of O's and l's. One such ordering can be obtained by ordering the strings of O's and l's in terms of increasing length and lexicographically ordering strings of equal length by treating each string as a binary number. Since there are only a finite number of strings of any length, every string in {~0, 1~* is thus mapped to some integer. The first
SEC. 0.4
PROCEDURES A N D ALGORITHMS
31
few pairs in this bijection are
Integer
String
1
e
2 3 4 5 6 7 8 9
0 1 00 01 10 11 000 001
In this fashion we see that for each machine language program we can find a unique integer and that for each integer we can find a certain machine language program. It seems that no matter what formalism for specifying procedures is taken, we shall always be able to find a one-to-one correspondence between procedures and integers. Thus it makes sense to talk about the ith procedure in any given formalism for specifying procedures. Moreover, the correspondence between procedures and integers is sufficiently simple that one can, given an integer i, write out the ith procedure, or given a procedure, find its corresponding number. Let us suppose that there is a procedure Pj which is an algorithm and takes as input a specification of a procedure in our formalism and returns the answer "yes" if and only if its input is an algorithm. All known formalisms for procedure specification have the property that procedures can be combined in certain simple ways. In particular, given the hypothetical procedure (algorithm) Pj, we could construct an algorithm Pk to work as follows: ALGORITHM Pk
lnput. Any Output.
procedure P which requires one input.
(1) "No" if ( a ) P is not an algorithm or ( b ) P is an algorithm and P(P) = "yes." (2) "Yes" otherwise. The notation P(P) means that we are applying procedure P to its own specification as input.
Method. (1) If Pj(P) =
"yes," then go to step (2). Otherwise output "no" and halt. (2) If the input P is an algorithm and P takes a procedure specification
32
MATHEMATICALPRELIMINARIES
CHAP. 0
as input and gives "yes" or "no" as output, Pk applies P to itself (P) as input. (We assume that procedure specifications are such that these questions about input and output forms can be ascertained by inspection. The assumption is true in known cases.) (3) Pk gives output "no" or "yes" if P gives output "yes" or "no," respectively. We see that Pk is an algorithm, on the assumption that P~ is an algorithm. Also Pk requires one input. But what does PE do when its input is itself? Presumably, Pj determines that Pk is an algorithm [i.e., Py(Pk)= "yes"]. Pk then simulates itself on itself. But now Pk cannot give an output that is consistent. If Pk determines that this simulation gives "yes" as output, Pk gives "no" as output. But Pk just determined that it gave "yes" when applied to itself. A similar paradox occurs if Pk finds that the simulation gives "no." We must conclude that it is fallacious to assume that the algorithm Pj exists, and thus the question "Is P an algorithm ?" is not decidable for any of the known procedure formalisms. [1 We should emphasize that a problem is decidable if and only if there is an algorithm which will take as input an arbitrary instance of that problem and give the answer yes or no. Given a specific instance of a problem, we are often able to answer yes or no for that specific instance. This does not necessarily make the problem decidable. We must be able to give a uniform algorithm which will work for all instances of the problem before we can say that the problem is decidable. As an additional caveat, we should point out that the encoding of the instances of a problem is vitally important. Normally a "standard" encoding (one that can be mapped by means of an algorithm into a Turing machine specification) is assumed. If nonstandard encodings are used, then problems which are normally undecidable can become decidable. In such cases, however, there will be no algorithm to go from the standard encoding to the nonstandard. (See Exercise 0.4.21.) 0.4.6.
Post's Correspondence Problem
In this section we shall introduce one of the paradigm undecidable problems, called Post's correspondence problem. Later in the book we shall use this problem to show that other problems are undecidable. DEFINITION
An instance of Post's correspondence problem over alphabet E is a finite set of pairs in E ÷ × E ÷ (i.e., a set of pairs of nonempty strings over E). The problem is to determine if there exists a finite sequence of (not necessarily distinct) pairs (xl, yl), (x2, Y2),..., (xm, Ym) such that x~x2...x= =
EXERCISES
33
YlYz "'" Ym" We shall call such a sequence a viable sequence for this instance of Post's correspondence problem. We shall often use xlx2...xm to represent the viable sequence. Example 0.20
Consider the following instance of Post's correspondence problem over {a, b}:
{(abbb, b), (a, aab), (ba, b)} sequence (a, aab), (a, aab), (ba, b), (abbb, b) is viable, since (a)(a)(ba)(abbb) = (aab)(aab)(b)(b). The instance {(ab, aba), (aba, baa), (baa, aa)} of Post's correspondence The
problem has no viable sequences, since any such sequence must begin with the pair (ab, aba), and from that point on, the total number of a's in the first components of the pairs in the sequence will always be less than the number of a's in the second components. D There is a procedure which is in a sense a "solution" to Post's correspondence problem. Namely, one can linearly order all possible sequences of pairs of strings that can be constructed from a given instance of the problem. One can then proceed to test each sequence to see if that sequence is viable. On encountering the first viable sequence, the procedure would halt and report yes. Otherwise the procedure would continue to operate forever. However, there is no algorithm to solve Post's correspondence problem, for one can show that if there were such an algorithm, then one could solve the halting problem for Turing machines (Exercise 0.4.22)--but the halting problem for Turing machines is undecidable (Exercise 0.4.14). ' EXERCISES
0.4.1.
A perfect number is an integer which is equal to the sum of all its divisors (including 1 but excluding the number itself). For example, 6= 1 +2+3 and 2 8 = 1 + 2 - t - 4 - t - 7 + 14 are the first two perfect numbers. (The next three are 496, 8128, and 33550336.) Construct a procedure which has input i and output the ith perfect number. (At present it is not known whether there are a finite or infinite number of perfect numbers.)
0.4.2.
Prove that the Euclidean algorithm of Example 0.15 is correct.
0.4.3.
Provide an algorithm to add two n-digit decimal numbers. How much time and space does the algorithm require as a function of n ? (See Winograd [1965] for a discussion of the time complexity of addition.)
0.4.4.
Provide an algorithm to multiply two n-digit decimal numbers. How much time and space does the algorithm require ? (See Winograd [1967]
34
MATHEMATICALPRELIMINARIES
CHAP. 0
and Cook and Aanderaa [1969] for a discussion of the time complexity of multiplication.) 0.4.5.
Give an algorithm to multiply two integer-valued n by n matrices. Assume that integer arithmetic operations can be done in one step. What is the speed of your algorithm ? If it is proportional to n 3 steps, see Strassen [1969] for an asymptotically faster one.
0.4.6.
Let L ~ [a,b}*. The characteristic function for L is a mapping .fL:Z---~ {0, 1 }, where Z is the set of nonnegative integers, such that .fL (i) = 1 if the i th string in {a, b}* is in L and .fL (i) = 0 otherwise. Show that L is recursively enumerable if and only if .fL is a partial recursive function.
0.4.7.
Show that L is a recursive set if and only if both L and L are recursively enumerable.
0.4.8.
Let P be a procedure which defines a recusively enumerable set L ~_ {a, b}*. F r o m P construct a procedure P' which will generate all and only all the elements of L. That is, the output of P' is to be an infinite string of the form xl ~: xz ~ x3 ~ . . . , where L = {xl, x z , . . . ] . Hint: Construct P' to apply i steps of procedure P to the jth string in [a, b}* for all (i, j), in a reasonable order. DEFINITION
A Turing machine consists of a finite set of states (Q), tape symbols (F), and a function ~ (the next move function, i.e., program) that maps a subset of Q × F to Q × F × [L, R]. A subset ~ ~ F is designated as the set of input symbols and one symbol in I" - ~ is designated the blank. One state, q0, is designated the start state. The Turing machine operates on a tape, one square of which is pointed to by a tape head. All but a finite number of squares hold the blank at any time. A configuration of a Turing machine is a pair (q, ~ Ffl), where q is the state, • fl is the nonblank portion of the tape, and r is a special symbol, indicating that the tape head is at the square immediately to its right. (F does not occupy a square.) The next configuration after configuration (q, 0cFfl) is determined by letting A be the symbol scanned by the tape head (the leftmost symbol of fl, or the blank if fl = e) and finding ~(q, A). Suppose that ~(q, A) = (p, A', D), where p is a state, A' a tape symbol, and D = L or R. Then the next configuration is (p, 0c' Ffl'), where ~'fl' is formed from ~ r fl by replacing the A to the right of r by A' and then moving the symbol in direction D (left if D = L, right if D = R). It may be necessary to insert a blank at one end in order to move F. The Turing machine can be thought of as a formalism for defining procedures. Its input may be any finite length string w in E*. The procedure is executed by starting with configuration (q0, r w) and repeatedly computing next configurations. If the Turing machine halts, i.e., it has reached a configuration for which no move is defined (recall that may not be specified for all pairs in Q x F), then the output is the nonblank portion of the Turing machine's tape.
EXERCISES
35
*0.4.9.
Exhibit a Turing machine that, given an input w in {0, 1}*, will write YES on its tape if w is a palindrome (i.e., w = wR) and write NO otherwise, halting in either case.
'0.4.10.
Assume that all Turing machines use a finite subset of some countable set of symbols, ala2 . . . . for their states and tape symbols. Show that there is a one-to-one correspondence between the integers and Turing machines.
*'0.4.11.
Show that there is no Turing machine which halts on all inputs (i.e., algorithm) and determines, given integer i written in binary on its tape, whether the ith Turing machine halts. (See Exercise 0.4.10.)
'0.4.12.
Let al, a2 . . . . be a countable set of symbols. Show that the set of finite-length strings over these symbols is countable.
'0.4.13.
Informally describe a Turing machine which takes a pair of integers i and j as-input, and halts if and only if the ith Turing machine halts when given the jth string (as in Exercise 0.4.12) as input. Such a Turing machine is called universal.
*'0.4.14.
Show that there exists no Turing machine which always halts, takes input (i,]) a pair of integers, and prints YES on its tape if Turing machine i halts with input j and NO otherwise. Hint: Assume such a Turing machine existed, and derive a contradiction as in Example 0.19. The existence of a universal Turing machine is useful in many proofs.
*'0.4.15.
Show that there is no Turing machine (not necessarily an algorithm) which determines whether an arbitrary Turing machine is an algorithm. Note that this statement is stronger than Exercise 0.4.14, where we essentially showed that no such Turing machine which always halts exists.
"0.4.16.
Show that it is undecidable whether a given Turing machine halts when started with blank tape.
0.4.17.
Show that the problem of determining whether a statement is a theorem in propositional calculus is decidable. Hint: See Exercises 0.3.2 and 0.3.3.
0.4.18.
Show that the problem of deciding whether a string is in a particular recursive set is decidable.
0.4.19.
Does Post's correspondence problem have a viable sequence in the following instances ? (a) (01,011), (10, 000), (00, 0). (b) (1, 11), (11,101), (101,011), (011, 1011). How do you reconcile being able to answer this exercise with the fact that Post's correspondence problem is undecidable ?
0.4.20.
Show that Post's correspondence problem with strings restricted to be over the alphabet [a} is decidable. How do you reconcile this result with the und~idability of Post's correspondence problem ?
36
MATHEMATICALPRELIMINARIES
"0.4.21.
CHAP. 0
Let P1, Pz . . . . be an enumeration of procedures in some formalism. Define a new enumeration P~ , P' z , . . . as follows (1) Let P~t-1 be the ith of P1, Pz . . . . which is not an algorithm. (2) Let Pat be the ith of Pi, Pz,. • • which is an algorithm. Then there is a simple algorithm to determine, given j, whether P~, is an algorithm--just see whether j is even or odd. Moreover, each of P1, Pz . . . . is P~ for some j. How do you reconcile the existence of this one-to-one correspondence between integers and procedures with the claims of Example 0.19 ?
• *0.4.22.
Show that Post's correspondence problem is undecidable. H i n t : Given a Turing machine, construct an instance of Post's problem which has a viable sequence if and only if the Turing machine halts when started with blank tape.
• 0.4.23.
Show that the Euclidean algorithm in Example 0.15 halts after at most 4 logz q steps when started with inputs p and q, where q > 1. DEFINITION A variant of Post's correspondence problem is the partial correspondence problem over alphabet E. An instance of the partial correspondence problem is to determine, given a finite set of pairs in E ÷ × E+, whether there exists for each m > 0 a sequence of not necessarily distinct pairs (xl, y~), (x2, Yz), • • •, (Xm, Ym) such that the first m symbols of the stnng X xXz . . . Xm coincide with the first m symbols of Y l Y 2 "'" ym,
• *0.4.24.
Prove that the partial correspondence problem is undecidable.
BIBLIOGRAPHIC
NOTES
Davis [1965] is a good anthology of many early papers in the study of procedures and algorithms. Turing's paper [Turing, 1936-1937] in which Turing machines first appear makes particularly interesting reading if one bears in mind that the paper was written before modern electronic computers were conceived. The study of recursive and partial recursive functions is part of a now welldeveloped branch of mathematics called recursive function theory. Rogers [1967], Kleene [1952], and Davis [1958] are good references in this subject. Post's correspondence problem first appeared in Post [1947]. The partial correspondence problem preceding Exercise 0.4.24 is from Knuth [1965]. Computational complexity is the study of algorithms from the point of view of measuring the number of primitive operations (time complexity) or the amount of auxiliary storage (space complexity) required to compute a given function. Borodin [1970] and Hartmanis and Hopcroft [1971] give readable surveys of this topic, and Irland and Fischer [1970] have compiled a bibliography on this subject. Solutions to many of the *'d exercises in this section can be found in Minsky [1967] and Hopcroft and Ullman [1969].
CONCEPTS FROM GRAPH THEORY
SEC. 0.5
0.5.
37
CONCEPTS FROM GRAPH THEORY
Graphs and trees provide convenient descriptions of many structures that are useful in performing computations. In this section we shall examine a number of concepts from graph theory which we shall use throughout the remainder of our book. 0.5.1.
Directed Graphs
Graphs can be directed or undirected and ordered or unordered. Our primary interest will be in ordered and unordered directed graphs. DEFINITION
An unordered directed graph G is a pair (A, R), where A is a set of elements called nodes (or vertices) and R is a relation on A. Unless stated otherwise, the term graph will mean directed graph. Example 0.21
Let G = ( A , R ) , where A = { 1 , 2 , 3 , 4 } and R = { ( 1 , 1 ) , ( 1 , 2 ) , ( 2 , 3 ) , (2, 4), (3, 4), (4, 1), (4, 3)}. We can draw a picture of the graph G by numbering four points 1, 2, 3, 4 and drawing an arrow from point a to point b if (a, b) is in R. Figure 0.6 shows a picture of this directed graph. [Z]
Fig. 0.6
Example of a directed graph.
A pair (a, b) in R is called an edge or arc of G. This edge is said to leave node a and enter node b (or be incident upon node b). For example, (1, 2) is an edge in the example graph. If (a, b) is an edge, we say that a is a predecessor of b and b is a successor of a. Loosely speaking, two graphs are the same if we can draw them to look the same, independently of what names we give to the nodes. Formally, we define equality of unordered directed graphs as follows. DEFINITION
Let G1 = (At, R~) and G2 = (A2, R2) be graphs. We say G~ and G2 are equal (or the same) if there is a bijectionf: A~ ---~ A2 such that a R~ b if and
38
MATHEMATICALPRELIMINARIES
CHAP. 0
only if f (a)Rzf(b). That is, there is an edge between nodes a and b in G a if and only if there is an edge between their corresponding nodes in G2. It is common to have certain information attached to the nodes and/or edges of a graph. We call such information a labeling. DEFINITION
Let (A, R) be a graph. A labeling of the graph is a pair of functions f and g, where f, the node labeling, maps A to some set, and g, the edge labeling, maps R to some (possibly distinct) set. Let G~ = (A 1, Rx) and G2 = (Az, R2) be labeled graphs, with labelings (f~, gl) and (f2, g2), respectively. Then G1 and G z are equal labeled graphs if there is a bijection h" A i ~ A 2 such that (1) a R1 b if and only if h(a) R2 h(b) (i.e., G1 and Gz are equal as unlabeled graphs). (2) fl (a) = f2(h(a)) (i.e., corresponding nodes have the same labels). (3) ga((a, b ) ) = g2((h(a), h(b))) (i.e., corresponding edges have the same label.) In many cases, only the nodes or only the edges are labeled. These situations correspond, respectively, to f or g having a single element for its range. In these cases, condition (2) or (3), respectively, is trivially satisfied. Example 0.22
Let G, = ((a, b, c}, [(a, b), (b, c), (c, a)])and G 2 = ((0, 1, 2}, ((1, 0), (2, (0, 2)]). Let the labeling of G I be defined by f~(a) = fl(b) = X, f~(c) = g~((a,b))=g~((b,c))=ot, gl((c,a))=fl. Let the labeling of G2 fz(0) = f2(2) = X, f2(1) = Y, gz((0, 2)) = g2((2, 1)) = a, and gz((1, 0)) = G 1 and G z are shown in Fig. 0.7. G~ and G z are equal. The correspondence is h(a) = O, h(b) = 2, h(c) =
)
X b
X
X
) X
Y (a)
G1 Fig. 0.7
(b) Equal labelled graphs.
G2
1),
Y, be ft.
1.
CONCEPTS FROM GRAPH THEORY
SEC. 0.5
39
DEFINITION
A sequence of nodes (a0, a t , . . . , a,), n ~ 1, is a path of length n from node a0 to node a, if there is an edge which leaves node a~_ 1 and enters node a~ for 1 ~ i < n. For example, (1, 2, 4, 3) is a path in Fig. 0.6. If there is a path from node a 0 to node a,, we say that a, is accessible from a o. A cycle (or circuit) is a path (a0, a t , . . . , a,) in which a0 = a,. In Fig. 0.6, (1, 1) is a cycle of length 1. A directed graph is strongly connected if there is a path from a to b for every pair of distinct nodes a and b. Finally we introduce the concept of the degree of a node. The in-degree of a node a is the number of edges entering a and the out-degree of a is the number of edges leaving a. 0.5.2.
Directed Acyclic Graphs
DEFINITION
A dag (short for directed acyclic graph) is a directed graph that has no cycles. Figure 0.8 shows an example of a dag. A node having in-degree 0 will be called a base node. One having outdegree 0 is called a leaf In Fig. 0.8, nodes 1, 2, 3, and 4 are base nodes and nodes 2, 4, 7, 8, and 9 are leaves.
Q
Fig. 0.8
Example of a dag.
If (a, b) is an edge in a dag, a is called a direct ancestor of b, and b a direct
descendant of a. If there is a path from node a to node b, then a is said to be an ancestor of b and b a descendant of a. In Fig. 0.8, node 9 is a descendant of node 1 ; node 1 is an ancestor of node 9. Note that if R is a partial order on a set A, then (A, R) is a dag. Moreover, if we have a dag (A, R) and let R' be the relation "is a descendant of" on A, then R' is a partial order on A.
40
CHAP. 0
MATHEMATICALPRELIMINARIES
0.5.3.
Trees
A tree is a special type of dag and has many important applications in compiler theory. DEFINITION
An (oriented) tree T is a directed graph G = (A, R) with a specified node r in A called the root such that (1) r has in-degree 0, (2) All other nodes of T have in-degree 1, and (3) Every node in A is accessible from r. Figure 0.9 provides an example of a tree with six nodes. The root is
Fig. 0.9
Example of a tree.
numbered 1. We shall follow the convention of drawing trees with the root on top and having all arcs directed downward. Adopting this convention we can omit the arrowheads. THEOREM 0.3 A tree T has the following properties: (1) T is acyclic. (2) For each node in a tree there is a unique path from the root to that node. Proof. Exercise.
E]
DEFINITION
A subtree of a tree T = (A, R) is any tree T' = (A', R') such that
(1) A' is nonempty and contained in A, (2) R ' = A ' × A ' A R , and (3) No node of A -- A' is a descendant of a node in A'.
SEC. 0.5
CONCEPTS FROM GRAPH THEORY
41
For example,
is a subtree of the tree in Fig. 0.9. We say that the root of a subtree dominates the subtree. 0.5.4.
Ordered Graphs
DEFINITION
An ordered directed graph is a pair (A, R) where A is a set of vertices as before and R is a set of linearly ordered lists of edges such that each element of R is of the form ((a, b~), (a, b2),. • •, (a, b~)), where a is a distinct member of A. This element would indicate that, for vertex a, there are n ares leaving a, the first entering vertex b~, the second entering vertex b2, and so forth. Example 0.23
Figure 0.10 shows a picture of an ordered directed graph. The linear ordering on the arcs leaving a vertex is indicated by numbering the arcs leaving a vertex by 1, 2 , . . . , n, where n is the out-degree of that vertex.
4
Fig. 0.10
Ordered directed graph.
The formal specification for Fig. 0.10 is (A, R), where A = [a, b, c} and R = [((a, c), (a, b), (a, b), (a, a)), ((b, c))}. D Notice that Fig. 0.10 is not a directed graph according to our definition, since there are two arcs leaving node a and entering node b. (Recall that in a set there is only one instance of each element.) As for unordered graphs, we define the notions of labeling and equality of ordered graphs.
42
CHAP. 0
MATHEMATICAL PRELIMINARIES
DEFINITION A labeling of an ordered graph G = (A, R) is a pair of mappings f and
g such that (1) f : A ----~ S for some set S ( f labels the nodes), and (2) g maps R to sequences of symbols from some set T such that g maps ((a, b l ) , . . . , (a, b,)) to a sequence of n symbols of T. (g labels the edges.) Labeled graphs Gi = (A1, R1) and Gz = (As, R2) with labelings (fl, gl) and (fz, g2), respectively, are equal if there exists a bijection h ' A 1 ~ A2 such that (1) Ri contains ((a, b ~ ) , . . . , (a, b,)) if and only if R2 contains ((h(a), h ( b l ) ) , . . . , (h(a), h(b;))), (2) f~fa) = f z ( h ( a ) ) for all a in A 1, and (3) g l ( ( ( a , bl), . . . , (a, b,))) : g2((h(a), h(bl)) . . . . , (h(a), h(b,))). Informally, two labeled ordered graphs are equal if there is a one-to-one correspondence between nodes that preserves the node and edge labels. If the labeling functions all have a range with one element, then the graph is essentially unlabeled, and only condition (1) needs to be shown. Similarly, only the node labeling or only the edge labeling may map to a single element, and condition (2) or (3) will become trivial. For each ordered graph (A, R), there is an underlying unordered graph (A, R') formed by allowing R' to be the set of (a, b) such that there is a list ((a, bl) . . . . , (a, bn)) in R, and b = bi for some i, 1 _~ i < n. An ordered dag is an ordered graph whose underlying graph is a dag. An ordered tree is an ordered graph (A, R) whose underlying graph is a tree, and such that if ((a, bl),. • •, (a, bn)) is in R, then b~ ~ bj, if i -~ j. Unless otherwise stated, we shall assume that the direct descendants of a node of an ordered dag or tree are always linearly ordered from left to right in a diagram. There is a great distinction between ordered graphs and unordered graphs from the point of view of when two graphs are the same. For example the two trees T~ and T2 in Fig. 0.11 are equivalent if T1 and
T1
7'2 Fig. 0.11 Two trees.
SEC.
0.5
CONCEPTS FROM GRAPH THEORY
43
T2 are unordered. But if T~ and T2 are ordered, then T1 and T2 are not the same. 0.5.5.
Inductive Proofs Involving Dags
Many theorems about dags, and especially trees, can be proved by induction, but it is often not clear on what to base the induction. Theorems which yield to this kind of proof are often of the form that something is true for all, or a certain subset of, the nodes of the tree. Thus we must prove something about nodes of the tree, and we need some parameter of nodes such that the inductive step can be proved. Two such parameters are the depth of a node, the minimum path length (or in the case of a tree, the path length) from a base node (root in the case of a tree) to the given node, and the height (or level) of a node, the maximum path length from the node to a leaf. Another approach to inductions on finite ordered trees is to order the nodes in some way and perform the induction on the position of the node in that sequence. Two common orderings are defined below. DEFINITION
Let T be a finite ordered tree. A preorder of the nodes of T is obtained by applying step 1 recursively, beginning at the root of T.
Step 1: Let this application of step 1 be to node a. If a is a leaf, list node a and halt. If a is not a leaf, let its direct descendants be a~, a 2 , . . . , a,. Then list a and subsequently apply step 1 to a~, a 2 , . . . , a, in that order. A postorder of T is formed by changing the last sentence of step 1 to read "Apply step 1 to al, a2, • • •, a, in that order and then list a." Example 0.24
Consider the ordered tree of Fig. 0.12. The preorder of the nodes is 123456789. The postorder is 342789651. [-] Sometimes it is possible to perform an induction on the place that a node has in some order, such as pre- or postorder. Examples of these forms of induction appear throughout the book. 0.5.6.
Linear Orders from Partial Orders
If we have order which is a partial order Intuitively, effect a partial
a partial order R on a set A, often we wish to find a linear a superset of the partial order. This problem of embedding in a linear order is called topological sorting. topological sorting corresponds to taking a dag, which is in order, and squeezing the dag into a single column of nodes
44
CHAP. 0
MATHEMATICALPRELIMINARIES
Fig. 0.12
Ordered tree.
such that all edges point downward. The linear order is given by the position of nodes in the column. For example, under this type of transformation the dag of Fig. 0.8 could look as shown in Fig. 0.13.
!
Fig. 0.13 Linear order from the dag of Fig. 0.8. Formally we say that R' is a linear order that embeds a partial order R on a set A if R' is a linear order and R ~ R', i.e., a R b implies that a R' b for all a, b in A. Given a partial order R, there are many linear orders that embed R (Exercise 0.5.5). The following algorithm finds one such linear order.
SEC. 0.5
CONCEPTS FROM GRAPH THEORY
45
ALGORITHM 0.1
Topological sort.
Input. A partial order R on a finite set A. Output. A linear order R' on A such that R ~ R'. Method. Since A is a finite set, we can represent the linear order R' on A as a list a 1, a 2. . . . , a, such that as R' aj if i < j , and A = [ a l , . . . , a,}. The following steps construct this sequence of elements: (1) L e t i = I , A t = A , a n d R ~ = R . (2) If At is empty, halt, and al, a 2 , . . . , a~_l is the desired linear order. Otherwise, let at be an element in A~ such that a R~ a~ is false for all a ~ A~. (3) Let A,.+t be A ; - {a,.} and Ri+ 1 be R, ~ (A,.+i × A,.+~). Then let i be i + 1 and repeat step (2). I~ If we represent a partial order as a dag, then Algorithm 0. l has a p a r t i c u larly simple interpretation. At each step (A~, R,) is a dag and ai is a base node of (A i, Ri). The dag (A~+~, R,.+~) is formed from (A,., R;) by deleting node ai and all edges leaving a,. Example 0.25
Let A = {a, b, c, d} and R ---- {(a, b), (a, c), (b, d), (c, d)}. Since a is the only node in A such that a' R a is false for all a' E A, we must choose a 1 = a. Then A z = {b, c, d} and R z = {(b, d), (c, d)}; we now choose either b or c for a2. Let us choose az = b. Then A 3 = {c, d} and R3 = {(c, d)}. Continuing, we find a3 = c and a 4 = d. The complete linear order R' is {(a, b), (b, c), (c, d), (a, c), (b, d), (a, d)}.
THEOREM 0 . 4
Algorithm 0.1 produces a linear order R' which embeds the given partial order R.
Proof. A simple inductive exercise. 0.5.7.
Representations for Trees
A tree is a two-dimensional structure, but in many situations it is convenient to use only one-dimensional data structures. Consequently we are interested in having one-dimensional representations for trees which have all the information contained in the two-dimensional picture. What we mean by this is that the two-dimensional picture can be recovered from the onedimensional representation. Obviously one one-dimensional representation of a tree T = (A, R) would be the sets A and R themselves.
46
MATHEMATICAL PRELIMINARIES
CHAP. 0
But there are also other representations. For example, we can use nested brackets to indicate the nodes at each depth of a tree. Recall that the depth of a node in a tree is the length of the path from the root to that node. For example, in Fig. 0.9, node 1 is at depth 0, node 3 is at depth 1, and node 6 is at depth 2. The depth of a tree is the length of the longest path. The tree of Fig. 0.9 has depth 2. Using brackets to indicate depth, the tree of Fig. 0.9 could be represented as 1(2, 3(4, 5, 6)). We shall call this the left-bracketed representation, since a subtree is represented by the expression appearing inside a balanced pair of parentheses and the node which is the root of that subtree appears immediately to the left of the left parenthesis. DEFINITION In general the left-bracketed representation of a tree T can be obtained by applying the following recursive rules to T. The string lrep(T) denotes the left-bracketed representation of tree T. (l) If T has a root numbered a with subtrees T1, T 2 , . . . , Tk in order, then lrep(T) = a(lrep(T1), l r e p ( T 2 ) , . . . , lrep(Tk)). (2) If T has a root numbered a with no direct descendants, then lrep(T) = a. If we delete the parentheses from a left-bracketed representation of a tree, we are left with a preorder of the nodes. We can also obtain a right-bracketed representation for a tree T, rrep(T), as follows: (1) I f T h a s a root numbered a with subtrees T 1, T 2 , . . . , Tk, then rrep(T) -- (rrep(T~), r r e p ( T 2 ) , . . . , rrep(Tk))a. (2) If T has a root numbered a with no direct descendants, then rrep(T) = a. Thus rrep(T) for the tree of Fig. 0.12 would be ((3, 4)2, ((7, 8, 9)6)5)1. In this representation, the direct ancestor is immediately to the right of the first right parenthesis enclosing that node. Also, note that if we delete the parentheses we are left with a postorder of the nodes. Another representation of a tree is to list the direct ancestor of nodes l, 2, . . . , n of a tree in that order. The root would be recognized by letting its ancestor be 0. Example 0.26
The tree shown in Fig. 0.14 would be represented by 0122441777. Here 0 in position 1 indicates that node 1 has "node 0" as its direct ancestor (i.e., node 1 is the root). The 1 in position 7 indicates that node 7 has direct ancestor 1.
CONCEPTS FROM GRAPH THEORY
sac. 0.5
47
Fig. 0.14 A tree. 0.5.8.
Paths Through a Graph
In this section we shall outline a computationally efficient method of computing the transitive closure of a relation R on a set A. If we view the relation as an unordered graph (A, R), then the transitive closure of R is equivalent to the set of pairs of nodes (a, b) such that there is a path from node a to node b. Another possible interpretation is to view the relation (or the unordered graph) as a (square) Boolean matrix (that is, a matrix of O's and l's) called an adjacency matrix, in which the entry in row i, column j is 1 if and only if the element corresponding to row i is R-related to the element corresponding to column j. Figure 0.15 shows the Boolean matrix M corresponding to the
1
2
3
4
1
1
0
0
0
0
1
1
0
0
0
1
1
0
1
0
i
Fig. 0.15 Boolean matrix for Fig. 0.6. oo
graph of Fig. 0.6. If M is a Boolean matrix, then M + = ,=]=~M" (where M" represents M Boolean multipliedt by itself n times) represents the transitive "t'That is, use the usual formula for matrix multiplication with the Boolean operations • and + for multiplication and addition, respectively.
48
CHAP. 0
MATHEMATICALPRELIMINARIES
closure of the relation represented by M. Thus the algorithm could also be used as a method of computing M +. For Fig. 0.15, M + would be a matrix of all l's. Actually we shall give a slightly more general algorithm here. We assume that we have an unordered directed graph in which there is a nonnegative cost ctj associated with an edge from node i to node j. (If there is no edge from node i to node j, ctj is infinite.) The algorithm will compute the minim u m cost of a path between any pair of nodes. The case in which we wish to compute only the transitive closure of a relation R over { a l , . . . , a,} is expressed by letting ctj = 0 if at R aj and c~i = oo otherwise. ALCORITHM 0.2 Minimum cost of paths through a graph. Input. A graph with n nodes numbered 1, 2 , . . . , c~j for 1 < i, j < n with ctj >_ 0 for all i and j.
n and a cost function
Output. An n × n matrix M = [ m j , with m~j the lowest cost of any path from node i to node j, for all i and j. Method. (1) (2) (3) (4)
Set mi~ = c~j for all i and j such that 1 < i, j ~ n. S e t k = 1. For all i and j, if mr1 > m~k + me j, set m~j to m~k -t- mk iIf k < n, increase k by 1 and go to step (3). If k = n, halt.
[Z]
The heart of Algorithm 0.2 is step (3), in which we deduce whether the current cost of going from node i to node j can be made smaller by first going from node i to node k and then from node k to node j.
Since step (3) is executed once for all possible values of i, j, and k, Algorithm 0.2 is n 3 in time complexity. It is not immediately clear that Algorithm 0.2 does produce the minimum cost of any path from node i to j. Thus we should prove that Algorithm 0.2 does what it claims. THEOREM 0.5
When Algorithm 0.2 terminates, mii is the smallest value expressible as e,,,. + . - . + c,~_lv~ such that v 1 -- i and Vm = j. (This sum is the cost of the path v~, v2, • • •, vm from node i to node j.)
SEC. 0.5
CONCEPTS FROM GRAPH THEORY
49
P r o o f To prove the t h e o r e m we shall prove the following statement by induction on 1, the value o f k in step (3) o f the algorithm.
Statement (0.5.1). After step (3) is executed with k = / , mr1 has the smallest value expressible as a sum o f the f o r m c,~,, + . . . + c~..,~., where v~ = i, v , = j, a n d none of v 2 , . . . , Vm_ ~ is greater t h a n I. We shall call this m i n i m u m value the correct value o f m~j with k = 1. This value is the cost of a cheapest p a t h from node i to node j which does not pass t h r o u g h a node whose n u m b e r is higher than l. Basis. Let us consider the initial condition, which we can represent by letting l = 0. [If you like, we can think of step (1) as step (3) with k = 0.] W h e n l -- 0, m = 2, so mtj = ctj, which is the correct initial value. Inductive Step. Assume that statement (0.5.1) is true for all l < 10. Let us consider the value o f mt~ after step (3) has been executed with k = lo. Suppose that the m i n i m u m sum C.~v, + . " + c.._,., for mil with k = l0 is such that no vp, 2 < p < m -- 1, is equal to lo. F r o m the inductive hypothesis e.,~,-+- .... + e.._,., is the correct value of mtj with k = 10 - - I , so c.,~. + . . . + c~..,., is also the correct value o f mtj with k -- 10. N o w suppose that the m i n i m u m sum stj = e.,~, -+- . . . + c~._,v, for mr1 with k -- 1o is such that vp = l0 for some 2 ~ p < m -- 1. T h a t is, stj is the cost of the p a t h v 1, v2, • • •, vm. W e can assume that there is no node vq on this path, q #: p, such that v~ is 10. Otherwise the p a t h vl, v 2 , . . . , Vm contains a cycle, a n d we can delete at least one term from the sum c~,~, + • • • + c~._,.. without increasing the value o f the sum stj. Thus we can always find a sum for stj in which vp = 10 for only one value of p, 2 ~ p < m -- 1. Let us assume that 2 < p < m - 1. The c a s e s p = 2 a n d p--m-1 are left to the reader. Let us consider the sum s t . , - - c.,., -+- . . . + c.._,~. a n d s..j = e....., + . . . + c.._,.. (the costs of the paths f r o m node i to node v p a n d f r o m node v p to node j in the sum s~). F r o m the inductive hypothesis we can assume that st.. is the correct value for m;.. with k = l0 -- 1 a n d that s~.j is the correct value for mv.j with k = l0 -- 1. Thus when step (3) is executed with k = l o, mij is correctly given the value mi., + m~.j. W e have thus shown that statement (0.5.1) is true for all 1. W h e n l = n, statement (0.5.1) states that at the end of Algorithm 0.2, m~ has the lowest possible value. [-7
A c o m m o n special case o f finding m i n i m u m cost paths t h r o u g h a g r a p h occurs when we w a n t to determine the set of nodes which are accessible f r o m a given node. Equivalently, given a relation R on set A, with a in A, we want to find the set of b in A such that a R + b, where R + is the transitive closure of R. F o r this p u r p o s e we can use the following algorithm o f quadratic time complexity.
50
MATHEMATICALPRELIMINARIES
CHAP. 0
ALGORITHM 0.3
Finding the set of nodes accessible from a given node of a directed graph.
Input. G r a p h (A, R), with A a finite set and a in A. Output. The set of nodes b in A such that a R* b. Method. We form a list L and update it repeatedly. We shall also mark m e m b e r s of A during the course of the algorithm. Initially, all members of A are u n m a r k e d . The nodes m a r k e d will b e those accessible f r o m a. (1) Set L = a and m a r k a. (2) If L is empty, halt. Otherwise, let b be the first element on list L. Delete b f r o m L. (3) F o r all c in A such that b R c and c is u n m a r k e d , add c to the b o t t o m of list L, m a r k c, and go to step (2). [~] We leave a p r o o f that Algorithm 0.3 works correctly to the Exercises.
EXERCISES
0.5.1.
What is the maximum number of edges a dag with n nodes can have ?
0.5.2.
Prove Theorem 0.3.
0.5.3.
Give the pre- and postorders for the tree of Fig. 0.14. Give left- and right-bracketed representations for the tree.
*0.5.4.
(a) Design an algorithm that will map a left-bracketed representation of a tree into a right-bracketed representation. (b) Design an algorithm that will map a right-bracketed representation of a tree into a left-bracketed representation.
0.5.5.
How many linear orders embed the partial order of the dag of Fig. 0.8 ?
0.5.6.
Complete the proof of Theorem 0.5.
0.5.7.
Give upper bounds on the time and space necessary to implement Algorithm 0.1. Assume that one memory cell is needed to store any node name or integer, and that one elementary step is needed for each of a reasonable set of primitive operations, including the arithmetic operations and examination or alteration of a cell in an array indexed by a known integer.
0.5.8.
Let A = (a, b, c, d} and R = ((a, b), (b, c), (a, c), (b, d)}. Find a linear order R' such that R ~ R'. How many such linear orders are there ? DEFINITION
An undirected graph G is a triple (A, E, f ) where A is a set of nodes, E is a set of edge names and f is a mapping from E to the set of unordered pairs of node~. I f f ( e ) = ~a, b], then we mean that edge e connects nodes a and b. A path in an undirected graph is a sequence of
EXERCISES
51
nodes a0, al, a2 . . . . . an such that there is an edge connecting at_l and at for 1 _~ i ~ n. An undirected graph is c o n n e c t e d if there is a path between every pair of distinct nodes. DEFINITION An undirected tree can be defined recursively as follows. An undirected tree is a set of one or more nodes with one distinguished
node r called the root of the tree. The remaining nodes can be partitioned into zero or more sets T~ . . . . , Tk each of which forms a tree. The trees T~ . . . . . Tk are called the subtrees of the root and an undirected edge connects r with all of and only the subtree roots. A spanning tree for a connectefi undirected graph G is a tree which contains all nodes of N. 0.5.9.
Provide an algorithm to construct a spanning tree for a connected undirected graph.
0.5.10.
Let (A,R) be an unordered graph such that A = {1,2,3, 4} and R = {(1, 2), (2, 3), (4, 1), (4, 3)}. Find R ÷, the transitive closure of R. Let the adjacency matrix for R be M. Compute M ÷ and show that M + is the adjacency matrix for (A, R+).
0.5.11.
Show that Algorithm 0.2 takes time proportional to n 3 in basic steps similar to those mentioned in Exercise 0.5.7.
0.5.12.
Prove that Algorithm 0.3 marks node b if and only if a R ÷ b.
0.5.13.
Show that Algorithm 0.3 takes time proportional to the maximum of # A and # R .
0.5.14.
The following are three unordered directed graphs. Which two are the same ? G1 = ({a, b, c}, [(a, b), (b, c), (c, a))) G2 = ({a, b, c), {(b, a), (a, c), (b, c))) G3 = ({a, b, c], {(c, b), (c, a), (b, a)])
0.5.15.
The following are three ordered directed graphs with only nodes labeled. Which two are the same 9. G1 = ({a, b, cl, {((a, b), (a, c)), ((b, a), (b, c)), ((c, b))}) with labeling ll(a) = X, ll(b) = Z , and It(c) = Y. G2 = ([a, b, c}, [((a, c)), ((b, c), (b, a)), ((c, b), (c, a))}) with labeling/2(a) = Y, lz(b) = X, and/2(c) = Z. G3 = ([a, b, c}, [((a, c), (a, b)), ((b, c)), ((c, a), (c, b))}) with labeling/3(a) = Y,/3(b) = X, and/3(c) = Z.
52
MATHEMATICAL PRELIMINARIES
CHAP. 0
0.5.16.
Complete the proof of Theorem 0.4.
0.5.17.
Provide an algorithm to determine whether an undirected graph is connected.
"0.5.18. "0.5.19.
Provide an algorithm to determine whether two graphs are equal. Provide an efficient algorithm to determine whether two nodes of a tree are on the same path. Hint: Consider preordering the nodes. Provide an efficient algorithm to determine the first common ancestor of two nodes of a tree.
**0.5.20.
Programming Exercises 0.5.21. 0.5.22. 0.5.23.
Write a program that will construct an adjacency matrix from a linked list representation of a graph. Write a program that will construct a linked list representation of a graph from an adjacency matrix. Write programs to impIement Algorithms 0.1, 0.2, and 0.3.
BIBLIOGRAPHIC
NOTES
Graphs are an ancient and honorable part of mathematics. Harary [1969], Ore [1962], and Berge [1958] discuss the theory of graphs. Knuth [1968] is a good source for techniques for manipulating graphs and trees inside computers. Algorithm 0.2 is Warshalrs algorithm as given in Floyd [1962a]. One interesting result on computing the transitive closure of a relation is found in Munro [1971], where it is shown that the transitive closure of a relation can be computed in the time required to compute the product of two matrices over a Boolean ring. Thus, using Strassen [1969], the time complexity of transitive closure is no greater than n 2-81, not n 3, as Algorithm 0.2 takes.
1
AN INTRODUCTION TO COMPILING
This book considers the problems involved in mapping one representation of a procedure into another. The most common occurrence of this mapping is during the compilation of a source program, written in a high-level programming language, into object code for a particular digital computer. We shall discuss algorithm-translating techniques which are applicable to the design of compilers and other language-processing devices. To put these techniques into perspective, in this chapter we shall summarize some of the salient aspects of the compiling process and mention certain other areas in which parsing or translation plays a major role. Like the previous chapter, those who have a prior familiarity with the material, in this case compilers, will find the discussion quite elementary. These readers can skip this chapter or merely skim it for terminology.
1.1.
PROGRAMMING LANGUAGES
In this section we shall briefly discuss the notion of a programming language. We shall then touch on the problems inherent in the specification of a programming language and in the design of a translator for such a language. 1.1.1.
Specification of Programming Languages
The basic machine language operations of a digital computer are invariably very primitive compared with the complex functions that occur in mathematics, engineering, and other disciplines. Although any function that can be specified as a procedure can be implemented as a sequence of exceedingly simple machine language instructions, for most applications it is much 53
54
AN INTRODUCTION TO COMPILING
CHAP. 1
preferable to use a higher-level language whose primitive instructions approximate the type of operations that occur i n t h e application. For example, if matrix operations are being performed, it is more convenient to write an instruction of the form A=B,C to represent the fact that A is a matrix obtained by multiplying matrices B and C together rather than a long sequence of machine language operations whose intent is the same. Programming languages can alleviate much of the drudgery of programming in machine language, but they also introduce a number of new problems of their own. Of course, since computers can still "understand" only machine language, a program written in a high-level language must be ultimately translated into machine language. The device performing this translation has become known as a compiler. Another problem concerned with programming languages is the specification of the language itself. In a minimal specification of a programming language we need to define (1) The set of symbols which can be used in valid programs, (2) The set of valid programs, and (3) The "meaning" of each vali,4 ~ros:am. Defining the permissible set of symbols is easy. One should bear in mind, however, that in some languages such as SNOBOL or F O R T R A N , the beginning and/or end of a card has significance and thus should be considered a "symbol." Blank is also considered a symbol in some cases. Defining the set of programs which are "valid" is a much more difficult task. In many cases it is very hard just to decide whether a given program should be considered valid. In the specification of programming languages it has become customary to define the class of permissible programs by a set of grammatical rules which allow some programs of questionable validity to be constructed. For example, many F O R T R A N specifications permit a statement of the form L
GOTO
L
within a "valid" F O R T R A N program. However, the specification of a superset of the truly valid programs is often much simpler than the specification of all and only those programs which we would consider valid in the narrowest sense of the word. The third and most difficult aspect of language specification is defining the meaning of each valid program. Several approaches to this problem have been taken. One method is to define a mapping which associates with each valid program a sentence in a language whose meaning we understand. For
SEC. 1.1
PROGRAMMING LANGUAGES
55
example, we could use functional calculus or lambda calculus as the "wellunderstood" language. Then we can define the meaning of a program in any programming language in terms of an equivalent "program" in functional calculus or lambda calculus. By equivalent program, we mean one which defines the same function. Another method of giving meaning to programs is to define an idealized machine. The meaning of a program can then be specified in terms of its effect on this machine started off in some predetermined initial configuration. In this scheme the abstract machine becomes an interpreter for the language. A third approach is to ignore deep questions of "meaning" altogether, and this is the appproach we shall take here. For us, the "meaning" of a source program is simply the output of the compiler when applied to the source program. In this book we shall assume that we have the specification of a compiler as a set of pairs (x, y), where x is a source language program and y is a target language program into which x is to be translated. We shall assume that we know what this set of pairs is beforehand, and that our main concern is the construction of an efficient device which when given x as input will produce y as output. We shall refer to the set of pairs (x, y) as a translation. If each x is a string over alphabet 5: and y is a string over A, then a translation is merely a mapping from l~* to A*. 1.1.2.
Syntax and S e m a n t i c s
It is often more convenient in specifying and implementing translations to treat a translation as the composition of two simpler mappings. The first of these relations, known as t h e syntactic mapping, associates with each input (program in the source language)some structure which is the domain for the second relation, the semantic mapping. It is not immediately apparent that there should be any structure which will aid in the translation process, but almost without exception, a labeled tree turns out to be a very useful structure to place on the input. Without delving into the philosophy of why this should be so, much of this book will be devoted to algorithms for the efficient construction of the proper trees for input programs. As a natural example of how tree structures are built on strings, every English sentence can be broken down into syntactic categories which are related by grammatical rules. For example, the sentence "The pig is in the pen" has a grammatical structure which is indicated by the tree of Fig. 1.1, whose nodes are labeled by syntactic categories and whose leaves are labeled by the terminal symbols, which, in this case, are English words. Likewise, a program written in a programming language can be broken
56
CHAP. 1
AN INTRODUCTION TO COMPILING
J
.
/ ~
I
I
the
pig
I
is
I
/
in
I
\ I
the
pen
Fig. 1.1 Tree structure for English sentence. down into syntactic c o m p o n e n t s which are related by syntactic rules governing the language. F o r example, the string
a+b,c m a y have a syntactic structure given by the tree of Fig. 1.2.t The term parsing or syntactic analysis is given to the process of finding the syntactic structure
*
c
b Fig. 1.2 Tree from arithmetic expression. tThe use of three syntactic categories, (expression), (term), and (factor), rather than just (expression), is forced on us by our desire that the structure of an arithmetic expression be unique. The reader should bear this in mind, lest our subsequent examples of the syntactic analysis of arithmetic expressions appear unnecessarily complicated.
BIBLIOGRAPHIC NOTES
57
associated with an input sentence. The syntactic structure of a sentence is useful in helping to understand the relationships among the various symbols of a sentence. The term syntax of a language will refer to a relation which associates with each sentence of a language a syntactic structure. We can then define a valid sentence of a language as a string of symbols which has the overall syntactic structure of (sentence). In the next chapter we shall discuss several methods of rigorously defining the syntax of a language. The second part of the translation is called the semantic mapping, in which the structured input is mapped to an output, normally a machine language program. The term semantics of a language will refer to a mapping which associates with the syntactic structure of each input a string in some language (possibly the same language) which we consider the "meaning" of the original sentence. The specification of the semantics of a language is a very difficult matter which has not yet been fully resolved, particularly for natural languages, e.g., English. Even the specification of the syntax and semantics of a programming language is a nontrivial task. Although there are no universally applicable methods, there are two concepts from language theory which can be used to make up part of the description. The first of these is the concept of a context-free grammar. Most of the rules for describing syntactic structure can be formalized as a context-free grammar. Moreover, a context-free grammar provides a description which is sufficiently precise to be used as part of the specification of the compiler itself. In Chapter 2 we shall present the relevant concepts from the theory of context-free languages. The second concept is the syntax-directed translation schema, which can be used to specify mappings from one language to another. We shall study syntax-directed translation schemata in some detail in Chapters 3 and 9. In this book an attempt has been made to present those aspects of language theory and other formal theories which bear on the design of programming languages and their compilers. In some cases the impact of the theory is only to provide a framework in which to talk about problems that occur in compiling. In other cases the theory will provide uniform and practicable solutions to some of the design problems that occur in compiling.
BIBLIOGRAPHIC
NOTES
High-level programming languages evolved in the early 1950's. At that time computers lacked floating-point arithmetic operations, so the first programming languages were representations for floating-point arithmetic. The first major programming language was FORTRAN, which was developed in the mid-!950's.
58
AN INTRODUCTION TO COMPILING
CHAP. 1
Several other algebraic languages were also developed at that time, but FORTRAN emerged as the most widely used language. Since that time hundreds of high-level programming languages have been developed. Sammet [1969] gives an account of many of the languages in existence in the mid-1960's. Much of the theory of programming languages and compilers has lagged behind the practical development. A great stimulus to the theory of formal languages was the use of what is now known as Backus Naur Form (BNF) in the syntactic definition of ALGOL 60 (Naur [1963]). This report, together with the early work of Chomsky [1959a, 1963], stimulated the vigorous development of the theory of formal languages during the 1960's. Much of this book presents results from language theory which have relevance to the design and understanding of language translators. Most of the early work on language theory was concerned with the syntactic definition of languages. The semantic definition of languages, a much more difficult question, received less attention and even at the time of the writing of this book was not a fully resolved matter. Two good anthologies on, the formal specification of semantics are Steel [1966] and Engeler [1971]. The IBM Vienna laboratory definition of PL/I [Lucas and Walk, 1969] is one example of a totally formal approach to the specification of a major programming language. One of the more interesting developments in programming languages has been the creation of extensible languages--languages whose syntax and semantics can be changed within a program. One of the earliest and most commonly proposed schemes for language extension is the macro definition. See McIlroy [1960], Leavenworth [1966], and Cheatham [1966], for example. Galler and Perlis [1967] have suggested an extension scheme whereby new data types and new operators can be introduced into ALGOL. Later developments in extensible languages are contained in Christensen and Shaw [1969] and Wegbreit [1970]. ALGOL 68 is an example of a major programming language with language extension facilities [Van Wijngaarden, 1969].
1.2.
AN O V E R V I E W OF C O M P I L I N G
We shall discuss techniques and algorithms which are applicable to the design of compilers and other language-processing devices. To put these algorithms into perspective, in this section we shall take a global picture of the compiling process. 1.2.1,
The Portions of a Compiler
Many compilers for many languages have certain processes in common. We shall attempt to abstract the essence of some of these processes. In doing so we shall attempt to remove from these processes as many machine-dependent and operating-system-dependent considerations as possible. Although implementation considerations are important (a bad implementation can destroy a good algorithm), we feel that understanding the fundamental
SEC. 1.2
AN OVERVIEW OF COMPILING
59
nature of a problem is essential and will make the techniques for solution of that problem applicable to other basically similar problems. A source program in a programming language is nothing more than a string of characters. A compiler ultimately converts this string of characters into a string of bits, the object code. In this process, subprocesses with the following names can often be identified: (1) (2) (3) (4)
Lexical analysis. Bookkeeping, or symbol table operations. Parsing or syntax analysis. Code generation or translation to intermediate code (e.g. assembly language). (5) Code optimization. (6) Object code generation (e.g. assembly).
In any given compiler, the order of the processes may be slightly different from that shown, and several of the processes may be combined into a single phase. Moreover, a compiler should not be shattered by any input it receives; it must be capable of responding to any input string. For those input strings which do not represent syntactically valid programs, appropriate diagnostic messages must be given. We shall describe the first five phases of compilation briefly. These phases do not necessarily occur separately in an actual compiler. However, it is often convenient to conceptually partition a compiler into these phases in order to isolate the problems that are unique to that part of the compilation process. 1.2.2.
Lexical Analysis
The lexical analysis phase comes first. The input to the compiler and hence the lexical analyzer is a string of symbols from an alphabet of characters. In the reference version of PL/I for example, the terminal symbol alphabet contains the 60 symbols AB
C ... Z $@~
0 1 2 ... 9
u
blank
=÷-,/(),.;"&l~><
7%
In a program, certain combinations of symbols are often treated as a single entity. Some typical examples of this would include the following" (1) In languages such as PL/I a string of one or more blanks is normally treated as a single blank. (2) Certain languages have keywords such as BEGIN, END, GOTO, DO, INTEGER, and so forth which are treated as single entities.
60
AN INTRODUCTION TO COMPILING
CHAP. 1
(3) Strings representing numerical constants are treated as single items. (4) Identifiers used as names for variables, functions, procedures, labels, and the like are another example of a single lexical unit in a programming language. It is the job of the lexical analyzer to group together certain terminal characters into single syntactic entities, called tokens. What constitutes a token is implied by the specification of the programming language. A token is a string of terminal symbols, with which we associate a lexical structure consisting of a pair of the form (token type, data). The first component is a syntactic category such as "constant" or "identifier," and the second component is a pointer to data that have been accumulated about this particular token. For a given language the number of token types will be presumed finite. We shall call the pair (token type, data) a " t o k e n " also, when there is no source of confusion. Thus the lexical analyzer is a translator whose input is the string of symbols representing the source program and whose output is a stream of tokens. This output forms the input to the syntactic analyzer. Example 1.1
Consider the following assignment statement from a FORTRAN-like language" COST -- (PRICE ÷ TAX) • 0.98 The lexical analysis phase would find COST, PRICE, and TAX to be tokens of type identifier and 0.98 to be a token of type constant. The characters --, (, -+-, ), and • are tokens by themselves. Let us assume that all constants and identifiers are to be mapped into tokens of the type (id). We assume that the data component of a token is a pointer, an entry to a table containing the actual name of the identifier together with other data we have collected about that particular identifier. The first component of a token is used by the syntactic analyzer for parsing. The second component is used by the code generation phase to produce appropriate machine code. Thus the output of the lexical analyzer operating directly on our input string would be the following sequence of tokens" (ida1 = ((ida2 + (id)3) • (ida, Here we have indicated the data pointer of a token by means of a subscript. The symbols --, (, + , ), and * are to be construed as tokens whose token type is represented by themselves. They have no associated data, and hence we indicate no data pointer for them. [---I
SEC. 1.2
AN OVERVIEW OF COMPILING
61
Lexical analysis is easy if tokens of more than one character are isolated by characters which are tokens themselves. In the example above, --, (, -+-, and • cannot appear as part of an identifier, so COST, PRICE, and TAX can be readily distinguished as tokens. However, lexical analysis may not be so easy in general. For example, consider the following valid F O R T R A N statements: (1)
DO 10 I = 1.15
(2)
DO 10 I - - 1,15
In statement (1) DO 10 l i s a variablet and 1.15 a constant. In statement (2) DO is a keyword, 10 a constant, and I a variable; 1 and 15 are constants. If a lexical analyzer were implemented as a coroutine [Gentleman, 1971; Mcllroy, 1968] and were to start at the beginning of one of these statements, with a command such as "find the next token," it could not determine if that token was DO or DO 10 I until it had reached the comma or decimal point. Thus a lexical analyzer may need to look ahead of the token it is actually interested in. A worse example occurs in PL/I, where keywords may also be variables. Upon seeing an input string of the form DECLARE(X1, X 2 , . . . ,
Xn)
the lexical analyzer would have no way of telling whether D E C L A R E was a function identifier and X1, X 2 , . . . , Xn were its arguments or whether D E C L A R E was a keyword causing the identifiers X1, X2, . . . , Xn to have the attribute (or attributes) immediately following the right parenthesis. Here the distinction would have to be made on what follows the right parenthesis. But since n can be arbitrarily large,** the PL/I lexical analyzer might have to look ahead an arbitrary distance. However, there is another approach to lexical analysis that is less convenient but it avoids the problem of arbitrary lookahead. We shall define two extreme approaches to lexical analysis. Most techniques in use fall into one or the other of these categories and some are a combination of the two: (1) A lexical analyzer is said to operate directly if, given a string of input text and a pointer into that text, the analyzer will determine the token immediately to the right of the place pointed to and move the pointer to the right of the portion of text forming the token. (2) A lexical analyzer is said to operate indirectly if, given a string of text, a pointer into that text, and a token type, it will determine if input tRecall that in FORTRAN blanks are ignored. :]:The language specification does not impose an upper limit on n. However, a given PL/I compiler will.
62
CHAP. 1
AN INTRODUCTION TO COMPILING
characters appearing immediately to the right of the pointer form a token of that type. If so, the pointer is moved to the right of the portion of text forming that token. Example 1.2
Consider the F O R T R A N text DO 10 I =
1,15
with the pointer currently at the left end. An indirect lexical analyzer would respond "yes" if asked for a token of type DO or a token of type (identifier). In the former case, the pointer would be moved two symbols to the right, and in the latter case, five symbols to the right. A direct lexical analyzer would examine the text up to the comma and conclude that the next token was of type DO. The pointer would then move two symbols to the right, although many more symbols would be scanned in the process. El Generally, we shall describe parsing algorithms under the assumption that lexical analysis is direct. The backtrack or "nondeterministic" parsing algorithms can be used with indirect lexical analysis. We shall include a discussion of this type of parsing in Chapters 4 and 6. 1.2.3.
Bookkeeping
As tokens are uncovered in lexical analysis, information about certain tokens is collected and stored in one or more tables. What this information is depends on the language. In the F O R T R A N example we would want to know that COST, PRICE, and TAX were floating-point variables and 0.98 a floating-point constant. Assuming that COST, PRICE, and TAX have not been declared in a type statement, this information about these variables can be gleaned from the fact that COST, PRICE, and TAX begin with letters other than I, J, K, L, M, or N. As another example about collecting information about variables, consider a F O R T R A N dimension statement of the form D I M E N S I O N A(10,20) On encountering this statement, we would have to store information that A is an identifier which is the name of a two-dimensional array whose size is 10 by 20. In complex languages such as PL/I, the number of facts which might be stored about a given variable is quite large--on the order of a dozen or so. Let us consider a somewhat simplified example of a table in which infor-
SEC. 1.2
AN OVERVIEW OF COMPILING
63
mation about identifiers is stored. Such a table is often called a symbol table. The table will list all identifiers together with the relevant information concerning each identifier. Suppose that we encounter the statement COST = (PRICE + TAX) • 0.98 After this statement, the table might appear as follows"
Entry
Identifier
1
COST PRICE TAX 0.98
2
3 4
Information Floating-point Floating-point Floating-point Floating-point
variable variable variable constant
On encountering a future identifier in the input stream, this table would be consulted to see whether that identifier has already appeared. If it has, then the data portion of the token for that identifier is made equal to the entry number of the original occurrence of the variable with that name. For example, if a succeeding statement in the F O R T R A N program contained the variable COST, then the token for the second occurrence of COST would be ( i d ) l , the same as the token for the first occurrence of COST. Thus such a table must simultaneously allow for (1) The rapid addition of new identifiers and new items of information, and (2) The rapid retrieval of information for a given identifier. The method of data storage usually used is the hash (or scatter) table, which will be discussed in Chapter 10 (Volume II). 1.2.4.
Parsing
As we mentioned earlier, the output of the lexical analyzer is a string of tokens. This string of tokens forms the input to the syntactic analyzer, which examines only the first components of the token (the token types). The information about each token (second component) is used later in the compiling process to generate the machine code. • Parsing, or syntax analysis, as it is sometimes known, is a process in which the string of tokens is examined to determine whether the string obeys certain structural conventions explicit in the syntactic definition of the language. It is also essential in the code generation process to know what the syntactic structure of a given string is. For example, the syntactic structure of the expression A q- B • C must reflect the fact that B and C are first multi-
64
CHAP. 1
AN INTRODUCTION TO COMPILING
plied and that then the result is added to A. No other ordering of the operations will produce the desired calculation. Parsing is one of the best-understood phases of compilation. From a set of syntactic rules it is possible to automatically construct parsers which will make sure that a source program obeys the syntactic structure defined by these syntactic rules. In Chapters 4-7 we shall study several different parsing techniques and algorithms for generating a parser from a given grammar. The output from the parser is a tree which represents the syntactic structure inherent in the source program. In many ways this tree structure is closely related to the parsing diagrams we used to make for English sentences in elementary school. Example 1.3
Suppose that the output of the lexical analyzer is the string of tokens (1.2.1)
(id), : ( (id)2 + (id)3) • (id)4
This string conveys the information that the following three operations are to be performed in exactly the following way: (1) (id)3 is to be added to (id)2, (2) The result of (1)is to be multiplied by (id)4, and (3) The result of (2) is to be stored in the location reserved for (id)~. This sequence of steps can be pictorially represented in terms of a labeled tree, as shown in Fig. 1.3. That is, the interior nodes of the tree represent
l
7
4
+ Fig. 1.3
3 Tree structure.
sEc. 1.2
AN OVERVIEW OF COMPILING
65
actions which must be taken. The direct descendants of each node either represent values to which the action is to be applied (if the node is labeled by an identifier or is an interior node) or help to determine what the action should be. (In particular, the = , q-, and • signs do this.) Note that the parentheses in (1.2.1) do not explicitly appear in the tree, although we might want to show them as direct descendants of n 1. The role of the parentheses is only to influence the order of computation. If they did not appear in (1.2.1), then the usual convention that multiplication "takes precedence" over addition would apply, and the first step would be to multiply (ida3 and (id)4. [Z] 1.2.5.
Code Generation
The tree built by the parser is used to generate a translation of the input program. This translation could be a program in machine language, but more often it is in an intermediate language such as assembly language or "three address code." (The latter is a sequence of simple statements; each involves no more than three identifiers, e.g. A = B, A = B -t- C, or GOTO A.) If the compiler is to do extensive code optimization, then code of the three address type is preferred. Since three address code does not pin computations to specific computer registers, it is easier to use registers to advantage when optimizing. If little or no optimization is to be done, then assembly or even machine code is preferred as an intermediate language. We shall give a running example of a translation into an assembly type language to illustrate the salient points of the translation process. For this discussion let us assume that we have a computer with one working register (the accumulator) and assembly language instructions of the form Instruction
Effect
LOAD m ADDt m MPY m STORE m LOAD =m ADD =m MPY =m
c(m) ~ accumulator c(accumulator) + c(m) ~ accumulator c(accumulator) • c(m) ~ accumulator c(accumulator) ~ m m ---, accumulator c(accumutator) + m ~ accumulator c(accumulator) • m ~ accumulator
tLet us assume that ADD and MPY refer to floating-point operations. Here the notation c(m) ---~ accumulator, for example, means the contents of memory location m are to be placed in the accumulator. The expression = m denotes the numerical value m. With these comments, the effects of the seven instructions should be obvious. The output of the parser is a tree (or some representation of one) which represents the syntactic structure inherent in the string of tokens coming out
66
AN INTRODUCTION TO COMPILING
CHAP. 1
of the lexicai analyzer. F r o m this tree, and the information stored in the symbol table, it is possible to construct the object code. In practice, tree construction and code generation are often carried out simultaneously, but conceptually it is easier to think of these two processes as occurring serially. There are several methods for specifying how the intermediate code is to be constructed from the syntax tree. One method which is particularly elegant and effective is the syntax-directed translation. Here we associate with each node n a string C(n) of intermediate code. The code for node n is constructed by concatenating the code strings associated with the descendants of n and other fixed strings in a fixed order. Thus translation proceeds from the bottom up (i.e., from the leaves toward the root). The fixed strings and fixed order are determined by the algorithm used. More will be said about this in Chapters 3 and 9. An important problem which arises is how to select the code C(n) for each node n such that C(n) at the root is the desired code for the entire statement. In general, some interpretation must be placed on C(n) such that the interpretation can be uniformly applied to all situations in which node n can appear. For arithmetic assignment statements, the desired interpretation is fairly natural and. will be explained in the following paragraphs. In general, the interpretation must be specified by the compiler designer if the method of syntax-directed translation is to be used. This task may be easy or hard, and in difficult cases, the detailed structure of the tree may have to be adjusted to aid in the translation process. For a specific example, we shall describe a syntax-directed translation of simple arithmetic expressions. We notice that in Fig. 1.3, there are three types of interior nodes, depending on whether their middle descendant is labeled = , + , or ,. These three types of nodes are shown in Fig. 1.4, where
/ (a)
(b) Fig. 1.4 Types of interior nodes.
(c) P
represents an arbitrary subtree (possibly a single node). We observe that for any arithmetic assignment statement involving only arithmetic operators + and ,, we can construct a tree with one node of type (a) (the root) and other interior nodes of types (b) and (c) only. The code associated with a node n will be subject to the following interpretation"
SEC. 1.2
AN OVERVIEW OF COMPILING
6'7
(1) If n is a node of type (a), then C(n) will be code which computes the value of the expression on the right and stores it in the location reserved for the identifier labeling the left descendant. (2) If n is a node of type (b) or (c), then C(n) is code which, when preceded by the operation code LOAD, brings to the accumulator the value of the subtree dominated by n. Thus, in Fig. 1.3, when preceded by LOAD, C(nl) brings to the accumulator the value of (id)2 ÷ (id)3, and C(n2) brings to the accumulator the value of ((id)2 -t- (id)3) * (id)4. C(n3) is code, which brings the latter value to the accumulator and stores it in the location of (id)l. We must consider how to build C(n) from the code for n's descendants. In what fellows, we assume that assembly language statements are to be generated in one string, with a semicolon or a new line separating the statements. Also, we assume that assigned to each node n of the tree is a level number l(n), which denotes the maximum length of a path from that node to a leaf. Thus, l(n)= 0 if n is a leaf, and if n has descendants n t , . . . , nk, l(n) = maxx_ ff2
o
tg
>
~n
We shall also use the following conventions to represent various symbols and strings concerned with a grammar" (1) a, b, c, and d represent terminals, as do the digits 0, 1 , . . . , 9. (2) A, B, C, D, and S represent nonterminals; S represents the start symbol. (3) U, V , . . . , Z represent either nonterminals or terminals. (4) a, f l , . . , represent strings of nonterminals and terminals. (5) u, v , . . . , z represent strings of terminals only. Subscripts and superscripts do not change these conventions. When a symbol obeys these conventions, we shall often omit ~mention of the convention. We can thus specify a grammar by merely listing its productions if all terminals and nonterminals obey conventions (1) and (2). Thus grammar G i can be specified simply as S OA A
~ 0A1 >~OOA1 ~e
No mention of the nonterminal or terminal sets or the start symbol is necessary. We now give further examples of grammars.
88
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
Example 2,3
Let G = ([}, {0, 1 , . . . , 9}, { ---~ 0 1 1 1 . . . 19}, ). Here (digit> is treated as a single nonterminal symbol. L(G) is clearly the set of the ten decimal digits. Notice that L(G) is a finite set. Example 2.4
Let Go = ([E, T, F}, [a, + , ,, (,)}, P, E), where P consists of the productions E
>E+TIT
T
>T,F]F
F
> (E) la
An example of a derivation in this grammar would be E - - - > E-+- T ~T+
T
>'F-q- T >a+T >a-+- T , F --->a+F,F .-=->a + a , F --->
a .+- a , a
L(Go) is the set of arithmetic expressions that can be built up using the sym-
bols a, ÷ , ,, (, and).
D
The grammar in Example 2.4 will be used repeatedly in the book and is always referred to as Go. Example 2.5
Let G be defined by S CB bB ~
.> aSBC[abC > BC bb
bC
> bc
cC
> cc
SEC. 2.1
REPRESENTATIONS FOR LANGUAGES
89
We have the following derivation in G" S ~
aSBC aab CB C >. a a b B C C aabbCC aabbcC aabbcc
The language generated by G is [a"b"c"[n ~ 1}.
D
Example 2.6
Let G be the grammar with productions S
~ CD
Ab ~
bA
C
> aCA
Ba ~
aB
C ~
bCB
Bb
~~ bB
AD
>aD
C
>e
BD-
>bD
D
>e
Aa
~ aA
An example of a derivation in G is S---~CD -
~.aCAD
- ~ abCBAD ===~.abBA D abBaD abaBD abab D abab
We shall show that L ( G ) = [ww] w ~ {a, b}*}. That is, L(G) consists of strings of a's and b's of even length such that the first half of each string is the same as the second half. Since L(G) is a set, the easiest way to show thaf L(G) = {ww[w ~ {a, b}*} is to show that {ww]w ~ {a, b}*} ..q L(G) and that L(G) ~ [ww[w ~ {a, b}*}.
90
ELEMENTS OF LANGUAGE THEORY
CHAP• 2
To show that { w w l w ~ {a, b}*} ~ L(G) we must show that every string of the form ww can be derived from S. By a simple inductive proof we can show that the following derivations are possible in G" (I) s ~
CD.
(2) For n .> O, n
C ~
c~c2 . . " c . C X . X . _ ~ c~c~ . . .
...
c~X.X~_~ . . .
X~ X~
where, for 1 ~ i ~ n, ct = a if and only if Xt = A, and ct = b if and only i f X t = B.
(3)
xo
• .. X2X1D ~ X n • ' "
X2c,D
n--I
c t X ~ . . . X2D .~ c~X~ . . .
X3c2D
n-2
~"c lc2X~ . . . X3D
clc2 " " c,,-tXnD CtC 2 • . . >" C l C
2
• • .
Cn_lCnD Cn_iC n
The details of such an inductive proof are straightforward and will be omitted here. In derivation (2), C derives a string of a's and b's followed by a mirror image string of A's and B's. In derivation (3), the A's and B's migrate to the right end of the string, where an A becomes a and B becomes b on contact with D, which acts as a right endmarker. The only way an A or B can be replaced by a terminal is for it to move to the right end of the string. In this fashion the string of A's and B's is reversed and thus matches the string of a's and b's derived from C in derivation (2). Combining derivations (1), (2), and (3) we have for n ~ 0 +
S~¢ic2
..•
¢n¢iC2 • • •
Cn
where c, E {a, b} for 1 < i < n. Thus [ w w l w ~ {a, b}*} ~ L(G). We would now like to show that L(G) ~ [wwl w ~ [a, b}*}. To do this we must show that S derives terminal strings only of the form ww. In general, to show that a grammar generates only strings of a certain form is a much more difficult matter than showing that it generates certain strings. At this point it is convenient to define two homomorphisms g and h such that
SEC. 2.1
REPRESENTATIONS FOR LANGUAGES
g(a) = a,
g(b) = b,
91
g(A) = g ( B ) = e
and h(a) = h(b) = e,
h(A) = A,
h(B) = B m
For this grammar G we can prove by induction on m ~ 1 that if S ~ then a¢ can be written as a string clcz . . . c, U f l V such that
~,
(1) Each ct is either a or b; (2) U is either C or e; (3) fl is a string in [a, b, A, B}" such that g ( f l ) = c ~ c 2 . . . c ~ , h(,O
=
X.X._,
. . . X,+,,
and Xj is A or B as cj is a or b, i < j < n; and (4) V is either D or e. The details of this induction will be omitted. We now observe that the sentential forms of G which consist entirely of terminal symbols are all of the form e l e z . . , e , e ~ c z . . . e , , where each c; ~ [a, b]. Thus, L(G) ~ {ww]w ~ [a, b}*}. We can now conclude that L(G) = ( w w l w ~ [a, b}*}. 2.1.3.
D
Restricted Grammars
Grammars can be classified according to the format of their productions. Let G = (N, 2:, P, S) be a grammar. DEFINITION
G is said to be (1) Right-linear if each production in P is of the form A ~ x B or A ~ x, where A and B are in N and x is in Z*. (2) Context-free if each production in P is of the form A ~ ~, where A is in N and tx is in (N u Z)*. (3) Context-sensitive if each production in P is of the form ~ ---~ fl, where
I~I_. OS[1St e
This grammar generates the language [0, 1}*.
92
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
The grammar of Example 2.4 is an important example of a context-flee grammar. Notice that according to our definition, every right-linear grammar is also a context-free grammar. The grammar of Example 2.5 is clearly a context-sensitive grammar. We should emphasize that the definition of context-sensitive grammar does not permit a production of the form A ~ e, commonly known as an e-production. Thus a context-free grammar having e-productions would not be a context-sensitive grammar. The reason for not permitting e-productions in context-sensitive grammars is to ensure that the language generated by a context-sensitive grammar i~ recursive. That is to say, we want to be able to give an algorithm which, presented with an arbitrary context-sensitive grammar G and input string w, will determine whether or not w is in L(G). (See Exercise 2.1.18.) Even if we permitted just one e-production in a context-sensitive grammar (without imposing additional conditions on the grammar), then the expanded class of grammars would be capable of defining any recursively enumerable set (see Exercise 2.1.20). The grammar in Example 2.6 is unrestricted. Note that it is not right-linear, context-free, or context-sensitive. CONVENTION
If a language L can be generated by a type x grammar, then L is said to be a type x language, for all the "type x" 's that we have defined or shall define. Thus L(G) of Example 2.3 is a right-linear language, L(Go) in Example 2.4 is a context-free language, and L(G) of Example 2.5 is a paradigm context-sensitive language. The language generated by the grammar in Example 2.6 is an unrestricted language, although {wwlw ~ (a, b}* and w ~ e] also happens to be a context-sensitive language. The four types of grammars and languages we have defined are often referred to as the Chomsky hierarchy. CONVENTION
We shall hereafter abbreviate context-free grammar and language by C F G and CFL, respectively, Likewise, CSG and CSL stand for contextsensitive grammar and context-sensitive language. Every right-linear language is a CFL, and there are CFL's, such as [0"l"]n ~> 1}, that are not right-linear. The CFL's which do not contain the empty string likewise form a proper subset of the context-sensitive languages. These in turn are a proper subset of the recursive sets, which are in turn a proper subset of the recursively enumerable sets. The (unrestricted) grammars define exactly the recursively enumerable sets. These matters are left for the Exercises.
REPRESENTATIONSFOR LANGUAGES
SEC. 2.1
93
Often, the context-sensitive languages are defined to be the languages that we have defined plus all those languages L U [e}, where L is a contextsensitive language as defined here. In that case, we may call the CFL's a proper subset of the CSL's. We should emphasize the fact that although we may be given a certain type of grammar, the language generated by that grammar might be generated by a less powerful grammar. As a simple example, the context-free grammar
ASIe
S
>
A
>011
generates the language {0, 1}*, which, as we have seen, can also be generated by a right-linear grammar. We should also mention that there are a number of grammatical models that have been recently introduced outside the Chomsky hierarchy. Some of the motivation in introducing new grammatical models is to find a generative device that can better represent all the syntax and/or semantics of programming languages. Some of these models are introduced in the Exercises.
Recognizers
2.1.4.
A second common method of providing a finite specification for a language is to define a recognizer for the language. In essence a recognizer is merely a highly stylized procedure for defining a set. A recognizer can be pictured as shown in Fig. 2.1. There are three parts to a recognizer an input tape, a finite state control, and an auxiliary memory.
!
ao
at [ a2 t
~
Input head
an
Input tape
Finite state control
Fig. 2.1 A recognizer.
94
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
The input tape can be considered to be divided into a linear sequence of tape squares, each tape square containing exactly one input symbol from a finite input alphabet. Both the leftmost and rightmost tape squares may be occupied by unique endmarkers, or there may be a right endmarker and no left endmarker, or there may be no endmarkers on either end of the input tape. There is an input head, which can read one input square at a given instant of time. In a move by a recognizer, the input head can move one square to the left, remain stationary, or move one square to the right. A recognizer which can never move its input head left is called a one-way recognizer. Normally, the input tape is assumed to be a read-only tape, meaning that once the input tape is set no symbols can be changed. However, it is possible to define recognizers which utilize a read-write input tape. The memory of a recognizer can be any type of data store. We assume that there is a finite memory alphabet and that the memory contains only symbols from this finite memory alphabet in some data organization. We also assume that at any instant of time we can finitely describe the contents and structure of the memory, although as time goes on, the memory may become arbitrarily large. An important example of an auxiliary memory is the pushdown list, which can be abstractly represented as a string of memory symbols, e.g., Z1 Z2 . . . Zn, where each Zt is assumed to be from some finite memory alphabet IF', and Z 1 is assumed to be on top. The behavior of the auxiliary memory for a class of recognizers is characterized by two functions--a store function and a fetch function. It is assumed that the fetch function is a mapping from the set of possible memory configurations to a finite set of information symbols, which could be the same as the memory alphabet. For example, the only information that can be accessed from a pushdown list is the topmost symbol. Thus a fetch function f for a pushdown list would be a mapping from r + to IF' such that f(Za . . . Z , ) = Za. The store function is a mapping which describes how memory may be altered. It maps memory and a control string to memory. If we assume that a store operation for a pushdown list replaces the topmost symbol on the pushdown list by a finite length string of memory symbols, then the store function g could be represented as g: F + × IF'* ~ F*, such that
g ( Z , Z ~ . . . Zn, r, . . . r , ) =
Y, . . . Y k Z ~ . . . Zn.
If we replace the topmost symbol Z 1 on a pushdown list by the empty string, then the s y m b o l Z 2 becomes the topmost symbol and can then be accessed by a fetch operation. Generally speaking, it is the type of memory which determines the name of a recognizer. For example a recognizer having a pushdown list for a mem-
src. 2.1
REPRESENTATIONS FOR LANGUAGES
95
ory would be called a pushdown recognizer (or more usually, pushdown automaton). The heart of a recognizer is the finite state control, which can be thought of as a program which dictates the behavior of the recognizer. The control can be represented as a finite set of states together with a mapping which describes how the states change in accordance with the current input symbol (i.e., the one under the input head) and the current information fetched from the memory. The control also determines in which direction the input head is to be shifted and what information is to be stored in the memory. A recognizer operates by making a sequence of moves. At the start of a move, the current input symbol is read, and the memory is probed by means of the fetch function, The current input symbol and the information fetched from the memory, together with the current state of the control, determine what the move is to be. The move itself consists of (1) Shifting the input head one square left, one square right, or keeping the input head stationary; (2) Storing information into the memory; and (3) Changing the state of the control. The behavior of a recognizer can be conveniently described in terms of configurations of the recognizer. A configuration is a picture of the recognizer describing (1) Thestate of the finite control; (2) The contents of the input tape, together with the location of the input head; and (3) The contents of the memory. We should mention here that the finite control of a recognizer can be deterministic or nondeterministic. If the control is nondeterministic, then in each configuration there is a finite set of possible moves that the recognizer can make. The control is said to be deterministic if in each configuration there is at most one possible move. Nondeterministic recognizers are a convenient mathematical abstraction, but, unfortunately, they are often difficult to simulate in practice. We shall give several examples and applications of nondeterministic recognizers in the sections that follow. The initial configuration of a recognizer is one in which the finite controI is in a specified initial state, the input head is scanning the leftmost symbol on the input tape, and the memory has a specified initial content. A final configuration is one in which the finite control is in one of a specified set of final states and the input head is scanning the right endmarker or, if there is no right endmarker, has moved off the right end of the input tape. Often, the memory must also satisfy certain conditions if the configuration is to be considered final.
96
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
We say that a recognizer accepts an input string w if, starting from the initial configuration with w on the input tape, the recognizer can make a sequence of moves and end in a final configuration. We should point out that a nondeterministic recognizer may be able to make many different sequences of moves from an initial configuration. However, if at least one of these sequences ends in a final configuration, then the initial input string will be accepted. The language defined by a recognizer is the set of input strings it accepts. For each class of grammars in the Chomsky hierarchy there is a natural class of recognizers that defines the same class of languages. These recognizers are finite automata, pushdown automata, linear bounded automata, and Turing machines. Specifically, the following characterizations of the Chomsky languages exist" (I) A language L is right-linear if and only if L is defined by a (one-way deterministic) finite automaton. (2) A language L is context-free if and only if L is defined by a (one-way nondeterministic) pushdown automaton, (3) A language L is context-sensitive if and only if L is defined by a (twoway nondeterministic) linear bounded automaton. (4) A language L is recursively enumerable if and only if L is defined by a Turing machine. The precise definition of these recognizers will be found in the Exercises and later sections. Finite automata and pushdown automata are important in the theory of compiling and will be studied in some detail in this chapter.
EXERCISES
2.1.1. Construct right-linear grammars for (a) Identifiers which can be of arbitrary length but must start with a letter (as in ALGOL). (b) Identifiers which can be one to six symbols in length and must start with L J, K, L, M, or N (as for FORTRAN integer variables). (c) Real constants as in PL/I or FORTRAN, e.g., -- 10.8, 3.14159, 2., 6.625E-27. L(d) All strings Of o's and l's having both an Odd number of O's and an odd number of l's. 2.1.2. Construct context-free grammars that generate (a) (b) (c) (d) (e)
All strings of O's and l's having equal numbers of 0's and l's. [aiaz . . . anan.., azal lag ~ {0, 1}, 1 ~ i ~ n}. Well-formedstatements in propositional calculus. (0'lJl i ~ j and i, j > 0). All possible sequences of balanced parenthesesl
EXERCISES
97
"2.1.3.
Describe the language generated by the productions S----~ b S S l a . Observe that it is not always easy to describe what language a grammar generates.
"2.1.4.
Construct context-sensitive grammars that generate (a) [an'In > 1). (b) {ww I w ~ [a, b}+}. (c) [ w l w ~ [a, b, c}+ and the number of a's in w equals the number of b's which equals the number of c's]. (d) [ambnambnl m, n > 1}.
Hint: Think of the set of productions in a context-sensitive grammar as a program. You can use special nonterminal symbols as a combination "input head" and terminal symbol. "2.1.5.
A "true" context-sensitive grammar G is a grammar (N, ~, P, S) in which each production in P is of the form
where • and fl are a production can be only in the context can be generated by
in (N U E)*, ~' ~ (N u ZZ)+, and A ~ N. Such interpreted to mean that A can be replaced by 0~_fl. Show that every context-sensitive language a "true" context-sensitive grammar.
• "2.1.6.
What class of languages can be generated by grammars with only left context, that is, grammars in which each production is of the form ~zA ~ ~zfl, ~z in (N U X~)*, fl in (N U E)+ ?
2.1.7.
Show that every context-free language can be generated by a grammar G = (N, ~ , P , S) in which each production is of either the form A ---~ cz, cz in N*, or A ~ w, w in E*.
2.1.8.
Show that every context-sensitive language can be generated by a grammar G = (N, X~, P, S) in which each production is either of the form ----~ fl, where cz and fl are in N +, or A ~ w, where A ~ N and w ~ X +.
2.1.9.
Prove that L(G) = [a"bnc" In ~ 1}, where G is the grammar in Example 2.5.
"2.1.10.
Can you describe the set of context-free grammars by means of a context-free grammar ?
• 2.1.11.
Show that every recursivelY enumerable set can be generated by a grammar with at most two nonterminal symbols. Can you generate every recursively enumerable set with a grammar having only one nonterminal symbol ?
2.1.12.
Show that if G = (N, ~, P, S) is a grammar such that ~ N = n, and X~ does not contain any of the symbols A1, A2 . . . . , then there is an equivalent grammar G' = (N', E, P', A1) such that N' = [A1, A 2 , . . . , An}.
98
ELEMENTSOF LANGUAGE THEORY
2.1.13.
CHAP. 2
Prove that the grammar G~ of Example 2.1 generates {0"l"[n > 1}. Hint: Observe that each sentential form has at most one nonterminal. Thus productions can be applied in only one place in a string. DEFINITION
In an unrestricted grammar G there are many ways of deriving a given sentence that are essentially the same, differing only in the order in which productions are applied. If G is context-free, then we can represent these essentially similar derivations by means of a derivation tree. However, if G is context-sensitive or unrestricted, we can define equivalence classes of derivations in the following manner. Let G = (N, E, P, S) be an unrestricted grammar. Let D be the set of all derivations of the form S ~ w. That is, elements of D are sequences of the form (0~0, 0~1. . . . , t~n) such that t~0 = S, ~n ~ ~*, and oct_~ ~ oct 1 < i < n. Define a relation R0 on D by (t~0, t X l , . . . , o~n) R0 (fl0, r i , . • . , fin) if and only if there is some i between 1 and n - 1 such that (1) txj = fl~ for all 1 ~ j ~ n such that j ~ i. (2) We can write oct_l = 71727a~'4~'s and t~+l = ~'lO?aeys such that 72~t~ and Y 4 ~ 6 are in P, and either t x t = 7 ' ~ 3 y 4 ~ s and fit = 71Y2736~'5, or conversely. Let R be the least equivalence relation containing R0. Each equivalence class of R represents the essentially similar derivations of a given sentence. *'2.1.14.
What is the maximum size of an equivalence class of R (as a function of n and i 0c~ I) if G is (a) Right-linear. (b) Context-free. (c) Such that every production is of the form tx ~
• 2.1.15.
fl and [ tx I < [fl [.
Let G be defined by S ~
A-
AOBIB1A > B_BI0
B----~ AAI 1
What is the size of the equivalence class under R which contains the derivation S ~
AOB ==~ BBOB ==~ 1BOB ==~ 1 A A O B ~
1000B ~
IOAOB
10001
DEFINITION
A grammar G is said to be unambiguous if each w in L(G) appears as the last component of a derivation in one and only one equivalence class under R, as defined above. For example,
EXERCISES
S-
> abClaB
B-
> be
99
bC----> bc is an ambiguous grammar since the sequences (S, abC, abc)
and
(S, aB, abc)
are in two distinct equivalence classes. "2.1.16.
Show that every right-linear language has an unambiguous right-linear grammar.
"2.1.17.
Let G = (N, ~, P, S) be a context-sensitive grammar, and let N u X~ n
have m members. Let w be a word in L(G). Show that S ==~ w, where n < (m + 1)IwI.
G
2.1.18.
Show that every context-sensitive language is recursive. Hint: Use the result of Exercise 2.1.17 to construct an algorithm to determine if w is in L(G) for arbitrary word w and context-sensitive grammar G.
2.1.19.
Show that every CFL is recursive. Hint: Use Exercise 2.1.18, but be careful about the empty word.
"2.1.20.
Show that if G = (N, ~, P, S) is an unrestricted grammar, then there is a context-sensitive grammar G ' = (N', X~ U {c}, P', S') such that w is in L(G) if and only if wc~is in L(G') for some i > 0
Hint: Fill out every noncontext-sensitive production of G with c's. Then add productions to allow the c's to be shifted to the right end of any sentential form. 2.1.21.
Show that if L = L(G) for any arbitrary grammar G, then there is a context-sensitive language Li and a homomorphism h such that L = h(L1).
2.1.22.
Let {A1, A2 . . . . } be a countable set of nonterminal symbols, not including the symbols 0 and 1. Show that every context-sensitive language L ~ { 0 , 1 } * has a CSG G = (N, {0,1}, P, A1), where N = {A1, A 2 , . . . ,A~}, for some i. We call such a context-sensitive grammar normalized.
"2.1.23.
Show that the set of normalized context-sensitive grammars as defined above is countable.
"2.1.24.
Show that there is a recursive set contained in {0, i}* which is not a context-sensitive language. Hint: Order the normalized, contextsensitive grammars so that one may talk about the ith grammar. Likewise, lexicographicaUy order {0, 1}* so that we may talk about the ith string in {0, 1}*. Then define L = {w~lw~ is not in L(G3} and show that L is recursive but not context-sensitive.
'1O0
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
*'2.1.25.
Show that a language is defined by a grammar if and only if it is recognized by a Turing machine. (A Turing machine is defined in the Exercises of Section 0.4 on page 34.)
2.1.26.
Define a nondeterministic recognizer whose memory is an initially blank Turing machine tape, which is not permitted to grow longer than the input. Show that a language is defined by such a recognizer if and only if it is a CSL. This recognizer is called a linear bounded automaton (LBA, for short). DEFINITION
An indexed grammar is a 5-tuple G = (N, E, A, P, S), where N, E, and A are finite sets of nonterminals, terminals, and intermediates, respectively. S in N is the start symbol and P is a finite set of productions of the forms A +
X i V i X 2 V 2 "'" X . V . ,
n>0
X+V.,
n~0
and M----> X~ ~uIX2 V 2
"
"
where A is in N, the X's in N kJ E, f in A, and the V's in A*, such that if Xi is in Z, then V~ = e. Let ~ and ,8 be strings in (NA* u E)*, A e N, 0 e A*, and let A --+ X~ V~ "'" X,V, be in P. Then we write
ocXOp ~
ocXt e ~O~x 2 e 2 0 [ . . . X~e~O'J~
G
where 0~ = 0 if ~ ~ N and 0~ = e if X~ ~ E. That is, the string of intermediates following a nonterminal distributes over the nonterminals, but not over the terminals, which can never be followed by an intermediate. If A f ~ X1Vx , ' " XnV, is in P, then o~AfOfl ~
O~Xllp'lO~ . " X.ly,O~fl,
G
as above. Such a step "consumes" the intermediate following A, but ,
otherwise is the same as the first type of step. Let ~
be the reflexive,
G
transitive closure of ===~, and define G
L(G) =
{wlw in ~* and S ~
w}.
G
E x a m p l e 2.7
Let G = ({S, T, A, B, C}, {a, b, c}, {f, g }, P, S), where P consists of S---->Tg T~
Tf
T~
A~C
Af
> aA
Bf - - - > bB
EXERCISES
Cf -
101
>cC
Ag----+a Bg---~b
Cg =-+. c Then L(G) = {a"b"c" In _> 1 }. For example, aabbcc has the derivation S--~Tg - - ~ Tfg AfgBfgCfg aAgBfgCfg aaBfgCfg - - ~ aabBgCfg aabbCfg aabbeCg > aabbcc
"2.1.27.
[-]
Give indexed grammars for the following languages: (a) {wwl w ~ {a, b]*]. (b) {a"b"21n ~ 1).
*'2.1.28.
Show that every indexed language iscontext-sensitive.
2.1.29.
Show that every CFL is an indexed language.
2.1.30.
Let us postulate a recognizer whose memory is a single integer (written in binary if you will). Suppose that the memory control strings as described in Section 2.1.4 are only X and Y. Which of the following could be memory fetch functions for the above recognizer ? (a) f ( i ) = (b) f ( i ) • (c) f ( i ) =
2.1.31.
0, 1, a, b, [0, 1 1,
if i is even if i is odd. if i is even if i is odd. if i is even and the input symbol under the input head is a otherwise.
Which of the following could be memory store functions for the recognizer in Exercise 2.1.30 ? (a) g(i, g(i, (b) g(i, g(i,
X ) -- 0 Y) -- i ÷ t. X ) -- 0 Y ) - { i ÷ 1, if the previous store instruction was X ÷ 2, if the previous store instruction was Y.
102
CHAP. 2
ELEMENTSOF LANGUAGE THEORY
DEFINITION
A tag system consists of two finite alphabets N and ~ and a finite set of rules of the form (t~, fl), where tx and fl are in (N U E)*. If ~, is an arbitrary string in (N u E)* and (t~, fl) is a rule, then we write • ~' I- ?ft. That is, the prefix ~ may be removed from the front of any . string provided fl is then placed at the end of the string. Let ~ be the reflexive, transitive closure of ~ . For any string ? in (N W E)*, L r is {wI w is in E* and ? t-- w}. • "2.1.32.
Show that L r is always defined by some grammar. Hint: Use Exercise 2.1.25 or see Minsky [1967].
• "2.1.33.
Show that for any grammar G, L(G) is defined by a tag system in the manner described above. The hint of Exercise 2.1.32 again applies.
Open Problems 2.1.34.
Is the complement of a context-sensitive language always contextsensitive ? The recognizer of Exercise 2.1.26 is called a linear bounded automaton (LBA). If we make it deterministic, we have a deterministic LBA (DLBA).
2.1.35.
Is every context-sensitive language recognized by a DLBA ?
2.1.36.
Is every indexed language recognized by a DLBA ? By Exercise 2.1.28, a positive answer to Exercise 2.1.35 implies a positive answer to Exercise 2.1.36.
BIBLIOGRAPHIC
NOTES
Formal language theory was greatly stimulated by the work of Chomsky in the late 1950's [Chomsky, 1956, 1957, 1959a, 1959b]. Good references to early work on generative systems are Chomsky [1963] and Bar-Hillel [1964]. The Chomsky hierarchy of grammars and languages has been extensively studied. Many of the major results concerning the Chomsky hierarchy are given in the Exercises. Most of these results are proved in detail in Hopcroft and UUman [1969] or Ginsburg [1966]. Since Chomsky introduced phrase structure grammars, many other models of grammars have also appeared in the literature. Some of these models use specialized forms of productions. Indexed grammars [Aho, 1968], macro grammars [Fischer, 1968], and scattered context grammars [Greibach and Hopcroft, 1969] are examples of such grammars. Other grammatical models impose restrictions on the order in which productions can be applied. Programmed grammars [Rosenkrantz, 1968] are a prime example. Recognizers for languages have also been extensively studied. Turing machines were deffined by A. Turing in 1936. Somewhat later, the concept of a finite state
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
103
machine appeared in McCulloch and Pitts [1943]. The study of recognizers was stimulated by the work of Moore [1956] and Rabin and Scott [1959]. A significant amount of effort in language theory has been expended in determining the algebraic properties of classes of languages and in determining decidability results for classes of grammars and recognizers. For each of the four classes of grammars in the Chomsky hierarchy there is a class of recognizers which defines precisely those languages generated by that class of grammars. These observations have led to a study of abstract families of languages and recognizers in which classes of languages are defined in terms of algebraic properties. Certain algebraic properties in a class of languages are necessary and sufficient to guarantee the existence of a class of recognizers for those languages. Work in this area was pioneered by Ginsburg and Greibach [1969] and Hopcroft and Ullman [1967]. Book [1970] gives a good survey of language theory circa 1970. Haines [1970] claims that the left context grammars in Exercise 2.1.6 generate exactly the context-sensitive languages. Exercise 2.1.28 is from Aho [1968].
2.2.
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
The regular sets are a class of languages central to much of language theory. In this section we shall study several methods of specifying languages, all of which define exactly the regular sets. These methods include regular expressions, right-linear grammars, deterministic finite automata, and nondeterministic finite automata. 2.2.1.
Regular Sets and Regular Expressions
DEFINITION
Let Z be a finite alphabet. We define a regular set over Z recursively in the following manner: (1) (2) (3) (4)
~ (the empty set) is a regular set over Z. {e} is a regular set over E. {a} is a regular set over Z for all a in Z. If P and Q are regular sets over Z, then so are (a) P U a . (b) PQ.
(c) P*. (5) Nothing else is a regular set. Thus a subset of Z* is regular if and only if it is ~ , {e}, or {a}, for some a in Z, or can be obtained from these by a finite number of applications of the operations union, concatenation, and closure. We shall define a convenient method for denoting regular sets over a finite alphabet E.
104
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
DEFINITION
Regular expressions over X and the regular sets they denote are defined recursively, as follows" (1) ~ is a regular expression denoting the regular set ~ . (2) e is a regular expression denoting the regular set {e}. (3) a in X is a regular expression denoting the regular set {a}. (4) I f p and q are regular expressions denoting the regular sets P and Q, respectively, then (a) (p + q) is a regular expression denoting P u Q. (b) (pq) is a regular expression denoting PQ. (e) (p)* is a regular expression denoting P*. (5) Nothing else is a regular expression. We shall use the shorthand notation p+ to denote the regular expression
pp*. Also, we shall remove redundant parentheses from regular expressions whenever no ambiguity can arise. In this regard, we assume that * has the highest precedence, then concatenation, and then -+-. Thus, 0 -+- 10" means (0 + (1 (0"))). Example 2.8
Some examples of regular expressions are (1) 01, denoting {01}. (2) 0", denoting {0}*. (3) O + 1)*, denoting {0, 1}*. (4) (0 + 1)* 011, denoting the set of all strings of 0's and l's ending in 011. (5) (a + b)(a + b + 0 + 1)*, denoting the set of all strings in {0, 1, a, b}* beginning with a or b. (6) (00 + 11)*((01 + 10)(00 + 11)*(01 + 10)(00 + 11)*)*, denoting the set of all strings of O's and l's containing both an even number of O's and an even number of l's. E] It should be quite clear that for each regular set we can find at least one regular expression denoting that regular set. Also, for each regular expression we can construct the regular set denoted by that regular expression. Unfortunately, for each regular set there is an infinity of regular expressions denoting that set. We shall say two regular expressions are equal ( = ) if they denote the same set. Some basic algebraic properties of regular expressions are stated in the following lemma. LEMMA 2.1 Let g, p, and ~, be regular expressions. Then (1)
~ + fl = fl + ~
(2)
~ * -- e
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
(3) 0~ ÷ (fl --k y) = (5) ~(fl + y ) = (7) ~e = (9) ~* = (11) ~-+-~=~
(0~ + fl) + y ~fl -q- ~? e0¢ = ~ • + ~*
(4) oc(fl?)= (6) (~ + fl)}, = (8) Z0c = (1 O) (~*)* -(12) ~+ ~ =
105
(~fl)? ~, + ,8? ~~ = O ~*
Proof (1) Let ~ and fl denote the sets L 1 and L 2, respectively. Then •-Jr fl denotes L1 U L2 and fl + ~ denotes L2 U L1. But L t U L2 = L2 u L 1 from the definition of union. Hence, 0¢ + ,8 = fl -q- 0~. The remaining parts are left for the Exercises. In what follows, we shall not distinguish between a regular expression and the set it denotes unless confusion will arise. For example, under this convention the symbol a will represent the set ~a}. W h e n dealing with languages it is often convenient to use equations whose indeterminates and coefficients represent sets. Here, we shall consider sets of equations whose coefficients are regular expressions and shall call such equations regular expression equations. For example, consider the regular expression equation (2.2.1)
X = aX + b
where a and b are regular expressions. We can easily verify by direct substitution that X = a*b is a solution to Eq. (2.2.1). That is to say, when we substitute the set represented by a*b in both sides of Eq. (2.2.1), then each side of the equation represents the same set. We can also have sets of equations that define languages. For example, consider the pair of equations (2.2.2)
X = a~ X --Jr-a 2 Y -~- a3
Y = blX-+- b2 Y-+- b3
where each at and b~ is a regular expression. We shall show how we cart solve this pair of simultaneous equations to obtain the solutions
X : (a 1 -Jr-azb2*ba)*(a 3 q- a2bz*b3) Y : (b 2 + b~aa*a2)*(b3 -if- b~aa*a3) However, we should first mention that not all regular expression equations have unique solutions. For example, if (2.2.3)
X = ~X + fl
is a regular expression equation and a denotes a set which contains the empty string, then X = 0c*(fl + y) is also a solution to (2.2.3) for all ?. (? does not even have to be regular. See Exercise 2.2.7.) Thus Eq. (2.2.3) has an infinity
'! 06
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
of solutions. In situations of this nature we shall use the smallest solution, which we call the minimal fixed point. The minimal fixed point for Eq. (2.2.3) is X = 0~*fl. DEFINITION
A set of regular expression equations is said to be in standard form over a set of indeterminates A = [X1, X 2 , . . . , X,} if for each Xt in A there is an equation of the form Yi : ~io + ~ilYl + oci2Y2 ~- "'" + O~inYn
with ~u a regular expression over some alphabet disjoint from A. The ~'s are the coefficients. Note that if ~u : @, a possible regular expression, then effectively there is no term for Xj in the equation for Xt. Also, if 0~u = e, then effectively the term for Xj in the equation for X~ is just Xj. That is, Z~ plays the role of coefficient 0, and e the role of coefficient 1 in ordinary linear equations. ALGORITHM 2. I Solving a set of regular expression equations in standard form.
Input. A set Q of regular expression equations in standard form over A, whose coefficients are regular expressions over alphabet X. Let A be the set
{x,,
xo}.
Output. A set of solutions of the form Xt = 0ci, 1 ~ i ~ n, where ~, is a regular expression over E. Method. The method is reminiscent of solving linear equations using Gaussfan elimination. Step 1" Let i = 1. Step 2" If i = n, go to step 4. Otherwise, using the identities of Lemma 2.1, write the equation for Xt as X~ = ~Xt --k fl, where ~ is a regular expression over 1~and fl is a regular expression of the form flo+fl~X~+~+ . . . +fl,X,, with each fit a regular expression over I:. We shall see that this will always be possible. Then in the equations for X~+~,..., X,, we replace Xt on the right by the regular expression 0~*fl. Step 3" Increase i by 1 and return to step 2. Step 4" After executing step 2 the equation for Xt will have only symbols in E and X t , . . . , X, on the right. In particular, the equation for X n will have only X, and symbols in E on the right. At this point i = n and we now go to step 5. Step 5" The equation for X~ is of the form Xt = ~zX~--k fl, where ~ and /~ are regular expressions over Z. Emit the statement X~ = e*fl and substitute e*,t/for X~ in the remaining equations. Step 6" If i = 1, end. Otherwise, decrease i by 1 and return to step 5.
[-7
SEC. 2.2
REGULAR SETS, THEIR GENERATORS~ AND THEIR RECOGNIZERS
107
Example 2.9
Let A = {X~, X2, X3}, and let the set of equations be (2.2.4)
X1----OX"2 -t-- 1X1 + e
(2.2.5)
X2 = OX3 -+- 1X2
(2.2.6)
X 3 --OXt + 1X a
From (2.2.4) we obtain X1 = 1X1 + (0X2 + e). We then replace Xt by l*(0X2 + e) in the remaining equations. Equation (2.2.6) becomes X 3 --01*(OX2 + e) q- 1X 3, which can be written, using Lemma 2.1, as (2.2.7)
X3=01*0X2+lX
3+01.
If we now work on (2.2.5), which was not changed by the previous step, we replace X 2 by I*0X 3, in (2.2.7), and obtain (2.2.8)
X 3 = (01"01"0 _jr_ 1)X3 q-- 01"
We now reach step 5 of Algorithm 2.1. From Eq. (2.2.8) we obtain the solution for X3" (2.2.9)
X 3 = (01"01"0 --¢- 1)*01"
We substitute (2.2.9) in (2.2.5), to yield (2.2.10)
X2 = 0(01"01"0 + 1)*01" + 1X2
Since X 3 does not appear in (2.2.4), that equation is not modified. We then solve (2.2.10), obtaining (2.2.11)
X2 -- 1*0(01 *01 *0 + 1)*01*
Substituting (2.2.11) into (2.2.4), we obtain (2.2.12)
X t = 01"0(01 *01"0 + 1)*01 * q- 1X 1 q-- e
The solution to (2.2.12) is (2.2.13)
X, = 1"(01"0(01"01"0 .+ 1)*01" -]- e)
The output of Algorithm 2.1 is the set of Eq. (2.2.9), (2.2.11), and (2.2.13).
D
108
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
We must show that the output of Algorithm 2.1 is truly a solution to the equations, in the sense that when the solutions are substituted for the indeterminates, the sets denoted by both sides of each equation are the same. As we have pointed out, the solution to a set of standard form equations is not always unique. However, when a set of equations does not have a unique solution we shall see that Algorithm 2.1 yields the minimal fixed point. DEFINITION
Let Q be a set of standard form equations over A with coefficients over ~. We say that a mapping f from A to languages in E* is a solution to Q if upon substituting f ( X ) for X in each equation, for all X in A, the equations become set equalities. We say that f : A --, 6'(E*) is a minimalfixed point of Q i f f i s a solution, and if g is any other solution, f ( X ) ~ g(X) for all X i n A. The following two lemmas provide useful information about minimal fixed points. LEMMA 2.2 Every set Q of standard form equations over A has a unique minimal fixed point.
Proof Let f ( X ) = {w l for all solutions g to Q, w is in g(X)}, for all X in A. It is straightforward to show that f is a solution and that f ( X ) ~ g(X) for all solutions g. Thus, f is the unique minimal fixed point of Q. E] We shall now characterize the minimal fixed point of a set of equations. LEPTA 2.3 Let Q be a set of standard form equations over A, where A = [ X x , . . . ,X,} and the equation for each X~ is
Then the minimal fixed point of Q is f, where f ( X ~ ) = [wa . . .
w~
l for some sequence of integers j~ . . . . ,Jm, m ~ 1,
W, is in 0cj,0, and wk is in txj~j~÷,, 1 < k < m, where j~ = i}.
Proof. It is straightforward to show that the following set equations are valid: f ( X , ) - - O~,o U oc,l f ( X 1) U . . . W o~o,f(X,,) for all i. Thus, f is a solution. To show that f is the minimal fixed point, suppose that g is a solution and that for some i there exists a string w in f(X~) -- g(X~). Since w is in f(X~), we can write w - - wl - " win, where for some sequence of integers j l , . . . , j m , we have Wm in OC:~o, Wk in 0~j~:~÷~,for 1 ~ k < m, and Jt = i.
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
109
Since g is a solution, we have g(Xj) = ~jo U a j l g ( X 1 ) (_) " ' " U ajng(Xn) for all j. In particular, a j0 ~ g(Xj) and aj~g(Xk) ~ g(Xj) for all j and k. Thus, Wmis in g ( X j . ) , Win-~Wm is in g(Xj._,), and so forth. Finally, w = w~w2 - . . Wm is in g(Xi, ) = g(X,). But then we have a contradiction, since we supposed that w was not in g ( X 3 . Thus we can conclude that f ( X 3 ~ g ( X 3 for all i. It immediately follows that f is the minimal fixed point of Q. [~] LEMMA 2.4
Let Q1 and Q2 be the set of equations before and after a single application of step 2 of Algorithm 2.1. Then Q1 and Q2 have the same minimal fixed point. Proof. Suppose that in step 2 the equation At --- O~iO -~- O~tiAi .q_ ~i,i+lAi+l _ql_ . . .
_~_ ~i,,A, '
is under consideration. (Note: The coefficient for A h is ~ for 1 ~ h < i.) In Q1 and Q2 the equations for Ah, h ~ i, are the same. Suppose that
(2.2.14)
Aj
=
~jo
-+" ~] ~jkAk k=l
is the equation for A~, j > i, in Q1. In Q2 the equation for Aj becomes
(2.2.15)
Aj -- flo +
~
,OkA~
k=i+i
where ~ytO~u~i0
Pk
~jk +
*
for i < k < n
We can use Lemma 2.3 to express the minimal fixed points of Q1 and Q2, which we shall denote by fl and f2, respectively. From the form of Eq. (2.2.15), every string in f2(A~) is in f l ( A j ) . This follows from the fact that any string w which is in the set denoted by • j~,0ci~ * can be expressed as WlW2 . . " Wm, where wl is in ~ji, wm is in ~ik, and w2 . . . . , Wm-1 are in ~,. Thus, w is the concatenation of a sequence of strings in the sets denoted by coefficients of Q1, for which the subscripts satisfy the condition of Lemma 2.3. A similar observation holds for strings in o~j~.*o%. Thus it can be shown that f2(A~) ~ f~(Aj). Conversely, suppose w is in f l ( A j ) . Then by Lemma 2.3 we may write w = W l . . ' W m , for some sequence of nonzero subscripts ll, . . . . lm such that Wm is in 0ct.0, wp is in ~I~1.... 1 ~ p < m, and 11 = j. We can group the w / s uniquely, such that we can write w : Yl "'" Y,, where yp : w, . . . w,, and
1 10
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
(1) If 1, ~ i, then s = t + 1. (2) If l, > i, then s is chosen such that l , + ~ , . . . , l, are all i and l,+~ =/= i. It follows that in either case, yp is in the coefficient of A j,÷, in the equation of Q2 for A h, and hence w is in fz(Aj). We conclude that f~(A~)= f2(Aj) for all j. [S] LE~VnV~A2.5 Let Q~ and Q2 be the sets of equations before and after a single application of step 5 in Algorithm 2.1. Then Q~ and Q2 have the same minimal fixed points.
Proof. Exercise, similar to Lemma 2.4. THEOREM 2.1 Algorithm 2.1 correctly determines the minimal fixed point of a set of standard form equations.
Proof. After step 5 has been applied for all j, the equations are all of the form A,. = a~, where at is a regular expression over Z. The minimal fixed point of such a set is clearly f(A,) = ~. 2.2.2.
Regular Sets and Right-Linear Grammars
We shall show that a language is defined by a right-linear grammar if and only if it is a regular set. A few observations are needed to show that every regular set has a right-linear grammar. Let ~ be a finite alphabet. LEMMA 2.6 (i) ~ , (ii) [e}, and (iii) {a} for all a in I~ are right-linear languages.
Proof. (i) G = ({S}, E, ~ , S) is a right-linear grammar such that L(G) = rE. (ii) G = ({S},Z, {S--~ e},S) is a right-linear grammar for which L(G) = [e}. (iii) G, = ([S}, l~, {S --~ a}, S) is a right-linear grammar for which L(G,) = [a}. [Z LEMMA 2.7 If L t and Lz are right-linear languages, then so are (i) L~ U Lz, (ii) L~L2, and (iii) L~*.
Proof. Since L~ and L 2 are right-linear, we can assume that there exist right-linear grammars Gt = (N 1, ~, P~, S~) and G2 = (N2, X, P2, $2) such that L(Gt) = Lt and L(G2) = L2. We shall also assume that N 1 and N 2 are disjoint. That is, we know we can rename nonterminals arbitrarily, so this assumption is without loss of generality. (i) Let G3 be the right-linear grammar
(N, u
u {s,}, x, P, v
v {s,
s, l&}, s,),
SEC. 2.2
111
R E G U L A R SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
where S a is a new nonterminal symbol not in N~ or N 2. It should be clear that +
L(G3) = L(G1) U L(G2) because for each derivation $3 =-~ w there is either +
+
Ga
a derivation S~ =-~ w or $2 =-~ w and conversely. Since G 3 is a right-linear Gt
G2
grammar, L(G3) is a right-linear language. (ii) Let G4 be the right-linear grammar (N1 U N~, X, P4, St) in which P4 is defined as follows: (1) If A ~ x B is in P~, then A ~ x B is in P4. (2) If A --~ x is in P1, then A ~ xS2 is in P4. (3) All productions in P2 are in P4. +
+
+
Note that if S~ =~ w, then $1 ~ G1
+
wSz, and that if $2 ~
G4
x, then S z - ~ x.
G~
G4
-4-
Thus, L(G1)L(G2) ~ L(G4). Now suppose that $1 ~
w. Since there are no
G4
productions of the form A ---~ x in P4 that "came out of" P~ we can write +
the derivation in the form S 1 ~
+
x S z =-> x y , where w = x y and all produc-
G4
G4
+
tions used in the derivation S1 =-~ x S z arose from rules (1) and (2) of the construction of P4. Thus we must have the derivations S~ =~ x and S ~ Gt
y.
G2
Hence, L(G4) ~ L ( G i ) L ( G 2 ) . It thus follows that L(G4) = L ( G t ) L ( G 2 ) . (iii) Let G5 = (Nt u ($5), X, Ps, $5) such that $5 is not in N~ and P5 is constructed as follows" (1) If A ~ x B is in P~, then A ---~ x B is in Ps. (2) If A ~ x is in P1, then A ~ xS~ and A ~ x are in P5(3) $5 ~ S a l e are in Ps+
A
proof
that
+
+
+
+
$5 ==~ x~S5 ==> x~xzS5 =~ . . . ==~ x t x z . . . x,,_~S5 =* Gn
G5 +
XlX z . . . X , , x X . if and only if S 1 ~ G1
Gn
Gs
+
175 +
x~, S~ ==~ xz . . . . , S1 = * x. is left for the G1
Exercises. From the above, we have L ( G s ) = (L(G1))*.
Gl
[~]
We can now equate the class of right-linear languages with the class of regular sets. THEOREM 2.2 A language is a regular set if and only if it is a right-linear language.
Proof. Only if." This portion follows from Lemmas 2.6 and 2.7 and induction on the number of applications of the definition of regular set necessary to show a particular regular set to be one. If: Let G = (N, E, P, S) be a right-linear grammar with N = (A l , . . . , A,]. We can construct a set of regular expression equations in standard form with the nonterminals in N as indeterminates. The equation for A t is
112
ELEMENTSOF LANGUAGE THEORY
A t = 0% + ~tlA1 + . . .
CHAP. 2
+ 0c,.,A,, where
(1) 0% = wl + ' " + wk, where At--* w i t . . . I wk are all productions with Ai on the left and only terminals on the right. If k -- 0, take ~o to be ~ . (2) 0q.j, j > 0, is xl + . . . + Xm, where At---, xlA~l.." Jx,,,Aj are all productions with At on the left and a right side ending in Aj. Again, if m = 0, then ~tj = ~ . Using Lemma 2.3, it is straightforward to show that L(G) is f(S), where f is the minimal fixed point of the constructed set of equations. This portion of the proof is left for the Exercises. But f(S) is a language with a regular expression, as constructed by Algorithm 2.1. Thus, L(G) is a regular set. Example 2.10
Let G be defined by the productions
S
>OAI1SIe
A
> 0BI1A
B
> 0S]IB
Then the set of equations generated is that of Example 2.9, with S, A, and B, respectively, identified with X1, X2, and X 3. In fact, L(G) is the set of strings whose number of O's is divisible by 3. It is not hard to show that this set is denoted by the regular expression of (2.2.13). [~] 2.2.3.
Finite Automata
We have seen three ways of defining the class of regular sets: (1) The class of regular sets is the least class of languages containing ~ , [e}, and {a} for all symbols a and closed under union, concatenation, and *. (2) The regular sets are those sets defined by regular expressions. (3) The regular sets are the languages generated by right-linear grammars. We shall now consider a fourth way, as the sets defined by finite automata. A finite automaton is one of the simplest recognizers. Its "infinite" memory is null. Ordinarily, the finite automaton consists only of an input tape and a finite control. Here, we shall allow the finite control to be nondeterministic, but restrict the input head to be one way. In fact, we require that the input head shift right on every move.'t The two-way finite automaton is considered in the Exercises. tRecall that, by definition, a one-way recognizer does not shift its input head left but may keep it stationary during a move. Allowing a finite automaton to keep its input head stationary does not permit the finite automaton to recognize any language not recognizable by a conventional finite automaton.
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
113
We specify a finite automaton by defining its finite set of control states, the allowable input symbols, the initial state, and the set of final states, i.e., the states which indicate acceptance of the input. There is a state transition function which, given the "current" state and "current" input symbol, gives all possible next states. It should be emphasized that the device is nondeterministic in an automaton-theoretic sense. That is, the device goes to all its next states, if you will, replicating itself in such a way that one instance of itself exists for each of its possible next states. The device accepts if any of its parallel existences reaches an accepting state. The nondeterminism of the finite automaton should not be confused with "randomness," in which the automaton could randomly choose a next state according to fixed probabilities but had a single existence. Such an automaton is called "probabilistic" and will not be studied here. We now give a formal definition of nondeterministic finite automaton. DEFINITION
A nondeterministic finite automaton is a 5-tuple M = (Q, %, d~, q0, F), where (1) Q is a finite set of states; (2) % is a finite set of permissible input symbols; (3) 6 is a mapping from Q x % to 6~(Q) which dictates the behavior of the finite state control; 6 is sometimes called the state transition function; (4) qo in Q is the initial state of the finite state control; and (5) F ~ Q is the set of final states. A finite automaton operates by making a sequence of moves. A move is determined by the current state of the finite control and the input symbol currently scanned by the input head. A move itself consists of the control changing state and the input head shifting one square to the right. To determine the future behavior of a finite automaton, all we need to know are (1) The current state of the finite control and (2) The string of symbols on the input tape consisting of the symbol under the input head followed by all symbols to the right of this symbol. These two items of information provide an instantaneous description of the finite automaton, which we shall call a configuration. DEFINITION
If M = (Q,%,~, qo, F) is a finite automaton, then a pair (q, w ) i n Q x %* is a configuration of M. A configuration of the form (q0, w) is called an initial configuration, and one of the form (q, e), where q is in F, is called a final (or accepting) configuration. A move by M is represented by a binary relation q--~-(or }--, where M is
1 14
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
understood) on configurations. If 6(q, a) contains q', then (q, aw) ~ (q', w) for all w in E*. This says that if M is in state q and the input head is scanning the input symbol a, then M may make a move in which it goes into state q' and shifts the input head one square to the right. Since M is in general nondeterministic, there may be states other than q' which it could also enter on one move. We say that C [.~- C' if and only if C = C'. We say that C O[-~- Ck, k ~ 1, if and only if there exist configurations C 1 , . . . , Ck_ 1 such that Ct [-~- Ct+~, for all i, 0 < i < k. CI-~- C' means that C I-~- C' for some k ~ 1, and C[-~- C' means that C [.~- C' for some k ~ 0. Thus, 1--~.and !~- are, respectively, the transitive and reflexive-transitive closure of l~-. We shall drop the subscript M if no ambiguity arises. We shall say that an input string w is accepted by M if (qo, w)1.-~--(q, e) for some q in F. The language defined by M, denoted L(M), is the set of input strings accepted by M, that is,
[wlw ~ ~* and (q0, w)1.-~-(q, e) for some q in F}. We shall now give two examples of finite automata. The first is a simple "deterministic" automaton; the second shows the use of nondeterminism. Example 2.11
Let M = (~p,q, r}, [0, 1}, $,p, {r~}) be a finite automaton, where $ is specified as follows" Input $ State
0
1
Jr]
~r}
p q r
M accepts all strings of O's and l's which have two consecutive O's. That is, state p is the initial state and can be interpreted as "Two consecutive O's have not yet appeared, and the previous symbol was not a 0." State q means "Two consecutive O's have not appeared, but the previous symbol was a 0." State r means "Two consecutive O's have appeared." Note that once state r is entered, M remains in that state. On input 01001, the only possible sequence of configurations, beginning with the initial configuration (p, 01001), is (p, 01001) [-- (q, 1'001)
(p, ool) (q, 01)
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
1 15
~- (r, 1) [- (r, e) Thus, 01001 is in L(M).
E]
Example 2.12
Let us design a nondeterministic finite automaton to accept the set of strings in {1, 2, 3} such that the last symbol in the input string also appears previously in the string. That is, 121 is accepted, but 31312 is not. We shall have a state q0, which represents the idea that no attempt has been made to recognize anything. In this state the automaton (or that existence of it, anyway) is "coasting in neutral." We shall have states q t, q2, and q3, which represent the idea that a "guess" has been made that the last symbol of the string is the subscript of the state. We have one final state, qs" In addition to remaining in qo, the automaton can go to state q= if a is the next input. If (an existence of) the automaton is in qa, it can go to qs if it sees another a. The automaton goes no place from qs, since the question of acceptance must be decided anew as each symbol on its input tape becomes the "last." We specify M formally as M = ([qo, qa, q2, q3, qs}, [1, 2, 3}, ~, qo, {qt'}) where ~ is given by the following table" Input 1
State
q0
ql q2 q3 qf
(qo,ql} {ql,q:} [q2} {q3}
2
3
{qo,q2} [qo,q3} (qa} {ql] {q2, qf} {q2} {q3} {q3,qf}
On input 12321, the proliferation of configurations is shown in the tree in Fig. 2.2.
~.q3, 21)--(q3, 1)---(q3,e) tq2,321)--(q2, 21)x-(q2, 1)--(q2, e) "\qf , 1) k q x , 2321)~(q1,321)~(qx, 21)~(q'a, 1)X--(ql,e) \"t.qf, e) Fig. 2.2
Configurations of M.
1 16
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
Since (q0, 12321)l-~-- (qs, e), the string 12321 is in L(M). Note that certain configurations are repeated in Fig. 2.2, and for this reason a directed acyclic graph might be a more suitable representation for the configurations entered b y M . E] It is often convenient to use a pictorial representation of finite automata. DEFINITION
Let M = (Q, E, ~, q0, F) be a nondeterministic finite automaton. The
transition graph for M is an unordered labeled graph where 'the nodes of the graph are labeled by the names of the states and there is an edge (p, q) if there exists an a ~ E such that ~(p, a) contains q. Further, we label the edge (p, q) by the list of a such that q ~ ~(p, a). Thus the transition graphs for the automata of Examples 2.11 and 2.12 are shown in Fig. 2.3. We have
1
0
0
_~0,1
Start (a) Example 2.11 1, 2, 3
1,2, l
1
1
Start -
)
(b) Example 2.12
Fig. 2.3 Transitiongraphs. indicated the start state by pointing to it with an arrow labeled "start," and final states have been circled. We shall define a deterministic finite automaton as a special case of the nondeterministic variety. DEFINITION
Let M = (Q, E, ~, q0, F) be a nondeterministic finite automaton. We say that M is deterministic if ~(q, a) has no more than one member for any q in
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
1 17
Q and a in E. If O(q, a) always has exactly one member, we say that M is
completely specified. Thus the automaton of Example 2.11 is a completely specified deterministic finite automaton. We shall hereafter reserve the term finite automaton for a completely specified deterministic finite automaton. One of the most important results from the theory of finite automata is that the classes of languages defined by nondeterministic finite automata and completely specified deterministic finite automata are identical. We shall prove the result now. CONVENTION Since we shall be dealing primarily with deterministic finite automata, we shall write "6(q, a ) = p" instead of "6(q, a ) = {p}" when the automaton with transition function 6 is deterministic. If O(q, a ) = ~ , we shall often say that 6(q, a) is "undefined." THEOREM 2.3 If L - - L ( M )
for some nondeterministic finite automaton
M, then
L -- L(M') for some finite automaton M'. Proof Let M - - (Q, E, 6, qo, F). We construct M ' = (Q', E, 6', q0, F'), as follows" (1) Q' -- (p(Q). Thus the states of M' are sets of states of M. (2) qo = {q0}. (3) F ' consists of all subsets S of Q such that S ~ F =/= ~ . (4) For all S ~ Q, 6'(S, a ) = s', where S ' = {pl0(q, a) contains p for some q in S}. It is left for the Exercises to prove the following statement by induction on/' (2.2.16)
(S, w)l-~ (S', e) if and only if S' = {Pl(q, w) l~- (P, e) for some q in S}
As a special case of (2.2.16), ({qo}, w)I-~ (S', e) for some S' in F' if and only if (q0 ' w)I .A-. (p ' e) for some p in F. Thus, L(M') = L(M) M Example 2.13
Let us construct a finite automaton M ' = (Q, {1, 2, 3}, ,6', (q0}, F) accepting the language of M in Example 2.12. Since M has 5 states, it seems that M ' has 32 states. However, not all of these are accessible from the initial state. That is, we call a state p accessible if there is a w such that (q0, w)I* (P, e), where q0 is the initial state. Here, we shall construct only the accessible states. We begin by observing that {q0} is accessible. 6'((q0}, a ) = (q0, q~} for a = 1, 2, and 3. Let us consider the state {q0, ql}. We have 6'((q0, qx}, 1) =
1 18
ELEMENTSOF LANGUAGE THEORY
CHAP. 2 Input
State
A = {q0} B = { q o , qx} C={qo,q2] D = [qo, q3} E = [qo, q l , q f } F = {qo, q I , q 2 } G = { q o , qx,q3} H = [q0, q2, qf} 1 = [qo,qz, q3} J = {qo, q 3, qf} K = {qo, q l , q z , qy} L = [qo, q l , q 2 , q 3 } M = {qo, q l , q 3 , qy} N = { q o , q 2 , q 3 , qf} P = {qo,ql,q2, q3,qf} Fig. 2.4
1
2
3
B E F G E K M F L G K P M L P
C F H I F K L H N I K P L N P
D G I J G L M I N J L P M N P
Transition function of M ' .
{qo, ql, qr}" Proceeding in this way, we find that a set of states of M is accessible if and only if(1) It contains qo, and (2) If it contains @, then it also contains ql, q~, or q3. The complete set of accessible states, together with the 8' function, is given in Fig. 2.4. The initial state of M' is A, and the set of final states consists of E, H, .1,
K, M, N, and P. 2.2.4.
Finite A u t o m a t a and Regular Sets
We shall show that a language is t regular set if and only if it is defined by a finite automaton. The method i; first to show that a finite automaton language is defined by a right-linear ~ :ammar. Then we show that the finite automaton languages include ~ , (e} {a} for all symbols a, and are closed under union, concatenation, and *. Ti us every regular set is a finite automaton language. The following sequence of lemmas proves these assertions. LEMMA 2.8
If L = L(M) for finite automaton M, then L = L(G) for some rightlinear grammar G.
Proof. Let M = (Q, E, 8, qo, F) (M is deterministic, of course). We let G' = (Q, ~, P, qo), where P is defined as follows: (1) If d~(q, a) = r, then P contains the production q ~ at. (2) If p is in F, then p ---~ e is a production in P.
SEC. 2.2
REGULAR SETS, THEIR GENERATORS, AND THEIR RECOGNIZERS
'l 19
We can show that each step of a derivation in G mimics a move by M. We shall prove by induction on i that i+1
(2.2.17)
For q in Q, q ~
w if and only if (q, w) [_L_(r, e) for some r in F
Basis. For i = 0, clearly q =~ e if and only if (q, e) [.-.~-(q, e) for q in F. Inductive Step. Assume that (2.2.17) is true for i and let w = ax, where i+1
[xl = i. Then q ~
i
w if and only ifq ==~as ~
ax for some s 6 Q. But q =~ as i
if and only if O(q, a) = s. From the inductive hypothesis, s ==~ x if and only i+1
if (s, x) I'- 1 (r, e) for some r ~ F. Therefore, q ~ w if and only if (q, w) l-L-(r, e) for some r ~ F. Thus Eq. (2.2.17) is true for all i > 0. +
We now have q0 =~ w if and only if (q0, w)I .-~- (r, e) for some r ~ F. Thus, L(G) = L(M). E] LEMMA 2.9
Let X be a finite alphabet. (i) ~ , (ii) {e}, and (iii) {a} for a ~ X are finite automaton languages.
Proof. (i) Any finite automaton with an empty set of final states accepts ~ . (ii) Let M = ({q0}, X, ~5, qo, [q0}), where 6(qo, a) is undefined for all a in X. Then L ( M ) = {e}. (iii) Let M = ({q0, ql}, X, 6, qo, [ql}), where ~(qo, a) = ql and 6 is undefined otherwise. Then L ( M ) = [a]. E] LEMMA 2.10 Let L1 = L(Mi) and L2 = L(M2) for finite automata Mt and M2. Then (i) L~ U L2, (ii) L~L2, and (iii) Lt* are finite automaton languages. Proof Let M~ = (Qi, E, 61, qt, El) and M2 = (Q2, E, c52,q2, F2). We assume without loss of generality that Q1 (q Q2 = ~ , since states can be renamed at will.
(i) Let automaton, (1) (2)
M = (Q1 u Q2 u [q0}, E, c5, q0, F) be a nondeterministic finite where qo is a new state, F=F1 UF2ifeisnotinLlorLzandF=F 1UF2U{qo}ife is in L1 or L2, and (3) (a) 8(q0, a) = ~5(ql, a) u ~5(q2, a) for all a in X, (b) 8(q, a) = 6i(q, a) for all q in Q1, a in X, and (c) cS(q, a) = ~2(q, a) for all q in Q2, a in X. Thus, M guesses whether to simulate M1 or M2. Since M is nondeterministic, it actually does both. It is straightforward to show by induction on i > 1 that (q0, w) l.-~-(q, e) if and only i f q i s in Qi and (ql, w) [~-~(q, e) or qis in Q2
120
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
and (q2, w)1~7 (q, e). This result, together with the definition of F, yields
L(M) = L(M~) U L(M2). (ii) To construct a finite automaton M to recognize M = (Q1 t.j Q z, E, t~, q~, F), where ~ is defined by (1) ~(q, a) = ~l(q, a) for all q in Q~ - F~, (2) 5(q, a) = ill(q, a) U t~2(q2, a) for all q in F1, and (3) $(q, a) = ~2(q, a) for all q in Q2. Let
F = J Fz', IF1 U F2,
L1L 2 let
if q2 ~ Fz if q2 ~ F2
That is, M begins by simulating M1. When M reaches a final state of M1, it may, nondeterministically, imagine that it is in the initial state of M2 by rule (2). M will then simulate M2. Let x be in L 1 and y in L 2. Then (q~, xy)[.~- (q, y) for some q in F~. If x = e, then q = q l. If y ~ e, then, using one rule from (2) and zero or more from (3), (q, y)1.~- (r, e) for some r ~ F2. If y = e, then q is in F, since q2 ~ F2. Thus, xy is in L(M). Suppose that w is in L(M). Then (q l, w)I-~ (q, e) for some q ~ F. There are two cases to consider depending on whether q ~ F2 or q ~ F1. Suppose that q ~ /72. Then we can write w = xay for some a in E such that O
(ql, xay) l~- (r, ay) I~- (s, y) IM*~(q, e), where r ~ Fi, s ~ Q2, and $2(q, a) contains s. Then x ~ L 1 and ay ~ L 2. Suppose that q ~ F 1. Then q2 ~ F2 and e is in L 2. Thus, w ~ L 1. We conclude that L(M) = L~L z. (iii) We construct M = (Q1 u [q'~}, ~, $, q', F 1 u {q'}), where q' is a new state not in Q1, to accept L~* as follows. ~ is defined by (1) O(q, a) = Ol(q, a) if q is in Q 1 - F1 and a ~ E, (2) t~(q, a) = ~1 (q, a) u ~ ~(ql, a) if q is in F 1 and a ~ E, and (3) $(q', a) = ~1 (q l, a) for all a in E. Thus, whenever M enters a final state of M1, it has the option of continuing to simulate M1 or to begin simulating M1 anew from the initial state. A proof that L ( M ) = L~* is similar to the proof of part (ii). Note that since q' is a final state, e ~ L(M). D THEOREM 2 . 4
A language is accepted by a finite automaton if and only if it is a regular set.
Proof Immediate from Theorem 2.2 and Lemmas 2.8, 2.9, and 2.10. D 2.2.5.
Summary
The results of Section 2.2 can be summarized in the following theorem.
EXERCISES
1 21
THEOREM 2.5 The following statements are equivalent: (1) L is a regular set. (2) L is a right-linear language. (3) L is a finite a u t o m a t o n language. (4) L is a nondeterministic finite a u t o m a t o n language. (5) L is d e n o t e d by a regular expression. [ ]
EXERCISES
2.2.1.
Which of the following are regular sets? Give regular expressions for those which are. (a) The set of words with an equal number of 0's and l's. (b) The set of words in {0, 1}* with an even number of O's and an odd number of l's. (c) The set of words in ~* whose length is divisible by 3. (d) The set of words in {0, 1]* with no substring 101.
2.2.2.
Show that the set of regular expressions over 1~ is a CFL.
2.2.3.
Show that if L is any regular set, then there is an infinity of regular expressions denoting L.
2.2.4.
Let L be a regular set. Prove directly from the definition of a regular set that L R is a regular set. Hint: Induction on the number of applications of the definition of regular set used to show L to be regular.
2.2.5.
Show the folilowing identities for regular expressions 0~, fl, and ~:
(a) ~(fl + 7) = ~P + ~7. (b) ~ + (p + ~) = (~ + p) + 7.
(g) (~ + P)r = ~r + Pr. (h) ~ * = e . (i) 0C* -t- 0C -- ~;*.
(j) (~*)* = ~*. (k) (~ + p)* = (~*p*)*. (1) ~ + ~ = ~ .
(d) 0ce -- e0~ = 0~. (e) ~ = ~ =~. (f) ~ + 0 ~ = ~ . 2.2.6.
Solve the following set of regular expression equations: A1 = (01" + 1)A1 + A2 A2 = 11 + 1A1 + 00A3 A3 = e + A 1
2.2.7.
÷A2
Consider the single equation (2.2.18)
X = 0CX + fl
where 0~ and fl are regular expressions over ~ and X q~ E. Show that (a) If e is not in 0c, then X = 0~*fl is the unique solution to (2.2.18). (b) If e is in ~, then 0c*fl is the minimal fixed point of (2.2.18), but there are an infinity of solutions.
122
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
(c) In either case, every solution to (2.2.18) is a set of the form • * ( f l u L) for some (not necessarily regular) language L.
2.2.8.
Solve the following general pair of standard form equations" X = ~1 X -t- 0~2 Y -b ~3 Y = PI X ~ 1~2 Y ~ ~3
2.2.9.
Complete the proof of Lemma 2.4.
2.2.10.
Prove L e m m a 2.5.
2.2.11.
Find right-linear grammars for those sets in Exercise 2.2.1 which are regular sets. DEFINITION
A grammar G = (N, E, P, S) is left-linear if every production in P is of the form A ---~ Bw or A --, w.
2.2.12.
Show that a language is a regular set if and only if it has a left-linear grammar. Hint: Use Exercise 2.2.4. DEFINITION A right-linear grammar G = (N, Z, P, S) is called a regular grammar when (1) All productions with the possible exception of S ~ e are of the form A ~ aB or A ----~ a, where A and B are in N and a is in Z. (2) If S---~ e is in P, then S does not appear on the right of any production.
2.2.13.
Show that every regular set has a regular grammar. Hint: There are several ways to do this. One way is to apply a sequence of transformations to a right-linear grammar G which will map G into an equivalent regular grammar. Another way is to construct a regular grammar directly from a finite automaton.
2.2.14.
Construct a regular grammar for the regular set generated by the right-linear grammar A----~ BIC B---~ OBI1BIOll C----~ O D I 1 C l e D ~
0Cl 1D
2.2.15.
Provide an algorithm which, given a regular grammar G and a string w, determines whether w is in L(G).
2.2.16.
Prove line (2.2.16) in Theorem 2.3.
2.2.17.
Complete the proof of L e m m a 2.7(iii).
EXERCISES
123
DEFINITION A production A ~
0c of right-linear grammar G = (N, E, P, S) is
useless if there do not exist strings w and x in ~* such that S =:~ wA ==~ woc ~
wx.
2.2.18.
Give an algorithm to convert a right-linear grammar to an equivalent one with no useless productions.
"2.2.19.
Let G = (N, Z, P, S) be a right-linear grammar. Let N = [A1 . . . . . A,}, and define ~xtj = xl + x2 + . - - + xm, where At ~ x t At . . . . . At ~ xmA# are all the productions of the form A t - - , y A # . Also define OClo = x t + . . . + Xm, where At ~ xt . . . . . At ---~ xm are all productions of the form At ---~ y. Let Q be the set of standard form equations At = ~zt0 + tgtlA1 + tZtzA2 + . . . + tZt~A,. Show that the minimal fixed point of Q is L(G). H i n t : Use Lemma 2.3.
2.2.20.
Show that L ( G ) o f Example 2.10 is the set of strings in [0, 1}*, whose length is divisible by 3.
2.2.21.
Find deterministic and nondeterministic finite automata for those sets of Exercise 2.2.1 which are regular.
2.2.22.
Show that the finite automaton of Example 2.11 accepts the language (0 + i)*00(0 + 1)*.
2.2.23.
Prove that the finite automaton of Example 2.12 accepts the language [wa[a in [1, 2, 3} and w has an instance of a}.
2.2.24.
Complete the proof of Lemma 2.10(iii).
*2.2.25.
A two-way finite automaton is a (nondeterministic) finite control with an input head that can move either left or right or remain stationary. Show that a language is accepted by a two-way finite automaton if and only if it is a regular set. Hint: Construct a deterministic one-way finite automaton which, after reading input w-~ e, has in its finite control a finite table which tells, for each state q of the two-way automaton in what state, if any, it would move off the right end of w, when started in state q at the rightmost symbol of w.
*2.2.26.
Show that allowing a one-way finite automaton to keep its input head stationary does not increase the class of languages defined by the device.
**2.2.27.
For arbitrary n, show that there is a regular set which can be recognized by an n-state nondeterministic finite automaton but requires 2" states in any deterministic finite automaton recognizing it.
2.2.28.
Show that every language accepted by an n-state two-way finite automaton is accepted by a 2"~"+1)-state finite automaton.
**2.2.29.
How many different languages over [0, 1} are defined by two-state (a) Nondeterministic finite automata ? (b) Deterministic finite automata ? (c) Finite automata?
124
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
DEFINITION
A set S of integers forms an arithmetic progression if we can write S = [ c , c + p , c ÷ 2 p . . . . . c + i p . . . . ]. For any language L, let S(L) = {/[for some w in L, I wl = i3. **2.2.30. Show that for every regular language L, S(L) is the union of a finite number of arithmetic progressions.
Open Problem 2.2.31.
How close to the bound of Exercise 2.2.28 for converting n-state twoway nondeterministic finite automata to k-state finite automata is it actually possible to come ?
BIBLIOGRAPHIC
NOTES
Regular expressions were defined by Kleene [1956]. McNaughton and Yamada [1960] and Brzozowski [1962] cover regular expressions in more detail. Salomaa [1966] describes two axiom systems for regular expressions. The equivalence of regular languages and regular sets is given by Chomsky and Miller [1958]. The equivalence of deterministic and nondeterministic finite automata is given by Rabin and Scott [1959]. Exercise 2.2.25 is from there and from Shepherdson [1959].
2.3.
PROPERTIES OF REGULAR SETS
In this section we shall derive a number of useful facts about finite automata and regular sets. A particularly important result is that for every regular set there is an essentially unique minimum state finite automaton that defines that set. 2.3.1.
Minimization of Finite A u t o m a t a
Given a finite automaton M, we can find the smallest finite automaton equivalent to M by eliminating all inaccessible states in M and then merging all redundant states in M. The redundant states are determined by partitioning the set of all accessible states into equivalence classes such that each equivalence class contains indistinguishable states and is as large as possible. We then choose one representative from each equivalence class as a state for the reduced automaton. Thus we can reduce the size of M if M contains inaccessible states or two or more indistinguishable states. We shall show that this reduced machine is the smallest finite automaton that recognizes the regular set defined by the original machine M. DEFINITION
Let M = (Q, E, ~, q0, F) be a finite automaton, and let qt and qz be distinct states. We say that x in E* distinguishes qx from q2 if (qx, x)[-~-- (q3, e),
SEC. 2.3
PROPERTIES OF REGULAR SETS
125
(q2,x)l-~-(q4, e), and exactly one of q3 and q4 is in F. We say that ql k and qz are k-indistinguishable, written qa ~ q~, if and only if there is no x, with ]xl ~ k, which distinguishes ql from q2. We say that two states ql and q2 are indistinguishable, written ql ~ qz, if and only if they are k-indistinguishable for all k _~ 0. A state q ~ Q is said to be inaccessible if there is no input string x such that (qo, x ) l ± (q, e). M is said to be reduced if no state in Q is inaccessible and no two distinct states of Q are indistinguishable. Example 2.14 Consider the finite automaton M whose transition graph is shown in Fig. 2.5.
0 0 Start-~
Fig. 2.5 Transitiongraph of M. To reduce M we first notice that states F and G are inaccessible from the start state A and thus can be removed. We shall see in the next algorithm that the equivalence classes under ~ are {A}, {B, D}, and {C, E}. Thus we can represent these sets by the states p, q, and r, respectively, to obtain the finite automaton of Fig. 2.6, which is the reduced automaton for M. Q LEMMA 2.11 Let M = (Q, ~ , 8 , q0,F) be a finite automaton with n states. States ql and qz are indistinguishable if and only if they are (n -- 2)-indistinguishable.
1 Start • ( ~0 . )'- - '- - - 0.1 - ~' - '- - -N-/ ~ x_ ~ ,_~~. j. .~. l. . . . . . , ~ ~ 0 Fig. 2.6 Reducedmachine.
126
ELEMENTS OF LANGUAGE
CHAP.
THEORY
2
Proof The "only if" portion is trivial. The "if" portion is trivial if F has 0 or n states. Therefore, assume the contrary. We shall show that the following condition must hold on the k-indistinguishability relations" n-2
n-3
2
1
0
To see this, we observe that for ql and q2 in Q, 0
(1) ql =-- qz if and only if both q~ and qz are either in F or not in F. k
k--1
k--1
(2) ql ~ q2 if and only if ql ~ q2 and for all a in 2, 6(ql, a) ~ 6(q2, a). 0
The equivalence relation = is the coarsest and partitions Q into two equivak+l
lence classes, F and Q -
k
k+l
k
F. Then if - - ~ ~ , - - is a strict refinement o f - - ,
k+l
k
that is, - - contains at least one more equivalence class than - - . Since there are at most n -- 1 elements in either F or Q - F we can have at most n -- 2 0
k+l
k
successive refinements of - - . If for s6me k, k
k+l
k+2
_ , then - k+l
by (2). Thus, - - is the first relation - - such that
..-,
k
-- - - .
[Z]
L e m m a 2.11 has the interesting interpretation that if two states can be distinguished, they can be distinguished by an input sequence of length less than the n u m b e r of states in the finite a u t o m a t o n . The following algorithm gives the details of how to minimize the n u m b e r of states in a finite a u t o m a ton. ALGORITHM
2.2
Construction of the canonical finite a u t o m a t o n .
Input. A finite a u t o m a t o n M -- (Q, Z, 5, q0, F). Output. A reduced equivalent finite a u t o m a t o n M'. Method. Step 1." Use Algorithm 0.3 on the transition graph of M to find those states which are inaccessible from q0. Delete all inaccessible states. o
1
Step 2: Construct the equivalence relations - - , - - , . . . , k+l
L e m m a 2.11. Continue until
k
as outlined in k
-- _ . Choose - - to be - - .
Step 3: Construct the finite a u t o m a t o n M ' =
(Q', E, 5', q;, F'), where
(a) Q' is the set of equivalence classes under _=. Let [p] be the equivalence class of state p under ~ . (b) 5'([p], a) = [q] if 6(p, a) = q. (c) q; is [q0](d) F ' = [[q]lq ~ F].
SEC. 2.3
PROPERTIES OF R E G U L A R SETS
127
It is straightforward to show that step 3(b) is consistent; i.e., whatever member of [p] we choose we get the same equivalence class for t~([p], a). A proof that L(M') = L(M) is also straightforward and left for the Exercises. We prove that no automaton with fewer states than M ' accepts L(M). THEOREM 2.6 M ' of Algorithm 2.2 has the smallest number of states of any finite automaton accepting L(M).
Proof. Suppose that M " had fewer states than M' and that L(M") -- L(M). Each equivalence class under --~ is nonempty, so each state of M' is accessible. Thus there exist strings w and x such that (q0', w)1~,, (q, e) and (q0', x)[~,, (q, e), where qo' is initial state of M " , but w and x take M' to different states. Hence, w and x take M to different states, say p and r, which are distinguishable. That is, there is some y such that exactly one of wy and xy is in L(M). But wy and xy must take M " to the same state, namely, that state s such that (q, y)[~,, (s, e). Thus is not possible that exactly one of wy and xy is in L(M"), as supposed. Example 2.15
Let us find a reduced finite automaton for the finite automaton M whose k
transition graph is shown in Fig. 2.7. The equivalence classes for = , k > 0, are as follows" 0
For ~ " 1
For ~ " 2
For --"
{A, F}, {B, C, D, E} {A, F}, {B, E}, {C, D} {A, F}, {B, E}, {C, D}.
Start a
b
a
_~
Fig. 2.7 Transition graph of M.
/"-
128
ELEMENTSOF LANGUAGE THEORY 2
1
CHAP. 2
1
Since ~ = ~ we have ~ ~ . The reduced machine M' is (~[A], [B], [C]}, [a, b}, ~', A, [[A]}), where ~' is defined as
[A] [B] [C]
a
b
[,4] IS] [c]
In] [c] [A]
Here we have chosen [A] to represent the equivalence class [A, F}, [B] to represent [B, E}, and [C] to represent [C, D}. 2.3.2.
The Pumping Lemma for Regular Sets
We shall now derive a characterization of regular sets that will be useful in proving certain languages not to be regular. The next theorem is referred to as a "pumping" lemma, because it says, in effect, that given any regular set and any sufficiently long sentence in that set, we can find a nonempty substring in that sentence which can be repeated as often as we like (i.e., "pumped") and the new strings so formed will all be in the same regular set. It is often possible to thus derive a contradiction of the hypothesis that the set is regular. THEOREM 2.7 The pumping lemma for regular sets" Let L be a regular set. There exists a constant p such that if a string w is in L and l wt ~ p, then w can be written as xyz, where 0 < t yl_< p and xy~z ~ L for all i _~ 0.
Proof. Let M = (Q, X, ~, q0, F) be a finite automaton with n states such that L(M) = L. Let p = n. If w ~ L and I w l ~ n, then consider the sequence of configurations entered by M in accepting w. Since there are at least n + 1 configurations in the sequence, there must be two with the same state among the first n + 1 configurations. Thus we have a sequence of moves such that (qo, xyz)1.-~-- (ql, yz) [-&.(ql, z)[--~. (q2, e)
< l y l _ < n. But then
for some q~ and 0 < k ~ n. Thus, 0
(qo, xfiz)1-~-. (q~, y'z)
[2-(ql,y~-~z ) o
l"~--(ql, yz) l-'~-"(q l, z) I-~- (q2, e)
PROPERTIES OF REGULAR SETS
SEC. 2.3
129
must be a valid sequence of moves for all i > 0. Since w = xyz is in L, xytz is in L for all i > 1. The case i = 0 is handled similarly, l---1 Example 2.16
We shall use the pumping iemma to show that L -- [0"l"[n > 1} is not a regular set. Suppose that L is regular. Then for a sufficiently large n, 0"1" can be written as xyz such that y =/= e and xyiz ~ L for all i ~ 0. If y ~ 0 + or y ~ 1+, then xz = xy°z ~ L. If y E 0+1 +, then xyyz ~ L. We have a contradiction, so L cannot be regular. [--] 2.3.3.
Closure Properties of Regular Sets
We say that a set A is closed under the n-ary operation 0 if O(a 1, a2, . . . , a,) is in A whenever at is in A for 1 < i < n. For example, the set of integers is closed under the binary operation addition. In this section we shall examine certain operations under which the class of regular sets is closed. We can then use these closure properties to help determine whether certain languages are regular. We already know that if Lt and L2 are regular sets, then L~ U L2, LiL2, and L* are regular. DEFINITION
A class of sets is a Boolean algebra of sets if it is closed under union, intersection, and complementation. THEOREM 2.8
The class of regular sets included in Z* is a Boolean algebra of sets for any alphabet l~.
Proof We shall show closure under complementation. We already have closure under union, and closure under intersection follows from the settheoretic law A n B = 7 U - B - (Exercise 0.1.4). Let M -- (Q, A, d~, q0, F) be any finite automaton with A ~ E. It is easy to show that every regular set L ~ E* has such a finite automaton. Then the finite automaton M' = (Q, A, 6, qo, Q - F) accepts A* -- L(M). Note that the fact that M is completely specified is needed here. Now L(M), the complement with respect to ~*, can be expressed as L ( M ) = L ( M ' ) U E * ( E - - A)E*. Since ]g*(E -- A)E* is regular, the regularity of L(M) follows from the closure of regular sets under union. E] THEOREM 2.9
The class of regular sets is closed under reversal.
Proof. Let M -- (Q, IE, ~, q0, F) be a finite automaton defining the regular set L. To define L R we "run M backward." That is, let M ' be the nondeterministic finite automaton (Q u {q~}, ~:, 6', q~, F'), where F' is [q0] if e ~ L and F ' = {q0, q~} if e e L.
130
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
(1) ~3'(q~,,a) contains q if $(q, a) ~ F, (2) For all q' in Q and a in E, d~'(q', a) contains q if O(q, a) = q'. It is easy to show that (q0, w)[.~-(q, e), where q ~ F, if and only if (q~,, wR)1~ (qo, e). ThUs, L ( M ' ) = ( L ( M ) ) R = L n. E] The class of regular sets is closed under most common language theoretic operations, More of these closure properties are explored in the Exercises. 2.3.4.
Decidable Questions About Regular Sets
We have seen certain specifications for regular sets, such as regular expressions and finite automata. There are certain natural questions concerning these representations that come up. Three questions with which we shall be concerned here are the following: The membership problem" "Given a specification of known type and a string w, is w in the language so specified ?" The emptiness problem: "Given a specification of known type, does it specify the empty set ?" The equivalence problem: "Given two specifications of the same known type, do they specify the same language ?" The specifications for regular sets that we shall consider are (1) Regular expressions, (2) Right-linear grammars, and (3) Finite automata. We shall first give algorithms to decide the three problems when the specification is a finite automaton. ALGORITHM 2.3 Decision of the membership question problem for finite automata. Input. A finite automaton M = (Q, E, A, q0, F) and a word w in 2;*. Output. "YES" if w ~ L ( M ) , "NO" otherwise. Method. Let w = ata z . . . an. Successively find the states q~ = di(qo, a~), q2 = t~(q~, a 2 ) , . . . , q, = ~(q,_~, a,). If q, is in F, say "YES"; if not, say "NO." ~
The correctness of Algorithm 2.3 is too obvious to discuss. However, it is worth discussing the time and space complexity of the algorithm. A natural measure of these complexities is the number of steps and memory cells needed to execute the algorithm on a random access computer in which each memory cell can store an integer of arbitrary size. (Actually there is a bound on the size of integers for real machines, but this bound is so large that we would undoubtedly never come against it for finite automata which we might
SEC. 2.3
PROPERTIES OF REGULAR SETS
1 31
reasonably consider. Thus the assumption of unbounded integers is a reasonable mathematical simplification here.) It is easy to see that the time taken is a linear function of the length of w. However, it is not so clear whether or not the "size" of M affects the time taken. We must assume that the actual specification for M is a string of symbols chosen from some finite alphabet. Thus we might suppose that states are named q0, q~, . . . , qi, -. •, where the integer subscripts are binary numbers. Likewise, the input symbols might be called a t, a2, . . . . Assuming a normal kind of computer, one could take the pairs in the relation d; and construct a two-dimensional array that in cell (i, j) gave 5(qi, a~). Thus the total time of the algorithm would be an amount proportional to the length of the specification of M to construct the table, plus an amount proportional to lw] to execute the algorithm. The space required is primarily the space required by the table, which is seen to be proportional to the length of M's specification. (Recall that tf is really a set of pairs, one for each pair of a state and input symbol.) We shall now give algorithms to decide the emptiness and equivalence problems when the method of specification is a finite automaton. ALGORITHM 2 . 4
Decision of emptiness problem for finite automata.
Input. A finite automaton M = (Q, E, tS, q0, F). Output. "YES" if L(M) ~ ~, "NO" otherwise. Method. Compute the set of states accessible from qo. If this set contains a final state, say "YES"; otherwise, say "NO." D ALGORITHM 2.5 Decision of equivalence problem for finite automata.
'Input. Two finite automata Mi = (Ql, Xl, t~l, ql,F1) (Q2, X2, ~2, q2, F2) s0ch that Q1 A Q2 = ~ .
and
M2 =
Output. "YES" if L(M~) = L(M2), "NO" otherwise. Method. Construct the finite automaton M = (Q, u
Qz, E, u E2, ~, u ~2, q,, F, u Fz).
Using Lemma 2.11 determine whether ql ~ q2. If so, say "YES"; otherwise, say "NO". [Z] We point out that we could also use Algorithm 2.4 to solve the equivalence problem since L(M1)= L(M2) if and only if
(L(M~) ~ L(M2) ) t,,.) (L(Mi) ~ L(M2) ) = ~ .
132
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
We now turn to the decidability of the membership, emptiness, and equivalence problems for two other representations of regular sets--the regular expressions and right-linear grammars. It is simple to show that for these representations the three problems are also decidable. A regular expression can be converted, by an algorithm which is implicit in Lemmas 2.9 and 2.10, into a finite automaton. The appropriate one of Algorithms 2.3-2.5 can then be applied. A right-linear grammar can be converted into a regular expression by Algorithm 2.1 and the algorithm implicit in Theorem 2.2. Obviously, these algorithms are too indirect to be practical. Direct, fastworking algorithms will be the subject of several of the Exercises. We can summarize these results by the following theorem. THEOREM 2.10 If the method of specification is finite automata, regular expressions, or right-linear grammars, then the membership, emptiness, and equivalence problems are decidable for regular sets. [Z It should be emphasized that these three problems are not decidable for every representation of regular sets. In particular, consider the following example. Example 2.17
We can enumerate the Turing machines. (See the Exercises in Section 0.4.) Let M1, M z , . . • be such an enumeration. We can define the integers to be a representation of the regular sets as follows: (1) If Mt accepts a regular set, then let integer i represent that regular set. (2) If Mt does not accept a regular set, then let integer i represent {e}. Each integer thus represents a regular set, and each regular set is represented by at least one integer. It is known that for the representation of Turing machines used here the emptiness problem is undecidable (Exercise 0.4.16). Suppose that it were decidable whether integer i represented @. Then it is easy to see that M~ accepts ~ if and only if i represents ~ . Thus the emptiness problem is undecidable for regular sets when regular sets are specified in this manner. D
EXERCISES
2.3.1. Given a finite automaton with n accessible states, what is the smallest number of states the reduced machine can have ?
EXERCISES
2.3.2.
1 33
Find the minimum state finite automaton for the language specified by the finite automaton M = ((A, B, C, D, E, F], {0, 1}, ~, A, [E, F}), where is given by Input 0 State
A
B
C
B
E A F D D
F A E F E
C D E F 2.3.3.
Show that for all n there is an n-state finite automaton such that n-2 ~
2.3.4.
1
n-3 ~.
Prove that L ( M ' ) = L ( M ) in Algorithm 2.2. DEFINITION We say that a relation R on 1~* is right-invariant if x R y implies x z R y z for all x, y, z in ~*.
2.3.5.
Show that L is a regular set if and only if L is the union of some of the equivalence classes of a right-invariant equivalence relation R of finite index. Hint: Only if." Let R be the relation x R y if and only if (q0, x)1~ (p, e), (q0, Y) [~ (q, e), and p = q. (That is, x and y take a finite automaton defining L to the same state.) Show that R is a right-invariant equivalence relation of finite index. If: Construct a finite automaton for L using the equivalence classes of R for states. DEFINITION We say that E is the coarsest right-invariant equivalence relation for a language L ~ E* if x E y if and only if for all z ~ E* we find x z ~ L exactly when y z ~ L. The following exercise states that every right-invariant equivalence relation defining a language is always contained in E.
2.3.6.
Let L be the union of some of the equivalence classes of a right-invariant equivalence relation R on ~*. Let E be the coarsest right-invariant equivalence relation for L. Show that E ~ R.
*2.3.7.
Show that the coarsest right invariant equivalence relation for a language is of finite index if and only if that language is a regular set.
2.3.8.
Let M = (Q, E, ~, q0, F) be a reduced finite automaton. Define the relation E on ~* as follows: x E y if and only if (q0, x)I ~ (p, e), (q0, Y)I~ (q, e), and p = q. Show that E is the coarsest right-invariant equivalence relation for L ( M ) .
134
ELEMENTS OF LANGUAGE THEORY
CHAP.
2
DEFINITION An equivalence relation R on E* is a congruence relation if R is both left- and right-invariant (i.e., if x R y, then wxz R wyz for all w, x, y, z in E*). 2.3.9.
Show that L is a regular set if and only if L is the union of some of the equivalence classes of a congruence relation of finite index.
2.3.10.
Show that if M1 and Mz are two reduced finite automata such that L ( M i ) = L(M2), then the transition graphs of M1 and M2 are the same.
"2.3.11.
Show that Algorithm 2.2 is of time complexity n 2. (That is, show that there exists a finite automaton M with n states such that Algorithm 2.2 requires n 2 operations to find the reduced automaton for M.) What is the expected time complexity of Algorithm 2.2 ? It is possible to find an algorithm for minimizing the states in a finite automaton which always runs in time no greater than n log n, where n is the number of states in the finite automaton to be reduced. The most time-consuming part of Algorithm 2.2 is the determination of the equivalence classes under =_ in step 2 using the method suggested in Lemma 2.11. However, we can use the following algorithm in step 2 to reduce the time complexity of Algorithm 2.2 to n log n. This new algorithm refines partitions on the set of states in a manner somewhat different from that suggested by Lemma 2.11. Initially, the states are partitioned into final and nonfinal states. Then, suppose that we have the partition consisting of the set of blocks {Itt, lt2 . . . . , uk-~}. A block lrt in this partition and an input symbol a are selected and used to refine this partition. Each block no such that ~(q, a) ~ us for some q in ~zj is split into two blocks/t~ and 7t~' such that 7t~. = [ q t q ~ 7r~ and 5(q,a) ~ 7it} and 7t~' =Tt s -- 7t:j, Thus, in contrast with the method in Lemma 2.11, here blocks are refined when the successor states on a given input have previously been shown inequivalent. ALGORITHM 2.6 Determining the equivalence classes of a finite automaton. Input. A finite automaton M = (Q, Z, 5, q0, F). Output. The indistinguishability classes under _=. Method. (1) For a (2) (3)
Define tS-l(q, a ) = [ p l t ~ ( p , a ) = q } for all q ~ Q and a ~ ~. ~ Z, let ;g;.o = [qlq ~ ~, and 5-1(q, a) ~ ~}. Letgl =Fandgz =Q--F. For all a ~ Z, define the index set I(a)
(4) Set k = 3.
[[ 1},
if z~n,,. ~ #n2,.
([2},
otherwise
. - . , . _
EXERCISES
135
(5) Select a ~ ~ and i ~ I(a). [If I ( a ) = Z~ for all a ~ 1~, halt; the output is the set {~za, zr2 . . . . . ~zk-a].] (6) Delete i from l(a). (7) For all j < k such that there is a state q ~ n~ and O(q, a) ~ nt do steps 7(a)-7(d): (a) Let ztjt = {q[O(q, a) 6 xi and q 6 ztj} and l e t / t iI t = try - x~. " Construct new ztl, a and (b) Replace zt~ by zr~ and let zrk = zri. ~zk,= for all a 6 Z. (c) For all a ~ ]~, modify l(a) as follows: [I(a) t..) {j}, I(a) = i I(a) U { k } ,
if j q~ I(a) and 0 < ~.~.a ~ #n;k,a otherwise
(d) S e t k = k + l . (8) Go to step 5. 2.3.12.
Apply Algorithm 2.6 to the finite automata in Example 2.15 and Exercise 2.3.2.
2.3.13.
Prove that Algorithm 2.6 correctly determines the indistinguishability classes of a finite automaton.
*'2.3.14.
Show that Algorithm 2.6 can be implemented in time n log n.
2.3.15.
Show that the following are not regular sets: (a) {0"10"ln > 1}. (b) {wwl w is in {0, 1}*]. (c) L(G), where G is defined by productions S ----. aSbS[ e. (d) {a"~ln >_ 1}. (e) {ap IP is a prime}. (f) {wiw is in {0, 1]* and w has an equal number of O's and l's].
2.3.16.
Let f ( m ) be a monotonically increasing function such that for all n there exists m such that f ( m + 1) > f ( m ) + n. Show that {aS(m)[m > 1} is not regular. DEFINITION Let L1 and Lz be languages. We define the following operations" L1/L2 = {wlfor some x ~ L2, wx is in L1]. INIT(Lx) = {wlfor some x, wx is in L1]. FIN(La) = {wlfor some x, xw is in L1]. SUB(L1) = {wl for some x and y, xwy is in La ]. M I N ( L i ) = {wlw ~ L1 and for no proper prefix x of w is x eLx}. (6) MAX(L1) = {wlw ~ L1 and for no x v~ e is wx ~ L,}.
(1) (2) (3) (4) (5)
Example 2.18
Let L1 ={0"1"0 m l n , m ~ l }
and L2 = l * 0 * .
T h e n L ~ / L z = L1 U {0il~li>_ 1,] < i]. Lz/L1 = ~3.
136
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
INIT(L,) = L~ u {Oel:li~ I,] ~ i} u 0". HN(Li) = [O~l:Oklk> I,]> I, i 1]. MAX(L a) -- ~ . D "2,3.17,
Let (a) (b) (c) (d) (e) (f)
L1 and L2 be regular. Show that the following are regular: L1/L2. INIT(L~). FIN(L1). SUB(L1). MIN(L~). MAX(L1).
"2.3,18,
Let L1 be a regular set and L2 an arbitrary language. Show that La]L2 is regular. Does there exist an algorithm to find a finite automaton for L 1[L2 given one for L a ? DEFINITION The derivative Dx~z of a regular expression 0c with respect to x ~ E* can be defined recursively as follows: (1) OeOC = 0~. (2) F o r a ~ X (a) Oaf2 = ~ . (b) Dae = ~ . (c) D a b = { ~ ' ifa-Cb e, ifa =b. (d) Da(O~ + fl) = ha~Z + Daft.
{(Da00fl, (e) Da(OCfl)= (DaOOfl + Daft, (f) Boa* = (Boa)a*. (3) F o r a ~ X a n d x ~ X*,
if e ¢ 0¢ if e ~ 0C.
D a . a = Ox(Oaa)
2.3.19.
Show that if o~ = 10"1, then (a) DeOC = 10"1.
(b) D0a = ~ . (c) Dla; = 0"1. *2.3.20.
Show that if ~ is a regular expression that denotes the regular set R, then Dx0c denotes x \ R = (w[ xw ~ R}.
*'2.3.21.
Let L be a regular set. Show that { x l x y ~ L for some y such that Ixl = [y I} is regular. A generalization of Exercise 2.3.21 is the following.
**2.3.22.
Let L be a regular set and f ( x ) a polynomial in x with nonnegative integer coefficients. Show that
EXERCISES
137
[w[ wy ~ L for some y such that [y[ = f(I w 1)] is regular. *2.3.23.
Let L be a regular set and h a homomorphism. Show that h(L) and h-I(L) are regular sets.
2.3.24.
Prove the correctness of Algorithms 2.4-2.5.
2.3.25.
Discuss the time and space complexity of Algorithms 2.4 and 2.5.
2.3.26.
Give a formal proof of Theorem 2.9. N o t e that it is not sufficient to show simply that, say, for every regular expression there is a finite automaton accepting the set denoted thereby. One must show that there is an algorithm to construct the automaton from the regular expression. See Example 2.17 in this connection.
*2.3.27.
Give an efficient algorithm to minimize the number of states in an incompletely specified deterministic finite automaton.
2.3.28.
Give efficient algorithms to solve the membership, emptiness, and equivalence problems for (a) Regular expressions. (b) Right-linear grammars. (c) Nondeterministic finite automata.
**2.3.29.
Show that the membership and equivalence problems are undecidable for the representation of regular sets given in Example 2.17.
*2.3.30.
Show that the question "Is L(M) infinite ?" is decidable for finite automata. Hint: Show that L(M) is infinite, for n-state finite automaton M, if and only if L(M) contains a word w such that n _elS
where S' is a new symbol, and let N ' = let N' = N and S ' = S. (3) Let G' = (N', Z, P', S'). [~
N t,.)[S'}. Otherwise,
Example 2.23
Consider the grammar of Example 2.19 with productions
S
> aSbS[ bSaSi e
Applying Algorithm 2.10 to this grammar, we would obtain the grammar with the following productions" S
!
S
>Sle aSbSl bSaSl aSb l abSl ab l bSa l baS l ba
THEOREM 2.14 Algorithm 2.10 produces an e-free grammar equivalent to its input grammar.
Proof
By inspection, G' of Algorithm 2.10 is e-free. To prove that
SEC. 2.4
CONTEXT-FREE LANGUAGES
149
L(G) -- L(G'), we can prove the following statement by induction on the length of w" (2.4.3)
A~
w if and only if w =/= e and A ~ G'
w G
The proof of (2.4.3) is left for the Exercises. Substituting S for A in (2.4.3), we see that for w =/= e, w ~ L(G) if and only if w ~ L(G'). The fact that e ~ L(G) if and only if e ~ L(G') is evident. Thus, L(G) -- L(G'). [--7 Another transformation on grammars which we find useful is the removal of productions of the form A --. B, which we shall call single productions. ALGORITHM 2.11 Removal of single productions.
Input. An e-free C F G G. Output. An equivalent e-free C F G G' with no single productions. Method. (1) Construct for each A in N the set N~ -- (BI A ==~ B} as follows" (a) Let N o = {A] and s e t / = 1. (b) Let N,. = {C] B ----, C is in P and B ~ N,._ 1} U N,._i. (c) If N~ =/= N,_ ~, set i - - i -¢- 1 and repeat step (b). Otherwise, let N A = N i. (2) Construct P' as follows" If B --, 0~is in P and not a single production, place A ---, a in P' for all A such that B ~ N~. (3) Let G' -- (N, E, P', S). [-] Example 2.24
Let us apply Algorithm 2.11 to the grammar G O with productions
E--+E+
TIT
T
~T,FIF
F-
~. ( E ) l a
In step (1), N~ = (E, T, F}, N r = {T, F}, N~ = {F]. After step (2), P' becomes
E
, E-t- T I T , F I ( E ) I a
T
,T,FI(E)la
F
, (E)la
[Z]
THEOREM 2.15 In Algorithm 2.11, G' has no single productions, and L(G) = L(G').
-
?
:
-
150
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
Proof By inspection, G' has no single productions. We shall that L(G') c__L(G). Let w be in L(G'). Then there exists in G' a S - 0% =~ 0¢1 =~ . . . => ~ , - - w. If the p r o d u c t i o n applied going ~,.+~ is A - - ~ fl, then there is some B in N (possibly, A - B)
first show derivation from ~; to such that
A =~ B and B = ~ ft. Thus, A =~ fl and ~z,. =~ ~,.+ 1. It follows that S =~ w, and O
G
O
G
O
w is in L(G). Thus, L(G') ~ L(G). To show that L(G') --L(G), we must show that L(G) c__L(G'). Thus let w be in L(G) and S -- ~0 =~ ~ =~ " " =~ ~, -- w be a leftmost derivation Im
lm
lm
o f w in G. We can find a sequence of subscripts il, iz,. • •, ik consisting of exactly those j such that ~j_ 1 =~ ~j by an application of a p r o d u c t i o n other lm
than a single production. In particular, since the derivation of a terminal string cannot end with a single production, i k -~ n. Since the derivation is leftmost, consecutive uses of single productions replace the symbol at the same position in the left sentential forms involved. Thus we see that S ==~ ~;, z > 0c;~=~ . . . = ~ 0~;~-- w. Thus, w is in L(G'). G"
O'
G'
G'
We conclude that L(G') -- L(G). DEFINITION
A C F G G -- (N, Z, P, S) is said to be cycle-free if there is no derivation +
of the form A =~ A for any A in N. G is said to be proper if it is cycle-free, is e-free, a n d has no useless symbols. G r a m m a r s w h i c h have cycles or e-productions are sometimes more difficult to parse than g r a m m a r s which are cycle-free and e-free. In addition, in any practical situation useless symbols increase the size o f a parser unnecessarily. T h r o u g h o u t this b o o k we shall a s s u m e a g r a m m a r has no useless symbols. F o r some of the parsing algorithms to be discussed in this b o o k we shall insist that the g r a m m a r at h a n d be proper. The following t h e o r e m shows that this requirement still allows us to consider all context-free languages. THEOREM 2.16 If L is a CFL, then L = L(G) for some proper C F G G.
Proof Use Algorithms 2.8-2.11.
[~]
DEFINITION
An A-production in a C F G is a p r o d u c t i o n of the form A ~ 0~ for some 0c. ( D o not confuse an " A - p r o d u c t i o n " with an "e-production," which is one o f the form B ~ e.) Next we introduce a transformation which can be used to eliminate from a g r a m m a r a p r o d u c t i o n of the form A ~ o~Bfl. To eliminate this p r o d u c t i o n we must add to the g r a m m a r a set of new productions formed by replacing the nonterminal B by all right sides of B-productions.
SEC. 2.4
CONTEXT-FREE LANGUAGES
151
LEMMA 2.14 Let G = (N, Z, P, S) be a C F G and A ---~ ~Bfl be in P for some B ~ N and 0~ and fl in (N U Z)*. Let B ---~ ~'1 I?z[ "'" [~'k be all the B-productions in P. Let G ' = (N, Z , P ' , S ) w h e r e
Then L(G) = L(G'). Proof. Exercise. Example 2.25
Let us replace the production A ---~ aAA in the grammar G having the two productions A ~ aAA[b. Applying Lemma 2.14, assuming that ~ = a, B = A, and fl = A, we would obtain G' having productions A ---~ a a A A A l a b A l b .
Derivation trees corresponding to the derivations of aabbb in G and G' are shown in Fig. 2.12(a) and (b). Note that the effect of the transformation is to "merge" the root of the tree in Fig. 2.12(a) with its second direct descendant. [~
a/!~A~b
LI
b
b
(a)
(b)
In G
In G' Fig. 2.12
2.4.3.
I ! I
b
b
b
Derivation trees in G and G'.
Chomsky Normal Form
DEFINITION
A C F G G = (N, Z, P, S) is said to be in Chomsky normal f o r m (CNF) if each production in P is of one of the forms (1) A ~ B C with A, B, and C in N, or (2) A ~ a w i t h a E Z, or
152
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
(3) If e ~ L(G), then S--~ e is a production, and S does not appear on the right side of any production. We shall show that every context-free language has a Chomsky normal form grammar. This result is useful in simplifying the notation needed to represent a context-free language. ALGORITHM 2.12 Conversion to Chomsky normal form. Input. A proper C F G G = (N, 2;, P, S) with no single productions. Output. A C F G G' in CNF, with L(G) -- L(G'). Method. From G we shall construct an equivalent C N F grammar G' as follows. Let P' be the following set of productions:
(1) Add each production of the form A --~ a in P (2) Add each production of the form A --~ B C in (3) If S --, e is in P, add S --, e to P'. (4) For each production of the form A ---, Xi . . add to P' the following set of productions. We let X', N, and let X',. be a new nonterminal if X; is in E.
to P'. P to P'. X k in P, where k > 2, stand for X~. if X,. is in
A
> X ' i ( X z . . . Xk>
< X , . . . X,>
>x~
> x;_~ > X~_lX;
where each ( X i .-- Xk) is a new nonterminal symbol. (5) For each production of the form A --> X 1 X 2, where either X1 or X 2 or both are in E, add to P' the production A --> X'iX'z. (6) For each nonterminal of the form a' introduced in steps (4) and (5), add to P' the production a' ---~ a. Finally, let N' be N together with all new nonterminals introduced in the construction of P'. Then our desired grammar is G' -- (N', E, P', S). [~ THEOREM 2.17 Let L be a CFL. Then L -- L(G) for some C F G G in Chomsky normal form.
Proof. By Theorem 2.16, L has a proper grammar. The grammar G' of Algorithm 2.12 is clearly in CNF. It suffices to show that in Algorithm 2.12, L ( G ) - L(G'). This statement follows by an application of Lemma 2.14 to
SEC. 2.4
CONTEXT-FREE LANGUAGES
15:3
each production of G' with a nonterminal a', and then to each production with a nonterminal of the form (Xt - . . Xj). The resulting grammar will b e G . [Z] Example 2.26
Let G be the proper C F G defined by S
> aABIBA
A .
> B B B [a
B
> ASIb
We construct P' in Algorithm 2.12 by retaining the productions S ---~ BA, A ~ a, B ~ A S , and B ~ b. We replace S ~ a A B by S ~ a ' ( A B ) and ( A B ) ~ AB. A ~ B B B is replaced by A ---~ B ( B B ) and ( B B ) ~ BB. Finally, we add a'--~ a. The resulting grammar is G ' = (N', {a, b}, P', S), where N ' = [S, A, B, ( A B ) , ( B B ) , a'} and P' consists of S-
> a ' ( A B ) I BA
A
>
B
>ASIb
(AB>
, AB
, nB > a [Z]
a' 2.4.4.
B < B B ) Ia
Greibach Normal Form
We next show that it is possible to find for each CFL a grammar in which. every production has a right side beginning with a terminal. Central to the construction is the idea of left recursion and its elimination. DEFINITION
A nonterminal A in C F G G = (N, ~, P, S) is said to be recursive if +
~Afl for some ~ and ft. If ~ = e, then A is said to be left-recursive. Similarly, if fl = e, then A is right-recursive. A grammar with at least one left(right-) recursive nonterminal is said to be left- (right-) recursive. A grammar in which all nonterminals, except possibly the start symbol, are recursive is said to be recursive. A ~
Certain of the parsing algorithms which we shall discuss do not work with left-recursive grammars. We shall show that every context-free language has at least one non-left-recursive grammar. We begin by showing how to eliminate immediate left recursion from a CFG.
154
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
LEMMA 2.15 Let G = (N, Z, P, S) be a C F G in which
a ~ A ~ IA~2i . . . I A ~ IP~ I/~ I''' I/L are all the A-productions in P and no fl~ begins with A. Let G ' = (N U [A'}, Z, P', S), where P ' is P with these productions replaced by
A A'
>P~I/~I'" IP, lP~A'I/~A'I... I/LA' >~1~!"" I~l~a'l~a'l,., IO~mA'
A' is a new nonterminal not in N.t Then L(G') = L(G).
Proof In G, the strings which can be derived leftmost from A using only A-productions are seen to be exactly those strings in the regular set (Pl -q- P2 + " ' " "2U P,)(O~I + ~2 + " ' ' -3U ~m)*" These are exactly the strings which can be derived rightmost from A using one A-production and some number of A'-productions of G'. (The resulting derivation is no longer leftmost.) All steps of the derivation in G that do not use an A-production can be done directly in G', since the non-A-productions of G and G' are the same. We conclude that w is in L(G) and that L(G') ~ L(G). For the converse, essentially the same argument is used. The derivation in G' is taken to be rightmost, and sequences of one A-production and any number of A'-productions are considered. Thus, L ( G ) = L(G'). D The effect of the transformation in Lemma 2.15 on derivation trees is shown in Fig. 2.13. Example 2.27
Let Go be our usual grammar with productions
E
>E+TIT
T
>T , F I F
F
> (E) la
The grammar G' with productions
E
> TITE'
E'
>+ TI+TE'
T
> FIFT'
tNote that the A ~ p{s are in the initial and final sets of A-productions.
CONTEXT-FREE LANGUAGES
SEC. 2.4
A
/\ /\ A
.,4
A
/\ A' /\ /\ /\.
Oql
°q2
~ik
A'
Oqk- I
A"
/\ /
A
155
A"
*°
A'
aik
/\ oq2
A'
\ oq 1
(a) Portion of tree in G
(b) Corresponding portion in G'
Fig. 2.13 Portions of trees.
T'-
> ,F[,FT'
F
> (E)la
is equivalent to Go and is the one obtained by applying the construction in Lemma 2.15 with A -- E and then A -- T. [-] We are now ready to give an algorithm to eliminate left recursion from a proper CFG. This algorithm is similar in spirit to the algorithm we used to solve regular expression equations. i
ALGORITHM 2.13 Elimination of left recursion.
Input. A proper C F G G -- (N, X, P, S). Output. A C F G G' with no left recursion. Method. (1) Let N = [ A I , . . . , A,}. We shall first transform G so that if At ---~ is a production, then a begins either with a terminal or some Aj such that j > i. For this purpose, set i = 1. (2) Let the Arproductions be A~--~ Atal 1 " " I AtO~m fl~ I " "lflp, where no ,81 begins with A k if k < i. (It will always be possible to do this.) Replace these Arproductions by
A,
> l~,i... IB~IP, A',I"" llbA~
156
ELEMENTS OF L A N G U A G E THEORY
CHAP. 2
where A'~ is a new variable. All the At-productions now begin with a terminal or Ak for some k > i. (3) If i = n, let G' be the resulting grammar, and halt. Otherwise, set i=i+ 1 and j = 1. (4) Replace each production of the form A t ---~ A~0~, by the productions A,---> fll0~! " " [ flm0~, where Aj ---~ fl~ ] . . . [tim are all the Approductions. It will now be the case that all Afproductions begin with a terminal or A k, for k > j, so all At-productions will then also have that property. (5) If j = i - 1, go to step (2). Otherwise, setj =j-Jr- 1 and go to step (4).
D THEOREM 2.18 Every CFL has a hon-left-recursive grammar. Proof. Let G be a proper grammar for C F L L. If we apply Algorithm 2.13, the only transformations used are those of Lemmas 2.14 and 2.15. Thus the resulting G' generates L.
We must show that G' is free of left recursion. The following two statements are proved by induction on a quantity which we shall subsequently define: (2.4.4)
After step (2) is executed for i, all At-productions begin with a terminal or A k, for k > i
(2.4.5)
After step (4) is executed for i and j, all At-productions begin with a terminal or A u, for k > j
We define, the score of an instance of (2.4.4) to be ni. The score of an instance of (2.4.5) is ni + j. We prove (2.4.4) and (2.4.5) by induction on the score of an instance of these statements. Basis (score of n)" Here i = 1 and j = 0. The only instance is (2.4.4) with i = 1. None of ill,- • . , ft, in step (2) can begin with A1, so (2.4.4) is immediate if i = 1. Induction. Assume (2.4.4) and (2.4.5) for scores less than s, and let i and j be such that 0 < j < i ~ n and ni + j = s. We shall prove this instance of (2.4.5). By inductive hypothesis (2.4.4), all Aj-productions begin with a terminal or A k, for k > j. [This follows because, if j > 1, the instance of (2.4.5) with parameters i and j -- 1 has a score lower than s. The case j = 1 follows from (2.4.4).] Statement (2.4.5) with parameters i and j is thus immediate from the form of the new productions. An inductive proof of (2.4.4) with score s (i.e., ni = s, j = 0) is left for the Exercises. It follows from (2.4.4) that none of A 1. . . . , A, could be left-recursive. +
Indeed, if A t ==> At0c for some 0~, there would have to be Aj and A k with lm
k < j such that A~ =-~ Ajfl ==~ Ak? =-~ Aloe. We must now show that no A't lm
lm
lm
CONTEXT-FREE LANGUAGES
SEC. 2,4
157
introduced in step (2) can be left-recursive. This follows immediately from the fact that if A~ ~ A)~, is a production created in step (2), then j < i since A~ is introduced after A). Example 2.28
Let G be
A
~ BCIa
B----> CA IAb C
> AB[ CC[ a
We take A t = A, A z = B, and A 3 = C. The grammar after each application of step (2) or step (4) of Algorithm 2.13 is shown below. At each step we show only the new productions for nonterminals whose productions change. Step (2) with i -- 1: no change. Step (4) with i - - 2, j - -
1: B ----, CA]BCb]ab
Step (2) with i = 2 : B --~ CA[ab[CAB'IabB' B' --~ CbB' l Cb Step (4) with i = 3 , j = 1: C ~
BCB]aB]CC]a
Step (4) with i = 3, j = 2: C ~ CA CBI abCB[ CAB' CBI abB'CB[aB[ CCI a Step (2) with i = 3: C ---~ abCBI abB' CBI aBI a l abCBC' [abB'CBC' IaBC'I aC'
c' ~
A CBC' IAB' CBC' I CC' IA CB IAB' CB I C
An interesting special case of non-left recursiveness is Greibach no[mal form. DEFINITION
A C F G G = (N, E, P, S) is said. to be in Greibach normal form ( G N F ) if G is e-free and each non-e-production in P is of the form A --~ aa with a ~ E a n d a ~ N*. If a grammar is not left-recursive, then we can find a natural partial order on the nonterminals. This partial order can be embedded in a linear order which is useful in putting a grammar into Greibach normal form. LEMMA 2.16 Let G = (N, ~, P, S) be a non-left-recursive grammar. Then there is a linear order < on N such that if A --~ Ba is in P, then A < B. +
Proof. Let R be the relation A R B if and only if A =~ Ba for some a.
158
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
By definition of left recursion, R is a partial order. (Transitivity is easy to show.) By Algorithm 0.1, R can be extended to a linear order < with the desired property. D
ALGORITHM 2.14 Conversion to Greibach normal form.
Input. A non-left-recursive proper C F G G = (N, E, P, S). Output. A grammar G' in G N F such that L(G) = L(G'). Method. (1) Construct, by Lemma 2.16, a linear order < on N such that every A-production begins either with a terminal or some nonterminal B such that A < B. Let N -- [ A ~ , . . . , A,}, so that A~ < A2 . . . < A,. (2) Set i = n - 1. (3) If i -- 0, go to step (5). Otherwise, replace each production of the form A, ~ Ajoc, where j > i, by A~ ---, fl~0~[ . . . [fl,,~z, where Aj ---, ,8~ ] . . . l fl,, are all the Alproductions. It will be true that each of , 8 ~ , . . . , tim begins with a terminal. (4) Set i = i - 1 and return to step (3). (5) At this point all productions (except possibly S----. e) begin with a terminal. For each production, say A - - . a X e . . . Xk, replace those Xj which are terminals by X'j, a new nonterminal. (6) For all X~. introduced in step (5), add the production X~ --~ Xj. [-] THnOREM 2.19 If L is a CFL, then L -- L(G) for some G in GNF.
Proof A straightforward induction on n -- i (that is, backwards, starting at i - - n 1 and finishing at i = 1) shows that after applying step (3) of Algorithm 2.13 for i all At-productions begin with a terminal. The property of the linear order < is crucial here. Step (5) puts the grammar into GNF, and by Lemma 2.14 does not change the language generated. [--] Example 2.29 Consider the grammar G with productions
E
>TITE'
E'
> +T] + T E '
T
>FIFT'
T'
> ,F] ,FT'
F
> (E) ia
Take E' < E < T' < T < F as the linear order on nonterminals.
SEC. 2.4
CONTEXT-FREE LANGUAGES
159
All F-productions begin with a terminal, as they must, since F is highest in the order. The next highest symbol, T, has productions T---, F[ FT', so we substitute for F in both to obtain T ~ ( E ) l a [(E)T' [aT'. Proceeding to T', we find no change necessary. We then replace the E-productions by E---, ( E ) I a I ( E ) T ' I a T ' I ( E ) E ' I a E ' I ( E ) T ' E ' I a T ' E ' . No change for E' is necessary. Steps ( 5 ) a n d (6) introduce a new nonterminal )' and a production )'--~ ). All instances of ) are replaced by )', in the previous productions. Thus the resulting G N F grammar has the productions g E' T T'
r
>(E)'IaI(E)'T'IaT'I(E)'E'IaE'I(E)' T'E']aT'E' ~ +TI +TE' ~(E)'IaI(E)'T'IaT' ~ , F I,FT'
~ (E)' Ia
)' ---~) One undesirable aspect of using this technique to put a grammar into G N F is the large number of new productions created. The following technique can be used to find a G N F grammar without introducing too many new productions. However, this new method may introduce more nonterminals. 2.4.5.
An Alternative Method of Achieving Greibach Normal Form
There is another way to obtain a grammar in which each production is of the form A ~ a~. This technique requires the grammar to be rewritten only once. Let G = (N, Z, P, A) be a C F G which contains no e-productions (not even A - ~ e) and no single productions. Instead of describing the method in terms of the set of productions we shall use a set of defining equations, of the type introduced in Section 2.2.2, to represent the productions. For example, the set of productions A
>AaBIBB[b
B
~ aA [ B A a t B d l c
can be represented by the equations A = A aB q-- BB -+- b
(2.4.6)
B = a A -Jr-B A a -at- B d -+- c
where A and B are now indeterminates representing sets.
160
CHAP. 2
ELEMENTSOF LANGUAGE THEORY
DEFINITION
Let A and X be two disjoint alphabets. A set of defining equations over and A is a set of equations of the form A . = o~ + ~2 + "'" + OCk,where A ~ A and each 0~ is a string in (A u X)*. If k = 0, the equation is taken to be A = ~ . There is one equation for each A in A. A solution to the set of defining equations is a function f from A to 6'(Z*) such that i l l ( A ) is substituted everywhere for A, for each A ~ A, then the equations become set equalities. We say that solution f is a minimalfixed point ill(A) ~ g(A) for all .4 ~ A and solutions g. We define a C F G corresponding to a set of defining equations by creating the productions A --~ 0~ [0c2[ . . . 10Okfor each equation A = ~i + " " + ~zk. The nonterminals are the symbols in A. Obviously, the correspondence is one-to-one. We shall state some results about defining equations that are generalizations of the results proved for standard form regular expression equations (which are a special case of defining equations). The proofs are left for the Exercises. LEMMA 2 . 1 7
The minimal fixed point of a set of defining equations over A and X is unique and is given by f ( A ) = [wlA ==>w with w ~ X*}, where G is the G
corresponding CFG.
Proof. Exercise.
[Z]
We shall employ a matrix notation to represent defining equations. Let us assume that A = [A 1, A 2 , . . . , A,}. The matrix equation
d = d R + _B represents n equations. Here _A is the row vector [A 1, A2, .. •, A,], R is an n × n matrix whose entries are regular expressions, and _B is a row vector consisting of n regular expressions. We take "scalar" multiplication to be concatenation, and scalar addition to be + (i.e., union). Matrix and vector addition and multiplication are defined as in the usual (integer, real, etc.) case. We let the regular expression in row j, column i of R be ~ + . . . + ~k, if A j ~ i , . . . , Aimk are all terms with leading symbol A~ in the equation for A~. We let the jth component of B be those terms in the equation for A j which begin with a symbol of X. Thus, Bj and Rio are those expressions such that the productions for A~ can be written as
A j - - A t R l j -+- A2R2s + . . . + AtRtj + . . . -+- A,,R,j + B i where Bj is a sum of expressions beginning with terminals. Thus the defining equations (2.4.6) would be written as
SEC. 2.4
(2.4.7)
CONTEXT-FREE LANGUAGES
[A, B] = [A, B] IaBB
161
Aa f~+ d l + [ b , a A + c ]
We shall now find an equivalent set of defining equations for d =
d R + _B such that the new set of defining equations corresponds to a set of productions all of whose right sides begin with a terminal symbol. The transformation turns on the following observation. LEMMA 2.18 Let A = d R ÷ B be a set of defining equations. Then the minimal fixed point is d = BR*, where R* = I + R ÷ R 2 + R 3 + . . . . I is an identity matrix (e along the diagonal and ~ elsewhere), R 2 = RR, R a = RRR, and so forth.
Proof. Exercise. If we let R ÷ = RR*, then we can write the minimal fixed point of the equations _A = dR + _B as _A = _B(R+ + I ) = _BR+ -+- BI = _BR+ + _B. Unfortunately, we cannot find a corresponding grammar for these equations; they are not defining equations, as the elements of R ÷ may be infinite sets of terms. However, we can replace R ÷ by a new matrix of "unknowns." That is, we can replace R ÷ by a matrix Q with qt~ as a new symbol in row i, column j. We can then obtain equations for the q~j's by observing that R + = RR + + R. Thus, Q = RQ ÷ R is a set of defining equations for the q~j's. Note that there are n 2 equations if Q and R are n × n matrices. The following lemma relates the two sets of equations. LEMMA 2.19 Let d = dR + _Bbe a set of defining equations over A and X. Let Q be a matrix of the size of R such that each component of Q is a unique new symbol. Then the system of defining equations represented by d = _BQ + _B and Q = R Q + R has a minimal fixed point which agrees on A with that of
d = dR + _B. Proof. Exercise.
E] We now give another algorithm to convert a proper grammar to GNF.
ALCORITHM 2.15 Conversion to Greibach normal form.
Input. A proper grammar G = (N, X, P, S) such that S ~ e is not in P. Output. A grammar G' ---- (N', X, P', S) in G N F . Method. (1) F r o m G, write the corresponding set of defining equations A = _AR + _B over N and X.
162
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
(2) Let Q be an n x n matrix of new symbols, where @N = n. Construct the new set of defining equations d = _BQ-Jr _B, Q = RQ q-R, and let G 1 be the corresponding grammar. Since every term in B begins with a terminal, all A-productions of G 1, for A ~ N, will begin with terminals. (3) Since G is proper, e is not a coefficient in R. Thus each q-production of G1, where q is a component of Q, begins with a symbol in N u Z. Replace each leading nonterminal A in these productions by all the right sides of the A-productions. The resulting grammar has only productions whose right sides begin with terminals. (4) For each terminal a appearing in a production as other than the first symbol on the right side, replace it by new nonterminals of the form a' and add production a' --~ a. Call the resulting grammar G'. [Z THEOREM 2.20 Algorithm 2.15 yields a grammar G' in GNF, and L(G) = L(G').
Proof. That G' is in G N F follows from the properness of G. That is, no component of _B or R is e. That L(G') = L(G) follows from Lemmas 2.14, 2.17, and 2.19. D Example 2.30 Let us consider the grammar whose corresponding defining equations are (2.4.7), that is,
A
~ AaBIBBIb
B
~ aA IBAaIBd[ c
We rewrite these equations according to step (2) of Algorithm 2.15 as
We then add the equations
(2.4.9) I~ ZXI=fa: • A a Z~ q--dlI~
zX] -q-
EaBBA a q - d l
The grammar corresponding t o (2.4.8) and (2.4.9) is
A
> b WI an YI c YI b
B
>b X l a A Z l c Z l a A l e
W X
~ aBWla11 > aBX
Y
>B W [ A a Y [ d Y [ B
z
~nXlAaZldZlAa[d
EXERCISES
163
Note that X is a useless symbol. In step (3), the productions Y - - , B W I Aa YI B and Z --~ B X I A a Z t Aa are replaced by substituting for the leading A's and B's. We omit this transformation, as well as that o f step (4), which should now be familiar to the reader. [~
EXERCISES
2.4.1.
Let G be defined by
S---->AB A ---> AalbB B
~alSb
Give derivation trees for the following sentential forms: (a) baabaab. (b) bBABb. (c) baSb. 2.4.2.
Give a leftmost and rightmost derivation of the string baabaab in the grammar of Exercise 2.4.1.
2.4.3.
Give all cuts of the tree of Fig. 2.14. nl
n4
n5 n7
2.4.4.
Fig. 2.14 Unlabelled derivation tree.
Show that the following are equivalent statements about a CFG G and sentence w: (a) w is the frontier of two distinct derivation trees of G. (b) w has two distinct leftmost derivations in G. (c) w has two distinct rightmost derivations in G.
**2.4.5. 2.4.6.
What is the largest number of different derivations that are representable by the same derivation tree of n nodes ? Convert the grammar
S
>AIB
A
> aBlbSlb
B ----~. ABIBa C---->. AS[b to an equivalent CFG with no useless symbols.
164
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
2.4.7.
Prove that Algorithm 2.8 correctly removes inaccessible symbols.
2.4.8.
Complete the proof of Theorem 2.13.
2.4.9.
Discuss the time and space complexity of Algorithm 2.8. Use a random access computer model.
2.4.10.
Give an algorithm to compute for a CFG G = (N, E, P, S) the set of . A ~ N such that A :=~ e. How fast is your algorithm ?
2.4.11.
Find an e-free grammar equivalent to the following" S
>ABC
A -----~ BBIe
B
> CCIa
C
>AAIb
2.4.12.
Complete the proof of Theorem 2.14.
2.4.13.
Find a proper grammar equivalent to the following" S-----> A I B A-----> C I D B----~ D I E C---->. S l a l e 19
> Slb
E----> S l c l e
2.4.14.
Prove Theorem 2.16.
2.4.15.
Prove Lemma 2.14.
2.4.16.
Put the following grammars in Chomsky normal form" (a) S ~ 0S1 t01. (b) S ---~ aBIbA A ---~ aS IbAAla B ---, bSlaBBIb.
2.4.17.
If G = (N, ~, P, S) is in CNF, S :=~ w, I w[ = n, and w is in ~*, what
k
G
is k? 2.4.18.
Give a detailed proof of Theorem 2.17.
2.4.19.
Put the grammar S
into G N F (a) Using Algorithm 2.14. (b) Using Algorithm 2.15.
~ BalAb
A ~
Sa[AAb[a
B~
SblBBalb
EXERCISES
*2.4.20.
Give a fast algorithm to test if a C F G G is left-recursive.
2.4.21.
Give an algorithm to eliminate right recursion from a C F G .
2.4.22.
Complete the proof of L e m m a 2.15.
*2.4.23.
165
Prove L e m m a s 2.17-2.19.
2.4.24.
Complete Example 2.30 to yield a proper g r a m m a r in G N F .
2.4.25.
Discuss the relative merits of Algorithms 2.14 and 2.15, especially with regard to the size of the resulting grammar.
*2.4.26.
Show that every C F L without e has a g r a m m a r where all productions are of the forms A ~ a B C , A ---~ aB, and A ----~ a. DEFINITION A C F G is an operator g r a m m a r if no production has a right side with two adjacent nonterminals.
*2.4.27.
Show that every C F L has an operator grammar. H i n t : Begin with a G N F grammar.
*2.4.28.
Show that every C F L is generated by a g r a m m a r in which each production is of one of the forms A ~
aBbC,
A ~
aBb,
A ~
aB,
or
A~
a
If e ~ L(G), then S ---~ e is also in P. **2.4.29.
Consider the g r a m m a r with the two productions S---~ S S [ a . Show that the n u m b e r of distinct leftmost derivations of a n is given by X~ = ~ t + j = , ~X~, where X1 = 1. Show that i~0 j~0
Xn+l = n + 1
n)
(These are the Catalan numbers.) *2.4.30.
Show that if L is a C F L containing no sentence of length less than 2, then L has a g r a m m a r with all productions of the form A ---~ aocb.
2.4.31.
Show that every C F L has a g r a m m a r in which if X1 X2 . . . Xk is the right side of a production, then XI . . . . . Xk are all distinct. DEFINITION A C F G G = (N, E, P, S) is linear if every production is of the form w B x or A ---~ w for w and x in Z* and B in N.
A ~
2.4.32.
Show that every linear language without e has a g r a m m a r in which each production is of one of the forms A ~ aB, A ~ Ba, or A ~ a.
*2.4.33.
Show that every C F L has a g r a m m a r G ---- (N, E, P, S) such that if A ,
is in N -- [S], then [ w l A :=~ w and w is in Z*} is infinite. 2.4.34.
Show that every C F L has a recursive grammar. H i n t : Use L e m m a 2.14 and Exercise 2.4.33.
166
ELEMENTSOF LANGUAGE THEORY
*2.4.35.
CHAP. 2
Let us call a C F G G = (N, ~, P, S)quasi-linear if for every production A ---, Xa . . . Xk there is at most one X~ which generates an infinite set of terminal strings. Show that every quasi-linear grammar generates a linear language. DEFINITION
The graph of a C F G G = (N, Z, P, S) is a directed unordered graph (N u ~ u {e}, R) such that A R X if and only if A ----~ocXfl is a production in P for some 0¢ and ft. 2.4.36.
Show that if a grammar has no useless symbols, then all nodes are accessible from S. Is the converse of this statement true ?
2.4.37.
Let T be the transformation on context-free grammars defined in Lemma 2.14. That is, if G and G' are the grammars in the statement of Lemma 2.14, then T maps G into G'. Show that Algorithms 2.10 and 2.11 can be implemented by means of repeated applications of this transformation T.
Programming Exercises 2.4.38.
Construct a program that eliminates all useless symbols from a CFG.
2.4.39.
Write a program that maps a CFG into an equivalent proper CFG.
2.4.40.
Construct a program that removes all left recursion from a CFG.
2.4.41.
Write a program that decides whether a given derivation tree is a valid derivation tree for a CFG.
BIBLIOGRAPHIC
NOTES
A derivation tree is also called a variety of other names including generation tree, parsing diagram, parse tree, syntax tree, phrase marker, and p-marker. The representation of a derivation in terms of a derivation tree has been a familiar concept in linguistics. The concept of leftmost derivation appeared in Evey [1963]. Many of the algorithms in this chapter have been known since the early 1960's, although many did not appear in the literature until considerably later. Theorem 2.17 (Chomsky normal form) was first presented by Chomsky [1959a]. Theorem 2.18 (Greibach normal form) was presented by Greibach [1965]. The alternative method of achieving G N F (Algorithm 2.15) and the result stated in Exercise 2.4.30 were presented by Rosenkrantz [1967]. Algorithm 2.14 for Greibach normal form has been attributed to M. Paull. Chomsky [1963], Chomsky and Schutzenberger [1963], and Ginsburg and Rice [1962] have used equations to represent the productions of a context-free grammar. Operator grammars were first considered by Floyd [1963]. The normal forms given in Exercises 2.4.26-2.4.28 were derived by Greibach [1965].
SEC. 2.5
2.5.
PUSHDOWN AUTOMATA
167
PUSHDOWN A U T O M A T A
We now introduce the pushdown a u t o m a t o n - - a recognizer that is a natural model for syntactic analyzers of context-free languages. The pushdown automaton is a one-way nondeterministic recognizer whose infinite storage consists of one pushdown list, as shown in Fig. 2.15. L
• .-
a2
art
Read only input tape
! [ Filit state
control
Zi Z2
Pushdown list
Zm Fig. 2.15
Pushdown a u t o m a t o n .
We shall prove a fundamental result regarding pushdown a u t o m a t a - that a language is context-free if and only if it is accepted by a nondeterministic pushdown automaton. We shall also consider a subclass of context-free languages which are of prime importance when parsability is considered. These, called the deterministic CFL's, are those CFL's which can be recognized by a deterministic pushdown automaton. 2.5.1.
The Basic Definition
We shall represent a pushdown list as a string of symbols with the topmost symbol written either on the left or on the right depending on which convention is most convenient for the situation at hand. For the time being we shall assume that the top symbol on the pushdown list is the leftmost symbol of the string representing the pushdown list. DEFINITION
A pushdown automaton (PDA for short) is a 7-tuple
p = (Q, z, r , 3, q0, z0, F),
168
ELEMENTS OF LANGUAGE THEORY
CHAP. 2.
where (1) Q is a finite set of :tate symbols representing the possible states of the finite state control, (2) 2~ is a finite input alphabet, (3) F is a finite alphabet of pushdown list symbols, (4) 6 is a mapping from Q x (Z w {e}) x F to the finite subsets of QxF*, (5) qo ~ Q is the initial state of the finite control, (6) Z0 E F is the symbol that appears initially on the pushdown list (the start symbol), and (7) F ~ Q is the set of final states.
A configuration of P is a triple (q, w, 0c) in Q x ~* x F*, where (1) q represents the current state of the finite control. (2) w represents the unused portion of the input. The first symbol of w is under the input head. If w = e, then it is assumed that all of the input tape has been read. (3) a represents the contents of the pushdown list. The leftmost symbol of ~ is the topmost pushdown symbol. If 0¢ = e, then the pushdown list is assumed to be empty.
A move by P will be represented by the binary relation I-v- (or ~ whenever P is understood) on configurations. We write
(2.5.1)
(q, aw, ZoO ~ (q', w, 7oc)
if dr(q, a, Z) contains (q', 7') for any q ~ Q, a ~ :E w [e}, w ~ Z*, Z ~ F, and 0~ ~ F*. If a ¢- e, Eq. (2.5.1) states that if P is in a configuration such that the finite control is in state q, the current input symbol is a, and the symbol on top of the pushdown list is Z, then P may go into a configuration in which the finite control is now in state q', the input head has been shifted one square to the right, and the topmost symbol on the pushdown list has been replaced by the string 7' of pushdown list symbols. If 7' = e, we say that the pushdown list has been popped. If a = e, then the move is called an e-move. In an e-move the current input symbol is not taken into consideration, and the input head is not moved. However, the state of the finite control can be changed, and the contents of the memory can be adjusted. Note that an e-move can occur even if all of the input has been read. No move is possible if the pushdown list is empty. We can define the relations 1.4-, for i ~ 0, I.-~--,and [--~- in the customary fashion. Thus, 1-~--and 1-~--are, respectively, the reflexive-transitive and transitive closures of l---.
SEC. 2.5
PUSHDOWNAUTOMATA
169
An initial configuration of P is one of the form (qo, w, Zo) for some w in lg*. That is, the finite state control is in the initial state, the input contains the string to be recognized, and the pushdown list contains only the symbol Z0. A final configuration is one of the form (q, e, ~x), where q is in FF and is in F*. We say that a string w is accepted by P if (q0, w, Z0)I-~-- (q, e, a0 for some q in F and tx in r'*. The language defined by P, denoted L(P), is the set of strings accepted by P. L(P) will be called a pushdown automaton
language. Example 2.31
Let us give a pushdown automaton for the language L = [0"l']n > 0}. Let P = ({qo, ql, q2}, {0, 1}, {Z, 0}, d~, qo, Z, {q0}), where 6(qo, O, Z) = [(q~, OZ)} 6(q~, O, O) = [(q~, 00)} ,fi(q~, 1, O) = {(qz, e)} 6(q2, 1, O) = [(qz, e)] 6(q2, e, Z) = {(qo, e)} P operates by copying the initial string of O's from its input tape onto its pushdown list and then popping one 0 from the pushdown list for each 1 that is seen on the input. Moreover, the state transitions ensure that all O's must precede the l's. For example, with the input string 001 I, P would make the following sequence of moves: (qo, 0011, Z) F- (q~, 011, 0Z) l- (q~, l l, 00Z) ~- (qz, 1, 0Z) F- (q2, e, Z) (q0, e, e) In general we can show that (qo, O, Z) J---- (q~, e, OZ)
(q~, 0', OZ) [d__(q~, e, O'÷~Z) (ql, 1, 0'+'Z) [- (q2, e, 0'Z) (q~, 1', 0'Z)! ~ (q~, e, Z) (q2, e, Z) t-- (q0, e, e)
170
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
Stringing all of this together, we have the following sequence of moves by P: (qo, 0"1", Z)[2,+1 (qo, e, e)
forn~
1
and (qo, e, Z)l-q-. (q0, e, Z) Thus, L ~ L(P). Now we need to show that L ~ L(P). That is, P will accept only strings of the form 0"1". This is the hard part. It is generally easy to show that a recognizer accepts certain strings. As with grammars, it is invariably much more difficult to show that a recognizer accepts only strings of a certain form. Here we notice that if P accepts an input string other than e, it must cycle through the sequence of states q0, q l, q2, q0. We notice that if (q0, w, Z) [A_ (q l, e, ~), i ~ 1, then w = 0 t and 0c = 0tZ. Likewise if (q2, w, 00 [._t_.(q2, e, fl), then w = 1t and 0~ = 0~fl. Also, (q~, w, tO ~ (q2, e, fl) only if w = 1 and ~ = Off; (q2, w, Z ) I ~ (q0, e, e) only if w = e. Thus if (q0, w, Z)I-z-(q0, e, ~), for some i _~ 0, either w = e and i = 0 or w = 0"1", i = 2n + 1, and 0~ = e. Hence L _p_L(P). [Z] We emphasize that a pushdown automaton, as we have defined it, can make moves even though it has scanned all of its input. However, a pushdown automaton cannot make a move if its pushdown list is empty. Example 2.32
Let us design a pushdown automaton for the language
L -- {wwRI w ~ {a, b}+]}. Let P = ([qo, ql, q2}, [a, b}, {Z, a, b}, ~, qo, Z, {q2]), where (1) (2) (3) (4) (5) (6) (7)
$(qo, a, Z) $(qo, b, Z) $(qo, a, a) O(qo, a, b) 6(qo, b, a) ,6(qo, b, b) ,6(ql, a, a) (8) 6(ql, b, b) (9) d~(ql, e, Z)
= : = : : = : -:
[(qo, aZ)} [(qo, bZ)} [(qo, aa), (q l, e)} ((qo, ab)} {(qo, ba)} ((qo, bb), (q l, e)} [(ql, e)} [(ql, e)] [(q2, e)}
P initially copies some of its input onto its pushdown list, by rules (1), (2), (4), and (5) and the first alternatives of rules (3) and (6). However, P is nondeterministic. Anytime it wishes, as long as its current input matches the top of the pushdown list, it may enter state q~ and begin matching its
SEC. 2.5
PUSHDOWN AUTOMATA
171
pushdown list against the input. The second alternatives of rules (3) and (6) represent this choice, and the matching continues by rules (7) and (8). Note that if P ever fails to find a match, then this instance of P "dies." However, since P is nondeterministic, it makes all possible moves. If any choice causes P to expose the Z on its pushdown list, then by rule (9) that Z is erased and state q2 entered. Thus P accepts if and only if all matches are made. For example, with the input string abba, P can make the following sequences of moves, among others" (1) (q0, abba, Z) [--- (qo, bba, aZ) (qo, ba, baZ) (qo, a, bbaZ) (qo, e, abbaZ) (2) (q0, abba, Z) ~ (qo, bba, aZ) [-- (qo, ba, baZ) (q l, a, aZ) k- (q~,e,Z) k- (q2, e,e). Since the sequence (2) ends in final state q2, P accepts the input string abba. Again it is relatively easy to show that if w = CxC2 . . . c,,c,,c,,_~ . . . c~, each ct in {a, b}, 1 < i < n, then (q0, w, Z) [.2_ (q0, c,,c,,-1 "'" ci, CnCn-t "'" clZ)
I--(ql,
cn_~ . . .
c , , c,,_~ . . .
c~Z)
I"-' (q~, e, Z) I ~ (q2, e, e). Thus, L It is ~ I'*, proof is
~ L(P). not quite as easy to show that if (q0, w, Z)I-~--(q2, e, ~) for some then w is of the form xx R for some x in (a + b) ÷ and ~ = e. This left for the Exercises. We can then conclude that L ( P ) = L.
The pushdown automaton of Example 2.32 quite clearly brings out the nondeterministic nature of a PDA. From any configuration of the form (qo, aw, a~) it is possible for P to make one of two moves---either push another a on the pushdown list or pop the a from the top of the pushdown list. We should emphasize that although a nondeterministic pushdown automaton may provide a convenient abstract definition for a language, the device must be deterministically simulated to be realized in practice. In Chapter 4 we shall discuss systematic methods for simulating nondeterministic pushdown automata.
172
2.5.2.
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
Variants of Pushdown Automata
In this section we shall define some variants of PDA's and relate the languages defined to the original PDA languages. First we would like to bring out a fundamental aspect of the behavior of a PDA which should be quite intuitive. This can be stated as "What transpires on top of the pushdown list is independent of what is under the top of the pushdown list." LEtvr~L~ 2.20 Let P = (Q, Z, I", d~, q0, Z0, F) be a PDA. If (q, w, A)I -z- (q', e, e), then (q, w, A~)I -n- (q', e, t~) for all A ~ 17' and ~ ~ r * .
Proof A proof by induction on n is quite elementary. For n = 1, the lemma is certainly true. Assuming that it is true for all 1 ~ n < n', let (q, w, A)!-~-- (q', e, e). Such a sequence of moves must be of the form
(q, w, A) 1-- (q,, w~, x~ . . . x~) I°~ (q~, w~, x~ ... x~) .
I'~-~(q,, w~, x~) I"~(q', e, e) where k > 1 and n~ < n' for 1 < i < k.t Then the following sequence of moves must also be possible for any ~r*.
(q, w, AtE) ~ (q l, w, X1 . . . XkO0 1"1 (q2, W2, X2 "'" Xut~)
1~-'~ (qk, Wk, Xutx) 1"~ (q', e, t~) Except for the first move, we invoke the inductive hypothesis.
El
Next, we would like to extend the definition of a PDA slightly to permit the P D A to replace a finite-length string of symbols on top of the pushdown •]'This is another of those "obvious" statements which may require some thought• Imagine the P D A running through the indicated sequence of configurations. Eventually, the length of the pushdown list becomes k - 1 for the first time. Since none of X2 . . . Xk has ever been the top symbol, they must still be there, so let n l be the number of elapsed moves• Then wait until the length of the list first becomes k -- 2 and let n2 be the number of additional moves made. Proceed in this way until the list becomes empty•
SEC. 2.5
PUSHDOWN AUTOMATA
173
list by some other finite-length string in a single move. Recall that our original version of PDA could replace only the topmost symbol on top of the pushdown list on a given move. DEFINITION
Let an extended PDA be a 7-tupte P = (Q, X, F, ~, qo, Z0, F), where is a mapping from a finite subset o f Q x (Z u [e}) x F* to the finite subsets of Q x r'* and all other symbols have the same meaning as before. A configuration is as before, and we write (q, aw, a},)~ (q', w, fl?) if 6(q, a, a) contains (q', 13) for q in Q, a in X u [e}, and a in F*. In this move the string a is replaced by the string fl on top of the pushdown list. As before, the language defined by P, denoted L(P), is {w I (q0, w, Z)I-~-- (q, e, a) for some q in F and a in F*}. Notice that unlike a conventional PDA, an extended pushdown automaton is capable of making moves when its pushdown list is empty. Example 2.33
Let us define an extended PDA P to recognize L = {wwRlw ~ [a, b}*}. Let P = ([q, p}, [a, b}, {a, b, S, Z}, 6, q, Z, [p}), where (1) 6(q, a, e) (2) 6(q, b, e) (3) O(q, e, e) (4) O(q, e, aSa) (5) O(q, e, bSb) (6) 6(q, e, SZ)
= = = = = =
{(q, a)} [(q, b)} {(q, S)} {(q, S)} {(q, S)} {(p, e)}
With input aabbaa, P can make the following sequence of moves: (q, aabbaa, Z) ~ (q, abbaa, aZ) (q, bbaa, aaZ) (q, baa, baaZ)
k- (q, baa, SbaaZ) }- (q, aa, bSbaaZ) }-- (q, aa, SaaZ) k- (q, a, aSaaZ) ]-- (q, a, SaZ) (q, e, aSaZ) [- (q, e, SZ) [-- (p, e, e)
174
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
P operates by first storing a prefix of the input on the pushdown list. Then a centermarker S is placed on top of the pushdown list. P then places the next input symbol on the pushdown list and replaces aSa or bSb by S on the list. P continues in this fashion until all of the input is used. If SZ then remains on the pushdown list, P erases SZ and enters the final state. D We would now like to show that L is a PDA language if and only if L is an extended PDA language. The "only if" part of this statement is clearly true. The "if" part is the following lemma. LEMMA 2.21 Let (Q, E, F, 6, qo, Zo, F) be an extended PDA. Then there is a PDA P, such that L(P1) = L(P).
Proof Let m = max{l all 6(q, a, 00 is nonempty for some q E Q, a ~ X u [e}]. We shall construct a PDA Pa to simulate P by storing the top m symbols that appear on P's pushdown list in a "buffer" of length m located in the finite state control of P1. In this way P~ can tell at the start of each move what the top m symbols of P's pushdown list are. If, in a move, P replaces the top k symbols on the pushdown list by a string of l symbols, then P1 will replace the first k symbols in the buffer by the string of length l. If i < k, then P1 will make k -- I bookkeeping e-moves in which k -- l symbols are transferred from the top of the pushdown list to the buffer in the finite control. The buffer will then be full and P~ ready to simulate another move of P. If I > k, symbols are transferred from the buffer to the pushdown list. Formally, let P~ = (Qi, E, F~, ~l, q~, z1, Fi), where (1) Qt = {[q, a]]q ~ Q, a ~ F~, and 0 < [a[ ~ m}. (2)
= r u (z,}.
(3) ~1 is defined as follows" (a) Suppose that ~(q, a, X~ . . . Xk) contains (r, Y~ . . . Yi). (i) If l > k , then for all Z ~ F~ and a ~ F~' such that la[=m-k, 6,([q, X, . . . Xka ], a, Z) contains ([r, fl], yZ) where fly = Y, . . . Y# and ]fl[ = m. (ii) If l < k , then for all Z ~ F1 and a ~ F~* such that
la]=m-k, ~5,([q, X1 .-. Xka ], a, Z) contains ([r, Y, . . . YlaZ], e) (b) For all q ~ Q, Z ~ F1, and a ~ F~* such t h a t [ a I < m, fil([q, a], e, Z) = {([q, aZ], e)}
SEC. 2.5
PUSHDOWN AUTOMATA
175
These rules cause the buffer in the finite control to fill up (i.e., contain m symbols). (4) qt = [q0, Z o Z T - 1]. The buffer initially contains Z0 on top and m -- 1 Z l ' s below. Z l's are used as a special marker for the bottom of the pushdown list. (5) F1 = [[q, ~]lq ~ F, 0c ~ F~*}. It is not difficult to show that
(q, aT, X l . . . XkXk+ ~ . " Xn) [-~-(r, w, Y1 "'" YIXk+i "'" Xn) if and only if ([q, a], aT, fl)1-~- ([r, a'], w, fl'), where (1) ~fl = x~ . . . x , z ~ , (2) ~ ' f l ' = r~ . . . Y,X~+x . . . x o z ~ , (3) I~t = t~:1 = m, and (4) Between the two configurations of P1 shown is none whose state has a second component (buffer) of length m. Direct examination of the rules of P1 is sufficient.
Thus, (qo, w, Zo)]-~- (q, e, oO for some q in F and ~z in F* if and only if
([q0 Z0Z~-l], w, Zl)[--~--Pt ([q, ~], e, r) where lfll = m and fl~, = 0cZ~. Thus, L(P1) = L(P).
[[]
Let us now examine those inputs to a P D A which cause the pushdown list to become empty. DEFINITION Let P = (Q, Z, F, ~, qo, Z0, F) be a PDA, or an extended PDA. We say that a string w ~ E* is accepted by P by empty pushdown list whenever (q0, w, Z0) [--~-(q, e, e) for some q ~ Q. Let L,(P) be the set of strings accepted by P by empty pushdown list. LEMMA 2.22 Let L be L(P) for some P D A P = (Q, Z, F, ~, q0, Zo, F). We can construct a P D A P ' such that L,(P') = L.
Proof. We shall let P' simulate P. Anytime P enters a final state, P ' will have a choice to continue simulating P or to enter a special state q, which causes the pushdown list to be emptied. However, there is one complication. P may make a sequence of moves on an input string w which causes its pushdown list to become empty without the finite control being in a final state. Thus, to prevent P ' from accepting w when it should not, we add to P ' a special bottom marker for the pushdown list which can be removed only by P ' in state q,. Formally, let P ' be (Q u {q,, q'}, Z, F u {Z'}, 6 ' , q ' , Z ' , O),t tWe shall usually make the set of final states ~ if the PDA is to accept by empty pushdown list. Obviously, the set of final states could be anything we wished.
176
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
where d~' is defined as follows: (1) If 8(q, a, Z) contains (r, y), then 8'(q, a, Z) contains (r, y) for all q~ Q,a ~ Xu{e],andZ ~ F. (2) d~'(q', e,Z') = {(q0, ZoZ')} • P " s first move is to write ZoZ' on the pushdown list and enter the initial state of P. Z' will act as the special marker for the bottom of the pushdown list. (3) For all q E F and Z ~ F u {Z'}, ~'(q, e, Z) contains (q,, e). (4) For all Z ~ F u {Z'}, ~'(q,, e, Z) = {(qe, e)}. We can clearly see that
(q', w, z') ~ (qo, w, z0z') (q, e, Y1 "'" Y,) (q,, e, Y2 "'" Yr)
F~.(qe, e, e) where Yr = Z', if and only if (qo, w, Z 0 ) ~
(q, e, Y1 "'" Yr-1)
for q E F and Y1 "'" Yr-1 ~ F*. Hence, L,(P') = L(P).
E]
The converse of Lemma 2.22 is also true. LEMMA 2.23 Let P = (Q, E, r', 3, q0, zo, ~ ) be a PDA. We can construct a P D A P ' such that L(P') = L,(P).
Proof. P' will simulate P but have a special symbol Z ' on the bottom of its pushdown list. As soon as P ' can read Z', P ' will enter a new final state q s" A formal construction is left for the Exercises. 2.5.3.
Equivalence of PDA Languages and CFL's
We can now use these results to show that the P D A languages are exactly the context-free languages. In the following lemma we construct the natural (nondeterministic) "top-down" parser for a context-free grammar. LEMMA 2.24 Let G = (N, Z, P, S) be a CFG. F r o m G we can construct a P D A R such that L,(R) = L(G).
Proof. We shall construct R to simulate all leftmost derivations in G. Let R = ({q}, Z, N U X, 5, q, S, ~ ) , where 5 is defined as follows: (1) If A ~ ~ is in P, then ~(q, e, A) contains (q, 00. (2) J(q, a, a) = [(q, e)} for all a in X. We now want to show that
SEC. 2.5
P U S H D O W N AUTOMATA
177
wl
(2.5.2)
A ===~ w if and only if (q, w, A)1.2-- (q, e, e) for some m, n ~ 1
Only if." We shall prove this part by induction on m. Suppose that m
A==~w. I f m =
1 andw=al
""ak,
k>0,
then
(q, al . . . ak, A) l---(q, a~ . . . ak, al . . . ak) ] - - ( q , e, e) m
N o w suppose that A :=~ w for some m > 1. The first step of this derivamt
tion must be of the form A ~ X12"2 " " Xk, where X~ ==~ xi for some m~ < m, 1 < i < k, and where x t x 2 . . . x k = w. Then
(q, w, A) ~ (q, w, X~X, . . . X,,) If X~ is in N, then (q, x,, X,) ~-- (q, e, e)
by the inductive hypothesis. If Xt -- xt is in Z, then (q, x,, X,) ]- (q, e, e)
Putting this sequence of moves together we have (q, w, A ) ~ - (q, e, e). If: We shall now show by induction on n that if (q, w, A) ~z_ (q, e, e), then +
A:=~w. For n = 1, w = e and A ---, e is in P. Let us assume that this statement is true for all n' < n. Then the first move made by R must be of the form (q, w, A) 1--- (q, w, X 1 "°" Y k )
and (q, x,, Xi) ~ - ( q , e, e) for 1 < i ~ k, where w = x l x 2 " " Xk ( L e m m a +
2.20). Then A ~
21 ' - " Xk is a production in P, and Xt ==~ x~ from the 0
inductive hypothesis if Xi ~ N. If Xt is in Z, then X~ ==~ x~. Thus A---~.X1
... X k
:,g
, x~X~ .. " Xk
~
2122
•..
Xk_lXk
-
~. 2 1 2 2
..°
2k_lXk
=
W
is a derivation of w from A in G. +
As a special case of (2.5.2), we have the derivation S ~ if (q, w, S) ~ (q, e, e). Thus, L~(R) = L(G). E]
w if and only
178
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
Example 2.34
Let us construct a P D A P such that L,(P) = L(Go) , where Go is our usual grammar for arithmetic expressions. Let P = ({~q), Z, F, ~, q, E, ~ ) , where is defined as follows: (1) (2) (3) (4)
6(q, e, 6(q, e, 5(q, e, d~(q, b,
E) T) F) b)
= = = =
[(q, E + T), (q, T)}. [(q, T , F ) , (q, F)}. [(q, (E)), (q, a)}. [(q, e)} for all b ~ [a, + , ,, (,)}.
With input a + a • a, P can make the following moves among others:
(q, a -k a , a, E) ~ (q, a -l- a , a, E + T) ~-(q,a + a,a,T+
T)
F- (q,a + a , a , F +
T)
~(q,a+
T)
a,a,a+
F- (q, + a , a , + T) ~- (q, a • a, T) ~(q,a,a,T,F) ]- (q, a , a, F , F) ~(q,a,a,a,F) ~(q,,a,,F) [-- ( q , a , F ) (q, a, a) ~ (q,e,e) Notice that in this sequence of moves P has used the rules in a sequence that corresponds to a leftmost derivation of a + a • a from E in G 0. D This type of analysis is called "top-down parsing," or "predictive analysis," because we are in effect constructing a derivation tree starting off from the top (at the root) and working down. We shall discuss top-down parsing in greater detail in Chapters 3, 4, and 5. We can construct an extended P D A that acts as a "bottom-up parser" by simulating rightmost derivations in reverse in a C F G G. Let us consider the sentence a + a • a in L(Go). The sequence
E===~ E + T - - - > E + T , F---~. E + T , a----> E + F , a E + a , a-----~. T + a , a - - - > F +
a , a---->a + a , a
of right-sentential forms represents a rightmost derivation of a + a • a from E in Go.
PUSHDOWN AUTOMATA
SEC. 2.5
179
Now suppose that we write this derivation reversed. If we consider that in going from the string a 4- a • a to the string F -Jr- a • a we have applied the production F--~ a in reverse, then we can say that the string a + a • a has been "left-reduced" to the string F 4 - - a , a. Moreover, this represents the only leftmost reduction that is possible. Similarly the right sentential form F + a • a can be left-reduced to T 4- a • a by means of the production T---~ F, and so forth. We can formally define the process of left reduction as follows. DEFINITION
Let G = (N, ~, P, S) be a CFG, and suppose that S ==>ocAw ==>o~flw ==~ xw i,m
i,m
i,m
is a rightmost derivation. Then we say that the right-sentential form ~flw can be left-reduced under the production A ~ fl to the right-sentential form ocAw. Furthermore, we call the substring fl at the explicitly shown position a handle of 0~flw. Thus a handle of a right-sentential form is any substring which is the right side of some production and which can be replaced by the left side of that production so that the resulting string is also a right-sentential form. Example 2.35
Consider the grammar with the following productions" S
>ActBd
A-
~ aAb[ab
B ~
aBbblabb
This grammar generates the language {a"b"cln ~> 1] u {a"b2"d[ n ~> 1]. Consider the right-sentential form aabbbbd. The only handle of this string is abb, since aBbbd is a right-sentential form. Note that although ab is the right side of the production A ---, ab, ab is not a handle of aabbbbd since aAbbbd is not a right-sentential form. [-] Another way or defining the handle of a right-sentential form is to say that the handle is the frontier of the leftmost complete subtree of depth 1 (i.e., a node all of whose direct descendants are leaves, together with these leaves) of some derivation tree for that right-sentential form. In the grammar Go, the derivation tree for a + a , a is shown in Fig. 2.16(a). The leftmost complete subtree has the leftmost node labeled F as root and frontier a. If we delete the leaves of the leftmost complete subtree, we are left with the derivation tree of Fig. 2.16(b). The frontier of this tree is F + a • a, and
180
cI-Ial>. 2
ELEMENTS OF L A N G U A G E T H E O R Y
+
i
T
I
F
F
!
a
!
a
i
a
(a)
E T
I
F
+
T T
I
F
•
E F
T
i
a
I
T T
F
F
a
I
I
a
a
(b)
(c) Fig. 2.16 Handle pruning.
this string is precisely the result of left-reducing a -k a • a. The handle of this tree is the frontier F of the subtree with root labeled T. Again removing the handle we are left with Fig. 2.16(c). The process of reducing trees in this manner is called handle pruning. F r o m a C F G G we can construct an equivalent extended PDA~P which operates by handle pruning. At this point it is convenient to represent a pushdown list as string such that the rightmost symbol of the pushdown list, rather than the leftmost, is at the top. Using this convention, if P --" (Q, ~, F, tS, q0, Z0, F) is a PDA, its configurations are exactly as before. However, the [- relation is defined slightly differently. If tS(q, a, 00 contains (p, fl), then we write (q, aw, ?oc) ~- (p, w, ?fl) for all w ~ ~* and ? ~ F*. Thus a notation such as "~(q, a, YZ) contains (p, VWX)" means different things depending on whether the (extended) P D A has the top of its pushdown list at the left or right. If at the left, Y and V are the top symbols before and after the move. If at the right, Z and X are the top symbols. Given a P D A
sEc. 2.5
PUSHDOWN AUTOMATA
1 81
with the top at the left, one can create a PDA doing exactly the same things, but with the pushdown top at the right, by reversing all strings in F*. For example, (p, VWX) ~ ~(q, a, YZ) becomes (p, X W V ) ~ ~(q, a, Z Y). Of course, one must specify the fact that the top is now at the right. Conversely, a P D A with top at the right can easily be converted to one with the top a the left. We see that the 7-tuple notation for PDA's can be interpreted as two different PDA's, depending on whether the top is taken at the right or left. We feel that the notational convenience which results from having these two conventions outweighs any initial confusion. As the "default condition," unless it is specified otherwise, ordinary PDA's have their pushdown tops on the left and extended PDA's have their pushdown tops on the right. LEMMA 2.25 Let (N, X, P, S) be a CFG. From G we can construct an extended PDA R such that L(R) = L(G).t R can "reasonably" be said to operate by handle pruning.
Proof. Let R = ({q, r }, X, N U ~ U {$}, ~, q, $, {r }) be an extended PDA~ in which $ is defined as follows: (1) d~(q, a, e) = {(q, a)} for all a ~ Z. These moves cause input symbols to be shifted on top of the pushdown list. (2) If A --~ ~ is in P, then O(q, e, 00 contains (q, A). (3) $(q, e, $ S ) = {(r, e)}. We shall show that R operates by computing right-sentential forms of G, starting with a string of all terminals (on R's input) and ending with the string S. The inductive hypothesis which will be proved by induction on n is
(2.5.3)
S~
*
fm
aAy ~
n
xy implies (q, xy, $) ~ (q, y, SaA)
rm
The basis, n = 0, is trivial; no moves of R are involved. Let us assume (2.5.3) for values of n smaller than the value we now choose for n. We can write n--1
eAy :=~ efly=-~ xy. Suppose that eft consists solely of terminals. Then Fm
I"m
eft = x and (q, xy, $ ) ~ (q, y, $efl) !- (q, Y, SEA). If eft is not in Z*, then we can write eft = ~,Bz, where B is the rightmost tObviously, Lemma 2.25 is implied by Lemmas 2.23 and 2.24. It is the construction that is of interest here. :l:Our convention puts pushdown tops on the right.
182
ELEMENTS OF LANGUAGE THEORY *
CHAP. 2 n--I
nonterminal. By (2.5.3), S ==>?Bzy ==~xy implies (q, xy, $ ) ~ (q, zy, $?B). rm
rm
Also, (q, zy, $?B) ~ (q, y, $~,Bz) ~-- (q, y, $o~A) is a valid sequence of moves. We conclude that (2.5.3) is true. Since (q, e, $ S ) ~ (r, e, e), we have L(C) _~ L(R). We must now show the following, in order to conclude that L(R) ~ L(G), and hence, L(G) = L(R)" (2.5.4)
If (q, xy, $) ~ (q, y, $aA), then aA y ~
xy
The basis, n = 0, holds vacuously. For the inductive step, assume that (2.5.4) is true for all values of n < m. When the top symbol of the pushdown list of R is a nonterminal, we know that the last move of R was caused by rule (2) of the definition of ,3. Thus we can write
(q, xy, $) ~
(q, y, $afl) t-- (q, y, $aA),
where A --, fl is in P. If 0~fl has a nonterminal, then by inductive hypothesis (2.5.4), o~fly =-~ xy. Thus, ocAy ==~ocfly ~
xy, as contended.
As a special case of (2.5.4), (q, w, $ ) ~ (q, e, $S) implies that S ~ w. Since R only accepts w if (q, w, $) t---(q, e, $S) ~ (r, e, e), it follows that L(R) _~ L(G). Thus, L(R) -- L(G). [~ Notice that R stores a right-sentential form of the type aAx with aA on the pushdown list and x remaining on the input tape immediately after a reduction. Then R can proceed to shift symbols of x on the pushdown list until the handle is on top of the pushdown list. Then R can make another reduction. This type of syntactic analysis is called "bottom-up parsing" or "reductions analysis." Example 2.36
Let us construct a bottom-up analyzer R for Go. Let R be the extended PDA ({q, r}, Z, I', 5, q, $, [r]), where 5 is as follows: (1) O(q, b, e) = [(q, b)} for all b in {a, + , . , (,)}. (2) 5(q, e, E + T) = [(q, E)} 5(q, e, T) = [(q, E)} 5(q, e, T . F) = [(q, T)} 6(q, e, F) = {(q, T)] ~(q, e, (e)) = {(q, F)} 5(q, e, a) = [(q, F)}. (3) 5(q, e, $ E ) = {(r, e)~. With input a + a • a, R can make the following sequence of moves"
PUSHDOWN AUTOMATA
SEC. 2.5
183
(q, a + a , a, $) 1-- (q, + a , a, $a) 1- (q, + a • a, $F)
[-(q, + a , a , $ T ) (q, + a , a, $E) (q, a • a, $E + ) (q, , a, $E + a) (q, • a, $E + F) ~--- (q, • a, $E + T) ~- (q, a, $E + T ,) 1-- (q, e, SE + T • a) [- (q, e, SE + T , F) 1- (q, e, $E + T) 1-- (q, e, $E)
y- (r, e, e) Notice that R can make a great number of different sequences of moves with input a + a • a. This sequence, however, is the only one that goes from an initial configuration to a final configuration. 5 We shall now demonstrate that a language defined by a P D A is a contextfree language. LEMMA 2.26 Let R = (Q, X, F, 8, q0, Z0, F) be a PDA. We can construct a C F G such that L(G) = L,(R).
Proof. We shall construct G so that a leftmost derivation of w in G directly corresponds to a sequence of moves made by R in processing w. We shall use nonterminal symbols of the form [qZr] with q and r in Q and +
Z ~ F. We shall then show that [qZr] =~ w if and only if (q, w, Z) t----(r, e, e). Formally, let G = (N, Z, P, S), where (1) N = ([qZr][q, r (2) The productions (a) If 8(q, a, Z) productions
~ O , Z ~ F} u (S}. in P are constructed as follows" contains (r, X1 " " Xk), t k _> 1, then add to P all of the form
[qZsk] ~
a[rXlsl][siX2s2] "'" [Sk-lXkS~]
tR has its pushdown list top on the left, since we did not state otherwise.
184
CHAP. 2
ELEMENTSOF LANGUAGE THEORY
for every sequence s 1, s 2 , . . . , se of states in Q. (b) If 6(q, a,Z) contains (r, e), then add the production [qZr]---, a to P. (c) Add to P, S --~ [qoZoq] for each q E Q. It is straightforward to show by induction on m and n that for all q, r ~ Q, m
and Z ~ F, [qZr] =~ w if and only if (q, w, Z) Fz- (r, e, e). We leave the proof +
for the Exercises. Then, S =~ [qoZoq] =~ w if and only if (q0, w, Z0) ~-- (q, e, e) for q in Q. Thus, Le(R) -- L(G). D We can summarize these results in the following theorem. THEOREM 2.21 The following statements are equivalent: (1) (2) (3) (4)
L L L L
is L(G) for a CFG G. is L(P) for a PDA P. is Le(P) for a PDA P. is L(P) for an extended PDA P.
Proof ( 3 ) ~ (1) by Lemma 2.26. (1)----~ (3) by Lemma 2.24. (4)--~ (2) by Lemma 2.21, and (2) ~ (4) is trivial. (2) ~ (3) by Lemma 2.22 and (3)----, (2) by Lemma 2.23. r-] 2.5.4.
Deterministic Pushdown Automata
We have seen that for every context-flee grammar G we can construct a PDA to recognize L(G). The PDA constructed was nondeterministic, however. For practical applications we are more interested in deterministic pushdown automata--PDA's which can make at most one move in any configuration. In this section we shall study deterministic PDA's and later on we shall see that, unfortunately, deterministic PDA's are not as powerful in their recognitive capability as nondeterministic PDA's. There are context-free languages which cannot be defined by any deterministic PDA. A language which is defined by a deterministic pushdown automaton will be called a deterministic CFL. In Chapter 5 we shall define a subclass of the context-free grammars called LR(k) grammars. In Chapter 8 we shall show that every LR(k) grammar generates a deterministic CFL and that every deterministic CFL has an LR(1) grammar. DEFINITION
A PDA P -= (Q, Z, F, ~, q0, Zo, F) is said to be deterministic (a D P D A for short) if for each q ~ Q and Z ~ I" either (1) ~(q,a,Z) contains at most one element for each a in E and ~(q, e, Z)-= ~ or
SEC. 2.5
PUSHDOWN AUTOMATA
185
(2) O(q, a, Z ) - - ~ for all a ~ Z and 6(q, e, Z) contains at most one element. These two restrictions imply that a D P D A has at most one choice of move in any configuration. Thus in practice it is much easier to simulate a deterministic PDA than a nondeterministic PDA. For this reason the deterministic CFL's are an important class of languages for practical applications. CONVENTION Since 6(q, a, Z) contains at most one element for a DPDA, we shall write 6(q, a, Z) = (r, ~) instead of 6(q, a, Z) = {(r, ?)}. Example 2.37
Let us construct a D P D A for the language L = {wcw~lw ~ [a, b}+]. Let P = ([qo, q l, q2}, {a, b, c}, {Z, a, b}, ,6, qo, Z, {q2]), where the rules of t~ are
6(qo, X, Y) = (qo, X Y)
for all X ~ (a, b}
Y ~ [Z,a,b} 6(qo, c, Y ) = (q~, Y) O(q~, X, X) = (q~, e)
for all Y E [a, b} for all X ~ [a, b}
di(q t , e, Z ) = (q2, e) Until P sees the centermarker c, it stores its input on the pushdown list. When the e is reached, P goes to state q l and proceeds to match its subsequent input against the pushdown list. A proof that L(P) -- L is left for the Exercises. D The definition of a D P D A can be naturally widened to include the extended PDA's which we would naturally consider deterministic. DEFINITION An extended PDA P -- (Q, E, r', ~, q0, Zo, F) is an (extended) determin-
istic PDA if the following conditions hold: (1) For n o q ~ Q , a ~ E u { e } a n d ~ , ~ F* is z~O(q, a, 7 ) > 1. (2) If O(q, a, 0c) ~ ~ , 6(q, a, fl) ~ ~ , and ~ ~ fl, then neither of ~ and fl is a suffix of the other.? (3) If 6(q, a, oc) ~ ~ , and O(q, e,/3) :~ ~ , then neither of 0c and fl is a suffix of the other. We see that in the special case in which the extended PDA is an ordinary PDA, the two definitions agree. Also, if the construction of Lemma 2.21 is tlf the extended PDA has its pushdown list top at the left, replace "suffix" by "prefix."
186
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
applied to an extended PDA P, the result will be a D P D A if and only if P is an extended DPDA. When modeling a syntactic analyzer, it is desirable to use a D P D A P that reads all of its input, even when the input is not in L(P). We shall show that it is possible to always find such a DPDA. We first modify a D P D A so that in any configuration with input remaining there is a next move. The next lemma shows how. LEMMA 2.27 Let P = (Q, E, r', ~, q0, z0, F) be a DPDA. We can construct an equivalent D P D A P' = (Q', Z, F', ~', q~, Z~, F') such that (1) For all a ~ E, q ~ Q' and Z ~ F', either (a) 8'(q, a, Z ) c o n t a i n s exactly one element and t~'(q, e , Z ) = ~ or (b) tS'(q, a, Z) = ~ and O'(q, e, Z) contains exactly one element. (2) If $(q, a, Z~,) = (r, ~,) for some a in ~ u {e}, then ~, = aZ~ for some a~F*.
Proof Z~ will act as an endmarker on the pushdown list to prevent the pushdown list from becoming completely empty. Let F ' = F u {Zg}, and. let Q' = {q~, q,} u Q. 6' is defined thus" (1) dV(q~, e, Zg) = (q0, ZoZ~) • (2) For all q E Q, a ~ Z u {e}, and Z ~ F such that t~(q, a , Z ) - ~ ~ , ~'(q, a, Z) = ~(q, a, Z). (3) If fi(q, e , Z ) = ~ and $(q, a , Z ) = ~ for some a ~ Z and Z ~ F, let $'(q, a, Z) = (q,, Z). (4) For all Z ~ r ' and a ~ ~, dV(q,, a, Z ) = (q,, Z). The first rule allows P ' to simulate P by having P' write Z0 on top of Z~ on the pushdown list and enter state qo. The rules in (2) permit P' to simulate P until no next move is possible. In such a situation P' will go into a nonfinal state q,, by rule (3), and remain there without altering the pushdown list, while consuming any remaining input. A proof that L(P') = L(P) is left for the Exercises. Q It is possible for a D P D A to make an infinite number of e-moves from some configurations without ever using an input symbol. We call these configurations looping. DEFINITION
Configuration (q, w, a) of D P D A P is looping if for all integers i there exists a configuration (p,, w, l?,) such that ]fl, I>--I~t and (q, w, 00 ~- (P~, w, fl~) ~ (P2, w, 1/2) F - ' " ". Thus a configuration is looping if P can make an infinite number of
SEC. 2.5
PUSHDOWN AUTOMATA
187
e-moves without creating a shorter pushdown list; that list might grow indefinitely or cycle between several different strings. Note that there are nonlooping configurations which after popping part of their list using e-moves enter a looping configuration. We shall show that it is impossible to make an infinite number of e-moves from a configuration unless a looping configuration is entered after a finite, calculable number of moves. If P enters a looping configuration in the middle of the input string, then P will not use any more input, even though P might satisfy Lemma 2.27. Given a D P D A P, we want to modify P to form an equivalent D P D A P' such that P' can never enter a looping configuration. ALGORITHM 2.16
Detection of looping configurations.
Input. D P D A P = (Q, E, I', ~, q0, Z0, F). Output. (1) C a = {(q, A)](q, e, A) is a looping configuration and there is no r in F such that (q, e, A) 1----(r, e, 00 for any 0~ ~ F*}, and (2) C2 = {(q,A)l(q, e,A) is a looping configuration and (q, e, A)~---(r, e, a) for some r ~ F and a ~ F*}.
Method. Let ~ Q = n t, # F = n 2, and let l be the length of the longest string written on the pushdown list by P in a single move. Let n 3 = n~(n~'"'~- n2)/(n z -- 1), where n3 = na if n2 = 1. n3 is the maximum number of e-moves P can make without looping. (1) For each q ~ Q and A ~ F determine whether (q, e, A) ~ (r, e, a) for some r e Q and ~ ~ F +. Direct simulation of P is used. If so, (q, e, A) is a looping configuration, for then we shall see that there must be a pair (q', A'), with q' ~ Q and A' E F, such that
(q, e, A) t--- (q', e, A'fl) (q', e, Ira(j- 1,) (q,, e, A' rJfl) where m > 0 and j > 0. Note that ~, can be e. (2) If (q, e, A) is a looping configuration, determine whether there is an r in F such that (q, e, A ) ~ (r, e, 00 for some 0 _< j < n 3. Again, direct simulation is used. If so, add (q, A) to C z. Otherwise, add (q, A) to C~. We claim that if P can reach a final configuration from (q, e, A), it must do so in n 3 or fewer moves. THEOREM 2.22 Algorithm 2.16 correctly determines C~ and C2.
188
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
Proof We first prove that step (1) correctly determines C1 U C2. If (q, A) is in C1 U C2, then, obviously, (q, e, A ) ~ (r, e, ~). Conversely, suppose that (q, e, A) ~ (r, e, ~). Case 1" There exists fl ~ r * , with I//I > nan21 and (q, e , A ) ~ (p, e, fl) ~--(r, e, ~ ) f o r some p ~ Q. If we consider, for j = 1, 2 , . . . , nan2l + 1, the configurations which P entered in the sequence of moves (q, e, A) (p, e, fl) the last time the pushdown list had length j, then we see that there must exist q' and A' such that at two of those times the state of P was q' and A' was on top of the list. In other words, we can write (q, e, A) l-- (q', e, A'6) (q', e, A'76 ) ~ (p, e, fl). Thus, (q, e, A) ~ (q', e, A'$) ~ (q', e, A'rJ~) for allj ~ 0 by Lemma 2.20. Here, m > 0, so an infinity of e-moves can be made from configuration (q, e, A), and (q, A) is in C~ U C 2. Case 2: Suppose that the opposite of case 1 is true, namely that for all fl such that (q, e, A) ~ (p, e, fl) ]--- (r, e, ~) we have I/~1 _< nln2l. Since there are n 3 -+- 1 different fl's, na possible states, and n2 -k- n~ -+- n~ -+- . . . -q- n~'"a = (n~~"~- n2)/(n 2 --1) possible pushdown lists of length at most nlnJ, there must be some repeated configuration. It is immediate that (q, A) is in Cx U C 2. The proof that step (2) correctly apportions C a U C2 between C a and C~ is left for the Exercises. E] DEFINITION
A D P D A P -- (Q, Z, F, ~, qo, Zo, F) is continuing if for all w ~ E* there exists p ~ Q and ~ ~ F* such that (qo, w, Zo) ~ - (p, e, ~). Intuitively, a continuing D P D A is one which is capable of reading all of its input string. LEMMA 2.28 Let P -- (Q, ~, F, 6, qo, Zo, F) be a DPDA. Then there is an equivalent continuing D P D A P'.
Proof Let us assume by Lemma 2.27 that P always has a next move. Let P ' = (Q u {p, r}, ~, F, t~', q0, Z0, F U [p]), where p and r are new states, t~' is defined as follows" (1) (2) ration, (3) (4) (5)
For all q ~ Q, a ~ Z, and Z E F, let 6'(q, a, Z) = O(q, a, Z). For all q E Q and Z ~ r' such that (q, e, Z) is not a looping configulet O'(q, e, Z) = $(q, e, Z). For all (q, Z) in the set Cl of Algorithm 2.16, let 6'(q, e, Z) = (r, Z). For all (q, Z) in the set C 2 of Algorithm 2.16, let r~'(q, e, Z) = (p, Z). For all a E E and Z ~ I", t~'(p, a, Z) = (r, Z) and ~'(r, a, Z) = (r, Z).
Thus, P ' simulates P. If P enters a looping configuration, then P' will enter on the next move either state p or r depending on whether the loop of configurations contains or does not contain a final state. Then, under all inputs, P'
s~c. 2.5
PUSHDOWN AUTOMATA
189
enters state r from p and stays in state r without altering the pushdown list. Thus, L(P') = L(P). It is necessary to show that P ' is continuing. Rules (3), (4), and (5) assure us that no violation of the "continuing" condition occurs if P enters a looping configuration. It is necessary to observe only that if P is in a configuration which is not looping, then within a finite number of moves it must either (1) Make a non-e-move or (2) Enter a configuration which has a shorter pushdown list. Moreover, (2) cannot occur indefinitely, because the pushdown list is initially of finite length. Thus either (1) must eventually occur or P enters a looping configuration after some instance of (2). We may conclude that P ' is continuing. [~] We can now prove an important property of DPDA's, namely that their languages are closed under complementation. We shall see in the next section that this is not true for the class of all CFL's. THEOREM 2.23
If L = L(P) for D P D A P, then L = L(P') for some D P D A P'.
Proof. We may, by Lemma 2.28, assume that P is continuing. We shall construct P' to simulate P and see, between two shifts of its input head, whether or not P has entered an accepting state. Since P ' must accept the complement of L(P), P' accepts an input if P has not accepted it and is about to shift its input head (so P could not subsequently accept that input). Formally, let P = (Q, ~, r', ~, q0, z0, F) and P' = (Q', E, I", ~', q~, z0, F'), where (1) Q' = {[q, i] Iq ~ Q, i ~ {0, 1, 2}}, (2) q~ = [q0, o] if q0 ~ F a n d q~ = [q0, 1] if q0 e F, and (3) F ' = {[q, 2][q in a}. The states [q, 0] are intended to mean that P has not been in a final state since it last made a non-e-move. [q, 1] states indicate that P has entered a final state in that time. [q, 2] states are used only for final states. If P ' is in a [q, 0] state and P (in simulation) is about to make a non-e-move, then P' first enters state [q, 2] and then simulates P. Thus, P ' accepts if and only if P does not accept. The fact that P is continuing assures us that P ' will always get a chance to accept an input if P does not. The formal definition of 6' follows: (i) I f q ~ Q , a ~ ~E, a n d Z
~ F, then
d~'([q, 1], a, Z) = J'([q, 2], a, Z) = ([p, i], ~,), where ~ ( q , a , Z ) = ( p , ~ , ) , i = 0 i f p ~ F , and i = l i f p E F . (ii) If q c Q, Z e F, and d(q, e, Z) = (p, ~,), then
1 90
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
6'([q, 1], e, Z) = ([p,
1], ~,)
a n d ~'([q, 0], e, Z ) -- ([p, i], 7), where i - - 0 if p ~ F a n d i (iii) If 6(q, e, Z ) -- O, then 6'([q, 0], e, Z ) -- ([q, 2], Z).
1 if p ~ F.
Rule (i) handles non-e-moves. The second c o m p o n e n t of the state is set to 0 or ] properly. Rule (ii) handles e-moves; again the second c o m p o n e n t of the state is h a n d l e d as intended. Rule (iii) allows P ' to accept an input exactly when P does not. A f o r m a l p r o o f that L ( P ' ) - L ( P ) will be omitted. ~-] There are a n u m b e r of other i m p o r t a n t properties of deterministic CFL's. We shall defer the discussion of these to the Exercises and the next section.
EXERCISES
2.5.1.
Construct PDA's accepting the complements (with respect to [a, b}*) of the following languages: (a) {a"b"a" [n ~> 1}. (b) [wwRIw ~ [a, b}*}. (c) [ambnamb" [m, n ~ 1}. (d) (ww] w ~ (a, b}*). Hint: Have the nondeterministic 'PDA "guess" why its input is not in the language and check that its guess is correct.
2.5.2.
Prove that the PDA of Example 2.31 accepts [wwR[ W ~ (a, b}+}.
2.5.3.
Show that every CFL is accepted by a PDA which never increases the length of its pushdown list by more than one on a single move.
2.5.4.
Show that every CFL is accepted by a PDA P = (Q, E, 1-', 6, q0, Z0, F) such that if (p, 7) is in O(q, a, Z), then either ~, = e, ~, - Z, or 7 -- YZ for some Y ~ F. Hint: Consider the construction of Lemma 2.21.
2.5.5.
Show that every CFL is accepted by a PDA which makes no e-moves. Hint: Recall that every CFL has a grammar in Greibach normal form.
2.5.6.
Show that every CFL is L(P) for some two-state PDA P.
2.5.7.
Complete the proof of Lemma 2.23.
2.5.8.
Find bottom-up and top-down recognizers (PDA's) for the following grammars: (a) S ---, aSb [e. (b) S - - ~ AS[b A - ~ SAla.
(c) s ~ , SSlA A--~OAliSIO1. 2.5.9.
Find a grammar generating L(P), where P = ([q0, ql, q2], [a, b}, {Z0, A], 6, q0, Z0, [q2})
EXERCISES
1 91
and ~ is given by 6(qo, a, Zo) -- (q~, AZo) O(qo, a, A) -- (q l, A A ) O(ql, a, A) -- (qo, A A ) (~(ql, e, A) -- (qz, A) 6(q2, b, A) = (q2, e) Hint: It is not necessary to construct the productions for useless nonterminals. "2.5.10.
Show that if P -- (Q, Z, F, 6, q0, Z0, F) is a PDA, then the set of strings which can appear on the pushdown list is a regular set. That is, show that {0~I(q0, w, 2"o) ~ (q, x, ~) for some q, w, and x} is regular.
2.5.11.
Complete the proof of Lemma 2.26.
2.5.12.
Let P be a P D A for which there is a constant k such that P can never have more than k symbols on its pushdown list at any time. Show that L(P) is a regular set.
2.5.13.
Give D P D A ' s accepting the following languages: (a) [0~1J IJ ~ i}. (b) [wl w consists of an equal number of a's and b's}. (c) L(Go), where Go is the usual grammar for rudimentary arithmetic expressions.
2.5.14.
Show that the D P D A of Example 2.36 accepts {wewRI w E {a, b}+}.
2.5.15.
Show that if the construction of Lemma 2.21 is applied to an extended DPDA, then the result is a DPDA.
2.5.16.
Prove that P and P' in Eemma 2.27 accept the same language.
2.5.17.
Prove that step (2) of Algorithm 2.16 correctly distinguishes C1 from Cz.
2.5.18.
Complete the proof of Theorem 2.23.
2.5.19.
The PDA's we have defined make a move independent of their input unless they move their input head. We could relax this restriction and allow the input symbol scanned to influence the move even when the input head remains stationary. Show that this extension still accepts only the CFL's.
*2.5.20.
We could further augment the PDA by allowing it to move two ways on the input. Also, let the device have endmarkers on the input. We call such an automaton a 2PDA, and if it is deterministic, a 2DPDA. Show that the following languages can be recognized by 2DPDA's: (a) {a"b"c" [n > 1}. (b) (wwlw ~ {a, b}*}. (c) {a2"l n Z> 1}.
2.5.21.
Show that a 2PDA canrecognize {wxw I w and x are in {0, 1}+}.
192
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
Open Questions 2.5.22.
Does there exist a language accepted by a 2PDA that is not accepted by a 2DPDA ?
2.5.23.
Does there exist a CFL which is not accepted by any 2DPDA ?
Programming Exercises 2.5.24.
Write a program that simulates a deterministic PDA.
*2.5.25.
Devise a programming language that can be used to specify pushdown automata. Construct a compiler for your programming language. A source program in the language is to define a PDA P. The object program is to be a recognizer which given an input string w simulates the behavior of P on w in some reasonable sense.
2.5.26.
Write a program that takes as input CFG G and constructs a nondeterministic top-down (or bottom-up) recognizer for G.
BIBLIOGRAPHIC
NOTES
The importance of pushdown lists, or stacks, as they are also known, in language processing was recognized by the early 1950's. Oettinger [1961] and Schutzenberger [1963] were the first to formalize the concept of a pushdown automaton. The equivalence of pushdown automaton languages and context-free languages was demonstrated by Chomsky [1962] and Evey [1963]. Two-way pushdown automata have been studied by Hartmanis et al. [1965], Gray et al. [1967], Aho et al. [1968], and Cook [1971].
2.6.
PROPERTIES OF CONTEXT-FREE LANGUAGES
In this section we shall examine some of the basic properties of contextfree languages. The results mentioned here are actually a small sampling of the great wealth of knowledge about context-free languages. In particular, we shall discuss some operations under which C F L ' s are closed, some decidability results, and matters of ambiguous context-free grammars and languages. 2.6.1.
Ogden's Lemma
We begin by proving a theorem (Ogden's lemma) about context-free grammars from which we can derive a number of results about context-free languages. F r o m this theorem we can derive a "pumping lemma" for contextfree languages.
SEC. 2.6
PROPERTIES OF CONTEXT-FREE LANGUAGES
193
DEFINITION
A position in a string of length k is an integer i such that 1 < i < k. We say that symbol a occurs at position i of string w if w = w law 2 and i w~l= i -- 1. For example, the symbol a occurs at the third position of the string baacc.
THEOREM 2.24 For each C F G G = (N, ~, P, S), there is an integer k ~ 1 such that if z is in L(G), ]z[ ~ k, and if any k or more distinct positions in z are designated as being "distinguished," then z can be written as uvwxy such that (1) w contains at least one of the distinguished positions. (2) Either u and v both contain distinguished positions, or x and y both contain distinguished positions. (3) vwx has at most k distinguished positions. (4) There is a nonterminal A such that +
S ~
+
uAy ~ G
for
+
uvAxy G
all integers i (including +
+
:- . . . ~ G
+
uv~Axty ~ G
uv~wxiy G
i = 0, in which case the derivation
is
+
S ==~ u A y ==~ uwy). o
G
Proof. Let m = ~ N
and l be the length of the longest right side of a production in P. Choose k = l 2"+3, and consider a derivation tree T for some sentence z in L(G), where Jz [ ~ k and at least k positions of z are designated distinguished. Note that T must contain at least one path of length at least 2m -k- 3. We can distinguish those leaves of T which, in the frontier z of T, fill the distinguished positions. Let us call node n of T a branch node if n has at least two direct descendants, say n 1 and n 2, such that nl and n 2 both have distinguished leaves as descendants. We construct a path nl, n 2 , . . , in T as follows" (1) nl is the root of T. (2) If we have found n i and only one of nt's direct descendants has distinguished leaves among its descendants (i.e., nt is not a branch node), then let n~+~ be that direct descendant of n~. (3) If n~ is a branch node, choose n~+~ to be that direct descendant of nt with the largest number of distinguished leaves for descendants. If there is a tie, choose the rightmost (this choice is arbitrary). (4) If ni is a leaf, terminate the path. Let nl, n 2 , . . . ,np be the path so constructed. A simple induction on i shows that if n i , . . . , n~ have r branch nodes among them, then ni+l has at least 12m+3-r distinguished descendants. The basis, i = 0, is trivial; r = 0,
194
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
and nl has at least k = 12m+3 distinguished descendants. For the induction, observe that if n~ is not a branch node, then n~ and n~+ 1 have the same number of distinguished descendants, and that if n~ is a branch node, n~+ 1 has at least 1/Ith as many. Since n~ has l 2m+3 distinguished descendants, the path n l , . . . , np has at least 2m -k 3 branch nodes. Moreover, np is a leaf and so is not a branch node. Thus, p > 2m -q- 3. Let ba,b2,..., b2m+3 be the last 2m ÷ 3 branch nodes in the path n l , . . . ,n~. We call bt a left branch node if a direct descendant of b~ not on the path has a distinguished descendant to the left of np and a right branch node otherwise. We assume that at least m -t- 2 of b l , . . . , b2m+3 are left branch nodes. The case in which at least m -k 2 are right branch nodes is handled analogously. Let 1 1 , . . . , lm+z be the last m - q - 2 left branch nodes in the sequence b~,..., b2m+3.Since ~ N = m, we can find two nodes among lz, . . . , lm+2, say Is and 18, such that f < g, and the labels of Ir and lg are the same, say A. This situation is depicted in Fig. 2.17. The double line represents the path n 1, . . . , np; . ' s represent distinguished leaves, but there may be others. If we delete all of ls's descendants, we have a derivation tree with frontier uAy, where u represents those leaves to the left of If and y represents those +
to the right. Thus, S ==~ uAy. If we consider the subtree dominated by 1r +
with the descendants of lg deleted, we see that A ==~ vAx, where v and x are the frontiers from the descendant leaves of lr to the left and right, respectively, of Ig. Finally, let w be the frontier of the subtree dominated by Ig. +
Then A ~
w. We observe that z = uvwxy.
o Fig. 2.17
w D e r i v a t i o n tree T.
x
y
SEC. 2.6
P R O P E R T I E S OF C O N T E X T - F R E E L A N G U A G E S
4-
195
+
Putting all these derivations together, we have S =-~ u A y ==~ uwy, and +
+
+
+
+
+
for all i > 1, S =-~ u A y ==~ u v A x y =-~ uv2AxZy ==~. . . ==~ uvtAxty ==~ uvtwxty. Thus condition (4) is satisfied. Moreover, u has at least one distinguished position, the descendant of some direct descendant of 1~. v likewise has at least one distinguished position, descending from 1r. Thus condition (2) is satisfied. Condition (1) is satisfied, since w has a distinguished position, namely np. To see that condition (3), that v w x has no more than k distinguished positions, is satisfied, we observe that b 1, being the 2m -t-- 3rd branch node from the end of path n 1, . . . , np, has no more than k distinguished positions. Since 1r is a descendant of b 1, our desired result is immediate. We should also consider the alternative case in which at least m -+ 2 of b l, . . . , b2,,+ 3 are right branch nodes. However, this case is handled symmetrically, and we shall find condition (2) satisfied because x and y each have distinguished positions. [~ An important corollary of Ogden's lemma is what is usually referred to as the pumping lemma for context-free languages. COROLLARY
Let L be a CFL. Then there exists a constant k such that if lzl >_ k and z ~ L, then we can write z = u v w x y such that v x ~ e, l v w x [ < k, and for all i, uvtwxty is in L. P r o o f In Theorem 2.24, choose any C F G for L and let all positions of each sentence be distinguished. [~]
It is the corollary to Theorem 2.24 that we most often use when proving certain languages not to be context-free. Theorem 2.24 itself will be used when we talk about inherent ambiguity of CFL's in Section 2.6.5. Example 2.38
Let us use the pumping lemma to show that L = [a"'[n > 1} is not a CFL. If L were a CFL, then we would have an integer k such that if n 2 > k, then a"' = uvwxy, where v and x are not both e and i vwxl < k. In particular, let n be k itself. Certainly k 2 > k. Then uv2wxZy is supposedly in L. But since Ivwxl_< k, we have 1 _ 1} is not a context-free language. Thus the context-free languages are not closed under intersection. We can also conclude that the context-free languages are not closed under complement. This follows from the fact that any class of languages closed under union and complement must also be closed under intersection, using De Morgan's law. The CFL's are closed under union by the corollary to Theorem 2.25. [Z]
There are many other operations under which the context-free languages are closed. Some of these operations will be discussed in the Exercises. We shall conclude this section by providing a few applications of closure properties in showing that certain sets are not context-free languages. Example 2.41 L = {ww[w ~ {a, b} +} is not a context-free language. Suppose that L were context-free. Then L' = L A a+b +a+b + = {amb"amb" l m, n > 1} would also be context-free by Theorem 2.26. But from Exercise 2.6.3(e), we know that L' is not a context-free language. [Z]
Example 2.42 L = { w w l w ~ {c,f} +} is not a context-free language. Let h be the homomorphism h ( c ) = a and h ( f ) = b. Then h ( L ) = { w w l w ~ {a, b}+}, which by the previous example is not a context-free language. Since the CFL's are closed under homomorphism (corollary to Theorem 2.25), we conclude that L is not a CFL. [2]
Example 2.43
A L G O L is not a context-free language. Consider the following class of A L G O L programs: L = {begin integer w; w : =
1;
endtw is
any string in {c,f}+}.
Let LA be the set of all valid A L G O L programs. Let R be the regular set denoted by the regular expression begin integer (c + f ) + ; (c ÷ f ) +
: = 1; end
Then L = LA ~ R. Finally let h be the homomorphism such that h(c) = c, h ( f ) = f , and h ( X ) = e otherwise. Then h(L) = {ww [w e {c, f}+}. Consequently, if LA is context-free, then h(LA ~ R) must also be contextfree. However, we know that h(LA ~ R) is not context-free so we must conclude that L A, the set of all valid A L G O L programs, is not a context-free language.
SEC. 2.6
PROPERTIES OF CONTEXT-FREE LANGUAGES
199
Example 2.43 shows that a programming language requiring declaration of identifiers which can be arbitrarily Iong is not context-free. In a compiler, however, identifiers are usually handled by the lexical analyzer and reduced to single tokens before reaching the syntactic analyzer. Thus the language that is to be recognized by the syntactic analyzer usually can be considered to be a context-free language. There are other non-context-free aspects of A L G O L and many other languages. For example, each procedure takes the same number of arguments each time it is mentioned. It is thus possible to show that the language which the syntactic analyzer sees is not context-free by mapping programs with three calls of the same procedure to {0"10"10" In > 0}, which is not a CFL. Normally, however, some process outside of syntactic analysis is used to check that the number of arguments to a procedure is consistent with the definition of the procedure. 2.6.3.
Decidability Results
We have already seen that the emptiness problem is decidable for contextfree grammars. Algorithm 2.7 will accept any context-free grammar G as input and determine whether or not L(G) is empty. Let us consider the membership problem for CFG's. We must find an algorithm which given a context-free grammar G = (N, E, P, S) and a word w in E*, will determine whether or not w is in L(G). Obtaining an efficient algorithm for this problem will provide much of the subject matter of Chapters 4-7. However, from a purely theoretical point of view we can immediately conclude that the membership problem is solvable for CFG's, since we can always transform G into an equivalent proper contextfree grammar G' using the transformations of Section 2.4.2. Neglecting the empty word, a proper context-free grammar is a context-sensitive grammar, so we can apply the brute force algorithm for deciding the membership problem for context-sensitive grammars to G'. (See Exercise 2.1.19.) Let us consider the equivalence problem for context-free grammars. Unfortunately, here we encounter a problem which is not decidable. We shall prove that there is no algorithm which, given any two CFG's G1 and G2, can determine whether L(G1) = L(G2). In fact, we shall show that even given a C F G G1 and a right-linear grammar G2 there is no algorithm to determine whether L(Gi) = L(G2). As with most undecidable problems, we shall show that if we can solve the equivalence problem for CFG's, then we can solve Post's correspondence problem. We can construct from an instance of Post's correspondence problem two naturally related context-free languages. DEFINITION
Let C = (x 1, y l ) , . . . , (x,, y,) be an instance of Post's problem over alphabet E. Let I = {1, 2 , . . . , n}, assume that I ~ E = .•, and let Lc
200
ELEMENTSOF LANGUAGE THEORY
be { x ~ , x , . . . X~,.i,,,ir,,-t''" i 1 1 i ~ , ' ' ' , im a r e in L m ~> 1}. Let {YtlY, " " Yi,.i,,,ir,,-t "'" it lit, . . . ,im a r e i n / , m > 1}.
CHAP. 2
Mc
be
LEMMA 2.29 Let C - - ( x t , Y t ) , . . . , (x,, y,) be an instance of Post's correspondence problem over X, where X rq {1, 2 . . . . , n} = ~ . Then (1) We can find extended D P D A ' s accepting Lc and M o (2) Lc n Mc -- ~ if and only if C has no solution. Proof (1) It is straightforward to construct an extended D P D A (with pushdown top on the right) which stores all symbols in ~: on its pushdown list. When symbols from { 1 , . . . , n} appear on the input, it pops x~ from the top of its list if integer i appears on the input. If x~ is not at the top of the list, the D P D A halts. The D P D A also checks with its finite control that its input is in X + { 1 , . . . , n} + and accepts when all symbols from X are removed from the pushdown list. Thus, L c is accepted. We may find an extended D P D A for M c similarly. (2) If L c ~ Mc contains the sentence wire "'" it, where w is in X +, then w is clearly a viable sequence. If xil "-" x~ = yi~ - . . y~, = w, then wim . . . i t will be in Lc ~ M o D Let us return to the equivalence problem for CFG's. We need two additional languages related t o an instance of Post's correspondence problem. DEFINITION
Let C - - ( x ~ , Y t ) , . . . , (x,, y,) be an instance of Post's correspondence problem over X and let I = ~ 1 , . . . , n}. Assume that X (3 I = ~ . Define Q____c= {w4pwR!] w is in x+I+}, where # is not in ]g or L Define Pc = Lc:#:Mg. LEMMA 2.30 Let C be as above. Then (1) We can find extended D P D A ' s accepting Qc and Pc, and (2) Qc ~ Pc = ~ if and only if C has no solution. Proof. (1) A D P D A accepting Qc can be constructed easily. For Pc, we know by L e m m a 2.29 that there exists a D P D A , say M~, accepting L o To find a D P D A M2 that accepts M g is not much harder; one stores integers and checks them against the portion of input that is in X+. Thus we can construct D P D A M3 to simulate Mr, check for :#, and then simulate M2. (2) If u v ~ w x is in Qc ~ Pc, where u and x are in X + and v and w in 1 +, then u = x ~, and v = wR, because u v @ w x is in Qo Because u v ~ w x is in Pc, u is a viable sequence. Thus, C has a solution. Conversely, if we have x i , ' " xi,. = Yii "'" Yt,., then x~, . . . Xt,,i m ' ' ' it ~ il "'" i,,,x~.,.., xi, is in Qc A P o
SEC. 2.6
PROPERTIES OF CONTEXT-FREE LANGUAGES
201
LEMMA 2.31
Let C be as above. Then (1) We can find a C F G for Qc u Pc, and (2) Qc u Pc = (E u I)* if and only if C has no solution.
Proof. (1) From the closure of deterministic CFL's under complement (Theorem 2.23), we can find DPDA's for Qc and Pc. From the equivalence of CFL's and PDA languages (Lemma 2.26), we can find CFG's for these languages. From closure of CFL's under union, we can find a C F G for Qc u Pc. (2) Immediate from Lemma 2.30(2) and De Morgan's law. D We can now show that it is undecidable whether two CFG's generate the same language. In fact, we can prove somethingstronger" It is still undecidable even if one of the grammars is right-linear. THEOREM 2.28 It is undecidable for a C F G G~ and a right-linear grammar G 2 whether
L(G~) = L(G~). Proof. If not, then we could decide Post's problem as follows" (1) Given an instance C, construct, by Lemma 2.31, a C F G G~ generating
Qc L9 Pc, and construct a right-linear grammar G z generating the regular set (~ U I)*, where C is over ~, C has lists of length n, and I = {1 . . . . , n}. Again, some renaming of symbols may first be necessary, but the existence or nonexistence of a solution is left intact. (2) Apply the hypothetical algorithm to determine if L(G~)= L(G2). By Lemma 2.31(2), this equality holds if and only if C has no solution. [Z] Since there are algorithms to convert a C F G to a PDA and vice versa, Theorem 2.28 also implies that it is undecidable whether two PDA's, or a PDA and a finite automaton, recognize the same language, whether a PDA recognizes the set denoted by a regular expression, and so forth. 2.6.4.
Properties of Deterministic CFL's
The deterministic context-flee languages are closed under remarkably few of the operations under which the entire class of context-free languages is closed. We already know that the deterministic CFL's are closed under complement. Since La = {a~b~cJ[i,j ~ 1} and L2 ---- {a~bJcJ[i,j ~ 1} are both deterministic CFL's and L~ ~ Lz = {a"b"c"[n ~ 1} is a language which is not context-free (Example 2.39), we have the following two nonclosure properties. THEOREM 2.29 The class of deterministic CFL's is not closed under intersection or union.
202
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
Proof. Nonclosure under intersection is immediate from the above. Nonclosure under union follows from De Morgan's law and closure under complement. The deterministic CFL's form a proper subset of the CFL's, as we see from the following example.
Example 2.44 We can easily show that the complement of L : {a"bnc"ln ~ 1} is a CFL. The sentence w ~ L if and only if one or more of the following hold" (1) w is not in a+b÷c÷. (2) w : aib@k, and i ~ j. (3) w : atbJck, and j ~ k. The set satisfying (1) is regular, and the sets satisfying (2) and (3) are each context-free, as the reader can easily show by constructing nondeterministic PDA's recognizing them. Since the CFL's are closed under union, L is a CFL. But if L were a deterministic CFL, then L would be likewise, by Theorem 2.23. But L is not even a CFL. [Z] The deterministic CFL's have the same positive decidability results as the CFL. That is, given a D P D A P, we can determine whether L ( P ) = ~, and given an input string w, we can easily determine whether w is in L(P). Moreover, given a deterministic D P D A P and a regular set R, we can determine whether L(P)= R, since L(P)= R if and only if we have (L(P)n R) U (L(P) n R) = ~. (L(P) n /~) U (L(P) N R) is easily seen to be a CFL. Other decidability results appear in the Exercises. 2.6.5.
Ambiguity
Recall that a context-free g r a m m a r G = (N, E, P, S) is ambiguous if there is a sentence w in L(G) with two or more distinct derivation trees. Equivalently, G is ambiguous if there exists a sentence w with two distinct leftmost (or rightmost) derivations. When we are using a g r a m m a r to help define a programming language we would like that g r a m m a r to be unambiguous. Otherwise, a programmer and a compiler may have differing opinions as to the meaning of some sentences.
Example 2.45 Perhaps the most famous example of ambiguity in a programming language is the dangling else. Consider the grammar G with productions S
> if b then S else S[if b then S la
SEC. 2.6
PROPERTIES OF CONTEXT-FREE LANGUAGES
203
G is ambiguous since the sentence if b then if b then a else a
has two derivation trees as shown in Fig. 2.18. The derivation tree in Fig. 2.18(a) imposes the interpretation if b then (if b then a) else a
while the tree in Fig. 2.18(b) gives if b then (if b then a else a)
S if
(a) S if
b
then if
S b
then
S
else
i
a
(b) Fig. 2.18 Two derivation trees. We might like to have an algorithm to determine whether an arbitrary C F G is unambiguous. Unfortunately, such an algorithm does not exist. THEOREM 2.30 It is undecidable whether a C F G G is ambiguous.
Proof. Let C = (x 1, Yl), • • •, (x,, y,,) be an instance of Post's correspondence problem over X. Let G be the C F G (IS, A, B}, X U / , P, S), where I = {1, 2 , . . . , hi, and P contains the productions
--
7
204
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
S
>AIB
A
> x~Ailxti,
B~
yiBilyti,
for 1 < i < n for 1 < i < n
The nonterminals A and B generate the languages Lc and Mc, respectively, defined on p. 199. It is easy to see that no sentence has more than one distinct leftmost derivation from A, or from B. Thus, if there exists a sentence with two leftmost derivations from S, one must begin with S ~ A, and Im
the other with S ~
B. But by Lemma 2.29, there is a sentence derived from
Im
both A and B if and only if instance C of Post's problem has a solution. Thus, G is ambiguous if and only if C has a solution. It is then a straightforward matter to show that if there were an algorithm to decide the ambiguity of an arbitrary CFG, then we could decide Post's correspondence problem. [~] Ambiguity is a function of the grammar rather than the language. Certain ambiguous grammars may have equivalent unambiguous ones. Example 2.46
Let us consider the grammar and language of the previous example. The reason that grammar G is ambiguous is that an else can be associated with two different then's. For this reason, programming languages which allow both if-then--else and if-then statements can be ambiguous. This ambiguity can be removed if we arbitrarily decide that an else should be attached to the last preceding then, as in Fig. 2.18(b). We can revise the grammar of Example 2.45 to have two nonterminals, S~ and $2. We insist that $2 generate if-then-else, while S~ is free to generate either kind of statement. The rules of the new grammar are S1
> if b then S~lif b then $2 else Sxla
$2
> if b then $2 else S2la
The fact that only $2 precedes else ensures that between the then-else pair generated by any one production must appear either the single symbol a or another else. Thus the structure of Fig. 2.18(a) cannot occur. In Chapter 5 we shall develop deterministic parsing methods for various grammars, including the current one, and shall be able at that time to prove our new grammar to be unambiguous. [Z] Although there is no general algorithm which can be used to determine if a grammar is ambiguous, it is possible to isolate certain constructs in productions which lead to ambiguous grammars. Since ambiguous grammars are
SEC. 2.6
PROPERTIES OF CONTEXT-FREE LANGUAGES
205
often harder to parse than unambiguous ones, we shall mention some of the more common constructs of this nature here so that they can be recognized in practice. A proper grammar containing the productions A--~ AAI;o~ will be ambiguous because the substring A A A has two parsesA
A
A
/\ A
/\ A
A
A
This ambiguity disappears if instead we use the productions
A
>AB1B
B
>~
A
>BAIB
B
>¢
or the productions
Another example of an ambiguous production is A ~ AocA. The pair of productions A --~ ocA ]A fl introduces ambiguity since A ~=~ ocA ~ ~A fl and A ~> Aft ~=~ ~Afl imply two distinct leftmost derivations of ocAfl. A slightly more elaborate pair of productions which gives rise to an ambiguous grammar is A - - ~ ocA ]ocArlA. Other exampIes of ambiguous grammars can be found in the Exercises. We shall call a CFL inherently ambiguous if it has no unambiguous CFG. It is not at first obvious that there is such a thing as an inherently ambiguous CFL, but we shall present one in the next example. In fact, it is undecidable whether a given C F G generates an inherently ambiguous language (i.e., whether there exists an equivalent unambiguous CFG). However, there are large subclasses of the CFL's known not to be inherently ambiguous and no inherently ambiguous programming languages have been devised yet. Most important, every deterministic CFL has an unambiguous grammar, as we shall see in Chapter 8. Example 2.47
Let L - - [ a i b : c 1 ] i - - j or j ~ l~. L is an inherently ambiguous CFL. Intuitively, the reason is that the words with i - - j must be generated by a set of productions different from those generating the words with j - - I. At least some of the words with i -- j -- l must be generated by both mechanisms.
206
ELEMENTSOF LANGUAGE THEORY
CHAP. 2
One C F G for L is S: ~ > A B I D C A
> aAle
B
> bBcle
C
> cCl e
D
> aDb[e
Clearly the above grammar is ambiguous. We can use Ogden's lemma to prove that L is inherently ambiguous. Let G be an arbitrary grammar for L, and let k be the constant associated with G in Theorem 2.24. If that constant is less than 3, let k = 3. Consider the word z = akbkc k+k', where the a's are all distinguished. We can write z = uvwxy. Since w has distinguished positions, u and v consist only of a's. If x consists of two different symbols, then uvZwx~y is surely not in L, so x is either in a*, b*, or c*. If x is in a*, uv2wx2y would be ak+Pbkc k+k t for some p, 1 ~ p ~ k, which is not in L. If x is in c*, uv2wx2y would be ak+P'bkc k÷kt÷~'', where 1 ~ pt ~ k. This word likewise is not in L. In the second case, where x is in b*, we have uv2wx2y = ak+P'bk+P'c k÷k:, where 1 ~ p~ ~ k. If this word is in L, then either p~ = P2 or p~ :¢: P2 and p~ = k! In the latter case, uv3wx3y = a k + 2 P l b k + Z r ' C k+kt is surely not in L. So we conclude that Pl = P2- Observe that pl = Iv[ and pz ----[x]. By Theorem 2.24, there is a derivation +
(2.6.1)
S ~
+
+
uAy ~
uvmAx'y
~
btvmwxmy
for all m :> 0
In particular, let m = k ! / p 1. Since 1 ~ pl ~ k, we know that m is an integer. Then uv'nwxmy : ak+k:bk+ktC k+k:. A symmetric argument starting with the word ak+k~bkc k shows that there exist u !, v !, W ! , x', y', where only u' has an a, v' is in b*, and there is a nonterminal B such that +
S ~
(2.6.2)
+
u'By' ~
+
u'(v')m'B(x')m'y ' ~
U'(v')m'w(x')m'y
= ak+k!bk+ktck+kt
If we can show that the two derivations of ak+k~be+ktCk÷kt have different derivation trees, then we shall have shown that L is inherently ambiguous, since G was chosen without restriction and has been shown ambiguous. Suppose that the two derivations (2.6.1) and (2.6.2) have the same derivation tree. Since A generates a's and b's and B generates b's and c's, neither A nor B could appear as a label of a descendant of a node labeled by the
EXERCISES
207
other. Thus there exists a sentential f o r m tiAt2Bt3, where the t's are terminal strings. F o r all i a n d j, tlv~wxq2(v')iw'(x')it3 would presumably be in L. But I vl = ]x[ a n d [v'l = Ix' [. Also, x a n d v' consist exclusively of b's, v consists of a's, a n d x' consists of c's. Thus choosing i a n d j equal a n d sufficiently large will ensure that the above w o r d has m o r e b's than a's or o's. W e m a y thus conclude that G is a m b i g u o u s and that L is inherently ambiguous. [ ]
EXERCISES
2.6.1.
Let L be a context-free language and R a regular set. Show that the following languages are context-free: (a) INIT(L). (b) FIN(L). (c) SUB(L). (d) L/R. (e) L ~ R. The definitions of these operations are found in the Exercises of Section 2.3 on p. 135.
2.6.2.
Show that if L is a CFL and h a homomorphism, then h-a(L) is a CFL. Hint: Let P be a PDA accepting L. Construct P' to apply h to each of its input symbols in turn, store the result in a buffer (in the finite control), and simulate P on the symbols in the buffer. Be sure that your buffer is of finite length.
2.6,3.
Show that the following are not CFL's: (a) [aibic j [j < i]. (b) {a~bJckl i < j < k}. (c) The set of strings with an equal number of a's, b's, and c's. (d) [a'bJaJb' IJ ~ i]. (e) [amb"a'~b ~ [m, n ~ 1]. (f) {a~biek[none of i, j, and k are equal}. (g) {nHa ~ [n is a decimal integer > 1}. (This construct is representative of F O R T R A N Hollerith fields.)
**2.6.4.
Show that every CFL over a one-symbol alphabet is regular. Hint: Use the pumping lemma.
**2.6.$.
Show that the following are not always CFL's when L is a CFL: (a) MAX(L). (b) MIN(L). (c) L 1/2 = [x[ for some y, xy is in L and [xl = [y[}.
*2.6.6.
2.6.7. '2.6.8.
Show the following pumping lemma for linear languages. If L is a linear language, there is a constant k such that if z ~ L and Iz] > k, then z = uvwxy, where[uvxy] < k, vx ~ e and for all i, uvtwx~y is in L. Show that [a"b"amb m In, m > 1} is not a linear language. A one-turn PDA is one which in any sequence of moves first writes
208
CHAP. 2
ELEMENTS OF LANGUAGE THEORY
symbols on the pushdown list and then pops symbols from the pushdown list. Once it starts popping symbols from the pushdown list, it can then never write on its pushdown list. Show that a C F L is linear if and only if it can be recognized by a one-turn PDA. *2.6.9.
Let G = (N, 2~, P, S) be a CFG, Show that the following are CFL's" (a) {tg[S ==~ tg}. Im
(b) { ~ l s ~ ~}. rm
(c) (~ls=~ ~]. 2.6.10.
Give details of the proof of the corollary to Theorem 2.25.
2.6.11.
Complete the proof of Theorem 2.26.
2.6.12.
Give formal constructions of the D P D A ' s used in the proofs of Lemmas 2.29( 1) and 2.30(1 ).
"2.6.13.
Show that the language Qc n Pc of Section 2.6.3 is a C F L if and only if it is empty.
2.6.14.
Show that it is undecidable for C F G G whether (a) L(G) is a CFL. (b) L(G) is regular. (c) L(G) is a deterministic CFL. Hint: Use Exercise 2.6.13 and consider a C F G for Qc u Pc.
2.6.15.
Show that it is undecidable whether context-sensitive grammar G generates a CFL.
2.6.16.
Let G1 and G2 be CFG's.
Show that it is undecidable whether
L(G1) n L(Gz) = ~ . "2.6.17.
Let G1 be a C F G and Gz a right-linear grammar. Show that (a) It is undecidable whether L(Gz) ~ L(G1). (b) It is decidable whether L(Gi) ~ L(G2).
"2.6.18.
Let (a) (b) (c) (d)
P1 and Pz be DPDA's. Show that it is undecidable whether
L(Pi) U L(Pz) is a deterministic CFL. L(Pa)L(P2) is a deterministic CFL. L(P1) ~ L(Pz). L(P1)* is a deterministic CFL.
*'2.6.19,
Show that it is decidable, for D P D A P, whether L(P) is regular. Contrast Exercise 2.6.14(b).
**2.6.20.
Let L be a deterministic C F L and R a regular set. Show that the following are deterministic CFL's" (a) LR. (b) L/R. (c) L u R. (d) MAX(L). (e) MIN(L). (f) L N R. Hint: F o r (a, b, e, f ) , let P be a D P D A for L and M a finite automaton for some regular set R. We must show that there is a D P D A P ' which
EXERCISES
209
simulates P but keeps on each cell of its pushdown list the information, " F o r what states p of M and q of P does there exist w that will take M from state p to a final state and cause P to accept if started in state q with this cell the top of the pushdown list ?" We must show that there is but a finite amount of information for each cell and that P ' can keep track of it as the pushdown list grows and shrinks. Once we know how to construct P', the four desired D P D A ' s are relatively easy to construct. 2.6.21.
Show that for deterministic C F L L and regular set R, the following may not be deterministic C F L ' s : (a) R L . (b) [ x l x R ~ L]. (c) {xi for some y ~ R, we have y x ~ L}. (d) h(L), for h o m o m o r p h i s m h.
2.6.22.
Show that h-I(L) is a deterministic C F L if L is.
**2.6.23.
Show that Qc u Pc is an inherently ambiguous C F L whenever it is not empty.
**2.6.24.
Show that it is undecidable whether a C F G G generates an inherently ambiguous language.
*2.6.25.
Show that the grammar of Example 2.46 is unambiguous.
**2.6.26.
Show that the language LI U Lz, where L1 = {a'b'ambmlm, n > 1} and L2 = {a'bma~b" i m, n > 1}, is inherently ambiguous.
**2.6.27.
Shove that the C F G with productions S----~ a S b S c l a S b l b S c l d ambiguous. Is the language inherently ambiguous ?
*2.6.28.
is
Show that it is decidable for a D P D A P whether L ( P ) has the prefix property. Is the prefix property decidable for an arbitrary C F L ? DEFINITION A D y c k language is a C F L generated by a grammar G = ({S}, Z, P, S), where E = {al . . . . . a~, bl . . . . . bk} for some k _~ 1 and P consists of the productions S --~ S S l a i S b l l a 2 S b z l " " l a k S b k l e .
**2.6.29.
Show that given alphabet Z, we can find an alphabet Z', a Dyck language LD ~ ~ ' , and a homorphism h from Z'* to Z* such that for any C F L L ~ Z* there is a regular set R such that h(LD ~ R ) = L.
*2.6.30.
Let L be a C F L and S ( L ) = [ilfor some w ~ L, we have Iwl = i}. Show that S ( L ) is a finite union of arithmetic progressions. DEFINITION
An n-vector is an n-tuple of nonnegative integers. If v i = (a l , . . . , a,) and v~ = (bl . . . . . b,) are n-vectors and c a nonnegative integer, then v l -~- vz = (a l ÷ b l . . . . . a, + b,) and cv l = (ca l , . . ., ca,). A set S of n-vectors is linear if there are n-vectors v0 . . . . . Vk such that S=[vlv = v 0 + ClVl + . . . + CkVk, for some nonnegative integers
21 0
ELEMENTS OF LANGUAGE THEORY
CHAP. 2
Cl . . . . . ck]. A set of n-vectors is semilinear if it is the union of a finite n u m b e r of linear sets.
*'2.6.31.
Let E - - a ~ , a2, . . . . a,. Let ~b(x) be the n u m b e r of instances of b in the string x. Show that [(#al(w), @as(w) . . . . . :~a.(w))l w ~ L) is a semilinear set for each C F L L _ E*. DEFINITION The index o f a derivation in a C F G G is the m a x i m u m number of nonterminals in any sentential form of that derivation, l(w), the index o f a sentence w, is the smallest index of any derivation of w in G. I(G), the index o f G, is max I(w) taken over all w in L(G). The index o f a CFL L is min I(G) taken over all G such that L(G) = L.
**2.6.32.
Show that the index of the g r a m m a r G with productions
S ~
SSIOSI le
is infinite. Show that the index of L(G) is infinite. +
*2.6.33.
A C F G G -- (N, Z, P, S) is self-embedding if A =~ uAv for some u and v in Z +. (Neither u nor v can be e.) Show that a C F L L is not regular if and only if all grammars that generate L are self-embedding. DEFINITION
Let ~ be a class of languages with L1 ~ E~ and L2 ~ Ez* in £ . Let a and b be new symbols not in Z1 U Z2. ~ is closed under
(1) Marked union if aL 1 U bL2 is in £ , (2) M a r k e d concatenation if LlaL2 is in L, and (3) Marked • if (aLl)* is in £ . 2.6.34.
Show that the deterministic C F L ' s are closed under marked union, marked concatenation, and marked ..
*2.6.35.
Let G be a (not necessarily context-free) g r a m m a r (N, Z, P, S), where each production in P is of the form x A y ~ x~y, x and y are in Z*, A E N, and ~' E (N u E)*. Show that L(G) is a CFL.
**2.6.36.
Let G1 -- (N1, E l , P1, $1) and G2 -- (N2, Z2, Pz, $2) be two C F G ' s . Show
it
is undecidable
whether [~1Si ~ Gi Im
whether
0~} = [fllS2 ~
[~1S1 - > ~ } = [ f l l S z : = ~ f l } Gl
G2
and
fl].
G~ Im
Open Problem 2.6.37.
Is it decidable, for D P D A ' s PI and P2, whether L(P1) -- L(P2)?
Research Problems 2.6.38.
Develop methods for proving certain grammars to be unambiguous. By Theorem 2.30 it is impossible to find a method that will work for
BIBLIOGRAPHIC NOTES
211
an arbitrary unambiguous grammar. However, it would be nice to have techniques that could be applied to large classes of context-free grammars. 2.6.39.
A related research area is to find large classes of CFL's which are known to have at least one unambiguous CFG. The reader should be aware that in Chapter 8 we shall prove the deterministic CFL's to be such a class.
2.6.40.
Find transformations which can be used to make classes of ambiguous grammars unambiguous.
BIBLIOGRAPHIC
NOTES
We shall not attempt to reference here all the numerous papers that have been written on context-free languages. The works by Hopcroft and Ullman [1969], Ginsburg [1966], Gross and Lentin [1970], and Book [1970] contain many of the references on the theoretical developments of context-free languages. Theorem 2.24, Ogden's lemma, is from Ogden [1968]. Bar-Hillel et al. [1961] give several of the basic theorems about closure properties and decidability results of CFL's. Ginsburg and Greibach [1966] give many of the basic properties of deterministic CFL's. Cantor [1962], Floyd [1962a], and Chomsky and Schutzenberger [19631 independently discovered that it is undecidable whether a CFG is ambiguous. The existence of inherently ambiguous CFL's was noted by Parikh [1966]. Inherently ambiguous CFL's are treated in detail by Ginsburg [1966] and Hopcroft and Ullman [19691. The Exercises contain many results that appear in the literature. Exercise 2.6.19 is from Stearns [1967]. The constructions hinted at in Exercise 2.6.20 are given in detail by Hopcroft and Ullman [1969]. Exercise 2.6.29 is proved by Ginsburg [1966]. Exercise 2.6.31 is known as Parikh's theorem and was first given by Parikh [1966]. Exercise 2.6.32 is from Salomaa [1969b]. Exercise 2.6.33 is from Chomsky [1959a]. Exercise 2.6.36 is from Blattner [19721.
THEORY OF TRANSLATION
A translation is a set of pairs of strings. A compiler defines a translation in which the pairs are (source program, object program). If we consider a compiler consisting of the three phases lexical analysis, syntactic analysis, and code generation, then each of these phases itself defines a translation. As we mentioned in Chapter 1, lexical analysis can be considered as a translation in which strings representing source programs are mapped into sti'ings of tokens. The syntactic analyzer maps strings of tokens into strings representing trees. The code generator then takes these strings into machine or assembly language. In this chapter we shall present some elementary methods for defining translations. We shall also present devices which can be used to implement these translations and algorithms which can be used to automatically construct these devices from the specification of a translation. We shall first explore translations from an abstract point of view and then consider the applicability of the translation models to lexical analysis and syntactic analysis. For the most part, we defer treatment of code generation, which is the principal application of translation theory, to Chapter 9. In general, when designing a large system, such as a compiler, one should partition the overall system into components whose behavior and properties can be understood and precisely defined. Then it is possible to compare algorithms which can be used to implement the function to be performed by that component and to select the most appropriate algorithm for that component. Once the components have been isolated and specified, it should then also be possible to establish performance standards for each component and tests by which a given component can be evaluated. We must therefore
212
SEC. 3.1
FORMALISMS FOR TRANSLATIONS
213
understand the specification and implementation of translations before we can apply engineering design criteria to compilers.
3.1.
FORMALISMS FOR TRANSLATIONS
In this section two fundamental methods of defining translations are presented. One of these is the "translation scheme," which is a grammar with a mechanism for producing an output for each sentence generated. The other method is the "transducer," a recognizer which can emit a finite-length string of output symbols on each move. First we shall consider translation schemes based on context-free grammars. We shall then consider finite transducers and pushdown transducers. 3.1.1.
Translation and Semantics
In Chapter 2 we considered only the syntactic aspects of languages. There we saw several methods for defining the well-formed sentences of a language. We now wish to investigate techniques for associating with each sentence of a language another string which is to be the output for that sentence. The term "semantics" is sometimes used to denote this association of outputs with sentences when the output string defines the "meaning" of the input sentence. DEFINITION
Suppose that Z is an input alphabet and A an output alphabet. We define a translation f r o m a language L 1 ~ E* to a language L 2 ~ A* as a relation
T from Z* to A* such that the domain of T is L 1 and the range of T is L 2. A sentence y such that (x, y) is in T is called an output for x. Note that, in general, in a translation a given input can have more than one output. However, any translation describing a programming language should be a function (i.e., there exists at most one output for each input). There are many examples of translations. Perhaps the most rudimentary type of translation is that which can be specified by a homomorphism. Example 3.1
Suppose that we wish to change every Greek letter in a sentence in E* into its corresponding English name. We can use the homomorphism h, where (1) h(a) -- a if a is a member of Z minus the Greek letters and (2) h(a) is defined in the following table if a is a Greek letter:
214
CHAP. 3
THEORY OF TRANSLATION
Greek Letter A B F A E Z H O I K A M
a fl y 5 e ( 1/ 0 z x 2 g
h alpha beta gamma delta epsilon zeta eta theta iota kappa lambda mu
Greek Letter N E, O II P Z T T • X ~F ~
v ~ o r~ p a z o ~b Z V co
h nu xi omicron pi rho sigma tau upsilon phi chi psi omega
For example, the sentence a = nr 2 would have the translation a = pi r 2.
[Z]
Another example of a translation, one which is useful in describing a process that often occurs in compilation, is mapping arithmetic expressions in infix notation into equivalent expressions in Polish notation. DEFINITION There is a useful way of representing ordinary (or infix) arithmetic expressions without using parentheses. This notation is referred to as Polish notation.t Let 19 be a set of binary operators (e.g., [ + , ,]), and let Z be a set of operands. The two forms of Polish notation, prefix Polish and postfix Polish, are defined recursively as follows: (1) Polish (2) E2 are
If an infix expression E is a single operand a, in Z, then both the prefix and postfix Polish representation of E is a. If E1 0 E2 is an infix expression, where 0 is an operator, and Et and infix expressions, the operands of 0, then (a) 0 E'tE~ is the prefix Polish representation of El 0 E2, where E'i and E~ are the prefix Polish representations of E~ and Ez, respectively, and (b) ,_.~r"0r"' ~ z is the postfix Polish representation of Ei 0 Ez, where E~' and E~ are the postfix Polish representations of E~ and Ez, respectively. (3) If (E) is an infix expression, then (a) The prefix Polish representation of (E) is the prefix Polish representation of E, and (b) The postfix Polish representation of (E) is the postfix Polish representation of E.
]'The term "Polish" is used, as this notation was first described by the Polish mathematician Lukasiewicz, whose name is significantly harder to pronounce than is "Polish."
SEC. 3.1
FORMALISMS FOR TRANSLATIONS
215
Example 3.2
Consider the infix expression (a + b) • c. This expression is of the form E1 * E2, where E1 = (a q- b) and E2 = c. Thus the prefix and postfix Polish expressions for E2 are both c. The prefix expression for E~ is the same as that for a-q-b, which is q-ab. Thus the prefix expression for (a ÷ b ) , c is
,+abe. Similarly, the postfix expression for a ÷ b is ab ÷, so the postfix expression for (a + b) • c is a b ÷ c , . D It is not at all obvious that a prefix or postfix expression can be uniquely returned to an infix expression. The observations leading to a proof of this fact are found in Exercises 3.1.16 and 3.1.17. We can use trees to conveniently represent arithmetic expressions. For example, (a ÷ b ) , c has the tree representation shown in Fig. 3.1. In the
+
Fig. 3.1 Tree representation for
(a + b)*c. tree representation each interior node is labeled by an operator from O and each leaf by an operand from ~. The prefix Polish representation is merely the left-bracketed representation of the tree with all parentheses deleted. Similarly, the postfix Polish representation is the right-bracketed representation of the tree, with parentheses again deleted. Two important examples of translations are the sets of pairs ((x, y)lx is an infix expression and y is the prefix (or, alternatively, postfix) Polish representation of x]. These translations cannot be specified by a homomorphism. We need translation specifiers with more power and shall now turn our attention to formalisms which allow these and other translations to be conveniently specified. 3.1.2.
Syntax-Directed Translation Schemata
The problem of finitely specifying an infinite translation is similar to the problem of specifying an infinite language. There are several possible approaches toward the specification of translations. Analogous to a language generator, such as a grammar, we can have a system which generates the pairs in the translation. We can also use a recognizer with two tapes to recognize
216
THEORY OF TRANSLATION
CHAP. 3
those pairs in the translation. Or we could define an automaton which takes a string x as input and emits (nondeterministically if necessary) all y such that y is a translation of x. While this list does not exhaust all possibilities, it does cover the models in common use. Let us call a device which given an input string x, calculates an output string y such that (x, y) is in a given translation T, a translator for T. There are several features which are desirable in the definition of a translation. Two of these features are (1) The definition of the translation should be readable. That is to say, it should be easy to determine what pairs are in the translation. (2) It should be possible to mechanically construct an efficient translator for that translation directly from the definition. Features which are desirable in translators are (1) Efficient operation. For an input string w of length n, the amount of time required to process w should be linearly proportional to n. (2) Small size. (3) Correctness. It would be desirable to have a small finite test such that if the translator passed this test, this would be a guarantee that the translator works correctly on all inputs. One formalism for defining translations is the syntax-directed translation schema. Intuitively, a syntax-directed translation schema is simply a grammar in which translation elements are attached to each production. Whenever a production is used in the derivation of an input sentence, the translation element is used to help compute a portion of the output sentence associated with the portion of the input sentence generated by that production. Example 3.3
Consider the following translation schema which defines the translation {(x, XR)[X ~ {0, 1}*}. That is, for each input x, the output is x reversed. The rules defining this translation are Production
Translation Element
(1) S---* OS (2) S ~ I S (3) S---. e
S=S1
S = SO S-- e
An input-output pair in the translation defined by this schema can be obtained by generating a sequence of pairs of strings (0c, fl) called translation f o r m s , where ~ is an input sentential form and fl an output sentential form. We begin with the translation form (S, S). We can then apply the first rule
sac. 3.1
FORMALISMS FOR TRANSLATIONS
217
to this form. To do so, we expand the first S using the production S ~ 0S. Then we replace the output sentential form S by SO in accordance with the translation element S ---- SO. For the time being, we can think of the translation element simply as a production S --~ SO. Thus we obtain the translation form (0S, SO). We can expand each S in this new translation form by using rule (1) again to obtain (00S, SO0). If we then apply rule (2), we obtain (001S, S100). If we then apply rule (3), we obtain (001,100). No further rules can be applied to this translation form and thus (001,100) is in the translation defined by this translation schema. V-] A translation schema T defines some translation z(T). We can build a translator for T(T) from the translation schema that works as follows. Given an input string x, the translator finds (if possible) some derivation of x from S using the productions in the translation schema. Suppose that S = ~0 =* ~ =* ~z =~ "'" =* ~, = x is such a derivation. Then the translator creates a derivation of translation forms
(~0,/~0) --~- ( ~ , P,) . . . . .
>
(~., p.)
such that (~0, fl0) = (S, S), (~,, ft.) = (x, y), and each fl, is obtained by applying to fl~-i the translation element corresponding to the production used in going from t~_ 1 to ~t at the "corresponding" place. The string y is an output for x. Often the output sentential forms can be created at the time the input is being parsed.
Example 3.4 Consider the following translation scheme which maps arithmetic expressions of L ( G o ) to postfix Polish" Production
Translation Element
E----~ E + T E--~T T--* T* F T---~F F ~ (E) F---~a
E = ET-+E-'T T= TF* T=F F = E F=a
The production E ~ E q- T is associated with the translation element This translation element says that the translation associated with E on the left of the production is the translation associated with E on the right of the production followed by the translation of T followed by + . E = ET +.
21 8
THEORYOF TRANSLATION
CHAP. 3
Let us determine the output for the input a + a • a. To do so, let us first find a leftmost derivation of a -+ a • a from E using the productions of the translation scheme. Then we compute the corresponding sequence of translation forms as shown"
(E, E) ~
(E -q- T, E T + )
- - - - > ( T + T, T T + ) ------~ (F q- T, F T + ) ~- (a q- T, aT q-) ----> (a -q- T , F, aTF , +) u ( a -q- F , F, aFF , + ) > (a + a , F, aaF , + ) ~ >(a-+a,a,
aaa,+)
Each output sentential form is computed by replacing the appropriate nonterminal in the previous output sentential form by the right side of translation rule associated with the production used in deriving the corresponding input sentential form. The translation schemata in Examples 3.3 and 3.4 are special cases of an important class of translation schemata called syntax-directed translation schemata. DEFINITION
A syntax-directed translation schema (SDTS for short) is a 5-tuple T - - (N, Z, A, R, S), where (1) N is a finite set of nonterminal symbols. (2) Z is a finite input alphabet. (3) A is a finite output alphabet. (4) R is a finite set of rules of the form A ----~~, fl, where ~ ~ (N U Z)*, fl ~ (N u A)*, and the nonterminals in ,6' are a permutation of the nonterminals in ~. (5) S is a distinguished nonterminal in N, the start symbol. Let A ~ ~, fl be a rule. To each nonterminal of ~ there is associated an identical nonterminal of ft. If a nonterminal B appears only once in and ,8, then the association is obvious. If B appears more than once, we use integer superscripts to indicate the association. This association is an intimate part of the rule. For example, in the rule A ~ B(I~CB .~2~,B(2~B(I~C, the three positions in BcI~CB Cz~ are associated with positions 2, 3, and 1, respectively, in B~2~B~I~C. We define a translation form of T as follows:
SEC. 3.1
FORMALISMS FOR TRANSLATIONS
219
(1) (S, S) is a translation form, and the two S's are said to be associated. (2) If (eArl, e'Afl') is a translation form, in which the two explicit instances of A are associated, and if A ---~ 7, ~" is a rule in R, then (e?fl, e'?'fl') is a translation form. The nonterminals of ? and ~,' are associated in the translation form exactly as they are associated in the rule. The nonterminals of • and fl are associated with those of e' and fl' in the new translation form exactly as in the old. The association will again be indicated by superscripts, when needed, and this association is an essential feature of the form. If the forms (eArl, e'Afl') and (eI, fl, e'~,'fl'), together with their associations, are related as above, then we write (eArl, e'Afl') =-~ (e?fl, e'?'fl'). +
*
T
k
We use =~, ==~, and ==~ to stand for the transitive closure, reflexive-transitive T
T
T
closure, and k-fold product of ==~. As is customary, we shall drop the subT
script T whenever possible. The translation defined by T, denoted z(T), is the set of pairs {(x, y) l (S, S) ~
(x, y), x ~ 12" and y ~ A*}.
Example 3.5
Consider the SDTS T = (IS}, {a, -q-}, {a, + }, R, S), where R has the rules
S
> -k Sc~)S ~2~,
S
> a,
S ~ + S (9) 1"
Construct a simple SDTS which maps infix expressions containing these operators into postfix Polish notation. 3.1.5.
Consider the following SDTS. A string of the form ( x ) is a single nonterminal" (exp)
> sum ( e x p ) C1~ with ( v a r ) ~-- ( e x p ) C2) to (exp)~3~,f begin
local t;
t ~--- 0; for ( v a r ) ~
t ~
( e x p ) ~z~ to ( e x p ) ~3~ do
t 4- (exp){1};
result t
end
(var) ---> (id), (id) (exp)
> (id), (id)
(id) ~
a (id), a (id)
(id) -
~ b(id), b ( i d )
(id)-
> a, a
( i d ) - - - > b, b Give the translation for the sentences (a) sum a a with a ~ b to b b . (b) sum sum a with a a ~ a a a to a a a a with "3.1.6.
b ~-- bb
to
bbb.
Consider the following translation scheme" (statement)
> for ( v a r ) ~
(exp){ 1) to ( e x p ) (z) do ( s t a t e m e n t ) ,
begin ( v a r ) ~---- (exp){1) ; L" if ( v a r ) .~ ( e x p ) {2} then begin ( s t a t e m e n t ) ; (var) ~
( v a r ) + 1;
go to L end end
tNote that this comma separates the two parts of the rule.
EXERCISES
(var) ~ (exp)-
235
(id), ( i d ) ~ (id), ( i d )
(statement) ~ (id) ~ (id)-
( v a r ) ~-- (exp), ( v a r ) ~-- (exp) a(id), a(id) > b(id), b(id)
( i d ) ----> a, a
(id) ~
b, b
Why is this not an SDTS ? What should the output be for the following input sentence: for a ~-- b to aa do baa ~
bba
Hint: Apply Algorithm 3.1 duplicating the nodes labeled ( v a r ) in the output tree. Exercises 3.1.5 and 3..1.6 provide examples of how a language can be extended using syntax macros. Appendix A.1 contains the details of how such an extension mechanism can be incorporated into a language. 3.1.7.
Prove that the domain and range of any syntax-directed translation are context-free languages.
3.1.8.
Let L _q ~* be a C F L and R ~ ~* a regular set. Construct an SDTS T such that ~(T) = {(x, y) lif x ~ L -- R, then y = 0 if x ~ L A R, then y = 1}
"3.1.9.
Construct an SDTS T such that 'r(T) = {(x, y ) l x e {a, b}* and y = d, where i = [ ~ , ( x ) - ~b(x) l, where ~a(x) is the number of d's in x}
"3.1.10.
Show that if Z is a regular set and M is a finite transducer, then M ( L ) and M - I ( L ) are regular.
"3.1.11.
Show that if L is a C F L and M is a finite transducer, then M ( L ) and M - ~ ( L ) are CFL's.
"3.1.12.
Let R be a regular set. Construct a finite transducer M such that M ( L ) = L / R for any language L. With Exercise 3.1.11, this implies that the regular sets and CFL's are closed u n d e r / R .
"3.1.13.
Let R be a regular set. Construct a finite transducer M such that M ( L ) = R/L for any language L.
236
CHAP. 3
THEORY OF TRANSLATION
3.1.14.
A n SDTS T = (N, ~, A, R, S) is right-linear if each rule in R is of the form A ~
xB, yB
or
A-----~x,y
where A, B are in N, x ~ ~*, and y ~ A*. Show that if T is right-linear, then 'r(T) is a regular translation. *'3.1.15.
Show that if T g a* × b* is an SDT, then T can be defined by a finite transducer.
3.1.16.
l e t us consider the class of prefix expressions over operators O and operands ~. If a l . . . a , is a sequence in (® u E)*, compute st, the score at position i, 0 < i < n, as follows" (1) So = 1. (2) If at is an m-ary operator, let & = &_ 1 + m -- 1. (3) If at ~ ~, let st = st-1 -- 1. Prove that a a . . . a , , st > 0 for all i < n.
is a prefix expression if and only if s, = 0 and
"3.1.17.
Let a l . . . a , be a prefix expression in which a l is an m-ary operator. Prove that the unique way to write a l . . . a , as a a w ~ ' . ' W m , where wl . . . . ,Wm are prefix expressions, is to choose wj, 1 < j < m, so that it ends with the first ak such that sk = m - - j .
"3.1.18.
Show that every prefix expression with binary operators comes from a unique infix expression with no redundant parentheses.
3.1.19.
Restate and prove Exercises 3.1.16-3.1.18 for postfix expressions.
3.1.20.
Complete the proof of Theorem 3.1.
"3.1.21.
Prove that the order in which step (2) of Algorithm 3.1 is applied t o nodes does not affect the resulting tree.
3.1.22.
Prove Lemrna 3.1.
3.1.23.
Give pushdown transducers for the simple SDT's defined by the translation schemata of Examples 3.5 and 3.7.
3.1.2,4.
Construct a grammar for SNOBOL4 statements that reflects the associativity and precedence of operators given in Appendix A.2.
3.1.25.
Give an SDTS that defines the (empty store) translation of the following PDT" ({q, p}, {a, b}, [Zo, A, B}, [a, b}, ~, q, Z0, ~ ) where ~ is given by
BIBLIOGRAPHIC NOTES
6(q, a, X) = (q, A X, e),
for all X = Z0, A, B
6(q, b, X) = (q, BX, e),
for all X = Z0, A, B
237
O(q, e, A) -- (p, A, a) 0(p, a, A) = (p, e, b)
6(p, b, B) = (p, e, b) 6(p, e, Zo) = (p, e, a) *'3.1.26.
Consider two pushdown transducers connected in series, so the output of the first forms the input of the second. Show that with such a tandem connection, the set of possible output strings of the second PDT can be any recursively enumerable set.
3.1.27.
Show that T is a regular translation if and only if there is a linear context-free language L such that T = [(x, y)lxcy R ~ L}, where c is a new symbol.
'3.1.28.
Show that it is undecidable for two regular translations T, and /'2 whether Ti = / 2 .
Open Problems 3.1.29.
Is it decidable whether two deterministic finite transducers are equivalent ?
3.1.30.
Is it decidable whether two deterministic pushdown transducers are equivalent ?
Research Problem 3.1.31.
It is known to be undecidable whether two nondeterministic finite transducers are equivalent (Exercise 3.1.28). Thus we cannot "minimize" them in the same sense that we minimized finite automata in Section 3.3.1. However, there are some techniques that can serve to make the number of states smaller. Can you find a useful collection of these ? The same can be attempted for PDT's.
BIBLIOGRAPHIC
NOTES
The concept of syntax-directed translation has occurred to many people. Irons [1961] and Barnett and Futrelle [1962] were among the first to advocate its use. Finite transducers are similar to the generalized sequential machines introduced by Ginsburg [1962]. Our definitions of syntax-directed translation schema and pushdown transducer along with Theorem 3.2 are similar to those of Lewis and Stearns [1968]. Griffiths [1968] shows that the equivalence problem for nondeterministic finite transducers with no e-outputs is also unsolvable.
238
THEORYOF TRANSLATION
3.2.
CHAP. 3
PROPERTIES OF SYNTAX-DIRECTED TRANSLATIONS
In this section we shall examine some of the theoretical properties of syntax-directed translations. We shall also characterize those translations which can be defined as simple syntax-directed translations. 3.2.1.
Characterizing Languages
DEFINITION
We say that language L characterizes a translation T if there exist two homomorphisms h~ and h2 such that T = [(h~ (w), h2(w)) l w E L}. Example 3.14
The translation T = {(a~,aO[n ~ 1} is characterized by 0 ÷, since T = {(ha(w), h2(w))lw ~ 0+}, where ha(0) = h2(0) = a. E] We say that a language Z ~ (X U A')* strongly characterizes a translation T ~ X* × A* if (I) X A A ' = ~ 3 . (2) Z = {(ha(w), h2(w))lw ~ L}, where (a) ha (a) = a for all a in Z and h, (b) = e for all b in A'. (b) hz(a ) = e for all a in X and h z is a one-to-one correspondence between A' and A [i.e., h2(b) e A for all b in A' and h2(b) = h2(b') implies that b = b']. Example 3.15
The translation T = [ ( a " , a " ) [ n > 1} i s strongly characterized by L 1 = {a"b"! n > 1}. It is also strongly characterized by L 2 = {wlw consists of an equal number of a's and b's]. The homomorphisms in each case are ha(a) = a, h i ( b ) = e and h2(a)= e, hz(b)= a. T is not strongly characterized by the language 0 +. [Z We can use the concept of a characterizing language to investigate the classes of translations defined by finite transducers and pushdown transducers. LEMMA 3.4
Let T = (N, E, A, R, S) be an SDTS in which each rule is of the form A~aB, bB or A ---, a, b f o r a ~ X u { e } , beAu[e], andB~N. Then v(T) is a regular translation. Proof Let M be the finite transducer (N U If}, X, A, ~, S, {f}), w h e r e f is a new symbol. Define O(A, a) to contain (B, b) if A ---~ aB, bB is in R, and
SEC. 3.2
PROPERTIES OF SYNTAX-DIRECTED TRANSLATIONS
239
to contain (f, b) if A ~ a, b is in R. Then a straightforward induction on n shows that ?1
(S, x, e ) ~
(A, e, y) if and only if (S, S) ---9. (xA, yA)
It follows that (S, x, e) ~ (f, e, y) if and only if (S, S) ==~ (x, y). The details are left for the Exercises. Thus, z(T) = z(M). E] THEOREM 3.3
T is a regular translation if and only if T is strongly characterized by a regular set. Proof If: Suppose that L G (X U A')* is a regular set and that ht and h2 are homomorphisms such that h ~ ( a ) = a for a ~ X, h i ( a ) = e for a ~ A', hz(a) = e for a ~ X, and h2 is a one-to-one correspondence from A' to A. Let T = {(ha(w), hz(w)lw ~ L], and let G = (N, Z U A', P, S) be a regular grammar such that L(G) = L. Then consider the SDTS U = (N, X, A, R, S), where R is defined as follows:
(1) If A ~ aB is in P, then A --, hl(a)B, h2(a)B is in R. (2) If A ~ a is in P, then A --~ ha(a), h2(a) is in R. /t
An elementary induction shows that (A, A ) = ~ (x, y) if and only if U n
w, h,(w) = x, and h2(w) = y.
A ~ G
+
Thus we can conclude that (S, S) =~ (x, y) if and only if (x, y) is in T. U
Hence z ( U ) = T. By Lemma 3.4, there is a finite transducer M such that , ( M ) = T. Only if: Suppose that T ~ Z* x A* is a regular translation, and that M = (Q, X, A, tS, q0, F) is a finite transducer such that , ( M ) = T. Let A ' = {a'[a ~ A] be an alphabet of new symbols. Let G = (Q, X u A', P, q0) be the right-linear grammar in which P has the following productions"
(1) If 6(q, a) contains (r, y), then q ~ ah(y)r is in P, where h is a homomorphism such that h(a) = a' for all a in A. (2) If q is in F, then q ~ e is in P. Let ha and h2 be the following homomorphisms" ha(a ) = a
for all a in X
ha(b) = e
for all b in A'
h2(a) = e
for all a in X
h2(b' ) = b
for all b' in A'
240
CHAP. 3
THEORYOF TRANSLATION
We can now show by induction on m and n that (q, x, e ) ~
(r, e, y) for
n
some m if and only if q :=~ wr for some n, where h~(w) = x and h2(w) = y. +
Thus, (qo, x, e) ~ (q, e, y), with q in F, if and only if qo => wq ==~ w, where h l ( w ) = x and h2(w ) = y. Hence, T = {(hi(w), h2(w)) I w ~ L ( G ) } . Thus, L ( G ) strongly characterizes T. D COROLLARY
T is a regular translation if and only if T is characterized by a regular set. P r o o f . Strong characterization is a special case of characterization. Thus the "only if" portion is immediate. The "if" portion is a simple generalization of the "if" portion of the theorem.
In much the same fashion we can show an analogous result for simple syntax-directed translations. THEOREM 3.4
T is a simple syntax-directed translation if and only if it is strongly characterized by a context-free language.
?roof. I f : Let T be strongly characterized by the language generated by Ga = (N, Z U A',P, S), where hi and h2 are the two homomorphisms involved. Construct a simple SDTS T1 = (N, X, A, R, S), where R is defined by: For each production A --~ woB awa . . . B k W k in P, let A ~
h~(wo)Baha(w~)... Bkh~(wk), h2(wo)Bah2(wa)""
Bkhz(we)
be a rule in R. A straightforward induction on n shows that Itl
n
(1) If A :=~ w, then (A, A) ==~ (hi(w), hz(w)). Gx
T1 B
B
(2) If (A, A) =~ (x, y), then there is some w such that A =:~ w, hi (w) = x, T1
Gt
and h2(w) = y. Thus, z ( T 1 ) = T. O n l y if: Let T = z(T2), where T2 = (N, E , A , R , S), and let A ' = { a ' [ a ~ A} be an alphabet of new symbols. Construct C F G G2 = (N, X U A', P, S), where P contains production A ~ Xoy~Blxmy'l . . . B k x k y ~ for each rule A ~ x o B l x t . . . B k X k , Y o B l Y l . . . B k Y k in R; Yl is Yt with each symbol a ~ A replaced by a'. Let h 1 and h 2 be the obvious homomorphisms, ha(a) = a for a ~ E, h i ( a ) = e for a ~ A', hz(a) = e for a E E, and h2(a') = a for a ~ A. Again it is elementary to prove by induction that
SEC. 3.2
PROPERTIES OF SYNTAX-DIRECTED TRANSLATIONS n
n
(1) If A ==~ w, then (A, A) ~ G~
(ha(w), h2(w)).
T2 n
(2) If (A, A) ~
n
(x, y), then for some w, we have A ~ w, ha(w) = x,
T2
and h2(w)= y.
241
G~
Eli
COROLLARY
A translation is a simple SDT if and only if it is characterized by a contextfree language. Eli We can use Theorem 3.3 and 3.4 to show that certain translations are not regular translations or not simple SDT's. It is easy to show that the domain and range of every simple SDT is a CFL. But there are simple syntax-directed translations whose domain and range are regular sets but which cannot be specified by any finite transducer or even pushdown transducer. Example 3.16
Consider the simple SDTS T with rules
S----~ OS, SO S
> 1S, S1
S
>e,e
z(T) = {(w, wR.)[W E {0, 1}*}. We shall show that z(T) is not a regular translation. Suppose that x(T) is a regular translation. Then there is some regular language L which strongly characterizes z(T). We can assume without loss of generality that L ~ {0, 1, a, b}*, and that the two homomorphisms involved are hi(0) = 0, hi(l) = 1, ha(a) = hi(b) = e and h2(0) = h2(1) = e, h2(a) = O,
hz(b) = 1. If L is regular, it is accepted by a finite automaton M = (Q, [0, 1, a, b}, d~, qo, F) with s states for some s. There must be some z ~ L such that hi(z) = 0~1 ' and h2(z)= lS0 ". This is because (0Sl ", 1"0 ") ~ z(T). All O's precede all l's in z, and all b's precede all a's. Thus the first s symbols of z are only O's and b's. If we consider the states entered by M when reading the first s symbols of z, we see that these cannot all be different; we can write z = uvw such that (qo, z) ~ (q, vw) t--- (q, w) ~ (p, e), where l uvl _ 1, and p ~ F. Then uvvw is in L. But hl(uvvw ) = 0"+ml" and h2(uvvw) = l'+n0 s, where not both m and n are zero. Thus, (0'÷ml s, l'+n0s) ~ z(T), a contradiction. We conclude that z(T) is not a regular translation. D
242
THEORY OF TRANSLATION
CHAP. 3
Example 3.17
Consider the SDTS T with the rules S-
~ A~I~cA ~2~,A~2~cA~1~
A
> 0A, 0A
A~ A
1A, 1A >e,e
Here 'r(T) = {(ucv, vcu)lu, v ~ [0, 1}*}. We shall show that 'r(T) is not a simple SDT. Suppose that L is a CFL which strongly characterizes z(T). We can suppose that A' = [c', 0', 1'}, L ~ ({0, 1, c} u A3*, and that h 1 and h2 are the obvious homomorphisms. For every u and v in {0, 1}*, there is a word zuv in L such that hl(zu,)= ucv and h2(z,,)= vcu. We consider two cases, depending on whether c precedes or follows c' in certain of the zu,'s. Case 1: For all u there is some v such that c precedes c' in z~v. Let R be the regular set {0, 1, 0', l'}*c[0, 1, 0', l'}*c'{0, 1, 0', 1'}*. Then L n R is a CFL, since the CFL's are closed under intersection with regular sets. Note that L n R is the set of sentences in Z in which c precedes c'. Let M be the finite transducer which, until it reads c, transmits O's and l's, while skipping over primed symbols. After reading c, M does nothing until it reaches c'. Subsequently, M prints 0 for 0' and 1 for 1', skipping over O's and l's. Then M ( L n R) is a CFL, since the CFL's are closed under finite transductions, and in this case M ( L N R) = [uulu ~ [0, 1}*}. The latter is not a C F L by Example 2.41. Case 2: For some u there is no v such that c precedes c' in zu,. Then for every v there is a u such that c' precedes c in z,,. An argument similar to case 1 shows that if L were a CFL, then [vv[v ~ [0, 1}*} would also be a CFL. We leave this argument for the Exercises. We conclude that ~(T) is not strongly characterized by any context-free language and hence is not a simple SDT. [[] Let 3, denote the class of regular translations, 3, the simple SDT's, and 3 the SDT's. From these examples we have the following result. THEOREM 3.5 Z, ~ Z, ~ Z. Proof 3, ~ 3 is by definition. 3, ~ 3s is immediate when one realizes that a finite transducer is a special case of a PDT. Proper inclusion follows from Examples 3.16 and 3.17. [[]
PROPERTIES OF SYNTAX-DIRECTED TRANSLATIONS
SEC. 3.2 3.2.2.
243
Properties of Simple SDT's
Using the idea of a characterizing language, we can prove analogs for many of the normal form theorems of Section 2.6. We shall mention two of them here and leave some others for the Exercises. The first is an analog of Chomsky normal form. THEOREM 3.6
Let T be a simple SDT. Then T = I:(T~), where Ti = (N, ~, A, R, S) is a simple SDTS such that each rule in R is of one of the forms (1) A ~ BC, BC, where A, B, and C are (not necessarily distinct) members of N, or (2) A --~ a, b, where exactly one of a and b is e and the other is in ~ or A, as appropriate. (3) S ---~ e, e if (e, e) is in T and S does not appear on the right of any rule.
Proof. Apply the construction of Theorem 3.4 to a grammar in CNF.
D
The second is an analog of Greibach normal form. THEOREM 3.7
Let T be a simple SDT. Then T = ~(T1), where T1 = (N, E, A, R, S) is a simple SDTS such that each rule in R is of the form A - 4 a~, b~, where is in N*, exactly one of a and b is e, and the other is in ~ or A [with the same exception as case (3) of the previous theorem].
Proof. Apply the construction of Theorem 3.4 to a grammar in GNF.
[Z]
We comment that in the previous two theorems we cannot make both a be in E and b in A at the same time. Then the translation would be length preserving, which is not always the case for an arbitrary SDT. 3.2,3.
A Hierarchy of SDT's
The main result of this section is that there is no analog of Chomsky normal form for arbitrary syntax-directed translations. With one exception, each time we increase the number of nonterminals which we allow on the right side of rules of an SDTS, we can define a strictly larger class of SDT's. Some other interesting properties of SDT's are proved along the way. DEFINITION
Let T = (N, ~, A, R, S) be an SDTS. We say that T is of order k if for no rule A ~ ~, fl in R does 0c (equivalently fl) have more than k instances of nonterminals. We also say that z(T) is of order k. Let 3k be the class of all SDT's of order k.
244
CHAP. 3
T H E O R Y OF T R A N S L A T I O N
Obviously, ~1 ~ ~32 ~ "'" -~ ~3~~ . . . . We shall show that each of these inclusions is proper, except that ~33 = 32. A sequence of preliminary results is needed. LEMMA 3.5
Proof. It is elementary to show that the domain of an SDTS of order 1 is a linear CFL. However, by Theorem 3.6, every simple SDT is of order 2, and every CFL is the domain of some simple SDT (say, the identity translation with that language as domain). Since the linear languages are a proper subset of the CFL's (Exercise 2.6.7), the inclusion of 3~ in ~2 is proper. D There are various normal forms for SDTS's. We claim that it is possible to eliminate useless nonterminals from an SDTS as from a CFG. Also, there is a normal form for SDTS's somewhat similar to CNF. All rules can be put in a form where the right side consists wholly of nonterminals or has no nonterminals. DEFINITION A nonterminal A in an SDTS T = (N, E, A, R, S) is useless if either (1) There exist no x ~ E* and y ~ A* such that (A, A) => (x, y), or (2) For no t~ and t~2 i n (N U E)* and fl~ and flz in (N U A)* does :g
(s, s) LEMMA 3.6 Every SDT of order k is defined by an SDTS of order k with no useless nonterminals.
Proof. Exercise analogous to Theorem 2.13.
[~
LEMMA 3.7 Every SDT T of order k ~ 2 is defined by an SDTS T1 = (N, 1~, A, R, S), where if A ---~ ~, fl is in R, then either (1) ~ and fl are in N*, or (2) ~ is in E* and fl in A*. Moreover, T1 has no useless nonterminals.
Proof. Let T2 = (N', E, A, R', S) be an SDTS with no useless nonterminals such that z ( T z ) = T. We construct R from R' as follows. Let A -----~x o B l x l ' " BkXk, YoCiYl"'" CkYk be a rule in R', with k > 0. Let be the permutation on the set of integers 1 to k such that the nonterminal Bi is associated with the nonterminal C~(i). Introduce new nonterminals A', D ~ , . . . , D k and E 0 , . . . , Ek, and replace the rule by
SEC. 3.2
PROPERTIES OF SYNTAX-DIRECTED TRANSLATIONS
A
245
> Eo A', EoA'
Eo
> Xo, Yo
A '~
> Dr'"
Dt
>B~E i , B iE i
forl~i x;, y..)
for 1 ~ i < k
Dk, D't • • • D~
where D~ = D',~.) for 1 ~ i < k
For example, if the rule is A ---, x o B ~ X l B 2 X 2 B 3 x 3 , y o B 3 y ~ B ~ y 2 B 2 y 3 , then zr -- (2, 3, 1). We would replace this rule by A Eo -
> EoA', EoA' > Xo, Yo
A ' - - - - ~ D 1 D 2 D 3, D 3 D 1 D 2
D~
~ B~E~, B t E~,
E1
> xl, Y2
for i = 1, 2, 3
E2 = > x2, Y3 E3
> x3, Y l
Since each Dt and E~ has only one rule, it is easy to see that the effect of all these new rules is exactly the same as the rule they replace. Rules in R' with no nonterminals on the right are placed directly in R. Let N be N' together with the new nonterminals. Then z ( T 2 ) : z(T~) and T2 satisfies the conditions of the lemma. D LEMMA 3.8 ~2 :
~3"
Proof. It suffices, by Lemma 3.7, to show how a rule of the form A ~ B 1 B 2 B 3, C1C2C 3 can be replaced by two rules with two nonterminals in
each component of the right side. Let ~z be the permutation such that Bt is associated with C~,). There are six possible values for ~z. In each case, we can introduce a new nonterminal D and replace the rule in question by two rules as shown in Fig. 3.6.
n(1)
n(2)
n(3)
1 1 2 2 3 3
2 3 1 3 1 2
3 2 3 1 2 1
Rules A ~ A ~ A ~ A ~ /1 ~ A ~
BtD, BiD, DB3, DB3, B1 D, B1D,
B1D BtD DB3 B3D DB1 DBi
Fig. 3.6 New rules.
D ~ BzB3, B2B3 D ~ BzB3, B3Bz D ~
B1B2, BzBx
D ~ B1Bz, BiBz D ~ D ~
B2B3, BzB3 B2B3, B3Bz
246
THEORYOF TRANSLATION
CHAP. 3
It is straightforward to check that the effect of the new rules is the same as the old in each case. LEMMA 3.9
Every SDT of order k ~ 2 has an SDTS T = (N, E, A, R, S) satisfying Lemma 3.7, and the following" (1) There is no rule of the form A ---~ B, B in R. (2) There is no rule of the form A --~ e, e in R (unless A = S, and then S does not appear on the right side of any rule). Proof. Exercise analogous to Theorems 2.14 and 2.15.
[~]
We shall now define a family of translations Tk for k ~ 4 such that Tk is of order k but not of order k - 1. Subsequent lemmas will prove this. DEFINITION
Let k ~ 4. Define 2~k, for the remainder of this section only, to be { a l , . . . , ak}. Define the permutation nk, for k even, by
~(i)
if i is odd
k+ i _I 2i + 1
if i is even Thus, re4 is [3, 1, 4, 2] and zt6 is [4, 1, 5, 2, 6, 3]. Define ztk for k odd by
~(i) =
rk+l 2 '
if/=
i k - - - ~ - + 1,
if i is even
i--1 2 '
if i is odd and i ¢ 1
1
Thus, zt5 = [3, 5, 1, 4, 2] and zt7 = [4, 7, 1, 6, 2, 5, 3]. Let T k be the one-to-one correspondence which takes a~'a~~ . . . a~* to
.~'~"'.~"~'~' "~'~'~' ~n(1)t~zt(2)".. ten(k)
For example, if as, a2, a3, and a4 are called a, b, c, and d, then T4 = {(a'bJckd z, cea'd~bJ) l i, j, k, l ~ 0} In what follows, we shall assume that k is a fixed integer, k ~ 4, and that there is some SDTS T = (N, ~k, Ek, R, S) of order k -- 1 which defines
SEC. 3.2
PROPERTIES OF SYNTAX-DIREC~D TRANSLATIONS
247
Tk. We assume without loss of generality that T satisfies Lemma 3.9, and hence Lemmas 3.6 and 3.7. We shall prove, by contradiction, that T cannot exist. DEFINITION
Let E be a subset of Ek and A ~ N. (Recall that we are referring to the hypothetical SDTS T.) We say that X is (A, d)-bounded in the domain (alt.
range) if for every (x, y) such that (A, A) =-> (x, y), there is some a E X such T
that x (alt. y) has no more than d occurrences of a. If X is not (A, d)-bounded in the domain (alt. range) for any d, we say that A covers X in the domain (alt. range). LEMMA 3.10 If A covers X in the domain, then it covers X in the range, and conversely.
Proof Suppose that A covers Z in the domain, but that X is (A, d)bounded in the range. By Lemma 3.6, there exist wl, w2, w3, and w4 in X~ such that (S, S)=0. (w~Aw2, w3Aw4). Let m = [WaW, l. Since A covers X in the domain, there exist w5 and w6 in Xk* such that (A, A) ==~ (w~, w6), and for all a ~ Z, ws has more than m + d occurrences of a. However, since X is (A, d)-bounded in the range, there is some b ~ X such that Wn has no more than d occurrences of b. But (wlwsw2, w3w6w4) would be a member of T k under these circumstances, although WlWsW2 has more than m + d occurrences of b and w3w6w4 has no more than m + d occurrences of b. By contradiction, we see that if X is covered by A in the domain, then it is also covered by A in the range. The converse is proved by a symmetric argument. D As a consequence of Lemma 3.10, we are entitled to say that A covers E without mentioning domain or range. LEMMA 3.11 Let A cover Xk" Then there is a rule A ~ B 1 . . . Bin, C 1 . . . Cm in R, and sets 1 9 ~ , . . . , ®m, whose union is Xk, such that Bi covers ®t, for 1 ~ i ~m.
Proof Let do be the largest finite integer such that for some X ~ Xk and Bi, 1 < i _< m, X is (B~, d0)-bounded but not (B~, do -- 1)-bounded. Clearly, do exists. Define d~ = do(k -- 1) + 1. There must exist strings x and y in X~ such that (A, A) ==~ (x, y), and for aU a E Xk, x and y each have at least d~ occurrences of a, for otherwise Xk would be (A, dl)-bounded. Let the first step of the derivation (A, A) ==~ (x, y) be (A, A) ==~ (B~ . . . Bin, C 1 " " C~). Since T is assumed to be of order k - 1, we have m ~ k -- 1. We can write x = x ~ . . . xm so that (B~, Bi) =-~ (xt, yi) for some y~.
248
THEORYOF TRANSLATION
CHAP. 3
If a is an arbitrary element of E k, it is not possible that none of x i has more than do occurrences of a, because then x would have no more than do(k -- 1) = dl -- 1 occurrences of a. Let ®t be the subset of £k such that xt has more than do occurrences of all and only those members of Oi. By the foregoing, O1 U 02 t,A . . . U O m = £k" We claim that Bt covers Ot for each i. For if not, then O~ is (Be, d)-bounded for some d > do. By our choice of do, this is impossible. D DEFINITION
Let at, aj, and at be distinct members of £k. We say that a~ is between a t and al if either (1) i < j < l , or (2) zte(i) < ~k(J) < z~k(l). Thus a symbol is formally between two others if it appears physically between them either in the domain or range of Tk. LEMMA 3.12 Let A cover £k, and let A ---~ B ~ . . . Bin, C 1 " " Cm be a rule satisfying Lemma 3.11. If Bt covers {at} and also covers {a,}, and at is between a t and a,, then Bt covers {a,}, and for no j ~ i does Bj cover {a,}. P r o o f Let us suppose that r < t < s. Suppose that Bj covers {a,}, j ~ i. There are two cases to consider, depending on whether j < i or j > i. Case 1: j < i. Since in the underlying grammar of T, Bt derives a string g
with a, in it and Bj derives a string with a, in it, we have (A, A):=~ (x, y), where x has an instance of at preceding an instance of a t. Then by Lemma 3.6, there exists such a sentence in the domain of Tk, which we know not to be the case. Case 2: j > i. Allow Be to derive a sentence with a, in it, and we can similarly find a sentence in the domain of T k with a, preceding a,. By contradiction, we rule out the possibility that r < t < s. The only other possibility, that nk(r) < nk(t) < nk(S), is handled similarly, reasoning about the range of T k. Thus no Bj, j ~ i, covers {at}. If Bj covers £, where a, e £, then B~ certainly covers {a,}. Thus by Lemma 3.11, Bt covers some set containing a,, and hence covers {a,}. D
LEMMA 3.13 If A covers Ek, k > 4, then there is some rule A --~ B ~ . . . Bin, C ~ . . . C~ and some i, 1 ~ i ~ m, such that Bg covers £k' Proof. We shall do the case in which k is even. The case of odd k is similar and will be left for the Exercises. Let A ~ B ~ . . . Bin, C x "'" Cm be a rule satisfying Lemma 3.11. Since m ~ k -- 1 by hypothesis about T, there must
SEC. 3.2
PROPERTIES OF SYNTAX-DIRECTED TRANSLATIONS
249
be some Bt which covers two members of Ek, say Bt covers [a,, as}, r ~ s. Hence, Bt covers {at} and [as}, and by Lemma 3.12, if at is between ar and as, then Bt covers (at} and no C i, j ~ i, covers {at}. If we consider the range of Tk, we see that, should Bt cover (ak/z] and {akin+ ~}, then it covers (a} for all a ~ E~, and no other Bj covers any {a}. It will follow by Lemma 3.11 that Bt covers Ek. Reasoning further, if Bt covers [am} and {a,}, where m < k/2 and n > k/2, then consideration of the domain assures us that B~ covers {ak/2} and (ak/2+ ~}. Thus, if one of r and s is equal to or less than k/2, while the other is greater than k/2, the desired result is immediate. The other cases are that r < k/2 and s < k/2 or r > k/2, s > k/2. But in the range, any distinct r and s, both equal to or less than k/2, have some at, t > k/2, between them. Likewise, if r > k/2 and s > k/2, we find some at, t < k/2, between them. The lemma thus follows in any case. [~] LEMMA 3.14 Tk is in 3k -- 3e-~, for k > 4.
Proof. Clearly, T k is in ~3k. It suffices to show that T, the hypothetical SDTS of order k -- 1, does not exist. Since S certainly covers Zk, by Lemma 3.13 we can find a sequence of nonterminals A0, A 1 , . . . , A#N in N, where A 0 = S and for 0 < i < ~ N , there is a rule A t ~ stAr+lilt, ?tAt+Id;r Moreover, for all i, At covers Ek. N o t all the A's can be distinct, so we can find i and j, with i < j and A t = Aj. By Lemma 3.6, we can find w t , . . . , w~o so that for all p ~ 0,
(S, S) ---> (w~A~w~, w3A,w~)
(w~w~A,w6w2,W3WTA:sW4)
(w ~w~A,w~w2, w3w~A~wfw,) " " (WlWsW9W6W2~ W3w¢w~0w~w,).
By Lemma 3.9(1) we can assume that not all of ~t, fit, ~'t, and Jt are e, and by Lemma 3.9(2) that not all of ws, w6, WT, and ws are e. For each a ~ Ek, it must be that wsw6 and w7w8 have the same number of occurrences of a, or else there would be a pair in z(T) not in T k. Since A t covers Ek, should ws, w6, w7, and w8 have any symbol but ax or a~, we could easily choose w9 to obtain a pair not in Tk. Hence there is an occurrence of a t or a k in w 7 or w8. Since A t covers E~ again, we could choose wl0 to yield a pair not in T k. We conclude that T does not exist, and that T k is not in
250
THEORY OF TRANSLATION
CHAP. 3
THEOREM 3.8 With the exception of k = 2, 3k is properly contained in ~3k÷1 for k .~ 1.
Proof. The case k = 1 is L e m m a 3.5. The other cases are L e m m a 3.14.
D A n interesting practical consequence of T h e o r e m 3.8 is that while it m a y be attractive to build a compiler writing system that assumes the underlying g r a m m a r to be in C h o m s k y n o r m a l form, such a system is not capable o f p e r f o r m i n g any syntax-directed translation of which a m o r e general system is capable. However, it is likely that a practically motivated S D T would at worst be in ~33 (and hence in ~3z).
EXERCISES
"3.2.1.
Let T be a SDT. Show that there is a constant c such that for each x in the domain of T, there exists y such that (x, y) 6 T and [y I we use case (d). The resulting a u t o m a t o n is M = ([ql, q2, [q,, 1 ] , . . . , [q4, 5]}, Z, 5, ql, [q2, [q,, 1 ] , . . . , [q4, 5]}), where ~5 is defined by (1) c~(ql, ~) = [q2} for all letters ix. (2) c~(q2, 00 = [[q,, 1]} for all 0c in Z. (3) c~([q,, i], ~) = [[q,, i -1- 1]} for all ~ in Z and 1 _~ i < 5. Note that [q7, 1] is inaccessible and has been removed from M. Also, M is deterministic here, although it need not be in general. The transition graph for this machine is shown in Fig. 3.7. [ZI
Start A~
. . . ~ Z~
~0 ..... 9
A,...,Z, 0. . . . . 9
Fig. 3.7 Nondeterministic finite automaton for identifiers. 3.3.3.
Direct Lexical Analysis
When the lexical analysis is direct, one must search for one of a large number of tokens. The most efficient way is generally to search for these in parallel, since the search often narrows quite quickly. Thus the model of a direct lexical analyzer is m a n y finite a u t o m a t a operating in parallel, or to be exact, one finite transducer simulating many a u t o m a t a and emitting a signal as to which of the a u t o m a t a has successfully recognized a string. If we have a set of nondeterministic finite a u t o m a t a to simulate in parallel and their state sets are disjoint, we can merge the state sets and next state functions to create one nondeterministic finite automaton, which may be converted to a deterministic one by Theorem 2.3. (The only nuance is that
SEC. 3.3
LEXICALANALYSIS
259
the initial state of the deterministic automaton is the set of all initial states of the components.) Thus it is more convenient to merge before converting to a deterministic device than the other way round. The combined deterministic automaton can be considered to be a simple kind of finite transducer. It emits the token name and, perhaps, information that will locate the instance of the token. Each state of the combined automaton represents states from various of the component automata. Apparently, when the combined automaton enters a state which contains a final state of one of the component automata, and no other states, it should stop and emit the name of the token for that component automaton. However, matters are often not that simple. For example, if an identifier can be any string of characters except for a keyword, it does not make for good practice to define an identifier by the exact regular set, because it is complicated and requires many states. Instead, one uses a simple definition for identifier (Example 3.18 is one such) and leaves it to the combined automaton to make the right decision. In this case, should the combined automaton enter a state which included a final state for one of the keyword automata and a state of the automaton for identifiers and the next input symbol (perhaps a blank or special sign) indicated the end of the token, the keyword would take priority, and indication that the keyword was found would be emitted. Example 3.21
Let us consider a somewhat abstract example. Suppose that identifiers are composed of any string of the four symbols D, F, I, and O, followed by a blank (b), except for the keywords DO and IF, which need not be followed by a blank, but may not be followed immediately by any of the letters D, F, I, or O. The identifiers are recognized by the finite automaton of Fig. 3.8(a), DO by that of Fig. 3.8(b), and IF by Fig. 3.8(c). (All automata here are deterministic, although that need not be true in general, of course.) The merged automaton is shown in Fig. 3.9. State q2 indicates that an identifier has been found. However, states [q 1, q8} and [q 1, qs} are ambiguous. They might indicate IF or DO, respectively, or they might just indicate the initial portion of some identifier, such as DOOF. To resolve the conflict, the lexical analyzer must look at an additional character. If a D , O , I, or F follows, we had the prefix of an identifier. If anything else, including a blank, follows (assume that there are more characters than the five mentioned), we enter new states, q9 or q l0, and emit a signal to the effect that DO or IF, respectively, was detected, and that it ends one symbol previously. If we enter q2, we emit a signal saying that an identifier has been found, ending one symbol previously. Since it is the output of the device, not the state, that is important, states
260
CHAP. 3
THEORYOF TRANSLATION
F,I,O
Start (a) Identifiers
Start--
Q (b) DO
Start
r Q (c) IF Fig. 3.8 Automata for lexical analysis.
q2, qg, and ql0 can be identified and, in fact, will have no representation at all in the implementation. U 3.3.4.
S o f t w a r e Simulation of Finite Transducers
There are several approaches to the simulation of finite automata or transducers. A slow but compact technique is to encode the next move function of the device and execute the encoding interpretively. Since lexical analysis is a major portion of the activity of a translator, this mode of operation is frequently too slow to be acceptable. However, some computers have single instructions that can recognize the kinds of tokens with which we have been dealing. While these instructions cannot simulate an arbitrary finite automaton, they work very well when tokens are either keywords or identifiers. An alternative approach is to make a piece of program for each state. The function of the program is to determine the next character (a subroutine may be used to locate that character), emit any output required, and transfer to the entry of the program corresponding to the next state. An important design question is the proper method of determining the next character. If the next state function for the current state were such that most different next characters lead to different next states, there is probably nothing better to do than to transfer indirectly through a table based on the next character. This method is as fast as any, but requires a table whose size is proportional to the number of different characters. In the typical lexical analyzer, there will be many states such that all but
EXERCISES
261
not D, F, I, 0
D,F,I,O
D
Start
D,F,I F,O
D,
D,I,O
D,F,I,O
notD, F,I,O
Fig. 3.9 Combined lexical analyzer.t
very few next characters lead to the same state, It may be too expensive of space to allocate a full table for each such state. A reasonable compromise between time and space considerations, for many states, would be to use binary decisions to weed out those few characters that cause a transition to an unusual state.
EXERCISES
3.3.1.
Give regular expressions for the following extended regular expressions: (a) (a+3b+3) .2. (b) (alb)* -- (ab)*. (c) (aal bb) .4 ~ a(ab[ ba)+b. tUnlike Fig. 3.8(a), Fig. 3.9 does not permit the empty string to be an identifier.
262
THEORYOF TRANSLATION
CHAP. 3
3.3.2.
Give a sequence of regular definitions that culminate in the definition of (a) A L G O L identifiers. (b) PL/I identifiers. (c) Complex constants of the form (~, fl), where 0c and fl are real F O R T R A N constants. (d) Comments in PL/I.
3.3.3.
Prove Theorem 3.3.
3.3.4.
Give indirect lexica! analyzers for the three regular sets of Exercise 3.3.2.
3.3.5.
Give a direct lexical analyzer that distinguishes among the following tokens" (1) Identifiers consisting of any sequence of letters and digits, with at least one letter somewhere. [An exception occurs in rule (3).] (2) Constants as in Example 3.19. (3) The keywords IF, IN, and INTEGER, which are not to be considered identifiers.
3.3.6.
Extend the notion of indistinguishable states (Section 2.3) to apply to nondeterministic finite automata. If all indistinguishable states are merged, do we necessarily get a minimum state nondeterministic automaton ?
**3.3.7.
Is direct lexical analysis for F O R T R A N easier if the source program is scanned backward ?
Research Problem 3.3.8.
Give an algorithm to choose an implementation for direct lexical analyzers. Your algorithm should be able to accept some indication of the desired time-space trade off. You may not wish to implement the symbolby-symbol action of a finite automaton, but rather allow for the possibility of other actions. For example, if many of the tokens were arithmetic signs of length 1, and these had to be separated by blanks, as in SNOBOL, it might be wise to separate out these tokens from others as the first move of the lexical analyzer by checking whether the second character was blank.
Programming Exercises 3.3.9.
Construct a lexical analyzer for one of the programming languages given in the Appendix. Give consideration to how the lexical analyzer will recover from lexicai errors, particularly misspellings.
3.3.10.
Devise a programming language based on extended regular expressions. Construct a compiler for this language. The object language program should be an implementation of the lexical analyzer described by the source program.
SEC. 3.4
PARSING
BIBLIOGRAPHIC
263
NOTES
The AED RWORD (Read a WORD) system was the first major system to use finite state machine techniques in the construction of lexical analyzers. Johnson et al. [1968] provide an overview of this system. An algorithm that constructs from a regular expression a machine language program that simulates a corresponding nondeterministic finite automaton is given by Thompson [1968]. This algorithm has been used as a pattern-matching mechanism in a powerful text-editing language called QED. A lexical analyzer should be designed to cope with lexical errors in its input. Some examples of lexical errors are (1) (2) (3) (4)
Substitution of an incorrect symbol for a correct symbol in a token. Insertion of an extra symbol in a token. Deletion of a symbol from a token. Transposition of a pair of adjacent symbols in a token.
Freeman [1964] and Morgan [1970] describe techniques which can be used to detect and recover from errors of this nature. The Bibliographic Notes at the end of Section 1.2 provide additional references to error detection and recovery in compiling. 3.4.
PARSING
The second phase of compiling is normally that of parsing or syntax analysis. In this section, formal definitions of two common types of parsing are given, and their capabilities are briefly compared. We shall also discuss what it means for one grammar to "cover" another grammar. 3.4.1.
Definition of Parsing
We say that a sentence w in L(G) for some C F G G has been parsed when we know one (or perhaps all) of its derivation trees. In a translator, this tree may be "physically" constructed in the computer memory, but it is more likely that its representation is more subtle. One can deduce the parse tree by watching the steps taken by the syntax analyzer, although the connection would hardly be obvious at first. Fortunately, most compilers parse by simulating a PDA which is recognizing the input either top-down or bottom-up (see Section 2.5). We shall see that the ability of a PDA to parse top-down is associated with the ability of a P D T to map input strings to their leftmost derivations. Bottom-up parsing is similarly associated with mapping input strings to the reverse of their rightmost derivations. We shall thus treat the parsing problem as that of mapping strings to either leftmost or rightmost derivations. While there are many other parsing strategies, these two definitions serve as the significant benchmarks.
264
THEORY OF TRANSLATION
CHAP. 3
Some other parsing strategies are mentioned in various parts of the book. In the Exercises at the end of Sections 3.4, 4.1, and 5.1 we shall discuss leftcorner parsing, a parsing method that is both top-down and bottom-up in nature. In Section 6.2.1 of Chapter 6 we shall discuss generalized top-down and bottom-up parsing. DEFINITION
Let G = (N, E, P, S) be a CFG, and suppose that the productions of P are numbered 1, 2 , . . . , p. Let a be in (N U X)*. Then
(1) A left parse of a is a sequence of productions used in a leftmost derivation of a from S. (2) A right parse of a is the reverse of a sequence of productions used in a rightmost derivation of a from S in G. We can represent these parses by a sequence of numbers from 1 to p. Example 3.22
Consider the grammar Go, where the productions are numbered as shown: (1) E ~ (2) E---~ (3) T---~ (4) T ~ (5) F ~ (6) F ~
E + T T T. F F (E) a
The left parse of the sentence a , (a + a) is 23465124646. The right parse of a • (a + a) is 64642641532. We shall use an extension of the =~ notation to describe left and right parses. CONVENTION
Let G = (N, X, P, S) be a CFG, and assume that the productions are numbered from I to p. We write ~z ~=-~ ,8 if a ==~ fl and the production lm
applied is numbered i. Similarly, we write 0~=>~ fl if 0~=~ fl and production rm
i is used. We extend these notations by (1) If a ~'==~ fl and fl "'==~ 7, then a .... ==~ 7. (2) If a ==~"' fl and fl ==~"' t', then a =~ .... 2,'. 3.4.2, Top-Down Parsing
In this section we wish to examine the nature of the left-parsing problem for CFG's. Let rr = i 1 .-- i, be a left parse of a sentence w in L(G), where G is a CFG. Knowing n, we can construct a parse tree for w in the following "top-down" manner. We begin with the root labeled S. Then i l gives the
sEc. 3.4
PARSING
265
production to be used to expand S. Suppose that i l is the number of the production S ---~ X i . • • Xk. We then create k descendants of the node labeled S and label these descendants X1, X 2 , . . •, Xk. If X1, X 2 , . . •, X~_ 1 are terminals, then the first i - I symbols of w must be X 1 . " Xi_l. Production i2 must then be of the form Xt ---~ Y ~ ' " Y~, and we can continue building the parse tree for w by expanding the node labeled X~. We can proceed in this fashion and construct the entire parse tree for w corresponding to the left parse ~r. Now suppose that we are given a C F G G = (N, E, P, S) in which the productions are numbered from 1 through p and a string w ~ E* for which we wish to construct a left parse. One way of looking at this problem is that we know the root and frontier of a parse tree and "all" we need to do is fill in the intermediate nodes. Left parsing suggests that we attempt to fill in the parse starting from the root and then working left to right toward the frontier. It is quite easy to show that there is a simple SDTS which maps strings in L(G) to all their left (or right, if you prefer) parses. We shall define such an SDTS here, although we prefer to examine the PDT which implements the translation, because the latter gives an introduction to the physical execution of its translation. DEFINITION
Let G = (N, Z, P, S) be a numbered from 1 to p. Define SDTS (N, ~, { 1 , . . . ,p}, R, S), that A ~ ~ is production i in P deleted.
C F G in which the productions have been T~, or Tt, where G is understood, to be the where R consists of rules A----~ ~, fl such and fl is i~', where ~' is ~ with the terminals
Example 3.23
Let G O be the usual grammar with productions numbered as in Example 3.22. Then T~ = ([E, T, F}, { + , . , (,), a}, { 1 , . . . , 6}, R, E), where R consists of E
~E+
E
~ T,
T~
~ T * F,
3TF
T
,~ F,
4F
r ~ F
(E), ~ a,
T,
lET 2T
5E 6
The pair of derivation trees in Fig. 3.10 shows the translation defined for
a,(a+a).
D
The following theorem is left for the Exercises.
266
CHAP. 3
THEORY OF TRANSLATION
E
E
2/!
i
L
4/! T
1
F
!
F
a
! I
!
!NNNN~,
! 1/E~T 2/! 4/! 4
/!
F
!
I
6
6
a
(a) input
(b) output Fig. 3.10
Translation T~.
THEOREM 3.10 Let G = (N, X, P, S) be a CFG. Then Ti = [(w, zt) lS ~==~w}.
Proof. We can prove by induction that (A, A) =-~ (w, n) if and only if T~
G
Using a construction similar to that in Lemma 3.2, we can construct for any grammar G a nondeterministic pushdown transducer that acts as a left parser for G. DEFINITION
Let G = (N, X, P, S) be a C F G in which the productions have been numbered from 1 to p. Let M f (or Mz when G is understood) be the nondeterministic pushdown transducer ([q}, X, N U X, [ 1, 2 , . . . , p}, t~, q, S, ~), where is defined as follows" (1) t~(q, e, A) contains (q, a, i) if the ith production in P is A --+ a. (2) 6(q, a, a) = {(q, e, e)} for all a in X. We call M~ the left parser for G. With input w, Mz simulates a leftmost derivation of w from S in G. Using rules in (1), each time Mi expands a nonterminal on top of the pushdown list according to a production in P, Mz will also emit the number of that
SEC. 3.4
PARSING
267
production. If there is a terminal symbol on top of the pushdown list, M~ will use a rule in (2) to ensure that this terminal matches the current input symbol. Thus, M~ can produce only a leftmost derivation for w.
THEOREM3.11 Let G = (N, ~, P, S) be a CFG. Then lr~(M~) = [(w, n)[S ~=-~ w}.
Proof Another elementary inductive exercise. The inductive hypothesis this time is that (q, w, A, e ) ~ (q, e, e, rt) if and only if A "==~ w. E] Note that M~ is almost, but not quite, the PDT that one obtains by Lemma 3.2 from the SDTS Tj°. Example 3.24
Let us construct a left parser for G 0. Here
Mi = ({q}, E, N w E, {1, 2 , . . . , 6}, $,q, E, ~), where 6(q, e, E) 0(q, e, T) $(q, e, F) O(q, b, b)
= = = =
{(q, ((q, {(q, [(q,
E .+ T, 1), (q, T, 2)~} T • F, 3), (q, F, 4)} (E), 5), (q, a, 6)} e, e)} for all b in Z
With the input string a + a , a, M~0 can make the following sequence of moves, among others:
(q,a + a , a , E , e )
(q, a-k- a , a , E +
T, 1)
(q, a + a , a , T +
T, 12)
(q, a + a , a , F + T, 124) F- (q, a + a , a , a -4- T, 1246) ~- (q, + a , a, + T, 1246)
(q, a , a, T, 1246) (q, a • a, T • F, 12463) (q, a • a, F • F, 124634) (q, a • a, a • F, 1246346) (q, • a, • F, 1246346) (q, a, F, 1246346) (q, a, a, 12463466) ~- (q, e, e, 12463466)
[~
268
THEORYOF TRANSLATION
CHAP. 3
The left parser is in general a nondeterministic device. To use it in practice, we must simulate it deterministically. There are some grammars, such as those which are not cycle-free, for which a complete simulation is impossible, in this case because there are an infinity of left parses for some words. Moreover, the natural simulation, which we shall discuss in Chapter 4, fails on a larger class of grammars, those which are left-recursive. An essential requirement for doing top-down parsing is that left recursion be eliminated. There is a natural class of grammars, which we shall call LL (for scanning the input from the !eft producing a !eft parse) and discuss in Section 5.1, for which the left parser can be made deterministic by the simple expedient of allowing it to look some finite number of symbols ahead on the input and to base its move on what it sees. The LL grammars are those which can be parsed "in a natural way" by a deterministic left parser. There is a wider class of grammars for which there is some D P D T which can implement the SDTS Tz. These include all the LL grammars and some others which can be parsed only in an "unnatural" way, i.e., those in which the contents of the pushdown list do not reflect successive steps of a leftmost derivation, as does Mz. Such grammars are of only theoretical interest, insofar as top-down parsing is concerned, but we shall treat them briefly in Section 3.4.4. 3.4,3. Bottom-Up Parsing
Let us now turn our attention to the right-parsing problem. Consider the rightmost derivation of a + a • a from E in G O•
E---~IE+
T
,-3E+T,F ---~6E+ T,a ----~4E+F,a ~6E+a,a ~.z T + a , a .,-4F+a,a ~6aq-a.a Writing in reverse the sequence of productions used in this derivation gives us the right parse 64264631 for a --t- a • a. In general, a right parse for a string w in a grammar G = (N, E, P, S) is a sequence of productions which can be used to reduce w to the sentence symbol S. Viewed in terms of a derivation tree, a right parse for a sentence w represents the sequence of handle prunings in which a derivation tree with
sEc. 3.4
PARSING
269
frontier w is pruned to a single node labeled S. In effect, this is equivalent to starting with only the frontier of a derivation tree for w and then "filling in" the derivation tree from the leaves to the root. Thus the term "bottom-up" parsing is often associated with the generation of a right parse. In analogy with the SDTS Tz which maps words in L(G) to their left parses, we can define T,, an SDTS which maps words to right parses. The translation elements have terminals deleted and the production numbers at the right end. We leave it for the Exercises to show that this SDTS correctly defines the desired translation. As for top-down parsing, we are really interested in a PDT which implements Tr. We shall define an extended PDT in analogy with the extended PDA. DEFINITION
An extended P D T is an 8-tuple P = (Q, 1~, F, A, 6, q0, Z0, F), where all symbols are as before except 6, which is a map from a finite subset of Q x (Z u {e}) x F* to the finite subsets of Q x F* x A*. Configurations are defined as before, but with the pushdown top normally on the right, and we say that (q, aw, fl~, x) ~- (p, w, fly, xy) if and only if J(q, a, ~) contains (p, 7', Y). The extended PDT P is deterministic (1) If for all q ~ Q, a ~ I: u {e}, and ~ ~ F*, @6(q, a, ~) < 1 and, (2) If 6(q, a, ~) ::/:: ;2 and J(q, b, fl) :/: ;2, with b = a or b = e, then neither of ~ and fl is a suffix of the other. DEFINITION
Let G = (N, 1~,P, S) be a CFG. Let M~ be the extended nondeterministic pushdown transducer ([q}, ~E,N U E U {$}, [ 1 , . . . , p}, ~, q, $, ;2). The pushdown top is on the right, and 6 is defined as follows: (1) O(q, e, ~) contains (q, A, i) if production i in P is A ---~ 0~. (2) J(q, a, e) = {(q, a, e)} for all a in E. (3) O(q, e, $S) = {(q, e, e)}. This pushdown transducer embodies the elements of what is known as a shift-reduce parsing algorithm. Under rule (2), M, shifts input symbols onto the top of the pushdown list. Whenever a handle appears on top of the pushdown list, Mr can reduce the handle under rule (1) and emit the number of the production used to reduce the handle. Mr may then shift more input symbols onto the pushdown list, until the next handle appears on top of the pushdown list. The handle can then be reduced and the production number emitted. Mr continues to operate in this fashion until the pushdown list contains only the sentence symbol on top of the end of pushdown list marker. Under rule (3) Mr can then enter a configuration in which the pushdown list is empty.
CHAP. 3
THEORY OF TRANSLATION
270
THEOREM 3.12
Let G = (N, Z, P, S) be a CFG. Then lr,(Mr) = {(w, nR)[ S :::~'~ w}. Proof. The proof is similar to that of Lemma 2.25 and is left for the Exercises. 5 Example 3.25
The right parser for Go would be Mr°0 = ({q}, E, N U • U {$}, {1, 2 , . . . , 6}, ~, q, $, ~), where ~(q, e, E -q- T) = [(q, E, 1)} d~(q, e, T) = {(q, E, 2)} ~(q, e, T • F) = [(q, T, 3)} dr(q, e, F) = {(q, T, 4)} ~(q, e, (E)) = {(q, F, 5)}
,~(q, e, a) = {(q, F, 6)} ~(q, b, e) = {(q, b, e)}
for all b in
~(q, e, SE) = {(q, e, e)} With input a 4- a . a, M~ ° could make the following sequence of moves, among others: (q, a + a . a, $, e) F-- (q, q-a . a, $a, e) ~- (q, --ka • a, $F, 6) (q, q- a • a, ST, 64) ~--. (q, ÷ a . a, $E, 642) (q, a • a, $E q-, 642) t-- (q, * a, $E + a, 642) (q, • a, $E + F, 6426) F- (q, * a, $E + T, 64264) (q, a, $E + T . , 64264) (q, e, $E + T . a, 64264) (q, e, SE + T • F, 642646) (q, e, $E q- T, 6426463) ~- (q, e, $E, 64264631) (q, e, e, 64264631)
SEC. 3.4
PARSING
271
Thus, Mr would produce the right parse 64264631 for the input string a + a.a. [~] We shall discuss deterministic simulation of a nondeterministic right parser in Chapter 4. In Section 5.2 we shall discuss an important subclass of CFG's, the LR (for scanning the input from [eft to right and producing a right parse), for which the PDT can be made to operate deterministically by allowing it to look some finite number of symbols ahead on the input. The LR grammars are thus those which can be parsed naturally bottomup and deterministically. As in left parsing, there are grammars which may be right-parsed deterministically, but not in the natural way. We shall treat these in the next section. 3.4.4.
Comparison of Top-Down and Bottom-Up Parsing
If we consider only nondeterministic parsers, then there is little comparison to be made. By Theorems 3.11 and 3.12, every C F G has both a left and right parser. However, if we consider the important question of whether deterministic parsers exist for a given grammar, things are not so simple. DEFINITION
A CFG G is left-parsable if there exists a D P D T P such that v(P) = {(x$, :n:)[(x, g) ~ T~}. G is right-parsable if there exists a D P D T P with "r(P) = {(x$, ~)](x, ~z) c Tr°}. In both cases we shall permit the DPDT to use an endmarker to delimit the right end of the input string. Note that all grammars are left- and right-parsable in an informal sense, but it is determinism that is reflected in the formal definition. We find that the classes of left- and right-parsable grammars are incommensurate; that is, neither is a subset of the other. This is surprising in view of Section 8.1, where we shall show that the LL grammars, those which can be left-parsed deterministically in a natural way, are a subset of the LR grammars, those which can be right-parsed deterministically in a natural way. The following examples give grammars which are left- (right-) parsable but not right- (left-) parsable. Example 3.26
Let G1 be defined by
272
CHAP. 3
THEORY OF TRANSLATION
(1) S ---, BAb (2) S --~ CAc (3) A ~ BA (4) A--~ a (5) B ~ a (6) C ~ a L ( G t ) - - a a + b ÷ aa+e. We can show that G1 is neither LL nor LR, because we do not know whether the first a in any sentence comes from B or C until we have seen the last symbol of the sentence. However we can "unnaturally" produce a left parse for any input string with a D P D T as follows. Suppose that the input is a"+2b, n ~ O. Then the D P D T can produce the left parse 15(35)"4 by storing all a's on the pushdown list until the b is seen. No output is generated until the b is encountered. Then the D P D T can emit 15(35)"4 by using the a's stored on the pushdown list to count to n. Likewise, if the input is a"+Ze, we can produce 26(35)"4 as output. In either case, the trick is to delay producing any output until b or e is seen. We shall now attempt to convince the reader that there is no D P D T which can produce a valid right parse for all inputs. Suppose that M were a D P D T which produced the right parses 55"43"1 for an+2b and 65"43"2 for a"+2e. We shall give an informal proof that M does not exist. The proof draws heavily on ideas in Ginsburg and Greibach [1966] in which it is shown that {a"b"[n ~ 1} u {a"b2"[n :> 1} is not a deterministic CFL. The reader is referred there for assistance in constructing a formal proof. We can show each of the following" (1) Let a * be input to M. Then the output of M is empty, or else M would emit a 5 or 6, and we could "fool" it by placing c or b, respectively, on the input, causing M to produce an erroneofis output. (2) As a's enter the input of M, they must be stored in some way on the pushdown list. Specifically, we can show that there exist integers j and k, pushdown strings a and fl, and state q such that for all integers p ~ 0, (qo, ak+JP, Zo, e) t--- (q, e, flPoc, e), where q0 and Z0 are the initial state and pushdown symbol of M. (3) If after k + jp a's, one b appears on M ' s input, M cannot emit symbol 4 before erasing its pushdown tape to e. For if it did, we could "fool" it by previously placingj more a's on the input and finding that M emits the same number of 5's as it did previously. (4) After reducing its pushdown list to e, M cannot "remember" how many a's were on the input, because the only thing different about M ' s configurations for different values of p (where k ÷ jp is the number of a's) is now the state. Thus, M does not know how many 3's to emit.
Example 3.27 Let G z be defined by (1) S--~ Ab (3) A ~
(5) B - - . a
AB
(2) S --~ Ac (4) A --~ a
SEC. 3.4
PARSING
273
L(G2) = a+b + a ÷c. It is easy to show that G2 is right-parsable. Using an
argument similar to that in Example 3.26 it can be shown that G2 is not leftparsable. E] THEOREM 3.13 The classes of left- and right-parsable grammars are incommensurate. Proof. By Examples 3.26 and 3.27.
[Z]
Despite the above theorem, as a general rule, bottom-up parsing is more appealing than top-down parsing. For a given programming language is often easier to write down a grammar that is right-parsable than one that is left-parsable. Also, as was mentioned, the LL grammars are included in the LR grammars. In the next chapter, we shall also see that the natural simulation of a nondeterministic PDT works for a class of grammars that is, in a sense to be discussed there, more general when the PDT is a right parser than a left parser. When we look at translation, however, the left parse appears more desirable. We shall show that every simple SDT can be performed by (1) A PDT which produces left parses of words, followed by (2) A DPDT which maps the left parses into output strings of the SDT. Interestingly, there are simple SDT's such that "left parse" cannot be replaced by "right parse" in the above. If a compiler translated by first constructing the entire parse and then converting the parse to object code, the above claim would be sufficient to prove that there are certain translations which require a left parse at the intermediate stage. However, many compilers construct the parse tree node by node and compute the translation at each node when that node is constructed. We claim that if a translation cannot be computed directly from the right parse, then it cannot be computed node by node, if the nodes themselves are constructed in a bottom-up way. These ideas will be discussed in more detail in Chapter 9, and we ask the reader to wait until then for the matter of nodeby-node translation to be formalized. DEFINITION
Let G = (N, X, P, S) be a CFG. We define L~ and L~, the left and right parse languages of G, respectively, by L~ = {n:lS "---~ w for some w in L(G)} and L~ = {nRI S ----'~ W for some w in L(G)}
We can extend the "==~ and ==~ notations to SDT's by saying that
274
THEORYOF TRANSLATION
CHAP. 3
(0~, fl) ~==~ (~,, ~) if and only if (tz, fl) ~ (~,, c~) by a sequence of rules such that the leftmost nonterminal of e is replaced at each step and these rules, with translation elements deleted, form the sequence of productions ~. We define ~-~ for SDT's analogously. DEFINITION
An SDTS is semantically unambiguous if there are no two distinct rules of the form A ---~ tx, fl and A ~ 0¢, ~,. A semantically unambiguous SDTS has exactly one translation element for each production of the underlying grammar. THEOREM 3.14 Let T = (N, E, A, R, S) be a semantically unambiguous simple SDTS. Then there exists a D P D T P such that I:,(P) = {(zt, y) i (S, S) ~ (x, y) for some x ~ X*}.
Proof. Assume N and zX are disjoint. Let P = ({q}, { 1 , . . . , p}, N L) A, A, ,5, q, S, ~), where 1 , . . . , p are the numbers of the productions of the underlying grammar, and ~ is defined by (I) Let A ---, e be production i, and A ~ with A ~ 0c. Then ~(q, i, A) (q, fl, e). (2) For all b in A, ~(q, e, b) = (q, e, b).
e, fl the lone rule beginning
P is deterministic because rule (1) applies only with a nonterminal on top of the pushdown list, and rule (2) applies only with an output symbol on top. The proof that P works correctly follows from an easy inductive hypothesis" (q, zt, A, e)[-- (q, e, e, y) if and only if there exists some x in E* such that (A, A) "=-~ (x, y). We leave the proof for the Exercises. To show a simple SDT not to be executable by any D P D T which maps Lf, where G is the underlying grammar, to the output of the SDT, we need the following lemma. LEMMA 3.15 There is no D P D T P such that v(P) = {(wc, wRcw)lw ~ (a, b}*}.
Proof. Here the symbol c plays the role of a right endmarker. Suppose that with input w, P emitted some non-e-string, say dx, where d = a or b. Let d be the other of a and b, and consider the action of P with wdc as input. It must emit some string, but that string begins with d. Hence, P does not map wdc to dwRcwd, as demanded. Thus, P may not emit any output until the right endmarker c is reached. At that time, it has some string aw on its pushdown list and is in state qw. Informally, a~ must be essentially w, in which case, by erasing a~, P can emit wR. But once P has erased aw, P cannot then "remember" all of w in order
SEe,
3.4
PARSING
275
to print it. A formal proof of the lemma draws upon the ideas outlined in Example 3.26. We shall sketch such a proof here. Consider inputs of the form w - - a ; . Then there are integers j and k, a state q, and strings • and fl such that when the input is aJ+"kc, P will place eft" on its pushdown list and enter state q. Then P must erase the pushdown list down to e at or before the time it emits w%. But since e is independent of w, it is no longer possible to emit w. [--] THEOREM 3.15 There exists a simple SDTS T - - ( N , E, A, R, S) such that there is no D P D T P for which r(P) = {OrR, x) ! (S, S) ~'~ (w, x) for some w}.
Proof. Let T be defined by the rules (1) S --. Sa, aSa (2) S --, Sb, bSb (3) S ~ c, c Then L7 = 3(1 ÷ 2)*, where G is the underlying grammar. If we let 17(1) -- a and h(2) -- b, then the desired z(P) is [(3e, h(e)Rch(e))la ~ [ 1, 2}*}. If P existed, with or without a right endmarker, then we could easily construct a D P D T to define the translation {(wc, wRcw)lw e {a, b}*}, in contradiction of Lemma 3.15. [--I We conclude that both left parsing and right parsing are of interest, and we shall study both in succeeding chapters. Another type of parsing which embodies features of both top-down and bottom-up parsing is left-corner parsing. Left-corner parsing will be treated in the Exercises. 3.4.5.
Grammatical Covering
Let G 1 be a CFG. We can consider a grammar G 2 to be similar from the point of view of the parsing process if L(Gz) -- L(G 1) and we can express the left and/or right parse of a sentence generated by G1 in terms of its parse in Gz. If such is the case, we say that G2 covers G I. There are several uses for covering grammars. For example, if a programming language is expressed in terms of a grammar which is "hard" to parse, then it would be desirable to find a covering grammar which is "easier" to parse. Also, certain parsing algorithms which we shall study work only if a grammar is some normal form, e.g., C N F or non-left-recursive. If G1 is an arbitrary grammar and G 2 a particular normal form of G 1, then it would be desirable if the parses in G i can be simply recovered from those in G z. If this is the case, it is not necessary that we be able to recover parses in G2 from those in G 1. For a formal definition of what it means to "recover" parses in one grammar from those in another, we use the notion of a string homomorphism between the parses. Other, stronger mappings could be used, and some of these are discussed in the Exercises.
276
THEORY OF TRANSLATION
CHAP. 3
DEFINITION
Let G 1 = (N1, ~, P1, $1) and G2 = (N2, ~, P2, $2) be C F G ' s such that L(G1) = L(G2). We say that G 2 left-covers G1 if there is a homomorphism h from P2 to P1 such that (1) If $2 ~==~ w, then $1 ht~)==~ W, and (2) For all rt such that Sx ~==~ w, there exists n' such that $2 *'==~ w and h(n')
=
n.
We say G 2 that right-covers G~ if there is a homomorphism h from P2 to PI such that (I) If S 2 ==~ w, then S1 ==~hC~)W, and (2) For all rt such that S~ ==~ w, there exists zt' such that Sz ==~' w and h(n')
=
n.
Example 3.28
Let G 1 be the grammar (1) S --~ 0S1 (2) S --~ 01 and G2 be the following C N F grammar equivalent to G l: (1) S ~ AB (2) s - ~ A c (3) B ~ SC
(4) A ~ 0 (5) C ~ 1 We see G 2 left-covers G 1 with the homomorphism h(1) = 1, h(2) = 2, and h(3) = h(4) = h(5) = e. For example, S 1432455
> 0011,
h(1432455) = 12,
and
S 12_____~0011
G2
G1
G2 also right-covers G1, and in this case, the same h can be used. For example, S,
> 1352544 0011, G~
h(1352544) = 12,
and
S~
12 0011
G1
G 1 does not left- or right-cover G2. Since both grammars are unambiguous, the mapping between parses is fixed. Thus a homomorphism g showing that G1 was a left cover would have to map 1"2 into (143)"24(5) "+I, which can easily be shown to be impossible. [~ Many of the constructions in Section 2.4, which put grammars into normal forms, can be shown to yield grammars which left- or right-cover the original.
EXERCISES
277
Example 3.29 The key step in the C h o m s k y normal form construction (Algorithm 2.12) is the replacement of a production A --~ X t ' " X,, n > 2, by A ~ X t B a , B t ~ X 2 B z , . . . , B,_z ~ X,_ t X.. The resulting g r a m m a r can be shown to left-cover the original if we m a p production A ~ XaB~ to A --, X~ . . . X. and each of the productions B 1 ~ X z B 2 , . . . , Bn-z - ~ X n - I X n to the empty string. If we wish a right cover instead, we may replace A ~ X 1 . . ' X, by A ~
BIXn, O 1 ~
B2Xn_i,...
, B,,_ 2 ~
X t X 2.
[--]
Other covering results are left for the Exercises.
EXERCISES
3.4.1.
Give an algorithm to construct a derivation tree from a left or right parse.
3.4.2.
Let G be a CFG. Show that L~ is a deterministic CFL.
3.4.3.
Is LrG always a deterministic CFL?
*3.4.4.
Construct a determinstic pushdown transducer P such that * ( P ) = [(g, g')l~: is in L~ and g' is the right parse for the same derivation tree}.
*3.4.5. 3.4.6.
Can you construct a deterministic pushdown transducer P such that '~(P) = {(g, g')[~: is in L~ and g' is the corresponding left parse} ? Give left and right parses in Go for the following words" (a) ((a))
(b) a + (a + a)
(c) 3.4.7.
a,a,a
Let G be the CFG defined by the following numbered productions (1) S ~ if B then S else S
(2)
S--~
s
(3) B-----~ B A B (4) B - - , B V B (5/ B - - ~ b Give SDTS's which define T~ and T,°. 3.4.8.
Give PDT's which define T~ and T~, where G is as in Exercise 3.4.7.
3.4.9.
Prove Theorem 3.10.
3.4.10.
Prove Theorem 3.11.
3.4.11.
Give an appropriate definition for T,°, and prove that for your SDTS, words in L(G) are mapped to their right parses.
3.4.12.
Give an algorithm to convert an extended PDT to an equivalent PDT. Your algorithm should be such that if applied to a deterministic extended PDT, the result is a DPDT. Prove that your algorithm does this.
278
CHAP. 3
THEORYOF TRANSLATION
3.4.13. • 3.4.14.
Prove Theorem 3.12. Give deterministic right parsers for the grammars (a) (1) S ~ S 0 (2) S - - - i S 1 (3) S ~ e (b) (1) (2) (3) (4) (5)
"3.4.15.
S-~
AB
A~
0A1
A--,e B-, B1 B--,e
Give deterministic left parsers for the grammars (a) (1) S - - ~ 0S (2) S---, 1S (3) S ~ e (b) (1) (2) (3) (4)
S ~ 0S1 S ~ A A --,. A1 A--~ e
"3.4.16.
Which of the grammars in Exercise 3.4.14 have deterministic left parsers ? Which in Exercise 3.4.15 have deterministic right parsers ?
"3.4.17.
Give a detailed proof that the grammars in Examples 3.26 and 3.27 are right- (left-) parsable but not left- (right-) parsable.
3.4.18.
Complete the proof of Theorem 3.14.
3.4.19.
Complete the proof of Lemma 3.15.
3.4.20.
Complete the proof of Theorem 3.15. DEFINITION T h e left corner of a non-e-production is the leftmost symbol (terminal or nonterminal) on the right side. A le~-corner p a r s e of a sentence is the sequence of productions used at the interior nodes of a parse tree in which all nodes have been ordered as follows. If a node n has p direct descendants nl, n2, . . . , np, then all nodes in the subtree with root n l precede n. Node n precedes all its other descendants. The descendants of n~ precede those of n3, which precede those of r/a, and so forth. Roughly speaking, in left-corner parsing the left corner of a production is recognized bottom-up and the remainder of the production is recognized top-down.
Example 3.30 Figure 3.11 shows a parse tree for the sentence bbaaab generated by the following grammar: (1) S ~ A S (3) A ~ b A A (5) B ~ b
(2) S ~
BB (4) A --~ a
(6) B - - r e
EXERCISES
279
nl
/N
(~n4 !
®n3 /\
~)n6 (~)n7 (~)n8
/
I
I /
~),n5 Qnl 2 Qnl 3(~)nl4 (~)n9
@nl0
!
~)nll
!
Fig. 3.11
Parse tree.
The ordering of the nodes imposed by the left-corner-parse definition states that node nz and its descendants precede n l, which is then followed by n3 and its descendants. N o d e n4 precedes nz, which precedes ///5, //6, and their descendants. Then //9 precedes ns, which precedes nl0, n l~, and their descendants. Continuing in this fashion we obtain the following ordering of nodes: H4 /'//2 /119 /7/5 ///15 ///10 ///16 Hll /I/12 1/6 H1 ///13 /q/7 t//3 ///14 /18
The left-corner parse is the sequence of productions applied at the interior nodes in this order. Thus the left-corner parse for bbaaab is 334441625. A n o t h e r m e t h o d of defining the left-corner parse of a sentence of a g r a m m a r G is to use the following simple SDTS associated with G. DEFINITION Let G = (N, E , P , S) be a C F G in which the productions are n u m b e r e d 1 to p. Let T~c be the simple SDTS (N, E, (1, 2 . . . . . p), R, S), where R contains a rule for each product i on in P determined as follows: If the ith p r o d u c t i o n in P is A - - , B0~ or A ~ a a or A ~ e, then R contains the rule A - - , B ~ , Bioc" or A --~ a~, i0~' or A ~ e, i, respectively, where ~' is ~ with all terminal symbols removed. Then, if (w, rr) is in "c(T~c), zc is a left-corner parse for w.
Example 3.31 Tic for the g r a m m a r of the previous example is S --~ A S , A 1 S
S ---, BB, B 2 B
A ~
A ---, a, 4
bAA, 3AA
B - - - , b, 5
B - - ~ e, 6
280
THEORY OF TRANSLATION
CHAP. 3
We can confirm that (bbaaab, 334441625) is in z(T~).
D
3.4.21.
Prove that (w, ~) is in z(T~) if and only if n is a left-corner parse for w.
3.4.22.
Show that for each CFG there is a (nondeterministic) PDT which maps the sentences of the language to their left-corner parses.
3.4.23.
Devise algorithms which will map a left-corner parse into (1) the corresponding left parse and (2) the corresponding right parse and conversely.
3.4.24.
Show that if Ga left- (right-) covers G2 and G2 left- (right-) covers G~, then G3 left- (right-) covers Gi.
3.4.25.
Let G a be a cycle-free grammar. Show that G i is left- and right-covered by grammars with no single productions.
3.4.26.
Show that every cycle-free grammar is left- and right-covered by grammars in CNF.
*3.4.27.
Show that not every CFG is covered by an e-free grammar.
3.4.28.
Show that Algorithm 2.9, which eliminates useless symbols, produces a grammar which left- and right-covers the original.
**3.4.29.
Show that not every proper grammar is left- or right-covered by a grammar in GNF. Hint: Consider the grammar S ~ S01 S11011.
**3.4.30.
Show that Exercise 3.4.29 still holds if the homomorphism in the definition of cover is replaced by a finite transduction.
"3.4.31.
Does Exercise 3.4.29 still hold if the homomorphism is replaced by a pushdown transducer mapping ?
Research Problem 3.4.32.
It would be nice if whenever G2 left- or right-covered G i, every SDTS with G1 as underlying grammar were equivalent to an SDTS with G~ as underlying grammar. Unfortunately, this is not so. Can you find the conditions relating G1 and G2 so that the SDT's with underlying grammar G t are a subset of those with underlying grammar Gz ?
BIBLIOGRAPHIC
NOTES
Additional details concerning grammatical covering can be found in Reynolds and Haskell [1970], Gray [1969] and Gray and Harrison [1969]. In some early articles left-corner parsing was called bottom-up parsing. A more extensive treatment of left-corner parsing is contained in Cheatham [1967].
4
GENERAL PARSING METHODS
This chapter is devoted to parsing algorithms that are applicable to the entire class of context-free languages. Not all these algorithms can be used on all context-free grammars, but each context-free language has at least one grammar for which all these methods are applicable. The full backtracking algorithms will be discussed first. These algorithms deterministically simulate nondeterministic parsers. As a function of the length of the string to be parsed, these backtracking methods require linear space but may take exponential time. The algorithms discussed in the second section of this chapter are tabular in nature. These algorithms are the Cocke-Younger-Kasami algorithm and Earley's algorithm. They each take space n 2 and time n 3. Earley's algorithm works for any context-free grammar and requires time n 2 whenever the grammar is unambiguous. The algorithms in this chapter are included in this book primarily to give more insight into the design of parsers. It should be clearly stated at the outset that backtrack parsing algorithms should be shunned in most practical applications. Even the tabular methods, which are asymptotically much faster than the backtracking algorithms, should be avoided if the language at hand has a grammar for which the more efficient parsing algorithms of Chapters 5 and 6 are applicable. It is almost certain that virtually all programming languages have easily parsable grammars for which these algorithms are applicable. The methods of this chapter would be used in applications where the grammars encountered do not possess the special properties that are needed by the algorithms of Chapters 5 and 6. For example, if ambiguous grammars are necessary, and all parses are of interest, as in natural language processing, then some of the methods of this chapter might be considered. 281
282
4.1.
GENERALPARSING METHODS
CHAP. 4
BACKTRACK PARSING
Suppose that we have a nondeterministic pushdown transducer P and an input string w. Suppose further that each sequence of moves that P can make on input w is of bounded length. Then the total number of distinct sequences of moves that P can make is also finite, although possibly an exponential function of the length of w. A crude, but straightforward, way of deterministically simulating P is to linearly order the sequences of moves in some manner and then simulate each sequence of moves in the prescribed order. If we are interested in all outputs for input w, then we would have to simulate all move sequences. If we are interested in only one output for w, then once we have found the first sequence of moves that terminates in a final configuration, we can stop simulating P. Of course, if no sequence of moves terminates in a final configuration, then all move sequences would have to be tried. We can think of backtrack parsing in the following terms. Usually, the sequences of moves are arranged in such an order that it is possible to simulate the next move sequence by retracing (backtracking) the last moves made unti! a configuration is reached in which an untried alternative move is possible. This alternative move would then be taken. In practice, local criteria by which it is possible, without simulating an entire sequence, to determine that the sequence cannot lead to a final configuration, are used to speed up the backtracking process. In this section we shall describe how we can deterministically simulate a nondeterministic pushdown transducer using backtracking. We shall then discuss two special cases. The first will be top-down backtrack parsing in which we produce a left parse for the input. The second case is bottom-up backtrack parsing in which we produce a right parse. 4.1.1.
S i m u l a t i o n of a P D T
Let us consider a P D T P and its underlying PDA M. If we give M an input w, it is convenient to know that while M may nondeterministically try many sequences of moves, each sequence is of bounded length. If so, then these sequences can all be tried in some reasonable order. If there are infinite sequences of moves with input w, it is, in at least one sense, impossible to directly simulate M completely. Thus we make the following definition. DEFINITION
A PDA M = (Q, ~, I', $, q0, Z0, F) is halting if for each w in ~*, there is a constant k , such that if (q0, w, Z 0 ) ~ (q, x, 7), then m < kw. A P D T is halting if its underlying PDA is halting.
sEc. 4.1
BACKTRACK PARSING
283
It is interesting to observe the conditions on a grammar G under which 1 the left or right parser for G is halting. It is left for the Exercises to s h o w that the left parser is halting if and only if G is not left-recursive; the right parser is halting if and only if G is cycle-free and has no e-productions. We shall show subsequently that these conditions are the ones under which our general top-down and bottom-up backtrack parsing algorithms work, although more general algorithms work on a larger class of grammars. We should observe that the condition of cycle freedom plus no e-productions is not really very restrictive. Every CFL without e has such a grammar, and, moreover, any context-free grammar can be made cycle-flee and e-fr~,e by simple transformations (Algorithms 2.10 and 2.11). What is more, if the original grammar is unambiguous, then the modified grammar left and right covers it. Non-left recursion is a more stringent condition in this sense. While every CFL has a non-left-recursive grammar (Theorem 2.18), there may be no non-left-recursive covering grammar. (See Exercise 3.4.29.) As an example of what is involved in backtrack parsing and, in general, simulating a nondeterministic pushdown transducer, let us consider the grammar G with productions (1) S- ~ aSbS ~ (2) S
>aS
(3) S
>c
The following pushdown transducer T is a left parser for G. The moves of T are given by $(q, a, S) = {(q, SbS, 1), (q, S, 2)} iS(q, c, S) = [(q, e, 3)] ~(q, b, b) = {(q, e, e)} Suppose that we wish to parse the input string aacbc. Figure 4.I shows a tree which represents the possible sequences of moves that T can make
with this input.
cJ C ° ~ c3~
=
~c4
c,~C2~c,s
!1 I I I !
C5
C8
C12
C6
I I
C9
Cl 3
C7
Clo
C14
[
C16
Fig. 4.1
?
=
Moves of parser.
284
GENERAL PARSING METHODS
CHAP. 4
Co represents the initial configuration (q, aacbc, S, e). The rules of T show that two next configurations are possible from Co, namely Cx = (q, ache, SbS, 1) and C2 = (q, acbc, S, 2). (The ordering here is arbitrary.) From C~, T can enter configurations C3 = (q, cbc, SbSbS, 11) and C4 = (q, cbc, SbS, 12). From C2, T can enter configurations C~ = (q, cbc, SbS, 21) and C~5 = (q, cbc, S, 22). The remaining configurations are determined uniquely. One way to determine all parses for the given input string is to determine all accepting configurations which are accessible from Co in the tree of configurations. This can be done by tracing out all possible paths which begin at C Oand terminate in a configuration from which no next move is possible. We can assign an order in which the paths are tried by ordering the choices of next moves available to T for each combination of state, input symbol, and symbol on top of the pushdown list. For example, let us choose (q, SbS, 1) as the first choice and (q, S, 2) as the second choice of move whenever the rule t~(q, a, S) is applicable. Let us now consider how all the accepting configurations of T can be determined: by systematically tracing out all possible sequences of moves of T. From Co suppose that we make the first choice of next move to obtain C~. From C~ we again take the first choice to obtain C 3. Continuing in this fashion we follow the sequence of configurations Co, C~, C a, C5, C6, C7. C7 represents the terminal configuration (q, e, bS, 1133), which is not an accepting configuration. To determine if there is another terminal configuration, we can "backtrack" up the tree until we encounter a configuration from which another choice of next move not yet considered is available. Thus we must be able to restore configuration C6 from C7. Going back to C6 from C7 can involve moving the input head back on the input, recovering what was previously on the pushdown list, and deleting any output symbols that were emitted in going from C6 to C7. Having restored C6, we must also have available to us the next choice of moves (if any). Since no alternate choices exist in C6, we continue backtracking to C5, and then C a and C~. From C~ we can then use the second choice of move for ~(q, a, S) and obtain configuration C4. We can then continue through configurations C8 and C9 to obtain C~0 = (q, e, e, 1233), which happens to be an accepting configuration. We can then emit the left parse 1233 as output. If we are interested in obtaining only one parse for the input we can halt at this point. However, if we are interested in all parses, we can proceed to backtrack to configuration Co and then try all configurations accessible from C2. C~4 represents another accepting configuration, (q, e, e, 2133). We would then halt after all possible sequences of moves that T could have made have been considered. If the input string had not been syntactically well formed, then all possible move sequences would have to be considered.
SEC. 4.1
BACKTRACK PARSING
285
After exhausting all choices of moves without finding an accepting configuration we would output the message "error." The above analysis illustrates the salient features of what is sometimes known as a nondeterministic algorithm, one in which choices are allowed at certain steps and all choices must be followed. In effect, we systematically generate all configurations that the data underlying the algorithm can be in until we either encounter a solution or exhaust all possibilities. The notion of a nondeterministic algorithm is thus applicable not only to the simulation of nondeterministic automata, but to many other problems as well. It is interesting to note that something analogous to the halting condition for PDT's always enters into the question of whether a nondeterministic algorithm can be simulated deterministicaUy. Some specific examples of nondeterministic algorithms are found in the Exercises. In syntax analysis a grammar rather than a pushdown transducer will usually be given. For this reason we shall now discuss top-down and bottomup parsing directly in terms of the given grammar rather than in terms of the left or right parser for the grammar. However, the manner in which the algorithms work is identical to the serial simulation of the pushdown parser. Instead of cycling through the possible sequences of moves the parser can make, we shall cycle through all possible derivations that are consistent with the input. 4.1.2.
Informal Top-Down Parsing
The name top-down parsing comes from the idea that we attempt to produce a parse tree for the input string starting from the top (root) and working down to the leaves. We begin by taking the given grammar and numbering in some order the alternates for every nonterminal. That is, if A---~ ~z~10c21"" 10c~ are all the A-productions in the grammar, we assign some ordering to the ¢zt's (the alternates for A). For example, consider the grammar mentioned in the previous section. The S-productions are S
> aSbS[aSIc
and let us use them in the order given. That is, aSbS will be the first alternate for S, aS the second, and c the third. Let us assume that our input string is aacbc. We shall use an input pointer which initially points at the leftmost symbol of the input string. Briefly stated, a top-down parser attempts to generate a derivation tree for the input as follows. We begin with a tree containing one node labeled S. That node is the initial active node. We then perform the following steps recursively"
286
GENERAL PARSING METHODS
CHAP. 4
(1) If the active node is labeled by a nonterminal, say A, then choose the first alternate, say X~ . . . Xk, for A and create k direct descendants for A labeled X1, X2,. • . , Xk. Make X1 the active node. If k = 0, then make the node immediately to the right of A active. (2) If the active node is labeled by a terminal, say a, then compare the current input symbol with a. If they match, then make active the node immediately to the right of a and move the input pointer one symbol to the right. If a does not match the current input symbol, go back to the node where the previous production was applied, adjust the input pointer if necessary, and try the next alternate. If no alternate is possible, go back to the next previous node, and so forth. At all times we attempt to keep the derivation tree consistent with the input string. That is, if x~ is the frontier of the tree generated thus far, where 0~is either e or begins with a nonterminal symbol, then x is a prefix of the input string. In our example we begin with a derivation tree initially having one node labeled S. We then apply the first S-production, extending the tree in a manner that is consistent with the given input string. Here, we would use S ~ a S b S to extend the tree to Fig. 4.2(a). Since the active node of the tree is a at this instant and the first input symbol is a, we advance the input pointer to the second input symbol and make the S immediately to the right of a the new active node. We then expand this S in Fig. 4.2(a), using the first alternate, to obtain Fig. 4.2(b). Since the new active node is a, which matches
a
(a)
I
(b)
i
I
¢
c
(c)
¢ (d)
Fig. 4.2 Partial derivation trees.
SEC. 4.1
BACKTRACKPARSING
287
the second input symbol, we advance the input pointer to the third input symbol. We then expand the leftmost S in Fig. 4.2(b), but this time we cannot use either the first or second alternate because then the resulting left-sentential form would not be consistent with the input string. Thus we must use the third alternate to obtain Fig. 4.2(c). We can now advance the input pointer from the third to the fourth and then to the fifth input symbol, since the next two active symbols in the left-sentential form represented by Fig. 4.2(c) are c and b. We can expand the leftmost S in Fig. 4.2(c) using the third alternate for S to obtain Fig. 4.2(d). (The first two alternates are again inconsistent with the input.) The fifth terminal symbol is c, and thus we can advance the input pointer one symbol to the left. (We assume that there is a marker to denote the end of the input string.) However, there are more symbols generated by Fig. 4.2(d), namely bS, than there are in the input string, so we now know that we are on the wrong track in finding a correct parse for the input. Recalling the pushdown parser of Section 4.1.1, we have at this point gone through the sequence of configurations Co, C1, C3, C5, C6, C7. There is no next move possible from C 7. We must now find some other left-sentential form. We first see if there is another alternate for the production used to obtain the tree of Fig. 4.2(d) from the previous tree. There is none, since we used S ~ c to obtain Fig. 4.2(d) from Fig. 4.2(c). We then return to the tree of Fig. 4.2(c) and reset the input pointer to position 3 on the input. We determine if there is another alternate for the production used to obtain Fig. 4.2(c) from the previous tree. Again there is none, since we used S - 4 c to obtain Fig. 4.2(c) from Fig. 4.2(b). We thus return to Fig. 4.2(b), resetting the input pointer to position 2. We used the first alternate for S to obtain Fig. 4.2(b) from Fig. 4.2(a), so now we try the second alternate and obtain the tree of Fig. 4.3(a). We can now advance the input pointer to position 3, since the a generated matches the a at position 2 in the input string. Now, we may use only the third alternate to expand the leftmost S in Fig. 4.3(a) to obtain Fig. 4.3(b). The input symbols at positions 3 and 4 are now matched, so we can advance the input pointer to position 5. We can apply only the third alternate for S in Fig. 4.3(b), and we obtain Fig. 4.3(c). The final input symbol is matched with the rightmost symbol of Fig. 4.3(c). We thus know that Fig. 4.3(c) is a valid parse for the input. At this point we can backtrack to continue looking for other parses, or terminate. Because our grammar is not left-recursive, we shall eventually exhaust all possibilities by backtracking. That is, we would be at the root, and all alternates for S would have been tried. At this point we can halt, and if we have not found a parse, we can report that the input string is not syntactically well formed.
288
CHAP. 4
GENERAL PARSING METHODS
a
s
S
!
c
(a)
(b)
a .
S
c
I
¢
(c) Fig. 4.3 Further attempts at parsing. There is a major pitfall in this procedure. If the grammar is left-recursive, then this process may never terminate. For example, suppose that Ae is the first alternate for A. We would then apply this production forever whenever A is to be expanded. One might argue that this problem could be avoided by trying the alternate A0c for A last. However, the left recursion might be far more subtle, involving several productions. For example, the first A-production might be A ---~ SC. Then if S ~ A B is the first production for S, we would have A =-~ S C =~ A B C , and this pattern would repeat. Even if a suitable ordering for the productions of all nonterminals is found, on inputs 'which are not syntactically well formed, the left-recursive cycles would occur eventually, since all preceding choices would fail. A second attempt to nullify the effects of left recursion might be to bound the number of nodes in the temporary tree in terms of the length of the input string. If we have a C F G G = (N, X, P, S) with ~ N = k and an input string w of length n - 1, we can show that if w is in L(G), then there is at least one derivation tree for w that has no path of length greater than kn. Thus we could confine our search to derivation trees of depth (maximum path length) no greater than kn. However, the number of derivation trees of depth < d can be an enormous function of d for some grammars. For example, consider the grammar
SEC. 4.1
BACKTRACKPARSING
289
G with productions S ~ S S I e. The number of derivation trees of depth d for this grammar is given by the recurrence D(1) = 1 D(d) = (D(d--
1))2+ 1
Values of D ( d ) for d from 1 to 6 are given in Fig. 4.4.
d
D(d)
1
2 3 4 5 6
1
2 5 26 677 458330
Fig. 4.4
Values of
D(d).
D ( d ) grows very rapidly, faster than 2 2`-` for d > 3. (Also, see Exercise 4.1.4.) This growth is so huge that any grammar in which two productions of this form need to be considered could not possibly be reasonably parsed using this modification of the top-down parsing algorithm. For these reasons the approach generally taken is to apply the top-down parsing algorithm only to grammars that are free of left recursion.
4.1.3.
The Top-Down Parsing Algorithm
We are now ready to describe our top-down backtrack parsing algorithm. The algorithm uses two pushdown lists (L1 and L2) and a counter containing the current position of the input pointer. To describe the algorithm precisely, we shall use a stylized notation similar to that used to describe configurations of a pushdown transducer. ALGORITHM 4.1 Top-down backtrack parsing. Input. A non-left-recursive CFG G = (N, X, P, S) and an input string w = ala 2 . . . a,, n ~ O. We assume that the productions in P are numbered
1 , 2 , . . . ,p. Output. One left parse for w if one exists. The output "error" otherwise. Method.
(i) For each nonterminal A in N, order the alternates for A. Let A t be the index for the ith alternate of A. For example, if A --~ t~litz2[.-. [~k
290
GENERAL PARSING METHODS
CHAP. 4
are all the A-productions in P and we have ordered the alternates as shown, then A 1 is the index for as, A2 is the index for 0~2, and so forth. (2) A 4-tuple (s, i, ~, fl) will be used to denote a configuration of the algorithm" (a) s denotes the state of the algorithm. (b) i represents the location of the input pointer. We assume that the n + 1st "input symbol" is $, the right endmarker. (c) 0~ represents the first pushdown list (L1). (d) fl represents the second pushdown list (L2). The top of 0c will be on the right and the top of fl will be on the left. L2 represents the "current" left-sentential form, the one which our expansion of nonterminals has produced. Referring to our informal description of topdown parsing in Section 4.1.1, the symbol on top of L2 is the symbol labeling the active node of the derivation tree being generated. L1 represents the current history of the choices of alternates made and the input symbols over which the input head has shifted. The algorithm will be in one of three states q, b, or t; q denotes normal operation, b denotes backtracking, and t is the terminating state. (3) The initial configuration of the algorithm is (q, 1, e, S$). (4) There are six types of steps. These steps will be described in terms of their effect on the configuration of the algorithm. The heart of the algorithm is to compute successive configurations defined by a "goes to" relation, t--. The notation (s, i, a, fl) ~ (s', i', a', fl') means that if the current configuration is (s, i, ~, fl), then we are to go next into the configuration (s', i', ~', fl'). Unless otherwise stated, i can be any integer from 1 to n + 1, a a string in (X w I)*, where I is the set of indices for the alternates, and fl a string in (N U X)*. The six types of move are as follows: (a) Tree expansion
(q, i, oc, Aft) F- (q, i, ocA 1, ?lfl) where A ~ 71 is a production in P and ?1 is the first alternate for A. This step corresponds to an expansion of the partial derivation tree using the first alternate for the leftmost nonterminal in the tree. (b) Successful match of input symbols and derived symbol
(q, i, oc, aft) ~- (q, i + 1, oca, fl) provided a~ - - a , i ~ n. If the ith input symbol matches the next terminal symbol derived, we move that terminal symbol from the top of L2 to the top of L1 and increment the input pointer.
SEC. 4.1
BACKTRACK PARSING
291
(c) Successful conclusion
( q , n + 1, a, $) }--- (t, n -+- 1, a, e) We have reached the end of the input and have found a leftsentential form which matches the input. We can recover the left parse from a by applying the following homomorphism h to a: h(a) -- e for all a in E; h(A,) = p, where p is the production number associated with the production A ~ ?, and ? is the ith alternate for A. (d) Unsuccessful match of input symbol and derived symbol
( q, i, oc, aft) ~ (b, i, o~, aft)
if at ~ a
We go into the backtracking mode as soon as the left-sentential form being derived is not consistent with the input.
(e) Backtracking on input (b, i, o~a, fl) F- (b, i -- 1, ~, aft) for all a in E. In the backtracking mode we shift input symbols back from L1 to L2. (f) Try next alternate
(b, i, ocAj, ?jfl) ~(i) (q, i, ~Aj+ 1, ?j+lfl), if )'j+l is the j + 1st alternate for A. (Note that ),~ is replaced by ~,j+t on the top of L2.) (ii) No configuration, if i - - 1 , A = S, and there are only j alternates for S. (This condition indicates that we have exhausted all possible left-sentential forms consistent with the input w without having found a parse for w.) (iii) ( b , i , ~ , A f l ) otherwise. (Here, the alternates for A are exhausted, and we backtrack by removing Aj from L1 and replacing ~,j by A on L2.) The execution of the algorithm is as follows.
Step 1" Starting in the initial configuration, compute successive next configurations Co ~ C1 [-- " " ~ Ci ~ .." until no further configurations can be computed. Step 2: If the last computed configuration is (t, n + 1, ?, e), emit h(?) and halt. h(?) is the first found left parse. Otherwise, emit the error signal. D Algorithm 4.1 is essentially the algorithm we described informally earlier, with a few bookkeeping features added to perform the backtracking.
292
GENERAL PARSING METHODS
CHAP. 4
Example 4.1
Let us consider the operation of Algorithm 4.1 using the grammar G with productions (1) E
> T-4- E
(2) E
>T
(3) T - - - . F . T (4) T
>F
(5) F -
>a
Let E~ be T + E , Ez be T, T~ be F • T, a n d / ' 2 be F. With the input a -4- a, Algorithm 4.1 computes the following sequence of configurations" (q, 1, e, E$) [-- (q, 1, g~, T + E$) I- (q, 1, E 1T 1, F • T + E$)
(q, 1, E 1T1F1, a • T + E$) (q, 2, E~ TtFta, • T + E$) (b, 2, Et T~Fta, * T + E$) [-- (b, 1, E~ T~Ft, a • T + E$) (b, 1, Et Tt, F , T + E$) l--- (q, 1, Et T 2, F + E$) [-- (q, 1, E~ T2F~, a + E$) (q, 2, Et T2Fta, + E$) [-- (q, 3, E~ T2F~a +, E$) l- (q, 3, E, TzF~a + E 1, T + E$) (q, 3, E~ T2F ~a + Et Tt, F • T + E$) (q, 3, E~ T2Fta + E~ TtF 1, a • T + E$) t- (q, 4, E~ T2F~a + E t T~F,a, • T + E$) (b, 4, E t T2Fxa + E~ TiF~a, • T + E$) (b, 3, E~ T2F~a + E t T~F t, a • T + E$) ~- (b, 3, E~ T2F~a + E~ T~, F , T + E$) ~- (q, 3, E~ T2F~a + E, T2, F q- E$) (q, 3, E, TzF~a -k E, T2Ft, a -4- E$) [- (q, 4, E, TzF~a -t-- E1TzF~a, .+- E$) [- (b, 4, E~ TzFta -4- E~ TzF, a, + E$)
SEC. 4.1
BACKTRACK PARSING
293
(b, 3, El TzF~a + E~ TzF a, a + E$) ~- (b, 3, E 1TzF~a + E 1T2, F + E$) ~- (b, 3, Et T2F~a + Ea, T + E$) ~- (q, 3, E 1TzFla -k E2, T$) ]- (q, 3, E 1T2Fla + E2 T 1, F • T$) ]- (q, 3, Ea T2F~a + E2 T~Fa, a • T$) F- (q, 4, gaTzV~a + gzr~.Fta, • T$) ~--- (b, 4, E1T2Fla + EzT1F1 a, * T$) [- (b, 3, Ea T2Faa + E2 TiF1, a • T$) (b, 3, Ea TzFaa + EzT1, F . T$) (q, 3, g~ T2Faa + E2T2, F$) [- (q, 3, EI T2F~a + E2 T2F1, aS)
(q, 4, g~ r~F~a + g~r~F~a, $) (t, 4, E~ T2F~a + E2T2F~a, e) The left parse is h(E~ T2Fxa + E2T2F~a) = 145245.
[Z
We shall now show that Algorithm 4.1 does indeed produce a left parse for w according to G if one exists. DEFINITION
A partial left parse is the sequence of productions used in a leftmost derivation of a left-sententiai form. We say that a partial left parse is consistent with the input string w if the associated left-sentential form is consistent with w. Let G = (N, X, P, S) be the non-left-recursive g r a m m a r of Example 4.1 and let w = a 1 - . . an be the input string. The sequence of consistent partial left parses for w, n0, tel, z~2, . . . . zti, • • • is defined as follows: (1) no is e and represents a derivation of S from S. (z~0 is not strictly a parse.) (2) n1 is the production n u m b e r for S ---~ ~, where ~z is the first alternate for S. (3) ztt is defined as follows: Suppose that S "'-~=~ xAy. Let fl be the lowest n u m b e r e d alternate for A, if it exists, such that we can write xfl?-- xyO, where ~ is either e or begins with a nonterminal and xy is a prefix of w. Then n ~ - - n i _ i A k , where k is the n u m b e r of alternate ft. In this case we call rt~ a continuation of n,._~. If, on the other hand, no such fl exists, or S .... =~ x for some terminal string x, then let j be the largest integer less than i - 1 such that the following conditions hold:
294
GENERALPARSING METHODS
CHAP. 4
(a) Let S ~'=~ xB?, and let nj+ 1 be a continuation of nj, with alternate ek replacing B in the last step of nj+ 1. Then there exists an alternate e,, for B which follows ek in the order of alternates for B. (b) We can write Xem? = xyO, where O is e or begins with a nonterminal; xy is a prefix of w. Then ni -- njB,,, where B m is the number of production B --, e~. In this case, we call nt a modification of ~i- 1(c) 7r~ is undefined if (a) or (b) does not apply. Example 4.2
For the g r a m m a r G of Example 4.1 and the input string a + a, the sequence of consistent partial left parses is e
1 13 14 145 1451 14513 14514 1452 14523 14524 145245 2 23 24
It should be observed that the sequence of consistent partial left parses up to the first correct parse is related to the sequence of strings appearing on L1. Neglecting the terminal symbols on L1, the two sequences are the same, except that L1 will have certain sequences that are not consistent with the input. When such a sequence appears on L1, backtracking immediately occurs.
[~]
It should be obvious that the sequence of consistent partial left parses is unique and includes all the consistent partial left parses in a natural lexicographic order. LEMMA 4.1
Let G = (N, X, P, S) be a non-left-recursive grammar. Then there exists i
a constant c such that if A ~
wBot and Iwl = n, then i < c"+2.t
lm
t i n fact, a stronger result is possible; i is linear in n. However, this result suffices for the time being and will help prove the stronger result.
SEC. 4.1
BACKTRACK PARSIN6
295
Proof. Let ~ N = k, and consider the derivation tree D corresponding i
to the leftmost derivation A =-~ wBt~. Suppose that there exists a path of length more than k(n -I- 2) from the root to a leaf. Let n o be the node labeled by the explicitly shown B in wBe. If the path reaches a leaf to the right of no, then the path to no must be at least as long. This follows because in a leftmost derivation the leftmost nonterminal is always rewritten. Thus the direct ancestor of each node to the right of no is an ancestor of no. The derivation tree D is shown in Fig. 4.5. A
w
Fig. 4.5 Derivation tree D.
no
Thus, if there is a path of length greater than k(n + 2) in D, we can find one such path which reaches n o or a node to its left. Then we can find k 4-- 1 consecutive nodes, say n l , . . . , nk+l, on the path such that each node yields the same portion of wB. All the direct descendants of n,., 1 _~ i < k, that lie to the left of ni+ ~ derive e. We must thus be able to find two of n a , . . . , nk+ ~ with the same label, and this label is easily shown to be a left-recursive nonterminal. We may conclude that D has no path of length greater than k(n --1- 2). Let l be the length of the longest right side of a production. Then D has no i
more than l k~"+2~ interior nodes. We conclude that if A ~ i ~ I kC"+2). Choosing c = l k proves the lemma. [Z]
wBt~, then
COROLLARY
Let G = (N, ~, P, S) be a non-left-recursive grammar. Then there is i
a constant c' such that if S ~
w B e and w ~ e, then [el 2" certainly follows. [-7
This example points out a major problem with top-down backtrack parsing. The number of steps necessary to parse by Algorithm 4.1 can be enormous. There are several techniques that can be used to speed this algorithm somewhat. We shall mention a few of them here. (1) We can order productions so that the most likely alternates are tried first. However, this will not help in those cases in which the input is not syntactically well formed, and all possibilities have to be tried. DEFINITION For a CFG G -- (N, Z, P, S), FIRST k(a) -- [x t~ ::~ lm
xfl and [xl = k or a =:~ x and i xl < k}.
That is, FIRSTk(a ) consists of all terminal prefixes of length k (or less if a derives a terminal string of length less than k) of the terminal strings that can be derived from ~. (2) We can look ahead at the next k input symbols to determine whether a given alternate should be used. For example, we can tabulate, for each alternate a, a lookahead set FIRSTk(a ). If no prefix of the remaining input string is contained in FIRSTk(a), we can immediately reject a and try the next alternate. This technique is very useful both when the given input is in L(G) and when it is not in L(G). In Chapter 5 we shall see that for certain classes of grammars the use of lookahead can entirely eliminate the need for backtracking. (3) We can add bookkeeping features which will allow faster backtracking. For example, if we know that the last m-productions applied have no applicable next alternates, when failure occurs we can skip back directly to the position where there is an applicable alternate. (4) We can restrict the amount of backtracking that can be done. We shall discuss parsing techniques of this nature in Chapter 6. Another severe problem with backtrack parsing is its poor error-locating capability. If an input string is not syntactically well formed, then a compiler
SEC. 4.1
BACKTRACKPARSING
301
should announce which input symbols are in error. Moreover, once one error has been found, the compiler should recover from that error so that parsing can resume in order to detect any additional errors that might occur. If the input string is not syntactically well formed, then the backtracking algorithm as formulated will merely announce error, leaving the input pointer at the first input symbol. To obtain more detailed error information, we can incorporate error productions into the grammar. Error productions are used to generate strings containing common syntactic errors and would make syntactically invalid strings well formed. The production numbers in the output corresponding to these error productions can then be used to signal the location of errors in the input string. However, from a practical point of view, the parsing algorithms presented in Chapter 5 have better error-announcing capabilities than backtracking algorithms with error productions. 4.1.5.
Bottom-Up Parsing
There is a general approach to parsing that is in a sense opposite to that of top-down parsing. The top-down parsing algorithm can be thought of as building the parse tree by trial and error from the root (top) and proceeding downward to the leaves. Its opposite, bottom-up parsing, starts with the leaves (i.e., the input symbols themselves) and attempts to build the tree upwards toward the root. We shall describe a formulation of bottom-up parsing that is called shiftreduce parsing. The parsing proceeds using essentially a right parser cycling through all possible rightmost derivations, in reverse, that are consistent with the input. A move consists of scanning the string on top of the pushdown list to see if there is a right side of a production that matches the symbols on top of the list. If so, a reduction is made, replacing these symbols by the left side of the production. If more than one reduction is possible, we order the possible reductions in some arbitrary manner and apply the first. If no reduction is possible, we shift the next input symbol onto the pushdown list and proceed as before. We shall always attempt to make a reduction before shifting. If we come to the end of the string and no reduction is possible, we backtrack to the last move at which we made a reduction. If another reduction was possible at that point we try that. Let us consider a grammar with productions S - - ~ AB, A - - . ab, and B --~ aba. Let the input string be ababa. We would shift the first a on the pushdown list. Since no reduction is possible, we would then shift the b on the pushdown list. We then replace ab on top of the pushdown list by A. At this point we have the partial tree of Fig. 4.6(a). As the A cannot be further reduced, we shift a onto the pushdown list. Again no reduction is possible, so we shift b onto the pushdown list. We can then reduce ab to A. We now have the partial tree of Fig. 4.6(b).
302
GENERAL PARSING METHODS
a
b
CHAP. 4
a
(a)
/~.b a/A~b (b) A
B
a
b
a
b
a
(c)
Ajs a/ ~b a/!~a (d)
Fig. 4.6 P a r t i a l p a r s e trees in b o t t o m up parse.
We shift a on the pushdown list and find that no reductions are possible. We then backtrack to the last position at which we made a reduction, namely where the pushdown list contained Aab (b is on top here) and we replaced ab by A, i.e., when the partial tree was that of Fig. 4.6(a). Since no other reduction is possible, we now shift instead of reducing. The pushdown list now contains Aaba. We can then reduce aba to B, to obtain Fig. 4.6(c). Next, we replace AB by S and thus have a complete tree, shown in Fig. 4.6(d). This method can be viewed as considering all possible sequences of moves of a nondeterministic right parser for a grammar. However, as with topdown parsing, we must avoid situations in which the number of possible moves is infinite. One such pitfall occurs when a grammar has cycles, that is, derivations +
of the form A ~ A for some nonterminal A. The number of partial trees can be infinite in this case, so we shall rule out grammars with cycles. Also, e-productions cause difficulty, since we can make an arbitrary number of reductions in which the empty string is "reduced" to a nonterminal. Bottomup parsing can be extended to embrace grammars with e-productions, but for simplicity we shall choose to outlaw e-productions here.
sEc. 4.1
BACKTRACKPARSING
303
ALGORITHM 4.2 Bottom-up backtrack parsing. Input. C F G G = (N, E, P, S) with no cycles or e-productions, whose productions are numbered 1 to p, and an input string w = alaz .. • a,, n ~ 1. Output. One right parse for w if one exists. The output "error" otherwise. Method.
(1) Order the productions arbitrarily. (2) We shall couch our algorithm in the 4-tuple configurations similar to those used in Algorithm 4.1. In a configuration (s, i, a, fl) (a) s represents the state of the algorithm. (b) i represents the current location of the input pointer. We assume the n + 1st input symbol is $, the right endmarker. (c) a represents a pushdown list L1 (whose top is on the right). (d) fl represents a pushdown list L2 (whose top is on the left). As before, the algorithm can be in one of three states q, b, or t. L1 will hold a string of terminals and nonterminals that derives the portion of input to the left of the input pointer. L2 will hold a history of the shifts and reductions necessary to obtain the contents of L1 from the input. (3) The initial configuration of the algorithm is (q, 1, $, e). (4) The algorithm itself is as follows. We begin by trying to apply step 1. Step 1" Attempt to reduce
(q, i, ~fl, ?) ~ (q, i, ~A, j~,) provided A --~ fl is the jth production in P and fl is the first right side in the linear ordering in (1) that is a suffix of aft. The production number is written on L2. If step 1 applies, return to step 1. Otherwise go to step 2. Step 2: Shift
(q, i, oc, r) [-- (q, i -k 1, oca~,s?) provided i :/= n -k 1. Go to step 1. If i -- n -+- 1, instead go to step 3. If step 2 is successful, we write the ith input symbol on top of L1, increment the input pointer, and write s on L2, to indicate that a shift has been made. Step 3: Accept
(q, n --Jr-1, $S, ?) ~ (t, n --t- 1, $S, ?) Emit h(?), where h is the homomorphism
304
GENERAL PARSING METHODS
CHAP. 4
h(s) = e h(j) = j
for all production numbers
h(7) is a right parse of w in reverse. Then halt. If step 3 is not applicable, go to step 4. Step 4" Enter backtracking mode
(q,n + 1, ~x, ~) I-- (b,n + 1, o~, 7) provided ~ -~ $S. Go to step 5. Step 5" Backtracking (a)
(b, i, ~A, Jr) ~ (q, i, rE'B, k~,)
if the jth production in P is A ----~fl and the next production in the ordering of (1) whose right side is a suffix of ~fl is B ~ fl', numbered k. Note that t~fl = t~'fl'. Go to step 1. (Here we have backtracked to the previous reduction, and we try the next alternative reduction.)
(b, n + 1, ~A, jy) ~ (b, n -q- 1, t~fl, y)
(b)
if the jth production in P is A ---~ fl and no other alternative reductions of ~fl remain. Go to step 5. (If no alternative reductions exist, "undo" the reduction and continue backtracking when the input pointer is at n q- 1.) (c)
(b, i, ~A, Jr) ~ (q, i --k 1, ~fla, sT)
if i ~ n -t- 1, the jth production in P is A ~ fl, and no other alternative reductions of ~fl remain. Here a = a~ is shifted onto L1, and an s is entered on L2. Go to step 1. Here we have backtracked to the previous reduction. No alternative reductions exist, so we try a shift instead. (d)
(b, i, ~a, sy) ~ (b, i -- 1, t~, r)
if the top entry on L2 is the shift symbol. (Here all alternatives at position i have been exhausted, and the shift action must be undone. The input pointer moves left, the terminal symbol a t is removed from L1 and the symbol s is removed from L2.) D Example 4.4
Let us apply this bottom-up parsing algorithm to the grammar G with productions
sEc.
4.1
BACKTRACK PARSING
305
(1) E - - - ~ E + T (2) E
~T
(3) T
~ T, F
(4) T~
~F
(5) F~ ~ a If E + T appears on top of L1, we shall first try reducing, using E ---~ E + T and then using E ~ T. If T . F appears on top of L1, we shall first try T----, T , F and then T---. F. With input a • a the bottom-up algorithm would go through the following configurations: (q, 1, $, e) ~ (q, 2, $a, s) }-- (q, 2, $F, 5s) (q, 2, ST, 45s) ~- (q, 2, $E, 245s) 1-- (q, 3, $E ,, s245s)
1-.- (q, 4, $E • a, ss245s) (q, 4, $E • F, 5ss245s) t-- (q, 4, S E . T, 45ss245s) I-- (q, 4, $E • E, 245ss245s) t-- (b, 4, $ E , E, 245ss245s) l- (b, 4, SE • T, 45ss245s) }--- (b, 4, $E • F, 5ss245s) t- (b, 4, $E • a, ss245s) 1--- (b, 3, $ E . , s245s)
(b, 2, $E, 245s) t--- (q, 3, $T,,s45s)
t--- (q, 4, $T • a, ss45s) (q, 4, $ T , F, 5ss45s) (q, 4, ST, 35ss45s)
(q, 4, $E, 235ss45s) I--- (t, 4, $E, 235ss45s) E] We can prove the correctness of Algorithm 4.2 in a manner analogous to the way we showed that top-down parsing worked. We shall outline a proof here, leaving most of the details as Exercises.
306
GENERALPARSING METHODS
CHAP. 4
DEFINITION
Let G = (N, E, P, S) be a CFG. We say that z~ is a partial right parse consistent with w if there is some g in (N W E)* and a prefix x of w such that LEMMA 4.5
Let G be a cycle-free CFG with no e-productions. Then there is a constant c such that the number of partial right parses consistent with an input of length n is at most c".
Proof. Exercise.
[~]
THEOREM 4.4
Algorithm 4.2 correctly finds a right parse of w if one exists, and signals an error otherwise. Proof By Lemma 4.5, the number of partial right parses consistent with the input is finite. It is left as an Exercise to show that unless Algorithm 4.2 finds a parse, it cycles through all partial right parses in a natural order. Namely, each partial right parse can be coded by a sequence of production indices and shift symbols (s). Algorithm 4.2 considers each such sequence that is a partial right parse in a lexicographic order. That lexicographic order is determined by an order of the symbols, placing s last, and ordering the production indices as in step 1 of Algorithm 4.2. Note that not every sequence of such symbols is a consistent partial right parse. [-7 Paralleling the analysis for Algorithm 4.1, we can also show that the lengths of the lists in the configurations for Algorithm 4.2 remain linear in the input length. THEOREM 4.5 Let one cell be needed for each symbol on a list in a configuration of Algorithm 4.2, and let the number of elementary operations needed to compute one step of Algorithm 4.2 be bounded. Then for some constants c a and c 2, Algorithm 4.2 requires can space and c~ time, when given input of length n~l.
Proof Exercise.
[~]
There are a number of modifications we can make to the basic bottom-up parsing algorithm in order to speed up its operations" (1) We can add "lookahead" so that if we find that the next k symbols to the right of the input pointer could not possibly follow an A in any rightsentential form, then we do not make a reduction according to any A-production.
EXERCISES
307
(2) We can attempt to order the reductions so that the most likely reductions are made first. (3) We can add information to determine whether certain reductions will lead to success. For example, if the first reduction uses the production A ~ ax . . . ak, where a~ is the first input symbol and we know that there is no y in Z* such that S ~ Ay, then this reduction can be immediately ruled out. In general we want to be sure that if $~ is on L1, then ~ is the prefix of a right-sentential form. While this test is complicated in general, certain notions, such as precedence, discussed in Chapter 5, will make it easy to rule out many ~'s that might appear on L1. (4) We can add features to make backtracking faster. For example, we might store information that will allow us to directly recover the previous configuration at which a reduction was made. Some of these considerations are explored in Exercises 4.1.12-4.1.14 and 4.1.25. The remarks on error detection and recovery with the backtracking top-down algorithm also apply to the bottom-up algorithm.
EXERCISES
4.1.1.
Let G be defined by S----~ ASIa
A ~
bSA[b
What sequence of steps is taken by Algorithm 4.1 if the order of alternates is as shown, and the input is (a) ba? (b) baba ? What are the sequences if the order of alternates is reversed ? 4.1.2.
Let G be the grammar S----> S A I A A ~
aAlb
What sequence of steps are taken by Algorithm 4.2 if the order of reductions is longest first, and the input is (a) ab ? (b) abab ? What if the order of choice is shortest first ? 4.1.3.
"4.1.4.
Show that every cycle-free CFG that does not generate e is rightcovered by one for which Algorithm 4.2 works, but may not be leftcovered by any for which Algorithm 4.1 works. Show that the solution to the recurrence
308
CHAP. 4
GENERAL PARSING METHODS
D(1) = 1
D(d) = ( D ( d -
1) 2) + 1
is D(d) = [k~], where k is a real number and [x] is the greatest integer _< x. Here, k = 1.502837 . . . . 4.1.5.
Complete the proof of Corollary 2 to Lemma 4.4.
4.1.6.
Modify Algorithm 4.1 to refrain from using an alternate if it is impossible to derive the next k input symbols, for fixed k, from the resulting left-sentential form.
4.1.7.
Modify Algorithm 4.1 to work on an arbitrary grammar by putting bounds on the length to which L1 and L2 can grow.
*'4.1.8.
Give a necessary and sufficient condition on the input grammar such that Algorithm 4.1 will never enter the backtrack mode.
4.1.9.
Prove L e m m a 4.5.
4.1.10.
Prove Theorem 4.5.
4.1.11.
Modify Algorithm 4.2 to work for an arbitrary C F G by bounding the length of the lists L1 and L2.
4.1.12.
Modify Algorithm 4.2 to run faster by checking that the partial right parse together with the input to the right of the pointer does not contain any sequence of k symbols that could not be part of a rightsentential form.
4.1.13.
Modify Algorithms 4.1 and 4.2 to backtrack to any specially designated previous configuration using a finite number of reasonably defined elementary operations.
*'4.1.14.
Give a necessary and sufficient condition on a grammar such that Algorithm 4.2 will operate with no backtracking. What if the modification of Exercise 4.1.12 is first m a d e ?
4.1.15.
Find a cycle-free grammar with no e-productions on which Algorithm 4.2 takes an exponential amount of time.
4.1.16.
Improve the bound of Lemma 4.4 if the grammar has no e-productions.
4.1.17.
Show that if a grammar G with no useless symbols has either a cycle or an e-production, then Algorithm 4.2 will not terminate on any sentence not in L(G). DEFINITION We shall outline a programming language in which we can write nondeterministic algorithms. We call the language N D F (nondeterministic F O R T R A N ) , because it consists of F O R T R A N - l i k e statements plus the statement C H O I C E (nl . . . . . nk), where k ~ 2 and nl . . . . . nk are statement numbers. To define the meaning of an N D F program, we postulate the existence of an interpreter capable of executing any finite number of
EXERCISES
309
programs in a round robin fashion (i.e., working on the compiled version of each in turn for a fixed number of machine operations). We assume that the meaning of the usual F O R T R A N statements is understood. However, if the statement CHOICE (nl . . . . . nk) is executed, the interpreter makes k copies of the program and its entire data region. Control is transferred to statement nt in the ith copy of the program for 1 ~ i < k. All output appears on a single printer, and all input is received from a single card reader (so that we had better read all input before executing any CHOICE statement).
Example 4.5 The following N D F program prints the legend NOT A PRIME one or more times if the input is not a prime number, and prints nothing if it is a prime" READ N I=l PICK A VALUE OF I G R E A T E R T H A N 1
1
I=I+l CHOICE (1, 2)
2
IF (I .EQ. N) STOP
F I N D IF I IS A DIVISOR OF N A N D NOT EQUAL TO N IF ( ( N / I ) , I .NE. N) STOP WRITE ("NOT A PRIME") STOP "4.1.38.
Write an N D F program which prints all answers to the "eight queens problem." (Select eight points on an 8 x 8 grid so that no two lie on any row, column, or diagonal line.)
'4.1.19.
Write N D F programs to simulate a left or right parser. It would be nice if there were an algorithm which determined if a given N D F program could run forever on some input. Unfortunately this is not decidable for F O R T R A N , or any other programming language. However, we can make such a determination if we assume that branches (from IF and assigned GOTO statements, not CHOICE statements) are externally controlled by a "demon" who is trying to make the program run forever, rather than by the values of the program variables. We say an N D F program is halting if for each input there is no sequence of branches and nondeterministic choices that cause any copy of the program to run beyond some constant number of executed statements, the constant being a function of the number of
310
GENERAL PARSING METHODS
CHAP. 4
input cards available for data. (Assume that the program halts if it attempts to read and no data is available.) • 4.1.20.
Give an algorithm to determine whether an N D F program is halting under the assumption that no D O loop index is ever decremented.
"4.1.21.
Give an algorithm which takes a halting N D F program and constructs from it an equivalent A L G O L program. By "equivalent program," we have to mean that the A L G O L program does input and output in an order which the N D F program might do it, since no order for the N D F program is known. ALGOL, rather than F O R T R A N , is preferred here, because recursion is very convenient.
4.1.22.
Let G = (N, ~, P, S) be a CFG. From G construct a C F G G' such that L(G') = ~* and if S '~=> w, then S ~==~ w. G
G'
A P D T (with pushdown top on the left) that behaves as a nondeterministic left-corner parser for a grammar can be constructed from the grammar. The parser will use as pushdown symbols nonterminals, terminals, and special symbols of the form [A, B], where A and B are nonterminals. Nonterminals and terminals appearing on the pushdown list are goals to be recognized top-down. In a symbol [A, B], A is the current goal to be recognized and B is the nonterminal which has just been recognized bottom-up. F r o m a C F G G = (N, 2~, P, S) we can construct a P D T M = ([q}, E, N x N W N W E, A, ~, q, S, ~ ) which will be a left-corner parser for G. Here A = [ 1, 2 . . . . , p} is the set of production numbers, and 5 is defined as follows: (1) Suppose that A ~ ot is the ith production in P. (a) If 0t is of the form Bfl, where B E N, then O(q, e, [C, B]) contains (q, fl[C, A], i) for all C ~ N. Here we assume that we have recognized the left-corner B bottom-up so we establish the symbols in fl as goals to be recognized topdown. Once we have recognized fl, we shall have recognized an A. (b) If 0t does not begin with a nonterminal, then O(q, e, C) contains (q, a[C, A], i) for all nonterminals C. Here, once is recognized, the nonterminal A will have been recognized. (2) 6(q, e, [A, A]) contains (q, e, e) for all A ~ N. Here an instance of the goal A which we have been looking for has been recognized. If this instance of A is not a left corner, we remove [A, A] from the pushdown list signifying that this instance of A was the goal we sought. (3) O(q, a, a ) = [(q, e, e)} for all a ~ E. Here the current goal is a terminal symbol which matches the current input symbol. The goal, being satisfied, is removed. M defines the translation [(w, rc)lw ~
L(G)
and ~ is a left-corner parse for w}.
EXERCISES
311
Example 4.6 Consider the C F G G = (N, ~, P, S) with the productions (1) E ~
E q- T
(2) E ~
T
(3) T -
>F t T
(4) T
~F
(5) F -
> (e)
(6) F - - +
a
A nondeterministic left-corner parser for G is the P D T M = ([q}, ~, N x N U N U ~ , { 1 , 2
. . . . ,6}, ~,q, E, Z~)
where ~ is defined as follows for all A ~ N: (1) (a) 5(q, e, [A, E]) contains (q, ÷ T[A, El, 1). (b) O(q, e, [A, T]) contains (q, [A, E], 2). (c) 5(q, e, [A, F]) contains (q, t T[A, T], 3) and (q, [A, T], 4) (d) O(q, e, A) = [(q, (E)[A, F], 5), (q, a[A, F], 6)}. (2) 6(q, e, [A, A]) contains (q, e, e). (3) O(q, a, a) = {(q, e, e)} for all a ~ E. Let us parse the input string a t a + a using M. The derivation tree for this string is shown in Fig. 4.7. Since the P D T has only one state, we shall ignore the state. The P D T starts off in configuration ( a t a Jr- a, E, e) The second rule in (1 d) is applicable (so is the first), so the PDT can go into configuration
(a t a -+- a, a[E, F], 6)
!
I
T
1
a
F
1 I
F a
Fig. 4.7 Derivation tree for a t a -k a.
31 2
GENERALPARSING METHODS
CHAP. 4
Here, the left-corner a has been generated using production 6. The symbol a is then compared with the current input symbol to yield (1" a + a, [E, F], 6) We can then use the first rule in (lc) to obtain
(1" a q- a, "1"T[E, T], 63) Here we are saying that the left corner of the production T ~ F 1" T will now be recognized once we find t and T. We can then enter the following configurations:
(a + a, TIE, T], 63) ~---(a + a, a[T, F][E, T], 636) ( + a, [ T, F] [E, T], 636)
l--- (+ a, [T, T][E, T], 6364) At this point T is the current goal and an instance of T which is not a left corner has been found. Thus using rule (2) to erase the goal, we obtain ( + a, [E, T], 6364) Continuing, we can terminate with the following sequence of configurations: ( + a, [E, El, 63642) ~ (+ a, + T[E, E], 636421) 1--- (a, TIE, El, 636421) (a, a[T, F][E, El, 6364216) (e, [T, F][E, E], 6364216) (e, IT, T][E, El, 63642164) (e, [E, E], 63642164) F- (e, e, 63642164) D "4.1.23. 4.1.24.
Show that the construction above yields a nondeterministic left-corner parser for a CFG. Construct a left-corner backtrack parsing algorithm. Let G = (N, X~,P, S) be a CFG which contains no production with a right side of length 0 or 1. (Every CFL L such that w ~ L implies that [ w I>_ 2 has such a grammar.) A nondeterministic shift-reduce right parser for G can be constructed such that each entry on the pushdown list is a pair of the form (X, Q), where X ~ N u E u {$} ($ is an endmarker for the pushdown list) and Q is the set of productions P with an indication of all possible prefixes of the right side of each production which could have been recognized to this point. That is, Q will be P with dots placed between some of the symbols of the right sides. There will be a dot in front of Xt in the production
BIBLIOGRAPHIC NOTES
313
A----~ X~ Xz .." X, if and only if X1 -.. X,-1 is a suffix of the list of grammar symbols on the pushdown list. A shift move can be made if the current input symbol is the continuation of some production. In particular, a shift move can always be made if A ---~ 0¢ is in P and the current input symbol is in FIRST i(~). A reduce move can be made if the end of the right side of a production has been reached. Suppose that A ~ tz is such a production. To reduce we remove It~f entries from the top of the pushdown list. If (X, Q) is the entry now on top of the pushdown list, we then write (A, Q') on the list, where Q' is computed from Q by assuming that an A has been recognized. That is, Q' is formed from Q by moving all dots that are immediately to the left of an A to the right of A and adding dots at the left end of the right sides if not already there. Example 4.7
Consider the grammar S - - ~ Scl ab and the input string abc. Initially, the parser would have ($, Q0) on the pushdown list, where Q0 is S---~. Scl .ab. We can then shift the first input symbol and write (a, Q1) on the pushdown list, where Q1 is S - - - , . Sc[. a.b. Here, we can be beginning the productions S ~ Scl ab, or we could have seen the first a of production S ~ ab. Shifting the next input symbol b, we would write (b, Qz) on the pushdown list, where Q1 is S - - ~ . Sc] .ab.. We can then reduce using production S ~ ab. The pushdown list would now contain ($, Qo)(S, Q3), where Q3 is S ~ .S.c[.ab. [[] Domolki has suggested implementing this algorithm using a binary matrix M to represent the productions and a binary vector V to store the possible positions in each production. The vector V can be used in place of Q in the algorithm above. Each new vector on the pushdown list can be easily computed from M and the current value of V using simple bit operations. 4.1.25.
Use Domolki's algorithm to help determine possible reductions in Algorithm 4.2.
BIBLIOGRAPHIC
NOTES
Many of the early compiler-compilers and syntax-directed compilers used nondeterministic parsing algorithms. Variants of top-down backtrack parsing methods were used in Brooker and Morris' compiler-compiler [Rosen, 1967b] and in the META compiler writing systems [Schorre, 1964]. The symbolic programming system C O G E N T simulated a nondeterministic top-down parser by carrying along all viable move sequences in parallel [Reynolds, 1965]. Top-down backtracking methods have also been used to parse natural languages [Kuno and Oettinger, 1962].
314
GENERAL PARSING METHODS
CHAP. 4
One of the earliest published parsing algorithms is the essentially left-corner parsing algorithm of Irons [1961]. Surveys of early parsing techniques are given by Floyd [1964b], Cheatham and Sattley [1963], and Griffiths and Petrick [1965]. Unger [1968] describes a top-down algorithm in which the initial and final symbols derivable from a nonterminal are used to reduce backtracking. Nondeterministic algorithms are discussed by Floyd [1967b]. One implementation of Domolki's algorithm is described by Hext and Roberts [1970]. The survey article by Cohen and Gotlieb [1970] describes the use of list structure representations for context-free grammars in backtrack and nonbacktrack parsing algorithms.
4.2,
TABULAR PARSING METHODS
We shall study two parsing methods that work for all context-free grammars, the Cocke-Younger-Kasami algorithm and Earley's algorithm. Each algorithm requires n 3 time and n 2 space, but the latter requires only n z time when the underlying grammar is unambiguous. Moreover, Earley's algorithm can be made to work in linear time and space for most of the grammars which can be parsed in linear time by the methods to be discussed in subsequent chapters. 4.2.1.
The Cocke-Younger-Kasami Algorithm
In the last section we observed that the top-down and bottom-up backtracking methods may take an exponential amount of time to parse according to an arbitrary grammar. In this section, we shall give a method guaranteed to do the job in time proportional to the cube of the input length. It is essentially a "dynamic programming" method and is included here because of its simplicity. It is doubtful, however, that it will find practical use, for three reasons: (1) n 3 time is too much to allow for parsing. (2) The method uses an amount of space proportional to the square of the input length. (3) The method of the next section (Earley's algorithm) does at least as well in all respects as this one, and for many grammars does better. The method works as follows. Let G = (N, Z, P, S) be a Chomsky normal form C F G with no e-production. A simple generalization works for nonC N F grammars as well, but we leave this generalization to the reader. Since a cycle-free C F G can be left- or right-covered by a C F G in Chomsky normal form, the generalization is not too important. Let w = a l a 2 . . . a n be the input string which is to be parsed according to G. We assume that each a~ is in X for 1 < i < n. The essence of the
SEC. 4.2
TABULAR PARSING METHODS
algorithm is the construction of a triangular p a r s e t a b l e we denote t~j for 1 ~ i ~ n and 1 < j < n -- i + 1. Each
T, tij
315
whose elements will have a value +
which is a subset of N. Nonterminal A will be in tij if and only if A =~ a~a;+~ • . . a~+j_~, that is, if A derives the j input symbols beginning at position i. As a special case, the input string w is in L ( G ) if and only if S is in t l,. Thus, to determine whether string w is in L ( G ) , we compute the parse table T for w and look to see if S is in entry t 2,. Then, if we want one (or all) parses of w, we can use the parse table to construct these parses. Algorithm 4.4 can be used for this purpose. We shall first give an algorithm to compute the parse table and then the algorithm to construct the parses from the table. ALGORITHM 4.3 C o c k e - Y o u n g e r - K a s a m i parsing algorithm. I n p u t . A Chomsky normal form C F G G = (N, X, P, S) with no e-production and an input string w = a l a 2 . . . a , in X +. Output.
The parse table T for w such that tij contains A if and only if
+
A ==~ a~a~+ ~ • • • ai+ j_ 1. Method.
(1) Set tit = [A I A ~
a~ is in P} for each i. After this step, if t~1 contains
+
A, then clearly A ==~ a t. (2) Assume that t~j, has been computed for all i, 1 < i < n, and all j ' , 1 < j ' < j . Set t~y = {A if or some k , 1 < k < j , A ~ B is in t~k, and C is in ti+k,j_k}.'l"
BC
is in P,
Since I < k < j, both k and j -- k are less than j. Thus both t~k and ti+k,j_ k are computed before t~j is computed. After this step, if t~j contains A, then +
A ~
BC ~
+
ai • • • ai+k_aC ~
a,. • • • a i + k _ i a i + k • • • ai+j_ 1.
(3) Repeat step (2) until tij is known for all 1 _~ i < n, and 1 < j n--i+l.
Example 4.8
Consider the C N F grammar G with productions tNote that we are not discussing in detail how this is to be done. Obviously, the computation involved can be done by computer. When we discuss the time complexity of Algorithm 4.3, we shall give details of this step that enable it to be done efficiently.
316
cam. 4
GENERAL PARSING METHODS
IASIb
S
> AA
A
> SA IASla
Let abaab be the input string. The parse table T that results from Algorithm 4.3 is shown in Fig. 4.8. F r o m step (1), tit = {A} since A ~ a is in P and at = a. In step (2) we add S to t32, since S ---~ A A is in P and A is in both t3t and t4t. Note that, in general, if the tt~'s are displayed as shown, we can
jt
A,S
A,S
A,S
S
A,S
A,S
A
S
A,S
1
A
S
A
A
S
i~
1
2
3
4
5
Fig. 4.8 Parse table T.
compute ttj, i > 1, by examining the nonterminals in the following pairs of entries" (ttl, t,+i.~-1), (t,2, t,+z,j-2), • • •, (t,.j-1, t,+j-l.,) Then, if B is in tie and C is in t~+k.j-k for some k such that 1 ~ k < j and B C is in P, we add A to ttj. That is, we move up the ith column and down the diagonal extending to the right of cell ttj simultaneously, observing the nonterminals in the pairs of cells as we go. Since S is in t~5, abaab is in L(G). [B A ~
THEOREM 4.6 If Algorithm 4.3 is applied to C N F g r a m m a r G and input string a~ .. • a,, +
then upon termination, A is in t~j if and only if A ==~ a s . . . a~+~_~. Proof. The proof is a straightforward induction on j and is left for the Exercises. The most difficult step occurs in the "if" portion, where one must +
observe that i f j > 1 and A ==~ a s . . . at+i_~, then there exist nonterminals +
B and C and integer k such that A ~
B C is in P, B ~
a~ . . . an+k_ 1, and
+
C ~
at+ k • • • a,+j_ l.
E]
Next, we show that Algorithm 4.3 can be executed on a random access computer in n 3 suitably defined elementary operations. For this purpose, we
SEC. 4.2
TABULAR PARSING METHODS
31 '7
shall assume that we have several integer variables available, one of which is n, the input length. An elementary operation, for the purposes of this discussion, is one of the following' (1) Setting a variable to a constant, to the value held by some variable, or to the sum or difference of the value of two variables or constants; (2) Testing if two variables are equal, (3) Examining and/or altering the value of t~j, if i and j are the current values of two integer variables or constants, or (4) Examining a,, the ith input symbol, if i is the value of some variable. We note that operation (3) is a finite operation if the grammar is known in advance. As the grammar becomes more complex, the amount of space necessary to store t,.j and the amount of time necessary to examine it both increase, in terms of reasonable steps of a more elementary nature. However, here we are interested only in the variation of time with input length. It is left to the reader to define some more elementary steps to replace (3) and find the functional variation of the computation time with the number of nonterminais and productions of the grammar. CONVENTION
We take the notation '~f(n) is 0(g(n))" to mean that there exists a constant k such that for all n ~ 1, f(n) ~ kg(n). Thus, when we say that Algorithm 4.3 operates in time 0(n3), we mean that there exists a constant k for which it never takes more than kn 3 elementary operations on a word of length n. THEOREM 4.7 Algorithm 4.3 requires 0(n 3) elementary operations of the type enumerated above to compute tij for all i and j.
Proof To compute tit for all i merely requires that we set i = 1 [operation(l)], then repeatedly set t~l to ~A]A --~ a~ is in P~} [operations (3) and (4)], test if i = n [operation (2)], and if not, increment i by I [operation (1)]. The total number of elementary operations performed is 0(n). Next, we must perform the following steps to compute t~j" (1) Set j -- 1. (2) Test ifj -- n. If not, increment j by 1 and perform line(j), a procedure to be defined below. (3) Repeat step (2) until j = n. Exclusive of operations required for line(j), this routine involves 2 n - 2 elementary operations. The total number of elementary operations required for Algorithm 4.3 is thus 0(n) plus ~ = 2 l(j), where l(j)is the number of elementary operations used in line(j). We shall show that l(j) is 0(n 2) and thus that the total number of operations is 0(n0.
318
GENERAL PARSING METHODS
CHAP. 4
The procedure line(j) computes all entries t~j such that 1 _< i < n j -I-- 1. It embodies the procedure outlined in Example 4.8 to compute tij. It is defined as follows (we assume that all t~j initially have value N)' (1) (2) (3) (4)
Let i = 1 a n d j ' = n - - j - k 1. L e t k = 1. Let k' -- i q- k and j " = j - k. Examine t;k and tk,i,,. Let t~j --- t~j u { A [ A
(5) (6) (7) (8)
- - , B C is in P, B in t,.k, and C in tk,j,,].
Increment k by 1. If k - - j , go to step (7). Otherwise, go to step (3). If i-----j', halt. Otherwise do step (8). Increment i by 1 and go to step (2).
We observe that the above routine consists of an inner loop, (3)-(6), and an outer loop, (2)-(8). The inner loop is executedj -- 1 times (for values of k from 1 to j - 1) each time it is entered. At the end, t,.j has the value defined in Algorithm 4.3. It consists of seven elementary operations, and so the inner loop uses 0(j) elementary operations each time it is entered. The outer loop is entered n -- j -Jr- 1 times and consists of 0(j) elementary operations each time it is entered. Since j ~ n, each computation of line(j) takes 0(n 2) operations. Since line(j) is computed n times, the total number of elementary operations needed to execute Algorithm 4.3 is thus 0(n3). [--] We shall now describe how to find a left parse from the parse table. The method is given by Algorithm 4.4. ALGORITHM 4.4 Left parse from parse table. Input. A Chomsky normal form C F G G - - ( N , E, P, S) in which the productions in P are numbered from 1 to p, an input string w -- a~a 2 . . . a,, and the parse table T for w constructed by Algorithm 4.3. Output. Method.
A left parse for w or the signal "error." We shall describe a recursive routine gen(i,j, A) to generate +
a left parse corresponding to the derivation A = , a i a ~ + 1 . . . ai+j_ 1. The routine gen(i, j, A) is defined as follows"
lm
(1) I f j -- 1 and the mth production in P is A --, a,., then emit the production number m. (2) If j > 1, k is the smallest integer, 1 < k < j, such that for some B in t;k and C in t~+k,j-k, A --* B C is a production in P, say the mth. (There
SEC. 4.2
TABULAR PARSING METHODS
31 9
may be several choices for A --~ B C here. We can arbitrarily choose the one with the smallest m.) Then emit the production number m and execute gen(i, k, B), followed by gen(i -t-- k, j - k, C). Algorithm 4.4, then, is to execute gen(1, n, S), provided that S is in t~,. If S is not in t,,, emit the message "error." We shall extend the notion of an elementary operation to include the writing of a production number associated with a production. We can then show the following result. THEOREM 4.8 If Algorithm 4.4 is executed with input string a I . . - a,, then it will terminate with some left parse for the input if one exists. The number of elementary steps taken by Algorithm 4.4 is 0(nZ). P r o o f An induction on the order in which gen is called shows that whenever gen(i,j, A) is called, then A is in t;j. It is thus straightforward to show that Algorithm 4.4 produces a left parse. To show that Algorithm 4.4 operates in time O(nZ), we prove by induction on j that for all j a call of gen(i, j, A) takes no more than e l i 2 steps for some constant c 1. The basis, j = 1, is trivial, since step (1) of Algorithm 4.4 applies and uses one elementary operation. For the induction, a call of gen(i,j, A) with j > 1 causes step (2) to be executed. The reader can verify that there is a constant c z such that step (2) takes no more than c z j elementary operations, exclusive of calls. If gen(i, k, B) and gen(i q - k , j k, C) are called, then by the inductive hypothesis, no more than elk z + c1( j -- k) z -? c2j steps are taken by gen(i, j, A). This expression reduces to c~(j z + 2k z -- 2kj) + czj. Since 1 < k < j and j~> 2, we know that 2k z - 2kj < 2 - 2j < --j. Thus, if we chose c, to be c 2 in the inductive hypothesis, we would have elk z q- c ~ ( j - k) 2 -t- czj < e l i z. Since we are free to make this choice of c~, we conclude the theorem. [--] t
Example 4.9 Let G be the grammar with the productions (1) S ~ ~ A A (2) S
~ AS
(3) S
>b
(4) A
~ SA
(5) A
~ AS
(6) A
,a
320
GENERAL PARSING METHODS
CHAP. 4
Let w = abaab be the input string. The parse table for w is given in Example 4.8. Since S is in T 15, w is in L ( G ) . To find a left parse for abaab we call routine gen(1, 5, S). We find A in t~l and in t24 and the production S ---~ A A in the set of productions. Thus we emit 1 (the production number for S---~ A A ) and then call gen(1, 1, A) and gen(2, 4, A). gen(1, 1, A) gives the production number 6. Since S is in t21 and A is in t33 and A --~ S A is the fourth production, gen(2, 4, A) emits 4 and calls gen(2, 1, S) followed by gen(3, 3, A). Continuing in this fashion we obtain the left parse 164356263. Note that G is ambiguous; in fact, abaab has more than one left parse. It is not in general possible to obtain all parses of the input from a parse table in less than exponential time, as there may be an exponential number of left parses for the input. D We should mention that Algorithm 4.4 can be made to run faster if, when we construct the parse table and add a new entry, we place pointers to those entries which cause the new entry to appear (see Exercise 4.2.21). 4.2.2.
The Parsing Method of Earley
In this section we shall present a parsing method which will parse an input string according to an arbitrary C F G using time 0(n 3) and space 0(n2), where n is the length of the input string. Moreover, if the C F G is unambiguous, the time variation is quadratic, and on most grammars for programming languages the algorithm can be modified so that both the time and space variations are linear with respect to input length (Exercise 4.2.18). We shall first give the basic algorithm informally and later show that the computation can be organized in such a manner that the time bounds stated above can be obtained. The central idea of the algorithm is the following. Let G = (N, Z, P, S) be a C F G and let w = a l a 2 . . . a n be an input string in Z*. An object of the form [A--~ X i X 2 . . . X k • Xk+l... Xm, i] is called an item for w if A --~ X'~ . . - X'~ is a production in P and 0 ~ i ~ n. The dot between X k and Xk+~ is a metasymbol not in N or Z. The integer k can be any number including 0 (in which case the • is the first symbol) or m (in which case it is the last).t For each integer j, 0 ~ j ~ n, we shall construct a list of items Ij such that [A --~ t~ • fl, i] is in Ij for 0 ~ i ~ j if and only if for some }, and ~, we have S ~ ~.4~, ~ ~ a~ ponent of the item and the portion of the input the item merely assure
. . . a,, and tx ~ a,.l " " aj. Thus the second comthe number of the list on which it appears bracket derived from the string ~. The other conditions on us of the possibility that the production A --~ t~fl
tIf the production is A ~ e, then the item is [A ~ . ,
i].
SEC. 4.2
TABULAR PARSING METHODS
321
could be used in the way indicated in some input sequence that is consistent with w up to position j. The sequence of lists Io, I1,. • •, I, will be called the parse lists for the input string w. We note that w is in L(G) if and only if there is some item of the form [S ~ ~ . , 0] in I,. We shall now describe an algorithm which, given any g r a m m a r , will generate the parse lists for any input string. ALGORITHM 4.5 Earley's parsing algorithm.
Input. C F G G = (N, E, P, S) and an input string w -- ala 2 • .. a, in Z*. Output. The parse lists Io, I~ . . . . .
I,.
Method. First, we construct Io as follows" (1) If S --~ a is a production in P, add [S ~ • 0~, 0] to Io. N o w perform steps (2) and (3) until no new items can be added to Io. (2) If [B ~ y-, 0] is on Io,t add [A --~ ~B • p, 01 for all [A --- a • Bp, 0] on I o. (3) Suppose that [A ~ ~ . Bfl, 0] is an item in I o. A d d to Io, for all productions in P of the form B ~ y, the item [B ~ • y, 0] (provided this item is not already in Io). We now construct Ij, having constructed I0, I 1 , . . . , Ij_ t(4) For each [B --~ 0~ • aft, i] in Ij_~ such that a -- aj, add [B --. 0~a • fl, i] to I v. N o w perform steps (5) and (6) until no new items can be added. (5) Let [A --~ ~,., i] be an item in Ij. Examine It for items of the form [B ~ 0~ • Ap, k]. F o r each one found, we add [B ~ ~A • fl, k] to Ij. (6) Let [A ~ 0~- Bfl, i] be an item in Ij. For all B ~ 7 in P, we add [B ~ • y, j] to Ij. N o t e that consideration of an item with a terminal to the right of the dot yields no new items in steps (2), (3), (5) and (6). The algorithm, then, is to construct Ij for 0 ~ j _~ n.
[~]
Example 4.10 Let us consider the g r a m m a r G with the productions (1) E
>T + E
(2) E--
>T
(3) T - - - > F , T (4) T
>F
(5) F ~ > (E) (6) F - - - ~ a tNote that ? can be e. This is the way rule (2) becomes applicable initially.
322
GENERAL PARSING METHODS
CHAP. 4
and let (a q - a ) , a be the input string. From step (1) we add new items [E----,. T + E, 0] and [ E - - , . T, 0] to I 0. These items are considered by adding to I0 the items [T ~ • F • T, 0] and [T --, • F, 0] from rule (3). Continuing, we then add [F ~ • (E), 0] and [F--~ • a, 0]. No more items can be added to I 0. We now construct I~. By rule (4) we add I F - - , (. E), 0], since aa -----(. Then rule (6) causes [ E ~ . T + E , 1], [ E ~ . T , 1], [ T - - ~ . F , T , 1], [T--, • F, 1], [F ~ • (E), 1], and [F----, • a, 1] to be added. Now, no more items can be added to I~. To construct 12, we note that a2 = a and that by rule (4) [F----, a . , 1] is to be added to Iz. Then by rule (5), we consider this item by going to I~ and looking for items with F immediately to the right of the dot. We find two, and add [ T - - , F . • T, 1] and [T--~ F . , 1] to 12. Considering the first of these yields nothing, but the second causes us to again examine I~, this time for items with. T in them. Two more items are added to/2, [E ~ T . q- E, 1] and [E--~ T . , 1]. Again the first yields nothing, but the second causes [F ~ (E .), 0] to be added to 12. Now no more items can be added to 12, so I2 is complete. The values of all the lists are given in Fig. 4.9.
[E----> [E~. [T---~ [T--~ [F--~ [F ~
Io • T q - E, 0] T, 0] • F . T, 0] • F, 0] • (E), 01 ° a, 0]
11 I F - - ~ (. E ) , 0 ] [E~. T - t - E , 1] [ E - - ~ • T, 1]
[ T - - ~ . F . T , 1] [T----~ • F, 11 [F ~ • (E), 1] [F ---~ . a , l ]
13 [E ~ T q- • E, 11 [E------~ • T-t- E, 31 [E ~ • T, 31 [T---> • F , T, 3]
[F ~ [T----~ [T ~ [E ~
[T --~ • F, 3]
[E ~ T., 3]
[F---~ • (E), 31 [F ~ • a, 31
[E----* T + E . , 1] [F ~ ( E . ) , 01
16
[T-.F..
Iz [F---. a . , 1] [T~F..T, 1] [ T - - ~ F . , 1] [ E - - ~ T . W E , l] [E ~ T . , 1] [ F - o (E .), 0]
h a., F. F., T.
3] • T, 3] 31 + E, 31
[F ~ [T----~ [T ~ [E ~ [E --~
ls ( E ) . , 0] F . , T, 0] F . , 0] T . + E, 01 T . , 0]
I7 [F---~ a . , 6]
T, 0]
[T---, • F , T, 6]
IT---, F . • T, 61
[T----~ • F, 6] [F ~ • (E), 6] [F ~ • a, 6]
[T----~ F . , 61 [T ~ F , T . , 01 [E ----* T . q- E, 01 [E--~ T.,0] Fig. 4.9
Parse lists for Example 4.10.
Since [E --~ T . , 0] is on the last list, the input is in L ( G ) .
TABULAR PARSING METHODS
SEC. 4.2
323
We shall pursue the following course in analyzing Earley's algorithm. First, we shall show that the informal interpretation of the items mentioned before is correct. Then, we shall show that with a reasonable concept of an elementary operation, if G is unambiguous, then the time of execution is quadratic in input length. Finally, we shall show how to construct a parse from the lists and, in fact, how this may be done in quadratic time. THEOREM 4.9 If parse lists are constructed as in Algorithm 4.5, then [A ---~ t~. fl, i] is on Ij if and only if ~ =~ a~+l "'" aj and, moreover, there are strings y and ,6 such that S => ~,AJ and 7 =~ al " " a~. Proof O n l y if: We shall do this part by induction on the n u m b e r of items which
have been added to I0, 11, . . . . 1i before [A ~ t~- fl, i] is added to Ij. For the basis, which we take to be all of I0, we observe that anything added to I 0 has a ~ e, so S =-~ 7AJ holds with ? = e. For the inductive step, assume that we have constructed I0 and that the hypothesis holds for all items presently on/~, i_~j. Suppose that [A ~ ~ . fl, i] is added to Ij because of rule (4). Then a = oc'aj and [A ---~ t x ' - a j f l , i] is on Ij_ 1. By the inductive hypothesis, o~' ~
a~+l • • " a j_ 1 and there exist
strings ~,' and 6' such that S =-~ ? ' A J ' and y ' = > a 1 • • • a r It then follows that = t~'aj ==~ at+ 1 "'" aj, and the inductive hypothesis is satisfied with ~, = ~,' and ~ = 6'. Next, suppose that [,4 ---. a . fl, i] is added by rule (5). Then 0~ = ogB for some B in N, and for some k , [,4 ~ o~' • B f l , i] is on I k. Also, [B --, 1/., k] . is on Ij for some I/in (N u E)*. By the inductive hypothesis, r / ~ > ak+ 1 "" • aj and 0~'==~ a ~ + l . . . a k .
Thus, tx = o g B = > a ~ + l . . . a
j. Also by hypothesis,
there exist ~,' and 6' such that S ~ ?'A6' and ~,'=> a 1 .-- ai. Again, the rest of the inductive hypothesis is satisfied with y = ~,' and J = 6'. The remaining case, in which [A ---. ct. fl, i] is added by rule (6), has 0c = e and i = j. Its elementary verification is left to the reader, and we conclude the "only if" portion. I f: The "if" portion is the p r o o f of the statement (4.2.1)
If S =~ ~,AO, ~, =~ a 1 . . . a~, A ---, ctfl is in P, and tz = ~ a~+l " ' " aj, then [A --. ~ • fl, i] is on list Ij
We must prove all possible instances of (4.2.1). Any instance can be characterized by specifying the strings 0~, fl, y, and ~, the nonterminal A, and the integers i and j, since S and al . . . a , are fixed. We shall denote such
324
GENERAL PARSING METHODS
CHAP. 4
an instance by [~, fl, 7, $, A , i, j]. The conclusion to be d r a w n f r o m the above instance is that [A ----~ at • fl, i] is o n list I 1. N o t e that 7 a n d J do not figure explicitly into the conclusion. The p r o o f will turn on ranking the various instances a n d proving the result by induction on the rank. The r a n k of the instance ~ = [at, fl, ~, $, A, i,j] is c o m p u t e d as follows" Let %(~) be the length o f a shortest derivation S ==~ ?AJ. Let ~2(~) be the length o f a shortest derivation 7 ==~ a I . . . a~. . Let %(t0 be the length o f a shortest derivation at ==~ at+~ . . . a v. The r a n k o f t~ is defined to be zl(t0 q- 2[.] -+- z2(tt) -q- z~(~)]. W e n o w prove (4.2.1) by induction on the r a n k o f an instance g = [at, fl, 7, t~, A , i,j]. If the r a n k is 0, then ~r~(g) = zz(g) = %(g) = j = 0. W e can conclude that at = 7 = $ = e a n d that A = S. T h e n we need to show that [S ~ • ,0, 0] is on list 0. However, this follows immediately f r o m the first rule for that list, as S ---~ fl must be in P. F o r the inductive step, let ~, as above, be an instance o f (4.2.1) of some r a n k r > 0, a n d assume that (4.2.1) is true for instances o f smaller rank. Three cases arise, depending on whether at ends in a terminal, ends in a nonterminal, or is e. Case 1: ot = at'a for some a in E. Since ~ ==~ at+ 1 . . . a~, we conclude that a = a v. Consider the instance ~' = [at', avfl , ~,, $, A, i , j -- 1]. Since A ~ at'avfl is in P, tt' is an instance o f (4.2.1), a n d its r a n k is easily seen to be r - 2. W e m a y conclude that [A --~ at'. a~fl, i] is on list I v_ 1. By rule (4), [A ----~ at • fl, i] will be placed on list I v. Case 2: at = at'B for some B in N. There is some k, i < k _< j, such that at'=,, a~+ 1 . . . a k a n d B ==~ ak+ 1 . - . ay. F r o m the instance o f lower r a n k ~' = [at', B fl, 7, $, A , i, k] we conclude that [A ~ at'. B fl, i] is on list Ik. Let
B ==~ t / b e the first step in a m i n i m u m length derivation B ==~ ak+ 1 • • • aj. Consider the instance a " = [t/, e, 7at', fl$, B, k , j ] . Since S ==~ 7A$ ==~ 7at'Bfl$, we conclude that zl(a") _< %(a) -Jr 1. Let n 1 be the m i n i m u m num. ber o f steps in a derivation at' ==~ at+ 1 .. • a k a n d n 2 be the m i n i m u m n u m b e r in a derivation B =-~ ak+l " ' " aj. T h e n %(~) = n I -4- n2. Since B ==~ t/=-~ ak+ 1 . . " a v, we conclude that z3(a") = n2 -- 1. It is straightforward to see that % ( t l " ) = % ( ~ ) -b n~. Hence %(~") q-- zs(a") = %(a) -k nl + nz -- 1 = ~:2(a) -Jr- z3(a) -- 1. Thus ~:l(a") + 2[j -q- z2(a") .jr_%(a")] is less than r. By the inductive hypothesis for a " we conclude that [B --~ 1//., k] is on list Iv, a n d with [A ~ at'- B f l , i] on list Ik, conclude by rule (2) or (5) that [A ~ at. fl, i] is on list I v. Case 3: at = e. We m a y conclude that i = j a n d z3(~) = 0 in this case.
TABULAR PARSING METHODS
SEC. 4.2
325
Since r > 0, we may conclude 'that the derivation S =~ yA,~ is of length greater than 0. If it were of length 0, then z l(a) -- 0. Then we would have 7 ----- e, so x2(a) = i -- 0. Since i = j and z3(a) = 0 have been shown in general for this case, we would have r -- 0. We can thus find some B in N and 7', 7", 6', and ~" in (N u X)* such that S =~ y'B6'=:> y'7"A,~"6', where B ---, 7 " A 6 " is in P, 7 = Y'7", ~ = ~"6', , and 7'B6' is the penultimate step in some shortest derivation S = ~ 7A6. Consider the instance a ' - - [ 7 " , A 6 " , 7', 6', B, k, j], where k is an integer such that 7 ' = ~ al . " a k and 7 " = ~ a k + l . . . a j . Let the smallest length of the latter derivations be nl and n2, respectively. Then Zz(a') -- n 1, "r3(a') -- n 2, and ~2(a) -- n~ ÷ n 2. We have already argued that z3(a) -- 0, and B, 7', and 6' were selected so that zl(a') -- ~ ( a ) -- I. It follows that the rank of a' is r -- 1. We may conclude that [B --, 7" • Ad;", k] is on list 1i. By rule (6), or rule (3) for list I0, we place [A --, • fl, j] on list 1i. D Note that as a special case of T h e o r e m 4.9, [S ~
a . , 0] is on list I, if
and only if S --, a is in P and a =~ a I - . . a,; i.e., al " " a, is in L(G) if and only if [S --, a . , 0] is on list I, for some ~. We shall now examine the running time of Algorithm 4.5. We leave it to the reader to show that in general 0(n0 suitably defined elementary steps are sufficient to parse any word of length n according to a k n o w n grammar. We shall concentrate on showing that if the g r a m m a r is unambiguous, 0(n 2) steps are sufficient. LEMMA 4.6 Let G = (N, X, P, S) be an unambiguous g r a m m a r and a 1 . . - a, a string in X*. Then when executing Algorithm 4.5, we attempt to add an item [A ~ tz • fl, i] to list Ij at most once if ct ~ e.
P r o o f This item can be added only in steps (2), (4), or (5). If added in step (4), the last symbol of ~ is a terminal, and if in steps (2) or (5), the last symbol is a nonterminal. In the first case, the result is obvious. In the second case, suppose that [A ---~ tx'B. fl, i] is added to list Ij when two distinct items, [ B - - , 7 ", k] and [B----~ 6 . , l], are considered. Then it must be that [A ~ tz' • Bfl, i] is on both list k and list 1. (The case k = l is not ruled out, however.) Suppose that k ~ 1. By Theorem 4.9, there exist 01, 02, 03, and 04 such that S =-> 01A02 => 01~'Bp02 ~ a l
=> a ~ . " a j l l O , .
" " ajll02 and S => 0 3 A O 4 ~
But in the first derivation, 0 ~ ' ==> a l " ' ' a k ,
03o(BflO 4
and in the
second, 03~' =-~ a l " " a v Then there are two distinct derivation trees for some al " " an, with ~'B deriving ai+l " " aj in two different ways.
326
GENERAL PARSING METHODS
CHAP. 4
Now, suppose that k = l. Then it must be that 7 ~ ~. It is again easy to find two distinct derivation trees for a 1 . . . a,. The details are left for the Exercises. E] We now examine the steps of Algorithm 4.5. We shall leave the definition of "elementary operation" for this algorithm to the reader. The crucial step in showing that Algorithm 4.5 is of quadratic time complexity is not how "elementary operation" is defined--any reasonable set of list-processing primitives will do. The crucial step in the argument concerns "bookkeeping" for the costs involved. We here assume that the g r a m m a r G is fixed, so that any processes concerning its symbols can be considered elementary. As in the previous section, the matter of time variation with the "size" of the grammar is left for the Exercises. For I0, step (1) clearly can be done in a fixed number of elementary operations. Step (3) for I0 and step (6) for the general case can be done in a finite number of elementary operations each time an item is considered, provided we keep track of those items [A ~ ~ . fl, j] which have been added to Ij. Since g r a m m a r G is fixed, this information can be kept in a finite table for each j. If this is done, it is not necessary to scan the entire list I v to see if items are already on the list. For steps (2), (4), and (5), addition of items to 1i is facilitated if we can scan some list I~ such that i < j for all those items having a desired symbol to the right of the dot, the desired symbol being a terminal in step (4) and a nonterminal in steps (2) and (5). Thus we need two links from every item on a list. The first points to the next item on the list. This link allows us to consider each item in turn. The second points to the next item with the same symbol to the right of the dot. It is this link which allows us to scan a list efficiently in steps (2), (4), and (5). The general strategy will be to consider each item on a list once to add new items. However, immediately upon adding an item of the form [A ~ oc. Bfl, i] to I v, we consult the finite table for I v to determine if [B ~ 7 ", J] is on I v for any 7. If so, we also add [A ~ ~ B . fl, i] to I v. We observe that there are a fixed number of strings, say k, that can appear as the first half of an item. Thus at most k ( j ÷ !) items appear on I v. If we can show that Algorithm 4.5 spends a fixed amount of time, say c, for each item on a list, we shall show that the amount of time taken is 0(n2), since
c ~ k( j + 1) = ½ck(n + 1 ) ( n + 2 ) ~ c ' n j=0
2
for some constant c'
The "bookkeeping trick" is as follows. We charge time to an item, under certain circumstances, both when it is considered and when it is entered onto a list. The m a x i m u m amount of time charged in either case is fixed. We also charge a fixed amount of time to the list itself.
SEC. 4.2
TABULAR PARSING METHODS
327
We leave it to the reader to show that I0 can be constructed in a fixed amount of time. We shall consider the items on lists Ij, for j > 0. In step (4) of Algorithm 4.5 for Ij, we examine aj and the previous list. For each entry on Ij_ 1 with a~ to the right of the dot, we add an item to Ij. As we can examine only those items on Ij_ 1 satisfying that condition, we need charge only a finite amount of time to each item added, and a finite amount of time to 1~ for examining a~ and for finding the first item of Ij_ ~ with • aj in it. Now, we consider each item on Ij and charge time to it in order to see if step (5) or step (6) applies. We can accomplish step (6) in a finite amount of time, as we need examine only the table associated with Ij that tells whether all [A ~ • ~, j] have been added for the relevant A. This table can be examined in fixed time, and if necessary, a fixed number of items are added to Ij. This time is all charged to the item considered. If step (5) applies, we must scan some list I~, k < j , for all items having • B in them for some particular B. Each time one is found, an item is added to list lj, and the time is charged to the item added, not the one being considered ! To show that the amount of time charged to any item on any list is bounded above by some finite number, we need observe only that by Lemma 4.6, if the grammar is unambiguous, only one attempt will ever be made to add an item to a list. This observation also ensures that in step (5) we do not have to spend time checking to see if an item already appears on a list. THEOREM 4.10 If the underlying grammar is unambiguous, then Algorithm 4.5 can be executed in 0(n 2) reasonably defined elementary operations when the input is of length n.
Proof. A formalization of the above argument and the notion of an elementary operation is left for the Exercises. D THEOREM 4.11 In all cases, Algorithm 4.5 can be executed in 0(n 3) reasonably defined elementary operations when the input is of length n.
Proof. Exercise. Our last portion of the analysis of Earley's algorithm concerns the method of constructing a parse from the completed lists. For this purpose we give Algorithm 4.6, which generates a right parse from the parse lists. We choose to produce a right parse because the algorithm is slightly simpler. A left parse can also be found with a simple alteration in the algorithm. Also for the sake of simplicity, we shall assume that the grammar at hand has no cycles. If a cycle does exist in the grammar, then it is possible to have
328
GENERAL PARSING METHODS
CHAP. 4
arbitrarily many parses for some input strings. However, Algorithm 4.6 can be modified to accommodate grammars with cycles (Exercise 4.2.23). It should be pointed out that as for Algorithm 4.4, we can make Algorithm 4.6 simpler by placing pointers with each item added to a list in Algorithm 4.5. Those pointers give the one or two items which lead to its placement on its list. ALGORITHM 4.6 Construction of a right parse from the parse lists. Input. A cycle-free C F G G = (N, X, P, S) with the productions in P numbered from 1 to p, an input string w = a 1 . . . a n, and the parse lists I0, I a , . . . , I, for w. Output. 7r, a right parse for w, or an "error" message. Method. If no item of the form [S---~ 0c., 0] is on In, then w is not in L(G), so emit "error" and halt. Otherwise, initialize the parse 7r to e and execute the routine R([S----~ ~z., 0], n) where the routine R is defined as follows: Routine R([A ---+ f l . , i], j ) :
(1) Let n be h followed by the previous value of ~z, where h is the number of production A ---~ ft. (We assume that 7r is a global variable.) (2) If fl = X i X z . . . Xm, set k = m and 1 = j. (3) (a) If Xk ~ X, subtract 1 from both k a n d / . (b) If X k ~ N, find an item [Xk ~ 7' ", r] in I~ for some r such that [A ~ X ~ X z . . . Xk_~ . X k . . . Xm, i] is in I,. Then execute R([Xk ---' 7' ", r], l). Subtract 1 from k and set l = r. (4) Repeat step (3) until k = 0. Halt. E] Algorithm 4.6 works by tracing out a rightmost derivation of the input string using the parse lists to determine the productions to use. The routine R called with arguments [A --~ f l . , i] and j appends to the left end of the current partial parse the number corresponding to the production A ~ ft. If fl = voBlvlB2v 2 . . . B,v,, where B 1 , . . . , Bs are all the nonterminals in fl, then the routine R determines the first production used to expand each Bt, say B,---~ fl,, and the position in the input string w immediately before the first terminal symbol derived from B,. The following recursive calls of R are then made in the order shown:
R([B, --,/~,.,i,],j,) R([B,_, ~ /L-~ ", ~,-~],L-,) °
SEC. 4.2
T A B U L A R P A R S I N G METHODS
329
where (1) j, = j Iv, I and (2) jq -- iq+x --[vq [for 1 ~ q < s. Example 4.11
Let us apply Algorithm 4.6 to the parse lists of Example 4.10 in order to produce a right parse for the input string (a + a) • a. Initially, we can execute R([E----~ T . , 0], 7). In step (1), n gets value 2, the number associated with production E ~ T. We then set k = 1 and 1 = 7 and execute step (3b) of Algorithm 4.6. We find [T ----~ F • T . , 0] on 17 and [E ~ • T, 0] on Io. Thus we execute R([T ~ F , T . , 0], 7), which results in production number 3 being appended to the left of n. Thus, n = 32. Following this call of R, in step (2) we set k = 3 and l = 7. Step (3b) is then executed with k = 3. We find [T---, F . , 6] on I0 and IT ~ F , . T, 0] on /6, so we call R([T----, F . , 6], 7). After completion of this call, we set k = 2 and l = 6. In Step (3a) we consider • and set k = 1 and 1 = 5. We then find [ F - - , ( E ) . , 0] on ls and [T---,. FF, T, 0] on I0, so we call R([F---~ ( E ) . , 0], 5). Continuing in this fashion we obtain the right parse 64642156432. The calls of routine R are shown in Fig. 4.10 superimposed on the derivation tree for (a + a) • a. D
E R([E~ T-,01,7)
!
i
F (
R([F~(E)-,0I, 5)
T R([T~ F., 61,7)
)
i R([T~F-,ll,2)
R([T~F,T.,01,7)
T/
R([,,,,,~,~E~ T+E., 1],4) E R([E~T.,3],4)
F
I
a
R([F~a.,11,2)
I
F
a
+
T
R([T~F.,31,4)
F R([Foa., 31,4) a
Fig. 4.10 Diagram of execution of Algorithm 4.6.
R([F~a -,61,7)
330
GENERAL PARSING METHODS
CHAP. 4
THEOREM 4.12 A l g o r i t h m 4.6 correctly finds a right parse of a 1 . . . a, if one exists, a n d can be m a d e to operate in time 0(n2). P r o o f A straightforward induction on the order of the calls o f routine R shows that a right parse is produced. We leave this portion of the p r o o f for the Exercises. In a m a n n e r analogous to T h e o r e m 4.10, we can show that a call of R([A --~ f t . , i], j ) takes time 0 ( ( j - 0 2) if we can show that step (3b) takes 0 ( j - i) elementary operations. To do so, we must preprocess the lists in such a way that the time taken to examine all the finite n u m b e r o f items on I k whose second c o m p o n e n t is l requires a fixed c o m p u t a t i o n time. That is, for each parse list, we must link the items with a c o m m o n second c o m p o n e n t a n d establish a header pointing to the first entry on that list. This preprocessing can be done in time 0(n 2) in an obvious way. In step (3b), then, we examine the items on list Iz with second c o m p o n e n t r = l, l - 1 , . . . , i until a desired item of the f o r m [X k --, 7 ", r] is found. The verification that we have the desired item takes fixed time, since all items with second c o m p o n e n t i on I, can be f o u n d in finite time. The total a m o u n t of time spent in step (3b) is thus p r o p o r t i o n a l to j i. [--]
EXERCISES
4.2.1.
Let G be defined by S - - , AS! b, A ---, S A l a . Construct the parse tables by Algorithm 4.3 for the following words: (a) bbaab. (b) ababab. (c) aabba.
4.2.2.
Use Algorithm 4.4 to obtain left parses for those words of Exercise 4.2.1 which are in L(G).
4.2.3.
Construct parse lists for the grammar G of Exercise 4.2.1 and the words of that exercise using Algorithm 4.5.
4.2.4.
Use Algorithm 4.6 to construct right parses for those words of Exercise 4.2.1 which are in L(G).
4.2.5.
Let G be given by S - - ~ SS[ a. Use Algorithm 4.5 to construct a few of the parse lists Io, 11 . . . . when the input is aa . . . . How many elementary operations are needed before I; is computed ?
4.2.6.
Prove Theorem 4.6.
4.2.7.
Prove that Algorithm 4.4 correctly produces a left parse.
4.2.8.
Complete the "only if" portion of Theorem 4.9.
4.2.9.
Show that Earley's algorithm operates in time 0(n 3) on any grammar.
EXERCISES
331
4.2.10.
Complete the proof of Lemma 4.6.
4.2.11.
Give a reasonable set of elementary operations for Theorems 4.10-4.12.
4.2.12.
Prove Theorem 4.10.
4.2.13.
Prove Theorem 4.11.
4.2.14.
Show that Algorithm 4.6 correctly produces a right parse.
4.2.15.
Modify Algorithm 4.3 to work on non-CNF grammars. H i n t : Each t~j must hold not only those nonterminals A which derive a t . . . a ~ + j _ , but also certain substrings of right sides of productions which derive ai
• • • ai+j-1.
"4.2.16.
Show that if the underlying grammar is linear, then a modification of Algorithm 4.3 can be made to work in time 0(n2).
"4.2.17.
We can modify Algorithm 4.3 to use "lookahead strings" of length k _~ 0. Given a grammar G and an input string w = a l a 2 . . . a , , we create a parse table T such that t~j contains A if and only if (1) S ==~ o~Ax, +
(2) A ==~ ai . . . at+j-l, and (3) at+jai+j+l . . . ai+~+k-1 -- FIRSTk(X). Thus, A would be placed in entry t~j provided the k input symbols to the right of input symbol a~+j_a can legitimately appear after A in a sentential form. Algorithm 4.3 uses lookahead strings of length 0. Modify Algorithm 4.3 to use lookahead strings of length k ~> 1. What is the time complexity of such an algorithm ? "4.2.18.
We can also modify Earley's algorithm to use lookahead. Here we would use items of the form [A ---, ~ • fl, i, u] where u is a lookahead string of length k. We would not enter this item on list Ij unless there is a derivation S ==~ ? A u v , where ~' ==~ al -.- at, • ==~ a~+~ ..- aj, and FIRSTk(flu) contains a~+l " ' ' a j + k . Complete the details of modifying Earley's algorithm to incorporate lookahead and then examine the time complexity of the algorithm.
4.2.19.
Modify Algorithm 4.4 to produce a right parse.
4.2.20.
Modify Algorithm 4.6 to produce a left parse.
4.2.21.
Show that it is possible to modify Algorithm 4.4 to produce a parse in linear time if, in constructing the parse table, we add pointers with each A in ttj to the B in t~k and C in tt+k.j-k that caused A to be placed in t~j in step (2) of Algorithm 4.3.
4.2.22.
Show that if Algorithm 4.5 is modified to include pointers from an item to the other items which caused it to be placed on a list, then a right (or left)parse can be obtained from the parse lists in linear time.
4.2.23.
Modify Algorithm 4.6 to work on arbitrary CFG's (including those with cycles). H i n t : Include the pointers in the parse lists as in Exercise 4.2.22.
332
CHAP. 4
GENERAL PARSING METHODS
4.2.24.
What is the maximum number of items that can appear in a list Ij in Algorithm 4.5 ?
*4.2.25.
A grammar G is said to be of finite ambiguity if there is a constant k such that if w is in L(G), then w has no more than k distinct left parses. Show that Earley's algorithm takes time 0(n 2) on all grammars of finite ambiguity.
Open Problems There is little known about the actual time necessary to parse an arbitrary context-free grammar. In fact, no good upper bounds are known for the time it takes to recognize sentences in L(G) for arbitrary CFG G, let alone parse it. We therefore propose the following open problems and research areas. 4.2.26.
Does there exist an upper bound lower than 0(n 3) on the time needed to recognize an arbitrary CFL on some reasonable model of a random access computer or a multitape Turing machine ?
4.2.27.
Does there exist an upper bound better than 0(n 2) on the time needed to recognize unambiguous CFL's ?
Research Problems 4.2.28.
Find a CFL which cannot be recognized in time f(n) on a random access computer or Turing machine (the latter would be easier), where f(n) grows faster than n; i.e., lim,_.~ (n/f (n)) = 0. Can you find a CFL which appears to take more than 0(n 2) time for recognition, even if you cannot prove this to be so ?
4.2.29.
Find large classes of CFG's which can be parsed in linear time by Eadey's algorithm. Find large classes of ambiguous CFG's which can be parsed in time 0(n 2) by Earley's algorithm. It should be mentioned that all the deterministic CFL's have grammars in the former class.
Programming Exercises 4.2.30.
Use Earley's algorithm to construct a parser for one of the grammars in the Appendix.
4.2.31.
Construct a program that takes as input any CFG G and produces as output a parser for G that uses Earley's algorithm. BIBLIOGRAPHIC
NOTES
Algorithm 4.3 has been discovered independently by a number of people. Hays [1967] reports a version of it, which he attributes to J. Cocke. Younger [1967] uses Algorithm 4.3 to show that the time complexity of the membership problem for context-free languages is 0(n3). Kasami [1965] also gives a similar algorithm. Algorithm 4.5 is found in Earley's Ph.D. thesis [1968]. An 0(n 2) parsing algorithm for unambiguous CFG's is reported by Kasami and Torii [1969].
5
ONE-PASS N O BACKTRACK PARSING
In Chapter 4 we discussed backtrack techniques that could be used to simulate the nondeterministic left and right parsers for large classes of context-free grammars. However, we saw that in some cases such a simulation could be quite extravagant in terms of time. In this chapter we shall discuss classes of context-free grammars for which we can construct efficient parsers --parsers which make cln operations and use c2n space in processing an input of length n, where cl and c2 are small constants. We shall have to pay a price for this efficiency, as none of the classes of grammars for which we can construct these efficient parsers generate all the context-free languages. However, there is strong evidence that the restricted classes of grammars for which we can construct these efficient parsers are adequate to specify all the syntactic features of programming languages that are normally specified by context-free grammars. The parsing algorithms to be discussed are characterized by the facts that the input string is scanned once from left to right and that the parsing process is completely deterministic. In effect, we are merely restricting the class of CFG's so that we are always able to construct a deterministic left parser or a deterministic right parser for the grammar under consideration. The classes of grammars to be discussed in this chapter include (1) The LL(k) grammars--those for which the left parser can be made to work deterministically if it is allowed to look at k input symbols to the right of its current input position.t (2) The LR(k) grammars--those for which the right parser can be made tThis does not involve an extension of the definition of a DPDT. The k "lookahead symbols" are stored in the finite control. 333
334
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
to work deterministically if it is allowed to look k input symbols beyond its current input position. (3) The precedence grammars--those for which the right parser can find the handle of a right-sentential form by looking only at certain relations between pairs of adjacent symbols of that sentential form. 5.1.
LL(k) GRAMMARS
In this section we shall present the largest "natural" class of left-parsable grammars--the LL(k) grammars. 5.1.1.
Definition of LL(k) Grammar
As an introduction, let G = (N, I£, P, S) be an unambiguous grammar and w = a t a 2 . . . a, a sentence in L ( G ) . Then there exists a unique sequence of left-sentential forms 0%, e l , . - . , em such that S = ~0, ei P'==> ~i+1 for 0 < i < m and e m = w. The left parse for w is PoP1 "'" Pro-l" NOW, suppose that we want to find this left parse by scanning w once from left to right. We might try to do this by constructing Co, e l , . . . , era, the sequence of left-sentential forms. If el = al "" " a i A f l , then at this point we could have read the first j input symbols and compared them with the first j symbols of 0~i. It would be desirable if ei+ i could be determined knowing only al . . . aj (the part of the input we have scanned to this point), the next few input symbols (aj+laj+2 . . . aj+~ for some fixed k), and the nonterminal A. If these three quantities uniquely determine which production is to be used to expand A, we can then precisely determine et+~ from e~ and the k input symbols aj+laj+ z . . . aj+ k. A grammar in which each leftmost derivation has this property is said to be an LL(k) grammar. We shall see that for each LL(k) grammar we can construct a deterministic left parser which operates in linear time. A few definitions are needed before we proceed. DEFINITION
Let e = x f l be a left-sentential form in some grammar G = (N, ~, P, S) such that x is in If* and fl either begins with a nonterminal or is e. We say that x is the closed p o r t i o n of e and that fl is the open p o r t i o n of e. The boundary between x and ,8 will be referred to as the border.
Example 5.1 Let ~ ---- a b a c A a B . The closed portion of ~ is abac; the open portion is A a B . If ~ ---- abc, then abc is its closed portion and e its open portion. Its border is at the right end.
sEc. 5.1
335
LL(k) GRAMMARS
The intuitive idea behind LL(k) grammars is that if we are constructing a leftmost derivation S =~ w and we have already constructed S =~ ~ :=~ ~2 lm
Im
lm
=* . . . =~ et~ such that ~ =* w, then we can construct ~+1, the next step of lm
lm
Ira
the derivation, by observing op,ly the closed portion of ~i and a "little more," the "little more" being the next k input symbols of w. (Note that the closed portion of ~t is a prefix of w.) It is important to observe that if we do not see all of w when ~,+1 is constructed, then we do not really know what terminal string is ultimately derived from S. Thus the LL(k)condition implies that ~;+i is substantially independent (except for the next k terminal symbols) of what is derived from the open portion of ~. Viewed in terms of a derivation tree, we can construct a derivation tree for a sentence w x y in an LL(k) grammar starting from the root and working top-down determfiaistieally. Specifically, if we have constructed the partial derivation tree with frontier wArE, then knowing w and the first k symbols of x y we would know which production to use to expand A. The outline of the complete tree is shown in Fig. 5.1. S
w
x
y
Fig. 5.1 Partial derivation tree for the sentence wxy. Recall that in Chapter 4 we defined for a C F G G -- (N, ~;, P, S) the function FIRST,(a), where k is an integer and 0c is in (N u Ig)*, to be [w in Z * [ e i t h e r [ w [ < k and ct=* w, o r [ w I-- k and ~ = * w x for some x]. G
G
We shall delete the subscript k and/or the superscript G from FIRST whenever no confusion will result. If ~ consists solely of terminals, then FIRSTk(~) is just {w], where w is the first k symbols of ~ if 10c[ ~ k, and w = ~ if l~] < k. We shall write
336
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
FIRSTk(a ) = w, rather than {w}, in this case. It is straightforward to determine FIRST~(00 for particular g r a m m a r G. We shall defer an algorithm to Section 5.1.6. DEFINITION
Let G = (N, Z, P, S) be a CFG. We say that G is L L ( k ) , for some fixed integer k, if whenever there are two leftmost derivations (1) S => w A e = ~ w f l a =-~ w x and lra
Im
(2) S =-~ w A a =-~ w~,a =-~ wy Ira
Ira
such that FIRSTk(x ) = FIRSTk(y), it follows that fl = 7'. Stated less formally, G is LL(k) if given a string w A a in (N U Z)* and the first k terminal symbols (if they exist) to be derived from A a there is at most one production which can be applied to A to yield a derivation of any terminal string beginning with w followed by those k terminals. We say that a grammar is L L if it is LL(k) for some k. Example 5.2
Let G 1 be the grammar S ~ a A S ] b, A ---~ a l b S A . Intuitively, G1 is LL(1) because given C, the leftmost nonterminal in any left-sentential form, and c, the next input symbol, there is at most one production for C capable of deriving a terminal string beginning with c. Going to the definition of an LL(1) grammar, if S = ~ w S a ~ lra
w x and S ~
wfla ~
lm
lm
w S a = ~ wTa ~
lrn
lm
wy
lra
and x and y start with the same symbol, we must have fl = 7'. Specifically, if x and y start with a, then production S ---~ a A S was used and fl = 7' = a A S . S---~ b is not a possible alternative. Conversely, if x and y start with b, S ~ b must be the production used, and fl = 7' = b. Note that x = y = e is impossible, since S does not derive e in G~. A similar argument prevails when we consider two derivations S ~
wAoc ~
lm
lm
wile ~ lm
w x and S ~ tm
wAoc ~ lrn
wT'a =-> wy.
[~
Ira
The grammar in Example 5.2 is an example of what is known as a simple LL(!) grammar. DEFINITION
A context-flee grammar G = (N, E, P, S ) with no e-productions such that for all A ~ N each alternate for A begins with a distinct terminal symbol is called a s i m p l e LL(1) grammar. Thus in a simple LL(1) grammar, given a pair (A, a), where A ~ N and a ~ Z, there is at most one production of the form A - - , aa.
SEC. 5.1
337
LL(k) GRAMMARS
Example 5.3
Let us consider the more complicated case of the grammar G 2 defined by S ---~ e t a b A , A ~ S a a l b . We shall show that G2 is LL(2). To do this, we shall show that if wBtx is any left-sentential form of G2 and w x is a sentence in L(G), then there is at most one production B ~ f l in Gz such that
FIRST2(flt~ ) contains FIRST2(x ). Suppose that S ~ lm
and S ~
Im
lm
wflo~ ==~ w x lm
wy, where the first two symbols of x and y agree if
wSoc :=~ w~,o~ ~
lm
wSt~ ~
lm
they exist. Since G2 is a linear grammar, ~ must be in (a + b)*. In fact, we can say more. Either w = ~ = e, or the last production used in the derivation S ~
wSoc was A ~
Saa. (There is no other way for S to "appear"
lm
in a sentential form.) Thus either ~ = e or ~ begins with aa. Suppose that S ---~ e is used going from wSt~ to wilts. Then fl = e, and x is either e or begins with aa. Likewise, if S ~ e is used going from wSt~ to w},~, then ~ = e and y = e, or y begins with aa. If S ~ abA is used going from wSoc to wflo~, then fl = a b A , and x begins with ab. Likewise, if S ---~ a b A is used going from wStx to wToc, then ~, = a b A , and y begins with ab. There are thus no possibilities other than x = y = e, x and y begin with aa, or both begin with ab. Any other condition on the first two symbols of x and y implies that one or both derivations are impossible. In the first two cases, S ---~ e is used in both derivations, and fl = ? = e. In the third case, S ---~ a b A must be used, and fl = ~, = abA. It is left for the Exercises to prove that the situation in which A is the symbol to the right of the border of the sentential form in question does not yield a contradiction of the LL(2) condition. The reader should also verify that G~ is not LL(1). D Example 5.4
Let us consider the grammar G 3 = ({S, A, B}, {0, 1, a, b}, P3, S), where P3 consists of S --~ A IB, A --~ a A b l O, B ~ aBbb l l. L(G3) is the language (a'Ob"in ~ 0} u {a"lb2"! n ~ 0}. G 3 is not LL(k) for any k. Intuitively, if we begin scanning a string of a's which is arbitrarily long, we do not know whether the production S ---~ A or S ~ B was first used until a 0 or 1 is seen. Referring to the definition of LL(k) grammar, we may take w = a = e, fl = A, ~, = B, x = akOb k, and y = a klb 2k in the derivations 0
S ~
,
S ~ lm
A ~
0
S ~
*
S ~ lm
akOb k
lm
B ~
a k l b 2k
lm
to satisfy conditions (1) and (2) of the definition. Moreover, x and y agree in
338
ONE-PASSNO BACKTRACKPARSING
CHAP. 5
the first k symbols. However, the conclusion that fl -- ~, is false. Since k can be arbitrary here, we may conclude that G 3 is not an LL grammar. In fact, in Chapter 8 we shall show that L(G3) has no LL(k) grammar. F---] 5.1.2.
Predictive Parsing Algorithms
We shall show that we can parse LL(k) grammars very conveniently using what we call a k-predictive parsing algorithm. A k-predictive parsing algorithm for a CFG G = (N, X, P, S) uses an input tape, a pushdown list, and an output tape as shown in Fig. 5.2. The k-predictive parsing algorithm attempts to trace out a leftmost derivation of the string placed on its input tape.
X
I w ,'l /'/
Input tape
Input head
i]
F ,i Output tape
7 Fig. 5.2
Predictive parsing algorithm.
The input tape contains the input string to be parsed. The input tape is read by an input head capable of reading the next k input symbols (whence the k in k-predictive). The string scanned by the input head will be called the lookahead string. In Fig. 5.2 the substring u of the input string wux represents the lookahead string. The pushdown list contains a string X0c$, where X'a is a string of pushdown symbols and $ is a special symbol used as a bottom of the pushdown list marker. The symbol X is on top of the pushdown list. We shall use r' to represent the alphabet of pushdown list symbols (excluding $). The output tape contains a string 7z of production indices. We shall represent the configuration of a predictive parsing algorithm by a triple (x, X~, n), where (1) x represents the unused portion of the original input string. (2) Xa represents the string on the pushdown list (with X on top). (3) n is the string on the output tape. For example, the configuration in Fig. 5.2 is (ux, Xa$, ~r).
SEC. 5.1
LL(k) GRAMMARS
339
The action of a k-predictive parsing algorithm ~ is dictated by a parsing table M, which is a mapping from (F u {$}) x Z.k to a set containing the following elements" (1) (fl, i), where fl is in F* and i is a production number. Presumably, fl will be either the right side of production i or a representation of it. (2) pop. (3) accept. (4) error. The parsing algorithm parses an input by making a sequence of moves, each move being very similar to a move of a pushdown transducer. In a move the lookahead string u and the symbol X on top of the pushdown list are determined. Then the entry M(X, u) in the parsing table is consulted to determine the actual move to be made. As would be expected, we shall describe the moves of the parsing algorithm in terms of a relation ~ on the set of configurations. Let u be FIRSTk(x ). We write (1) (x, X0c, 70 ~ (x, fl0~, 7ti) if M(X, u) = (fl, i). Here the top symbol X on the pushdown list is replaced by the string fl ~ IF'*, and the production number i is appended to the output. The input head is not moved. (2) (x, a0~, 70 1-- (x', 0c, n) if M(a, u) = pop and x = ax'. When the symbol on top of the pushdown list matches the current input symbol (the first symbol of the lookahead string), the pushdown list is popped and the input head is moved one symbol to the right. (3) If the parsing algorithm reaches configuration (e, $, n), then parsing ceases, and the output string zt is the parse of the original input string. We shall assume that M($, e) is always accept. Configuration (e, $, 70 is called
accepting. (4) If the parsing algorithm reaches configuration
(x, X0~, 10 and
M(X, u) = error, then parsing ceases, and an error is reported. The configuration (x, X~, n) is called an error configuration. If w ~ E* is the string to be parsed, then the initial configuration of the parsing algorithm is (w, X0$, e), where Xo is a designated initial symbol. If (x, X0$, e)I-~--(e, $, 70, we write ~ ( w ) = 7t and call n the output of for input w. If (w, X0$, e) does not reach an accepting configuration, we say that ~(w) is undefined. The translation defined by ~, denoted z(~), is the set of pairs {(w, 7t) [~(w) = n}. We say that ~ is a valid k-predictive parsing algorithm for CFG G if (1) L(G) = {wta(w) is defined}, and (2) If a(w) = n, then 7t is a left parse of w. If a k-predictive parsing algorithm ~ uses a parsing table M and a is a valid parsing algorithm for a CFG G, we say that M is a valid parsing table for G.
340
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
Example 5.5
Let us construct a 1-predictive parsing algorithm et for G~, the simple LL(1) grammar of Example 5.2. First, let us number the productions of G1 as follows: (1) S -
,~ aAS
(2) S ~
b
(3) A
.~ a
(4) A~
,~bSA
A parsing table for a is shown in Fig. 5.3. L o o k a h e a d string a
Symbol on top of pushdown list
e
aAS, 1
b, 2
error
a, 3
bSA, 4
error
pop
error
error
error
pop
error
error
error
accept
,
,
.
Fig. 5.3 Parsing table for a. Using this table, et would parse the input string abbab as follows"
(abbab, S$, e) ~ (abbab, aAS$, 1) 1- (bbab, ASS, 1) ~- (bbab, bSAS$, 14) k- (bab, SAS$, 14) }--- (bab, bASS, 142) k-- (ab, ASS, 142) k- (ab, aSS, 1423) (b, S$, 1423) (b, b$, 14232) l--- (e, $, 14232)
SEC. 5.1
341
LL(k) GRAMMARS
For the first move M(S, a) = (aAS, 1), so S on top of the pushdown list is replaced by aAS, and production number 1 is written on the output tape. For the next move M(a, a) = pop so that a is removed from the pushdown list, and the input head is moved one position to the right. Continuing in this fashion, we obtain the accepting configuration (e, $, 14232). It should be clear that 14232 is a left parse of abbab, and, in fact, a~ is a valid 1-predictive parsing algorithm for G 1. A k-predictive parsing algorithm for a C F G G can be simulated by a deterministic pushdown transducer with an endmarker on the input. Since a pushdown transducer can look only at one input symbol, the lookahead string should be stored as part of the state of the finite control. The rest of the simulation should be obvious. THEOREM 5.1
Let a~ be a k-predictive parsing algorithm for a CFG G. Then there exists a deterministic pushdown transducer T such that z(T) = ((w$, z01 a~(w) = n~}.
Proof. Exercise.
[[]
COROLLARY
Let ~ be a valid k-predictive parsing algorithm for G. Then there is a deterministic left parser for G. Example 5.6 Let us construct a deterministic left parser P~ for the 1-predictive parsing algorithm in Example 5.5. Since the grammar is simple, we can obtain a smaller DPDT if we move the input head one symbol to the right on each move. The left parser will use $ as both a right endmarker on the input tape and as a bottom of the pushdown list marker. Let P~ = ([q0, q, accep(}, (a, b, $}, {S, A, a, b, $~},~, q0, $, [accept]), where is defined as follows" ~(q0, e, $) = (q, S$, e) ~(q, a, S) = (q, AS, 1) ~(q, b, S) = (q, e, 2) ~(q, a, A) = (q, e, 3)
~(q, b, A) = (q, SA, 4) ~(q, $, $) = (accept, e, e) It is easy to see that (w$, n) ~ z(P~) if and only if ~(w) = n.
I[]
342
ONE-PASS NO BACKTRACK
5.1.3.
Implications of the LL(k) Definition
PARSING
CHAP.
5
We shall show that for every LL(k) grammar we can mechanically construct a valid k-predictive parsing algorithm. Since the parsing table is the heart of the predictive parsing algorithm, we shall show how a parsing table can be constructed from the grammar. We begin by examining the implications of the LL(k) definition. The LL(k) definition states that, given a left-sentential form w a s , then w and the next k input symbols following w will uniquely determine which production is to be used to expand A. Thus at first glance it might appear that we have to remember all of w to determine which production is to be used next. However, this is not the case. The following theorem is fundamental to an understanding of LL(k) grammars. THEOREM 5.2 Let G = (N, E, P, S) be a CFG. Then G is LL(k) if and only if the following condition holds" If A ---~ fl and A ---~ ~, are distinct productions in P, then FIRSTk(fl00 ~ FIRSTk(T00 = ~ ' for all wAtt such that S = ~ wAtx. lm
Proof Only if: Suppose that there exist w, A, ~, fl, and 7 as above, but FIRSTk(fltx) ~ FIRSTk(7~) contains x. Then by definition of FIRST, we
have derivations S ~ Im
wAoc =~ wflo~ =~ w x y and S ~ Im
lm
lm
wAoc =~ wTo~ =-~ w x z lm
lm
for some y and z. (Note that here we need the fact that N has no useless nonterminals, as we assume for all grammars.) If Ix I < k, then y = z = e. Since fl ~ 7, G is not LL(k). lf: S ~ lm
Suppose that G is not LL(k). Then there exist two derivations wAoc ~ lm
wfloc ~ lm
w x and S =~ wAoc =-~ wT~ =-~ wy such that x and y lm
lm
lm
agree up to the first k places, but fl ~ ~,. Then A ---~ fl and A ~ y are distinct productions in P, and FIRST(fiE) and FIRST(?~) each contain the string FIRST(x), which is also FIRST(y). [~] Let us look at some applications of Theorem 5.2 to LL(1) grammars. Suppose that G = (N, E, P, S) is an e-free CFG, and we wish to determine whether G is LL(1). Theorem 5.2 implies that G is LL(1) if and only if for all A in N each set of A-productions A --, ~1 I ~ z l ' " I~. in P is such that FIRST~(gl), F I R S T ~ ( a 2 ) , . . . , FIRSTI(~ .) are all pairwise disjoint. (Note that e-freedom is essential here.) Example 5.7
The grammar G having the two productions S - - - , a S a cannot be EL(l), since F I R S T 1 ( a S ) = F I R S T , ( a ) = a. Intuitively, in parsing a string begin-
SEC. 5.1
LL(k) GRAMMARS
343
ning with an a, looking only at the first input symbol we would not know whether to use S ~ a S or S --~ a to expand S. On the other hand, G is LL(2). Using the notation in Theorem 5.2, if S =~ wAoc, then A = S and 0c = e. lm
The only two productions for S are as given, so that fl = a S and ~, = a. Since FIRST2(aS) = aa and FIRST2(a ) = a, G is LL(2) by Theorem 5.2. E] Let us consider LL(1) grammars with e-productions. At this point it is convenient to introduce the function FOLLOW~. DEFINITION
Let G = (N, Z, P, S) be a CFG. We define FOLLOW~(fl), where k is an integer and fl is in (N U E)*, to be the set {wl S ~ eft? and w is in FIRST~(?)}. As is customary, we shall omit k and G whenever they are understood. Thus, FOLLOW~(A) includes the set of terminal symbols that can occur immediately to the right of A in any sentential form, and if eA is a sentential form, then e is also in FOLLOW~(A). We can extend the functions FIRST and F O L L O W to domains which are sets of strings rather than single strings, in the obvious manner. That is, let G = (N, E, P, S) be a CFG. If L ~ (N U Z)*, then FIRST~(L) = {wl for some a in L, w is in FIRS'r~(a)} and FOLLOW~(L) = [wl for some a in L, w is in FOLLOW~(a)}. For LL(1) grammars we can make the following important observation. THEOREM 5.3 A C F G G = (N, E, P, S) is LL(1) if and only if the following condition holds" For each A in N, if A --, fl and A ~ 7 are distinct productions, then FIRST1( ~ F O L L O W i (A)) ~ FIRST1( 7 F O L L O W i ( A ) ) = ~ . Proof. Exercise.
[Z]
Thus we can show that a grammar G is LL(1) if and only if for each set of A-productions A --~ tz~ [ ~ 2 1 - ' ' I~n the following conditions hold" (1) FIRSTi(tz~), F I R S T i ( t z 2 ) , . . . , FIRSTi(tzn) are all pairwise disjoint. (2) If ~t ~
e, then FIRST~(~i) A FOLLOW~(A) = ~ for 1 < j < n,
i~j. These conditions are merely a restatement of Theorem 5.3. We should caution the reader that, appealing as it may seem, Theorem 5.3 does not generalize directly. That is, tet G be a C F G such that statement (5.1.1) holds" (5.1.1)
If A --~ fl and A --~ 7 are distinct A-productions, then FIRSTk( fl FOLLOWk(A)) A FIRSTk( 7 F O L L O W k ( A ) ) =
344
CHAP. 5
ONE-PASS NO B A C K T R A C K P A R S I N G
Such a grammar is called a strong LL(k) grammar. Every LL(1) grammar is strong. However, the r~ext example shows that when k > 1 there are LL(k) grammars that are not strong LL(k) grammars. Example 5.8
Consider the grammar C, defined by
S
> aAaalbAba
A
>ble
Using Theorem 5.2, we can verify that G is an LL(2) grammar. Consider the derivation S ~ aAaa. We observe that FIRST2(baa ) ~ FIRSTz(aa ) = ~ . Using the notation o f Theorem 5.2 here, ~ = aa, fl = b, and ? = e. Likewise, if S = > bAba, then F I R S T 2 ( b b a ) ~ F I R S T 2 ( b a ) = ~ . Since all derivations in G are of length 2, we have shown G to be LL(2), by Theorem 5.2. But FOLLOWz(A) = [aa, ba}, so FIRSTz(b FOLLOW2(A)) ~ F I R S T z ( F O L L O W z ( A ) ) = {bali, violating (5.1.1). Hence G is not a strong LL(2) grammar.
D
One important consequence of the LL(k) definition is that a left-recursive grammar cannot be LL(k) for any k (Exercise 5.1.1). Example 5.9
Consider the grammar G with the two productions S ~
Sa[ b. From
i
Theorem 5.2, consider the derivation S ~ Sa t, i ~ 0 with A = S, ~ = e, fl = Sa, and ~, = b. Then for i ~ k, FIRSTk(Saa t) A FIRSTk(ba ~) = ba k- ~. Thus, G cannot be LL(k) for any k. [[] It is also important to observe that every LL(k) grammar is unambiguous (Exercise 5.1.3). Thus, if we are given an ambiguous grammar, we can immediately conclude that it cannot be LL(k) for any k. In Chapter 8 we shall see that many deterministic context-free languages do not have an LL(k) grammar. For example, {anObn[n ~ 1~ u {a"lbZn[n ~ 1} is such a language. Also, given a C F G G which is not LL(k)for a fixed k, it is undecidable whether G has an equivalent LL(k) grammar. But in spite of these obstacles, there are several situations in which various transformations can be applied to a grammar which is not LL(1) to change the given grammar into an equivalent LL(1) grammar. We shall give two useful examples of such transformations here. The first is the elimination of left recursion. We shall illustrate the technique with an example. Example 5.10
Let G be the grammar S ----~ Salb, which we saw in Example 5.9 was not
SEC. 5.1
LL(k) GRAMMARS
345
LL. We can replace these two productions by the three productions
S
> bS'
S'
> aS' l e
to obtain an equivalent grammar G'. Using Theorem 5.3, we can readily show G' to be LL(1). E] Another useful transformation is left factoring. We again illustrate the technique through an example. Example 5.11
Consider the LL(2) grammar G with the two productions S---~ aS! a. We can "factor" these two productions by writing them as S---~ a(S[ e). That is, we assume that concatenation distributes over alternation (the vertical bar). We can then replace these productions by
S
> aA
A
>Sic
to obtain an equivalent LL(1) grammar. In general, the process of left factoring involves replacing the productions A --~ tzfll l " " [tzfln by A ----~tzA' and A' ----~fll 1"" [fin. 5.1.4.
Parsing LL(1) Grammars
The heart of a k-predictive parsing algorithm is its parsing table M. In this section and the next we show that every LL(k) grammar G can be leftparsed by a k-predictive parsing algorithm by showing how a valid parsing table can be constructed from G. We shall first consider the important special case where G is an LL(1) grammar. ALGORITHM 5.1
A parsing table for an LL(1) grammar.
Input. An LL(1) CFG G = (N, X, P, S). Output. M, a valid parsing table for G. Method. We shall assume that $ is the bottom of the pushdown list marker. M is defined on (N U X u {$}) x (X U {e}) as follows: (1) If A --~ ~ is the ith production in P, then M(A, a) = (oc, i) for all a in FIRST 1(~), a ~ e. If e is also in FIRST~ (00, then M(A, b) = (oc, i) for all b in FOLLOWl(A).
346
CHAP. 5
ONE-PASSNO BACKTRACK PARSING
(2) M ( a , a) = pop for all a in ~. (3) M ( $ , e) = accept.
(4) Otherwise M ( X , a) : error, for X in N U ~ U {$}, a in E U [e}.
D
Before we prove that Algorithm 5.1 does produce a valid parsing table for G, let us consider an example of Algorithm 5.1. Example 5.12
Let us consider producing a parsing table for the grammar G with productions (1) (3) (5) (7)
E ~
(2) (4) (6) (8)
TE'
E'---~ e T'---~ • F T '
F ~
(E)
E ' .--, -b T E ' T ~ FT'
T'--~ e F ~ a
Using Theorem 5.3 the reader can verify that G is an LL(1) grammar. I n fact, the discerning reader will observe that G has been obtained from Go using the transformation eliminating left recursion as in Example 5.10. Go is not LL, by the way. Let us now compute the entries for the E-row using step (1) of Algorithm
E
a
(
TE', 1
TE', 1
t
)
+
e, 3
+TE', 2
•
e
e, 3
,,
FT', 4 T
FT', 4
e, 6
it
a, 8
e, 6
• FT', 5
e, 6
(E),7
pop pop pop
pop pop
accept
Fig. 5.4 Parsing table for G.
SEC. 5.1
LL(k) GRAMMARS
347
5.1. Here, FIRSTI[TE' ] = {(, a}, so M[E, (] = [TE', 1] and M[E, a] = [TE', 1]. All other entries in the E-row are error. Let us now compute the entries for the E'-row. We note FIRST1[-+- TE'] = -t-, so M[E', .+] = [-q- TE', 2]. Since E'--~ e is a production, we must compute FOLLOWl[E' ] = {e,)}. Thus, M[E', e] = M[E', )] = [e, 3]. All other entries for E' are error. Continuing in this fashion, we obtain the parsing table for G shown in Fig. 5.4. Error entries have been left blank. The 1-predictive parsing algorithm using this table would parse the input string (a • a) in the following sequence of moves: [(a • a), E$, e]
[(a • a), TE'$, 1] [(a • a), FT'E'$, 14] [(a • a), (E)T'E'$, 147]
[a • a), E)T'E'$, 147] [a • a), TE')T'E'$, 1471] ]-- [a • a), FT'E')T'E'$, 14714]
[a • a), aT'E')T'E'$, 147148] [, a), T'E')T'E'$, 147148] ]-- [. a),. FT'E')T'E'$, 1471485] [a), FT'E')T'E'$, 1471485] [a), aT'E')T'E'$, 14714858] 1-- D, T'E')T'E'$, 14714858] [), E')T'E' $, 147148586] i-- D, )T'E'$, 1471485863]
[e, T'E', 1471485863] [e, E'$, 14714858636] [e, $, 147148586363] THEOREM 5.4
Algorithm 5.1 produces a valid parsing table for an LL(I) grammar G.
Proof. We first note that if G is an LL(1) grammar, then at most one value is defined in step (1) of Algorithm 5.1 for each entry M(A, a) of the parsing matrix. This observation is merely a restatement of Theorem 5.3. Next, a straightforward induction on the number of moves executed by a 1-predictive parsing algorithm ~ using the parsing table M shows that if (xy, S$, e)1--~-(y, oc$, zt), then S ~ x0~. Another induction on the number of steps in a leftmost derivation can be used to show the converse, namely
348
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
that if S"=~ xa, where a is the open portion of xa, and FIRSTI(y ) is in FIRSTI(a), then (xy, S$, e) l * (Y, aS, n). It then follows that (w, S$, e) I*-- (e, $, ~z) if and only if S "=* w. Thus ff is a valid parsing algorithm for G, and M a valid parsing table for G. V-] 5.1.5. Parsing LL(k) Grammars
Let us now consider the construction of a parsing table for an arbitrary LL(k) grammar G = (N, E, P, S), where k > 1. If G is a strong LL(k) grammar, then we can use Algorithm 5.1 with lookahead strings of length up to k symbols. However, the situation is somewhat more complicated when G is not a strong LL(k) grammar. In the LL(1)-predictive parsing algorithm we placed only symbols in N u E on the pushdown list, and we found that the combination of the nonterminal symbol on top of the pushdown list and the current input symbol was sufficient to uniquely determine the next production to be applied. However, when G is not strong, we find that a nonterminal symbol and the lookahead string are not always sufficient to uniquely determine the next production. For example, consider the LL(2) grammar
S
> aAaa tbAba
A
>hie
of Example 5.8. Given the nonterminal A and the lookahead string ba we do not know whether we should apply production A ~ b or A ~ e. We can, however, resolve uncertainties of this nature by associating with each nonterminal and the portion of a left-sentential form which may appear to its right, a special symbol which we shall call an LL(k) table (not to be confused with the parsing table). The LL(k) table, given a lookahead string, will uniquely specify which production is to be applied next in a leftmost derivation in an LL(k) grammar. DEFINITION
Let E be an alphabet. If L 1 and L 2 are subsets of X*, let L1 ~)k L2 = {w Ifor some x ~ L 1 and y ~ L2, we have w = xy if [xyl ~ k and w is the first k symbols of xy otherwise]. Example 5.13
Let L, = [e, abb) and L2 : [b, bab]. Then L1 O2 L2 = {b, ba, ab}. The ~ k operator is similar to an infix FIRST operator. LEMMA 5.1 For any C F G G = (N, E , P , S), and for all a and fl in (N u E)*, FIRST~(afl) = FIRST~(a) @)~ FIRST~(fl).
sEc. 5.1
349
LL(k) GRAMMARS
P r o o f Exercise.
[B
DEFINITION
Let G = (N, Z, P, S) be a CFG. For each A in N and L ~ Z *k we define T,4.L, the L L ( k ) table associated with A and L to be a function which given
a lookahead string u in Z *k returns either the symbol error or an A-production and a finite list of subsets of Z *k. Specifically, (1) TA,L(U)= error if there is no production A--~ a in P such that FIRSTk(a ) @e L contains u. (2) TA,L(U) = (A ~ a, (Y~, Y 2 , . . . , Ym)) if A ~ a is the unique production in P such that FIRSTk(a ) @e L contains u. If a = xoBxxtBzx z
. . .
BmX m,
m>0,
where each Bt ~ N a n d x t E Z*, then Yt = FIRSTk(xiB,+ix~+~ "'" Bmxm)@k L" We shall call Y, a local follow set for B,. [If m = 0, TA.L(U) = (A --~ a, ~).] (3) TA,L(U) is undefined if there are two or more productions A --~ ~ , 1 ~ 1 "-" 1~. such that FIRSTk(et) @e L contains u, for 1 _< i < n, n > 2. This situation will not occur if G is an LL(k) grammar. Intuitively, if TA.L(U)= error, then there is no possible derivation +
in G of the form A x = - ~ u v
for any x E L
and v ~ Z*. Whenever
TA,L(U) = (A --~ a, (Y1, Y2, . . . , Ym)), there is exactly one production, A ---~ a, +
which can be used in the first step of a derivation A x ==~ uv for any x ~ L and v ~ Z*. Each set of strings Yt gives all possible prefixes of length up to k of terminal strings which can follow a string derived from B~ when we use the production A ~ a, where a = x o B l x i B 2 x 2 . . . BmXm, in any derivation of the form A x ~
ax ~
uv, with x in L.
lm
By Theorem 5.2, G = (N, Z, P, S) is not LL(k) if and only if there exists in (N U Z)* such that (1) S ==~ wAa, and Im
(2) FIRSTk(fla) ~ F I R S T k ( I , a ) ¢ ~ for some fl ~ ~, such that A ~ and A --~ t' are in P.
fl
By Lemma 5.1 we can rephrase condition (2) as (2') If r = FIRSTk(a ), then (FIRSTe(fl) Ok L) C3 (FIRSTk(?) @k L) ~ ~ . Therefore, if G is LL(k), and we have the derivation S =~ w A a =~ wx, lm
lm
then TA,L(U) will uniquely determine which production is to be used to expand A, where u is FIRSTk(x) and L is FIRSTk(a ).
350
CHAP. 5
ONE-PASSNO BACKTRACK PARSING
Example 5.14 Consider the LL(2) grammar
S
> aAaalbAba
A
>ble
Let us compute the LL(2) table Ts, te~, which we shall denote To. Since S ---~aAaa is a production, we compute FIRST2(aAaa ) Q2 {e} = {aa, ab}. Likewise, S---~ bAba is a production, and FIRST2(bAba ) e 2 [e} = {bb}. Thus we find To(aa)= (S---~ aAaa, Y). Y is the local follow set for A; Y = FIRST2(aa)02 [e] = [aa}. The string aa is the string to the right of A in the production S---~ aAaa. Continuing in this fashion, we obtain the table To shown below" Table To
aa ab bb
Production
Sets
S ~ S ~ S ~
{aa} {aa} {ba}
aAaa aAaa bAba
For each u in (a -+ b) .2 not shown, To(u) =
error.
D
We shall now provide an algorithm to compute those LL(k) tables for an LL(k) grammar G which are needed to construct a parsing table for G. It should be noted that if G is an LL(1) grammar, this algorithm might produce ~nore than one table per nonterminal. However, the parsers constructed by Algorithms 5.1 and 5.2 will be quite similar. They act the same way on inputs in the language, of course. On other inputs, the parser of Algorithm 5.2 might detect the error while the parser of Algorithm 5.1 proceeds to make a few more moves. ALGORITHM 5.2 Construction of LL(k) tables.
Input. An LL(k) CFG G = (N, E, P, S). Output. 3, the set of LL(k) tables needed to construct a parsing table for G.
Method. (1) Construct To, the LL(k) table associated with S and {e}. (2) Initially set ~ = IT0}. (3) For each LL(k) table T in 3 with entry r(u) = (a ~
XoaxXxa~x~... a~x~, (r~, r ~ , . . . , r , ) ) ,
SEC. 5.1
LL(k) GRAMMARS
351
add to 3 the LL(k) table Ts,,r,, for 1 < i < m, if Ts,,r, is not already in 3. (4) Repeat step (3) until no new LL(k) tables can be added to 3. Example 5.15
Let us construct the relevant set of LL(2) tables for the grammar
S
~ aAaalbAba
A
>ble
We begin with 3 = [Ts.t,l~}. Since Ts.t,~(aa)= (S---~ aAaa, {aa}), we must add TA.taa~ tO 3. Likewise, since To(bb)= (S ~ bAba, {ha}), we must also add TA,Cbo~to 3. The nonerror entries for the LL(2) tables TA.taa~ and TA.Cbo~ are shown below' Table Ta, taa~
Production ba aa
Sets
A ----~b a----~e
u
Table TA, tbal
Production ba
a ----~ e
bb
A ----~ b
Sets
At this point 3 {Ts,{e], ZA,{aa} , ZA,{ba] ~ and no new entries can be added to 3 in Algorithm 5.2 so that the three LL(2) tables in 3 are the relevant LL(2) table for G. =
From the relevant set of LL(k) tables for an LL(k) grammar G we can use the following algorithm to construct a valid parsing table for G. The kpredictive parsing algorithm using this parsing table will actually use the LL(k) tables themselves as nonterminal symbols on the pushdown list. ALGORITHM 5.3 A parsing table for an LL(k) grammar G = (N, X, P, S).
Input. An LL(k) C F G G = (N, X, P, S) and 3, the set of LL(k) tables for G. Output. M, a valid parsing table for G. Method. M is defined on (3 U X u [$}) × X.k as follows: (1) If A --~ xoBlxlB2x2
...
BmX m
is the ith production in P and TA,L is
352
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
in 3, then for all u such that Ta,z(u) = (A ---~ XoBlX~B2x 2 . . . BmXm, aAaa
(2) S
> bAba
(3) A
>b
(4) A
>e
using the relevant set of LL(2) tables constructed in Example 5.15. The parsing table resulting from Algorithm 5.3 is shown in Fig. 5.5. In Fig. 5.5, To = Ts, t,~, T1 = Ta,~aa~, and T2 = TA,Cba~-Blank entries indicate error. aa
ab
a
To
aTlaa, 1
aTlaa, 1
T1
e,4
bb
b
e
b T2 ba, 2
b,3
T2
pop
ba
pop
e,4
b,. 3
pop
pop
pop pop accept
Fig. 5.5
Parsing table.
The 2-predictive parsing algorithm would make the following sequence of moves with input bba" (bba, To$ , e) ~- (bba, bTzba$ , 2)
~- (ba, T~ba$, 2) F- (ba, ba$, 24)
(a, aS, 24) }-(e, $, 24).
[-]
SEC. 5.1
LL(k) GRAMMARS
353
THEOREM 5.5
If G = (N, E, P, S) is an LL(k) grammar, then the parsing table constructed in Algorithm 5.3 is a valid parsing table for G under a k-predictive parsing algorithm.
Proof. The proof is similar to that of Theorem 5.4. If G is LL(k), then no conflicts occur in the construction of the relevant LL(k) tables for G, :¢
since if A ----~fl and A ~
~ are in P and S :=> wAoc, then lm
(FIRSTkffl) @~k FIRSTk(tx)) ~ (FIRSTk(?) @~k FIRSTk(00) = ~ . In the construction of the relevant LL(k) tables for G, we compute a table Ta.z only if for some S, w, and ~, we have S ~ wAot, and L = FIRST~(a). That lm
is, L will be a local follow set for A. Thus, if u is in Z *k, then there is at most one production A ~ fl such that u is in FIRSTk(fl) OkL. Let us define the homomorphism h on 3 u E as foIlows:
h(a) = a
for all a E
h(T) = A
if T is an LL(k) table associated with A and L for some L.
Note that each table in 3 must have at least one entry which is a production index. Thus, A is uniquely determined by T. We shall now prove that
(5.1.2)
S "=~ x0c if and only if there is some ~' in (3 u ~)* such that h(og) = ~, and (xy, To$, e)I--~- (y, oc'$, n) for all y such that 0~~
y. To is the LL(k) table associated with S and {el.
If." From the manner in which the parsing table is constructed, whenever a production number i is emitted corresponding to the ith production A ---~fl, the parsing algorithm replaces a table T such that h(T) = A by a string fl' such that h ( f l ' ) = ft. The "if" portion of statement (5.1.2)can thus be proved by a straightforward induction on the number of moves made by the parsing algorithm. Only if." Here we shall show that
(5.1.3)
If A ===~x, then the parsing algorithm will make the sequence of moves (xy, T, e) ~ (y, e, n) for any LL(k) table T associated with A and L, where L = FIRSTk(0 0 for some ~ such that S ==~wAoc, lm
and y is in L. The proof will proceed by induction on In I. If A ~
a la2 . . . an, then
354
CHAP• 5
ONE-PASS NO BACKTRACK PARSING
(ata z . . . any , T, e) ~- (a t . . . any, a 1 . . . a n, i)
since T ( u ) = (A ---~ a l a 2 . . , a n, f~) for all u in FIRSTk(ala2 . . . a,)@k L. Then (al . . . anY, al . . . a n, i)I.-~- (y, e, i). Now suppose that statement (5.I.3) is true for all leftmost derivations of length up to /, and suppose that A t = = ~ x o B t x i B 2 x 2 . . . BmX m and Bj='==~yj, where [rrj] < l. Then (XoY~Xl " ' ' YmX,nY, T, e)I -~- (XoYlX~ " " YmXmY, xoTax~ "'" TmX m, i), since T ( u ) : ( A ~ x o B l X 1 . . . BmX m, < Y t " ' " Ym>) f o r all u included in F I R S T k ( x o B a x a . . " B,nXm) e e L. Each Tj is the LL(k) table associated with Bj and Y~, 1 < j < m, so that the inductive hypothesis holds for each sequence of moves of the form ( y j x y . . . YmXmY, Tj, e)[.-~-- (x i . . . YmXmY, e, ztj)
Putting in the popping moves for the x / s we obtain (xoY~X~y2x2 "'" YmxmY, T, e) ~- ( X o Y l X l y 2 x z " " YmXmY, x o T l x l T z x z
"'" TmXm, i)
~-- ( y l x 2 y 2 x 2
Tmxm, i)
~-- ( x l y z x z
'''
. . . y.,xmY , T l x t T 2 x z YmXmY, x I T z x z
...
" ' " TmXm, iffl)
(YzX2 " ' " ymXmY, T2x2 " ' " TmXm, iffl) ,
[-- (y, e, ilrlrt2 . . . Ztm)
From statement (5.1.3) we have, as a special case, that if S~==~ w, then (w, To $, e)[---- (e, $, n). E] As another example, let us construct the parsing table for the LL(2) grammar G 2 of Example 5.3. Example 5.17
Consider the LL(2) grammar G2 (1) S
>e
(2) S
> abA
(3) A
~ Saa
(4) A
>b
Let us first construct the relevant LL(2) tables for G 2. We begin by constructing To = Ts.t,~"
LL(k) GRAMMARS
see. 5.1
355
Table To
Production e
S-----re
ab
S~
abA
Sets m
{e]
From To we obtain T1 = T,~.c,~" Table T1
b aa
ab
Production
Sets
A --, b A---~ Saa A---~ Saa
[aa} {aa}
From Ta we obtain T2 = Ts, ca=~" Table T2
u
Production
Sets
aa ab
S ---~ e S ~ abA
{aa}
m
From Tz we obtain T3 = TA.ta=~" Table T3
aa
ab ba
Production
Sets
A ----~Saa A.--, Saa A --, b
{aa} {aa}
From these LL(2) tables we obtain the parsing table shown in Fig. 5,6. The 2-predictive parsing algorithm using this parsing table would parse the input string abaa by the following sequence of moves" (abaa, To $, e) R (abaa, ab T1 $, 2) 1- (baa, bT1 $, 2) R (aa, Ti $, 2) [-- (aa, T2aa$, 23) i-- (aa, aa$, 231)
[- (a, aS, 231) t- (e, $, 231)
[B
356
ONE-PASSNO BACKTRACK PARSING aa
To
ab
ba
bb
b
T2aa, 3
T2aa, 3
T2
e, 1
ab T3 , 2
T3
T2aa, 3
T2aa, 3
pop
pop
e
e, 1
ab T1, 2
Tl
a
a
CHAP. 5
b,4
b, 4 pop
b
pop
pop
pop
accept
S Fig. 5.6 Parsing table for G2.
We conclude this section by showing that the k-predictive parsing algorithm parses every input string in linear time. THEOREM 5.6 The number of steps executed by a k-predictive parsing algorithm, using the parsing table resulting from Algorithm 5.3 for an LL(k) context-free grammar G = (N, E, P, S), with an input of length n, is a linear function of n.
Proof. If G is an LL(k) grammar, G cannot be left-recursive. From Lemma 4.1, the maximum number of steps in a derivation of the form +
A ==> B~ is less than some constant c. Thus the maximum number of moves lm
that can be made by a k-predictive parsing algorithm t~ before a pop move, which consumes another input symbol, is bounded above by c. Therefore, a can execute at most O(n) moves in processing an input of length n. D 5.1.6.
Testing for the LL(k) Condition
Given a grammar G, there are several questions we might naturally ask about G. First, one might ask whether G is LL(k) for a given value of k. Second, is G an LL grammar? That is, does there exist some value of k such that G is LL(k)? Finally, since the left parsers for LL(1) grammars are particularly straightforward to construct, we might ask, if G is not LL(1), whether there is an LL(1) grammar G' such that L(G')= L(G). Unfortunately we can provide an algorithm to answer only the first question. It can be shown that the second and third questions are undecidable. In this section we shall provide a test to determine whether a grammar is LL(k) for a specified value of k. If k = 1, we can use Theorem 5.3. For arbitrary k, we can use Theorem 5.2. Here we shall give the general case. It is essentially just showing that Algorithm 5.3 succeeds in producing a parsing table only if G is LL(k).
SEC. 5.1
LL(k) GRAMMARS
357
Recall that G = (N, Z, P, S) is not LL(k) if and only if for some ~ in (N u Z)* the following conditions hold" (1) S ==~wAoc, tm
(2) L = FIRSTk(a), and (3) (FIRSTk(fl) Ok L) A (FIRSTk(?) ~
L) ¢ ~ ,
for some fl ~ ~, such that A ---~ fl and .,4 --~ 7 are productions in P. ALGORITHM 5.4 Test for LL(k)-ness.
lnput. A CFG G : (N, Z, P, S) and an integer k. Output. "Yes" if G is LL(k). "No," otherwise. Method. (1) For a nonterminal A in N such that A has two or more alternates, compute a(A) = {L ~ Z'k IS ==~wan and L = FIRSTk(~)}. (We shall prolm
vide an algorithm to do this subsequently.) (2) If A ~ fl and A --~ ? are distinct A-productions, compute, for each L in a(A), f(L) = (FIRSTk(fl) @k L) ~ (FIRSTk(?) @k L). Iff(L) ~ ~ , then halt and return "No." If f ( L ) = ~ for all L in a(A), repeat step (2) for all distinct pairs of A-productions. (3) Repeat steps (1) and (2) for all nonterminals in N. (4) Return "yes" if no violation of the LL(k) condition is found. To implement Algorithm 5.4, we must be able to compute FIRST~(fl) for any fl in (N u Z)* and CFG G = (N, Z, P, S). Second, we must be able to find the sets in a(A) -----{L ~ E*k [there exists ~ such that S =~ wAs, and lm
L - - FIRSTk(~)}. We shall now provide algorithms to compute both these items. AL6ORIrIaM 5.5 Computation of FIRSTk(fl).
Input. A CFG G = (N, Z, P, S) and a string f l = X i X z . . . X , (N W Z)*.
in
Output. FIRST~(fl). Method. We compute FIRSTk(X~) for 1 < i ~ n and observe that by Lemma 5.1 FIRSTk(fl) -- FIRSTk(X1) O1, FIRSTk(X2) @)I, " " Ok FIRSTk(X,) It will thus suffice to show how to find FIRSTk(X ) when X is in N; if X is in Z L) {e}, then obviously FIRSTk(X) = (X}.
358
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
We define sets Ft(X) for all X in N U l~ and for increasing values of i, i > 0, as follows" (1) F~(a) = [a} for all a in E and i > 0. (2) Fo(A) = {~x E ~ * k l A ~ X0~ is in P, where either Ix[ = k or Ix[ < k and ~ = e]. (3) Suppose that F 0, F x , . . . , Ft-1 have been defined for all A in N. Then F~(A) = {xlA ---~ Y ~ . . . Y, is in P and x is in F,_ ,(g~) ~ k F,_ x(gz) ~ k "'" ~)k F,_ ~(Y.)} W F,_ i(A). (4) As F~_~(A) ~ Ft(A) ~ E,k for all A and i, eventuaIly we must reach an i for which Ft_i(A) = Ft(A) for all A in N. Let FIRSTk(A ) = Ft(A) for that value of i. E] Example 5.18
Let us construct the sets F~(X), assuming that k has the value l, for the grammar G with productions
S
> BA
A
>-q-BAle
B
>
C
> , DCle
D
> (S)la
DC
Initially, F0(S) = F0(B) =
Fo(A) = [-t-, e} Fo(C) = [*, e} Fo(D) = [(, a} Then F,(B)={(, a} and Fi(X) = Fo(X) for all other X. Then F z ( S ) = [(, a} and F 2 ( X ) = El(X) for all other X. F3(X) = F2(X) for all X, so that FIRST(S) = FIRST(B) = FIRST(D) = [(, a} FIRST(A) = [ + , e} FIRST(C) = {,, e}
D
THEOREM 5.7 Algorithm 5.5 correctly computes FIRSTk(A ).
Proof. We observe that if for all X in N U ~, Fi_i(X)= Ft(X), then
SEC. 5.1
LL(k) GRAMMARS
359
Ft(X) = Fj(X) for all j > i and all X. Thus we must prove that x is in FIRSTk(A ) if and only if x is in Fj(A) for some j. lf: We show that F / A ) ~ FIRSTk(A) by induction on j. The basis, j = 0, is trivial. Let us consider a fixed value of j and assume that the hypothesis is true for smaller values ofj. If x is in Fj(A), then either it is in Fj_ i(A), in which case the result is immediate, or we can find A ~ Y1 . . . Yn in P, with xp in F~_ l(Yp), where 1 ~ p ~ n, such that x = FIRSTk(x 1 . . . xn). By the inductive hypothesis, xp is in FIRSTk(Yp). Thus there exists, for each p, a derivation Y~ ==~ yp, where xp is FIRSTk(yp). Hence, A =-~ y l . . . y ~ . We must now show that x = FIRSTk(y I . . . y~), and thus conclude that x is in FIRSTk(A ).
Case 1: tx~ . . . xn[ < k. Then yp = xp for each p, and x = y t . . . y ~ . Since Yl "'" Yn is in FIRSTk(A ) in this case, x is in FIRSTk(A ). Case 2." For some s > 0, Ix1 " " x,I < k but Ix1 " . . x~+11>_ k. Then yp = xp for 1 ~ p ~ s, and x is the first k symbols of x~ . . . x,+~. Since x,+~ is a prefix of Ys+l, x is a prefix of y1 . . . y,+l and hence of y~ . . . y~. Thus, x is FIRSTk(A ). /.
Only if: Let x be in FIRSTk(A). Then for some r, A ==~ y and x = FIRSTk(y ). We show by induction on r that x is in F~(A). The basis, r = 1, is trivial, since x is in Fo(A ). (In fact, the hypothesis could have been tightened somewhat, but there is no point in doing so.) Fix r, and assume that the hypothesis is true for smaller r. Then r-1
A~Y1
"'" Y n ~ Y ,
rla
where y = y l . . . y~ and Yp ==~ yp for 1 < p ~ n. Evidently, rp < r. By the inductive hypothesis, xp is in F~_I(Yp) , where xp = FIRSTk(yp). Thus, FIRSTk(x ~ . . . xp), which is x, is in F~(A). [Z In the next algorithm we shall see a method of computing, for a given g r a m m a r G = (N, X, P, S), those sets L ~ X *k such that S ==~ wAo~ and lm
FIRSTk(e ) = L for some w, A, and e. ALGORITHM 5.6
Computation of tr(A).
Input. A C F G G = (N, X, P, S). Output. a(A) = {L1L ~ X*e such that S =-~ wAo~ and Im
FIRSTk(~ ) = L for some w and ~}.
360
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
Method. We shall compute, for all A and B in N, sets a(A, B) such that $
a(A,B)={L[LcE*
k, and for some x and a, A = ~ x B a
and L =
lm
FIRST(e)}. We construct sets tri(A, B) for each A and B and for i = 0, 1. . . . as follows" (1) Let a o ( A , B ) = {L ~ Z*k[A---~flBa is in P and L = FIRST(a)}. (2) Assume that tr~_~(A, B) has been computed for all A and B. Define at(A, B) as follows" (a) If L is in at_ i(A, B), place L in tri(A, B). (b) If there is a production A ---~ X1 . " X, in P, place L in at(A, B) if for some j, 1 _~j < n, there is a set L' in trt_~(Xj, B) and L = L' ~ k FIRST(Xj+i) @~k "'" ~ k FIRST(X,). (3) When for some i, an(A, B) = a~_ x(A, B) for all A and B, let a(A, B) = at(A, B). Since for all i, a~_~(A, B ) ~ at(A, B ) ~ (P(Z*k), such an i must exist. (4) The desired set is a(S, A). [~] THEOREM 5.8 In Algorithm 5.6, L is in a(S, A) if and only if for some w ~ Z* and a ~ (N u Z)*, S ==~wAa and L = FIRSTk(a ). lm
Proof The proof is similar to that of the previous theorem and is left for the Exercises. Example 5.19
Let us test the grammar G with productions
S
> AS[e
A
~ aAlb
for the LL(1) condition. We begin by computing F I R S T i ( S ) = [e, a, b] and FIRSTi(A) = {a, b}. We then must compute a(S) = a(S, S) and a(A) = a(S, A). From step (1) of Algorithm 5.4 we have ao(S, S) = {{e}]
ao(S, A) -- {{e, a, b}}
ao(A, S) =
ao(A, A) = {{e}}
F r o m step (2) we find no additions to these sets. For example, since S ~ AS is a production, and ao(A, A) contains {e} we must add to a l(S, A) the set L = {e} t~)l FIRST(S) = {e, a, b] by step (2b). But a~(S, A) already contains {e, a, b}, because ao(S, A) contains this set. Thus, a ( A ) = {(e, a, b}) and a ( S ) = [[el}. To check that G is LL(1), we
EXERCISES
361
have to verify that ( F I R S T ( A S ) ~ , (e}) ~ (FIRST(e)@), ( e ) ) = ~ . [This is for the two S-productions and the lone m e m b e r of a(S, S).] Since FIRST(AS)-- FIRST(A)~,
F I R S T ( S ) - - (a, b}
and F I R S T ( e ) = [e], we indeed verify that [a, b} 5~ [ e ) = F o r the two A-productions, we must show that (FIRST(aA) ~
~.
[e, a, b)) ~ (FIRST(b) @>~ [e, a, b)) -- ~ .
This relation reduces to (a) 5~ {b] -- ~ , which is true. Thus, G is EL(l).
[-7
EXERCISES
5.1.1.
Show that if G is left-recursive, then G is not an LL grammar.
5.1.2.
Show that if G has two productions A - , G cannot be LL(1).
aOc]afl, where ~ ~ fl, then
5.1.3.
Show that every LL grammar is unambiguous.
5.1.4.
Show that every grammar obeying statement (5.1.1) on page 343 is LL(k).
5.1.5.
Show that the grammar with productions S~
aAaBl bAbB
A-
~ alab
B-
~aBla
is LL(3) but not LL(2). 5.1.6.
Construct the LL(3) tables for the grammar in Exercise 5.1.5.
5.1.7.
Construct a deterministic left parser for the grammar in Example 5.17.
5.1.8.
Give an algorithm to compute FOLLOW~(A) for nonterminal A.
"5.1.9.
Show that every regular set has an LL(I) grammar.
5.1.10.
Show that G = (N, Z , P , S) is an LL(1) grammar if and only if for each set of A-productions A ~ 0~1 1(~2[''" ](~n the following conditions hold" (1) FIRST?(0ct) ~ FIRST~(~j) = ~5 for i ~ j. (2) If 0¢; ==~ e, FIRST~(0Cj) ~ FOLLOW~(A) -- ~ for 1 __~j -< n, i :~ j. Note that at most one ct~ can derive e.
*'5.1.11.
Show that it is undecidable whether there exists an integer k such that a CFG is LL(k). [In contrast, if we are given a fixed value for k, we can determine if G is LL(k) for that particular value of k.]
"5.1.12.
Show that it is undecidable whether a CFG generates an LL language.
"5.1.13.
The definition of an LL(k) grammar is often stated in the following manner. Let G - ( N , E, P, S) be a CFG. If S =~ wax, for w and x
362
CHAP. 5
ONE-PASS N O B A C K T R A C K PARSING
in ~* and A ~ N, then for each y in Z *k there is at most one production A ---~ a such that y is in FIRSTk(ax). Show that this definition is equivalent to the one given in Section 5.1.1. 5.1.14.
Complete the proof of Theorem 5.4.
5.1.15.
Prove Lemma 5.1.
5.1.16. *'5.1.17. 5.1.18. "5.1.19. *'5.1.20.
Prove Theorem 5.8. Show that if L is an LL(k) language, then L has an LL(k) grammar in Chomsky normal form. Show that an LL(0) language has at most one member. Show that the grammar G with productions S ~ Find an equivalent LL(1) grammar for L(G).
aaSbbla[e is LL(2).
Show that the language [a"0b"l n > 0} LJ {a"lb2"l n > 0} is not an LL language. Exercises 5.1.21-5.1.24 are done in Chapter 8. The reader may wish to try his hand at them now.
*'5.1.21.
Show that it is decidable for two LL(k) grammars, G1 and G2, whether L ( G , ) = L(Gz).
*'5.1.22.
Show that for all k > 0, there exist languages which are LL(k + 1) but not LL(k).
*'5.1.23.
Show that every LL(k) language has an LL(k + 1) grammar with no e-productions.
*'5.1.24.
Show that every LL(k) language has an LL(k + 1) grammar in Greibach normal form.
"5.1.25. Suppose that A ~ 0eft lay are two productions in a grammar G such that a does not derive e and fl and ~ begin with different symbols. Show that G is not LL(1). Under what conditions will the replacement of these productions by A ------~ 0cA' h ' ---> ]31~,
transform G into an equivalent LL(1) grammar 9. 5.1.26.
Show that if G = (N, Z, P, S) is an LL(k) grammar, then, for all A ~ N, GA is LL(k), where Ga is the grammar obtained by removing all useless productions and symbols from the grammar (N, Z, P, A). Analogous to the LL grammars there is a class of grammars, called the LC grammars, which can be parsed in a left-corner manner with a deterministic pushdown transducer scanning the input from left to right. Intuitively, a grammar G = (N, Z , P , S) is LC(k) if, knowing ,
the leftmost derivation S ~
wAS, we can uniquely determine that the
lm
production to replace A is A ~
X 1 . - . X~ once we have seen the
EXERCISES
363
portion of the input derived from X1 (the symbol X1 is the left corned and the next k input symbols. In the formal definition, should Xx be a terminal, we may then look only at the next k - 1 symbols. This restriction is made for the sake of simplicity in stating an interesting theorem which will be Exercise 5.1.33. In Fig. 5.7, we would recognize production A ~ X1 . . . Xp after seeing wx and the first k symbols (k - 1 if Xa is in E) of y. Note that if G were LL(k), we could recognize the production "sooner," specifically, once we had seen w and FIRSTk(Xy).
w
x
y
Fig. 5.7 Left-corner parsing.
We shall make use of the following type of derivation in the definition of an LC grammar. DEFINITION Let G be a CFG. We say that S = - ~ wAS if S ~ lc
w A S and the lrn
nonterminal A is not the left corner of the production which introduced it into a left-sentential form of the sequence represented by S :=-> wAo~. lm
For example, in Go, E ~
E + T is false, since the E in E + T
le
arises from the left corner of the production E ~
E + T. On the other
,
hand, E ~
a + T is true, since T is not the left corner of the produc-
lc
tion E ~
E + T, which introduced the symbol T in the sequence E==-> E + T:=> a + T.
DEFINITION CFG G----(N, Z , P , S) is L C ( k ) if the following conditions are , satisfied" Suppose that S ~ wAG. Then for each lookahead string u lc
there is at most one production B ~ cx such that A ~ B'F and
.
364
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
(1) (a) If a = Cfl, C ~ N, then u ~ FIRSTk(fl},d;), and (b) In addition, if C = A, then u is not in FIRST,(d;); (2) If a does not begin with a nonterminal, then u ~ FIRSTk(aT'~). Condition (la) guarantees that the use of the production B ~ Cfl can be uniquely determined once we have seen w, the terminal string derived from C (the left corner) and FIRSTk(fl)'~) (the lookahead string). Condition (lb) ensures that if the nonterminal A is left-recursive (which is possible in an LC grammar), then we can tell after an instance of A has been found whether that instance is the left corner of the production B----~ AT' or the A in the left-sentential form wAa. Condition (2) states that FIRSTk(0~7't~) uniquely determines that the production B ---~ 0~ is to be used next in a left-corner parse after having seen wB, when ~ does not begin with a nonterminal symbol. Note that 0¢ might be e here. For each LC(k) grammar G we can construct a deterministic left corner parsing algorithm that parses an input string recognizing the left corner of each production used bottom-up, and the remainder of the production top-down. Here, we shall outline how such a parser can be constructed for LC(1) grammars. Let G = ( N , ~ , P , S ) be an L C ( 1 ) g r a m m a r . F r o m G we shall construct a left corner parser ~ such that ~r(~) -{(x, n)lx ~ L(G)and n; is a left-corner parse for x}. ~ uses an input tape, a pushdown list and an output tape as does a k-prodictive parser. The set of pushdown symbols is I" = N U ~ w (N x N) u {$}. Initially, the pushdown list contains S$ (with S on top). A single nonterminai or terminal symbol appearing on top of the pushdown list can be interpreted as the current goal to be recognized. When a pushdown symbol is a pair of nonterminals of the form [A, B], we can think of the first component A as the current goal to be recognized and the second component B as a left comer which has just been recognized. For convenience, we shall construct a left-corner parsing table T which is a mapping from F × (~ u {e}) to (F* × (P u {e}) u {pop, accept, error}. This parsing table is similar to a 1-predictive parsing table for an LL(1) grammar. A configuration of M will be a triple (w, X0~, zt), where w represents the remaining input, X0c represents the pushdown list with X e F on top, and 7t is the output to this point. If T(X, a) = (fl, i), X in N u (N x N), then we write (aw, Xoc, rt) ~ (aw, floe, rti). If T(a, a) = pop, then we write (aw, aoc, 70 ~ (w, oc, 70. We say that 7t is a (left-corner) parse of x if (x, S$, e)I ~ (e, $, 70. Let G = (N, E, P, S) be an LC(1) grammar. T is constructed from G as follows: (1) Suppose that B ----~ tz is the ith production in P. (a) If 0c = Cfl, where C is a nonterminal, then T([A, C], a) = (iliA, B], i) for all A ~ N and a ~ F I R S T i(fl)'~) such
EXERCISES
that S ~
wAO and A ~
365
B~,. Here, 12 recognizes left
lc
corners bottom-up. Note that A is either S or not the left corner of some production, so at some point in the parsing A will be a goal. (b) If 0~ does not begin with a nonterminal, then T(A, a) = (tz[A, B], i) for all A ~ N and a ~ FIRSTi(~}'~) such that S ~
wAO and A =-> BT.
lc
(2) T([A, A], a ) = (e, e) for all A ~ N and a ~ FIRSTi(O) such that S ~
wA~.
le
(3) T(a, a) = pop for all a E E. (4) T($, e) = aecept. (5) T(X, a) = error otherwise. Example 5.20
Consider the following grammar G with productions (1) S - - * S + A (3) A --> A . B (5) B---> ( S )
(2) S--~ A (4) A--> B (6) B ~ a
G is an LC(1) grammar. G is, in fact, Go slightly disguised. A leftcorner parsing table for G is shown in Fig. 5.8. The parser using this left-corner parsing table would make the following sequence of moves on input ~ a , a): ((a • a), S$, e) ~---((a • a), (S)[S, B]$, 5) (a • a), ST[S, B]$, 5)
F- (a • a), a[S, B])[S, B]$, 56) l-- (* a), [S, B])[S, B]$, 56) [- (,a), IS, A])[S, B]$, 564) }- (,a), • B[S, A])[S, B]$, 5643)
t- (a), B[S, A])[S, B]$, 5643)
(a), a[B, B][S, hi)IS, B]$, 56436) ~-- (), [B, B][S, A])[S, B]$, 56436) (), [S, A])[S, B]$, 56436) (), IS, S])[S, B]$, 564362)
1-- (),)[S, B]$, 564362) 1-- (e, [S, B]$, 564362) [--- (e, [S, A]$, 5643624) (e, IS, S]$, 56436242) t- (e, $, 56436242)
366
CHAP. 5
ONE-PASS NO BACKTRACK PARSING
Input symbol
Pushdown symbol
a
<
S a[S,B], 6 p
.
.
.
.
.
>
+
i'i
< S > [ S , B], 5 .
.
.
.
.
.
.
.
.
.
,
i~
A a[A,B] , 6 [A,B] , 5 L.
.
.
.
.
.
.
.
.
.
,
.
,,
~h
B a[B,B], 6 [B,B], 5
[S,S]
e,e
Is, A]
[S, S], 2
e,e
+A[S, SI, l
•
[S, SI, 2
,,
*B[S, A I , 3
[S, S], 2
,,
[S, BI
[S,A],4
[S,A],4
[S,A],4
[S,A],4
[A,AI
e,e
e,e
.B[A,AI,3
e,e
.
IA,B]
[A,A],4
[A,A],4 ,
,,
.,
e,e p
a <
Ih,a],4
[A,AI,4
..
e,e
[B, BI
.
.
.
.
e,e
e,e
.
pop pop
>
pop
+[
pop pop
sl
accept Fig. 5.8 Left-corner parsing table for G.
The reader can easily verify that 56436242 is the correct left corner parse for (a • a). "5.1.27.
Show that the grammar with productions
S----+ A [ B A ~ aAblO B - > aBbb l 1 is not LC(k) for any k. "5.1.28.
Show that the following grammar is LC(1)"
E----->.E + T [ T T--->" T . F [ F F----> PI" F I P P----~" (E) l a 5.1.29.
Construct a left-corner parser for the grammar in Exercise 5.1.28.
EXERCISES
367
*'5.1.30.
Provide an algorithm which will test whether an arbitrary grammar is LC(1).
*'5.1.31.
Show that every LL(k) grammar is LC (k).
5.1.32.
Give an example of an LC(1) grammar which is not LL.
*'5.1.33.
Show that ff Algorithm 2.14 is applied to a grammar G to put it in Greibach normal form, then the resulting grammar is LL(k) if and only if G is LC(k). Hence the class of LC languages is identical to the class of LL languages.
"5.1.34.
Provide an algorithm to construct a left-corner parser for an arbitrary LC(k) grammar.
Research Problems 5.1.35.
Find transformations which can be used to convert non-LL(k) grammars into equivalent LL(1) grammars.
Programming Exercises 5.1.36.
Write a program that takes as input an arbitrary C F G G and constructs a 1-predictive parsing table for G if G is LL(i).
5.1.37.
Write a program that takes as input a parsing table and an input string and parses the input string using the given parsing table.
5.1.38.
Transform one of the grammars in the Appendix into an LL(1) grammar. Then construct an LL(1) parser for that grammar.
5.1.39.
Write a program that tests whether a grammar is LL(1). Let M be a parsing table for an LL(1) grammar G. Suppose that we are parsing an input string and the parser has reached the configuration (ax, Xoc, ~z). If M(X, a ) = error, we would like to announce that an error occurred at this input position and transfer to an error recovery routine which modifies the contents of the pushdown list and input tape so that parsing can proceed normally. Some possible error recovery strategies are (1) Delete a and try to continue parsing. (2) Replace a by a symbol b such that M(X, b ) ~ error and continue parsing. (3) Insert a symbol b in front of a on the input such that M(X, b) error and continue parsing. This third technique should be used with care since an infinite loop is easily possible. (4) Scan forward on the input until some designated input symbol b is found. Pop symbols from the pushdown list until a symbol X is found such that X : ~ bfl for some ft. Then resume normal parsing. We also might list for each pair (X, a) such that M(X, a) = error, several possible error recovery methods with the most promising method listed first. It is entirely possible that in some situations inser-
368
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
tion of a symbol may be the most reasonable course of action, while in other cases deletion or change would be most likely to succeed. 5.1.40.
Devise an error recovery algorithm for the LL(1) parser constructed in Exercise 5.1.38.
BIBLIOGRAPHIC
NOTES
LL(k) grammars were first defined by Lewis and Steams [1968]. In an early version of that paper, these grammars were called TD(k) grammars, TD being an acronym for top-down. Simple LL(1) grammars were first investigated by Korenjak and Hopcroft [1966], where they were called s-grammars. The theory of LL(k) grammars was extensively developed by Rosenkrantz and Steams [1970], and the answers to Exercises 5.1.21-5.1.24 can be found there. LL(k) grammars and other versions of deterministic top-down grammars have been considered by Knuth [1967], Kurki-Suonio [1969], Wood [1969a, 1970] and Culik [1968]. P.M. Lewis, R.E. Stearns, and D.J. Rosenkrantz have designed compilers for ALGOL and FORTRAN whose syntax analysis phase is based on an LL(1) parser. Details of the ALGOL compiler are given by Lewis and Rosenkrantz [1971]. This reference also contains an LL(1) grammar for ALGOL 60. LC(k) grammars were first defined by Rosenkrantz and Lewis [1970]. Clues to Exercises 5.1.27-5.1.34 can be found there.
5.2.
DETERMINISTIC
BOTTOM-UP
PARSING
In the previous section, we saw a class of grammars which could be parsed top-down deterministically, while scanning the input from left to right. There is an analogous class of languages that can be parsed deterministically bottom-up, using a left-to-right input scan. These are called LR grammars, and their development closely parallels the development of the LL grammars in the preceding section. 5.2.1.
Deterministic Shift-Reduce Parsing
In Chapter 4 we indicated that bottom-up parsing can proceed in a shiftreduce fashion employing two pushdown lists. Shift-reduce parsing consists of shifting input symbols onto a pushdown list until a handle appears on top of the pushdown list. The handle is then reduced. If no errors occur, this process is repeated until all of the input string is scanned and only the sentence symbol appears on the pushdown list. In Chapter 4 we provided a backtrack algorithm that worked in essentially this fashion, normally making some initially incorrect choices for some handles, but ultimately making the correct choices. In this section we shall consider a large class of grammars for which
sEc. 5.2
DETERMINISTIC BOTTOM-UP PARSING
369
this type of parsing can always be done in a deterministic manner. These are the LR(k) grammars the largest class of grammars which can be "naturally" parsed bottom-up using a deterministic pushdown transducer. The L stands for left-to-right scanning of the input, the R for producing a right parse, and k for the number of input "lookahead" symbols. We shall later consider various subclasses of LR(k) grammars, including precedence grammars and bounded-right-context grammars. Let ~x be a right-sentential form in some grammar, and suppose that is either the empty string or ends with a nonterminal symbol. Then we shall call ~ the open portion of 0~x and x the closed portion of 0cx. The boundary between ~ and x is called the border. These definitions of open and closed portion of a right-sentential form should not be confused with the previous definitions of open and closed portion, which were for a left-sentential form. A "shift-reduce" parsing algorithm can be considered a program for an extended deterministic pushdown transducer which parses bottom-up. Given an input string w, the DPDT simulates a rightmost derivation in reverse. Suppose that S
~ - oCo ------~ o~1------- ~ rm
rm
~. O~m =
. . .
W
rm
is a rightmost derivation of w. Each right-sentential form at is stored by the DPDT with the open portion of ~t on the pushdown list and the closed portion as the unexpended input. For example, if 0~ -- o~Ax, then 0~A would be on the pushdown list (with A on top) and x would be the as yet unseanned portion of the original input string. Suppose that 0ct_~--?Bz and that the production B - - . fly is used in the step ~t-1 ~ ~, where yfl = o~A and y z = x. With 0~A on the pushdown rm
list, the PDT will shift some number (possibly none) of the leading symbols of x onto the pushdown list until the right end of the handle of o~t is found. In this case, the string y is shifted onto the pushdown list. Then the PDT must locate the left end of the handle. Once this has been done, the PDT will replace the handle (here fly), which is on top of the pushdown list, b y the appropriate nonterminal (here B) and emit the number of the production B----~ fly. The PDT now has yB on its pushdown list, and the unexpended input is z. These strings are the open and closed portions, respectively, of the right-sentential form ~_ i. Note that the handle of ocAx can never lie entirely within ~, although it could be wholly within x. That is, ~t-1 could be of the form o~AxlBx2, and a production of the form B ---~y, where x l y x 2 = x could be applied to obtain 0~. Since Xl could be arbitrarily long, many shifts may occur before ~t can be reduced to ~_ 1. To sum up, there are three decisions which a shift-reduce parsing
370
CHAP. 5
ONE-PASS NO BACKTRACK PARSING
algorithm must make. The first is to determine before each move whether to shift an input symbol onto the pushdown list or to call for a reduction. This decision is really the determination of where the right end of a handle occurs in a right-sentential form. The second and third decisions occur after the right end of a handle is located. Once the handle is known to lie on top of the pushdown list, the left end of the handle must be located within the pushdown list. Then, when the handle has been thus isolated, we must find the appropriate nonterminal by which it is to be replaced. A grammar in which no two distinct productions have the same right side is said to be uniquely invertible (UI) or, alternatively, backwards deterministic. It is not difficult to show that every context-free language is generated by at least one uniquely invertible context-free grammar. If a grammar is uniquely invertible, then once we have isolated the handle of a right-sentential form, there is exactly one nonterminal by which it can be replaced. However, many useful grammars are not uniquely invertible, So in general we must have some mechanism for knowing with which nonterminal to replace a handle. Example 5.21
Let us consider the grammar G with the productions (1) S
~ SaSb
(2) S
~e
Consider the rightmost derivation: S ~
SaSb ~
SaSaSbb ~
SaSabb ~
Saabb ~
aabb
Let us parse the sentence aabb using a pushdown list and a shift-reduce parsing algorithm. We shall use $ as an endmarker for both the input string and the bottom of the pushdown list. We shall describe the shift-reduce parsing algorithm in terms of configurations consisting of triples of the form (~X, x, n), where (1) ~X represents the contents of the pushdown list, with X on top; (2) x is the unexpended input; and (3) n is the output to this point. We can picture this configuration as the configuration of an extended PDT with the state omitted and the pushdown list preceeding the input. In Section 5.3.1 we shall give a formal description of a shift-reduce parsing algorithm. Initially, the algorithm will be in configuration ($, aabb$, e). The algo-
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
371
rithm must then recognize that the handle of the right-sentential form aabb is e, occurring at the left end, and that this handle is to be reduced to S. We defer describing the actual mechanism whereby handle recognition occurs. Thus the algorithm must next enter the configuration ($S, aabb$, 2). It will then shift an input symbol on top of the pushdown list to enter configuration ($Sa, abb$, 2). Then it will recognize that the handle e is on top of the pushdown list and make a reduction to enter configuration ( $SaS, abb $, 22). Continuing in this fashion, the algorithm would make the following sequence of moves:
($, aabb $, e) ~ ($S, aabb $, 2) k- ( $Sa, abb $, 2) k- ( $SaS, abb $, 22) F- ( $SaSa, bb $, 22) k- ( $SaSaS, bb $, 222) k- ( $SaSaSb, b $, 222) ?- ( $SaS, b $, 2221) ?- ( $SaSb, $, 2221) ?- ($S, $, 22211)
accept [~ 5.2.2.
LR(k) Grammars
In this section we shall define a large class of grammars for which we can always construct deterministic right parsers. These grammars are the LR(k) grammars. Informally, we say that a grammar is LR(k) if given a rightmost derivation S = ao - ~ al =~ a2 ~ . . . - ~ am = z, we can isolate the handle of each rm
rm
rm
rm
right-sentential form and determine which nonterminal is to replace the handle by scanning ai from left to right, but only going at most k symbols past the right end of the handle of ai. Suppose that a,_~ = aAw and a~ = aflw, where fl is the handle of a~. Suppose further that /3 = X1X2... X,.. If the grammar is LR(k), then we can be sure of the following facts: (1) Knowing aX~X~ ... Xj and the first k symbols of Xj+~... X,w, we can be certain that the right end of the handle has not been reached until j = r .
(2) Knowing aft and at most the first k symbols of w, we can always determine that fl is the handle and that fl is to be reduced to A.
372
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
(3) When ~_ ~ = S, we can signal with certainty that the input string is to be accepted. Note that in going through the sequence ~m, ~ - ~ , ' ' ' , ~0, we begin by looking at only FIRSTk(~m)= FIRSTk(w ). At each step our lookahead string will consist only o f k or fewer terminal symbols. We Shall now define the term LR(k) grammar. But before we do so, we first introduce the simple concept of an augmented grammar. DEFINITION
Let G = (N, Z, P, S) be a CFG. We define the augmented grammar derived from G as G' : (N U {S'}, ~, P U {S' ----~S}, S'). The augmented grammar G' is merely G with a new starting production S ' - ~ S, where S' is a new start symbol, not in N. We assume that S' ---~ S is the zeroth production in G' and that the other productions of G are numbered 1, 2 . . . . , p. We add the starting production so that when a reduction using the zeroth production is called for, we can interpret this "reduction" as a signal to accept. We shall now give the precise definition of an LR(k) grammar. DEFINITION
Let G = (N, E, P, S) be a C F G and let G ' = (N', ~, P', S') be its augmented grammar. We say that G is LR(k), k ~ 0, if the three conditions (1) S' ~ G' rm
(2) S' ~ G' rm
ocAw ~
~pw,
G p rm
~Bx ~
oq~y, and
G" r m
(3) FIRSTk(w ) = FIRSTk(Y) imply that aAy = ~,Bx. (That is, ~ = ~,, A = B, and x = y.) A grammar is LR if it is LR(k) for some k. Intuitively this definition says that if aflw and afly are right-sentential forms of the augmented grammar with FIRSTk(w ) = FIRSTk(y ) and if A ---~ fl is the last production used to derive aflw in a rightmost derivation, then A ~ fl must also be used to reduce afly to aAy in a right parse. Since A can derive fl independently of w, the LR(k) condition says that there is sufficient information in FIRSTk(w ) to determine that aft was derived from aA. Thus there can never be any confusion about how to reduce any rightsentential form of the augmented grammar. In addition, with an LR(k) grammar we will always know whether we should accept the present input string or continue parsing. If the start symbol does not appear on the right side of any production, we can alternatively define an LR(k) grammar G = (N, E, P, S) as one in which the three conditions
sEc. 5.2
DETERMINISTIC
BOTTOM-UP
PARSING
373
:g
(1) S =:~ ~aw ::~ ~pw, rill
I'm
(2) S =:~ ?Bx ~ rlll
ocpy, and
1"rrl.
(3) F I R S T k ( w ) = FIRSTk(y ) imply that otAy = yBx. The reason we cannot always use this definition is that if the start symbol appears on the right side of some production we may not be able to determine whether we have reached the end of the input string and should accept or whether we should continue parsing. Example 5.22
Consider the grammar G with the productions S
>Sa[a
If we ignore the restriction against the start symbol appearing on the right side of a production, i.e., use the alternative definition, G would be an LR(0) grammar. However, using the correct definition, G is not LR(0), since the three conditions 0
(1) S'"---~ S"----'-~ S, Gt
rnl
(2) S' ~ G" r m
Gt
rrll
S ~
Sa, and
G' r m
(3) FIRST0(e ) = FIRSTo(a ) = e do not imply that S'a = S. Relating this situation to the definition we would have ~ = e, , 8 = S, w = e , 7, = e, A = S', B = S, x = e, a n d y = a . The problem here is that in the right-sentential form Sa of G' we cannot determine whether S is the handle of Sa (i.e., whether to accept the input derived from S) looking zero symbols past the S. Intuitively, G should not be an LR(0) grammar and it is not, if we use the first definition. Throughout this book, we shall use the first definition of LR(k)-ness. [~ In this section we show that for each LR(k) grammar G = (N, E, P, S) we can construct a deterministic right parser which behaves in the following manner. First of all, the parser will be constructed from the augmented grammar G'. The parser will behave very much like the shift-reduce parser introduced in Example 5.21, except that the LR(k) parser will put special information symbols, called LR(k) tables, on the pushdown list above each grammar symbol on the pushdown list. These LR(k) tables will determine whether a shift move or a reduce move is to be made and, in the case of a reduce move, which production is to be used.
374
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
Perhaps the best way to describe the behavior of an LR(k) parser is via a running example. Let us consider the g r a m m a r G of Example 5.21, which we can verify is an LR(1) g r a m m a r . The a u g m e n t e d g r a m m a r G' is (0) S ' - - +
S
(1) S -
>SaSb
(2) S
>e
A n LR(1) parser for G is displayed in Fig. 5.9.
Parsing action
Goto
b
e
S
a
b
To
2
X
2
T1
X
X
T1
S
X
A
X
T2
X
T2
2
2
X
T3
X
X
T3
S
S
X
X
T4
Ts
75
2
2
X
r6
X
X
T5
1
X
1
X
X
X
r6
S
S
X
X
T4
TT,
1
X
X
X
X
T7
Legend i -= reduce using production i S ~ shift A -= accept X ~ error Fig. 5.9 LR(t) parser for G. An LR(k) parser for a C F G G is nothing m o r e than a set of rows in a large table, where each row is called an " L R ( k ) table." One row, here To, is distinguished as the initial LR(k) table. Each LR(k) table consists of two f u n c t i o n s - - a parsing action function f a n d a goto function g: (1) A parsing action function f takes a string u in E,k as a r g u m e n t (this string is called the l o o k a h e a d string), and the value of f(u) is either shift, reduce i, error, or accept.
SEC. 5.2
DETERMINISTIC BO'I~rOM-UP PARSING
375
(2) A goto function g takes a symbol X in N u ]E as argument and has as value either the name of another LR(k) table or error. Admittedly, we have not explained how to construct such a parser at this point. The construction is delayed until Sections 5.2.3 and 5.2.4. The LR parser behaves as a shift-reduce parsing algorithm, using a pushdown list, an input tape, and an output buffer. At the start, the pushdown list contains the initial LR(k) table To and nothing else. The input tape contains the word to be parsed, and the output buffer is initially empty. If we assume that the input word to be parsed is aabb, then the parser would initially be in configuration (T0, aabb, e) Parsing then proceeds by performing the following algorithm. ALGORITHM 5.7 LR(k) parsing algorithm.
Input. A set 3 of LR(k) tables for an LR(k) grammar G = (N, ~, P, S), with To ~ 3 designated as the initial table, and an input string z ~ ~*, which is to be parsed. Output. If z ~ L(G), the right parse of G. Otherwise, an error indication. Method. Perform steps (1) and (2) until acceptance occurs or an error is encountered. If acceptance occurs, the string in the output buffer is the right parse of z. (1) The lookahead string u, consisting of the next k input symbols, is determined. (2) The parsing action function f of the table on top of the pushdown list is applied to the lookahead string u. (a) If f ( u ) = shift, then the next input symbol, say a, is removed from the input and shifted onto the pushdown list. The goto function g of the table on top of the pushdown list is applied to a to determine the new table to be placed on top of the pushdown list. We then return to step (1). If there is no next input symbol or g (a) is undefined, halt and declare error. (b) If f(u) = reduce i and production i is A --~ ~, then 21~1 symbols? are removed from the top of the pushdown list, and production number i is placed in the output buffer. A new table T' is then exposed as the top table of the pushdown list, and the goto function of T' is applied to A to determine the next table to be placed ?If 0~ = X ~ - . - Xr, at this point the top of the pushdown list will be of the form
ToX1T1XzTz... X, Tr. Removing 2 I~1 symbols removes the handle from the top of the pushdown list along with any intervening LR tables.
376
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
on top of the pushdown list. We place A and this new table on top of the pushdown list and return to step (1). (c) If f ( u ) = error, we halt parsing (and, in practice, transfer to an error recovery routine). (d) If f ( u ) = accept, we halt and declare the string in the output buffer to be the right parse of the original input string. D Example 5.23
Let us apply Algorithm 5.7 to the initial configuration (To, aabb, e) using the LR(1) tables of Fig. 5.9. The lookahead string here is a. The parsing action function of To on a is reduce 2, where production 2 is S --~ e. By step (2b), we are to remove 2[el = 0 symbols from the pushdown list and emit 2. The table on top of the pushdown list after this process is still T 0. Since the goto part of table To with argument S is T~, we then place STi on top of the pushdown list to obtain the configuration (ToST~, aabb, 2). Let us go through this cycle once more. The lookahead string is still a. The parsing action of T~ on a is shift, so we remove a from the input and place a on the pushdown list. The goto function of Tt on a is T2, so after this step we have reached the configuration (ToSTlaT2, abb, 2). Continuing in this fashion, the LR parser would make the following sequence of moves:
(To, aabb, e) ~ (ToST1, aabb, 2) }--- (ToSTiaT2, abb, 2) (ToSTa aT2ST3, abb, 22) 1---(ToSTaaT2ST3aT4, bb, 22) ]--- (ToSTlaT2ST3aT4ST6, bb, 222) (ToSTiaT2ST3aT4ST6bTT, b, 222) }- (ToST~aT~ST3, b, 2221) 1---(ToST~aT2ST3bTs, e, 2221) [-- (ToSTa, e, 22211) Note that these steps are essentially the same as those of Example 5.21 and that the LR(1) tables explain the way in which choices were made in that example. D In this section we shall develop the necessary algorithms to be able to automatically construct an LR parser of this form for each LR grammar. In fact, we shall see that a grammar G is LR(k) if and only if it has an LR(k) parser. But first, let us return to the basic definition of an LR(k) grammar and examine some of its consequences.
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
377
A proof that Algorithm 5.7 correctly parses an LR(k) grammar requires considerable development of the theory of LR(k) grammars. Let us first verify that our intuitive notions of what a deterministically right-parsable grammar ought to be are in fact implied by the LR(k) definition. Suppose that we are given a right-sentential form ~tflw of an augmented LR(k) grammar such that ~Aw:::~. ~flw. We shall show that by scanning ~fl and rm
FIRST~(w), there can be no confusion as to (1) The location of the right end of the handle, (2) The location of the left end of the handle, or (3) What reduction to make once the handle has been isolated. (1) Suppose that there were another right-sentential form ~fly such that FIRSTk(y ) = FIRSTk(W) but that y can be written as Y lY2Y3, where B ~ Y2 is a production and t~flyiBy3 is a right-sentential form such that ¢tfly~By3 =::* ~flY~Y2Y3. This case is explicitly ruled out by the LR(k) definition. This rm
becomes evident when we let x = Y3 and ~,B = ~flylB in that definition. The right end of the handle might occur before the end of ft. That is, there may be another right-sentential form ~,~Bvy such that B ----~~'2 is a production, FIRSTk(y ) = FIRSTk(w ) and ~,lBvy ~ }'l~2vy tufty. This case is also ruled =
rm
out if we let ~,B = 7,1B and x = vy in the LR(k) definition. (2) Now suppose that we know where the right end of the handle of a right-sentential form is but that there is confusion about its left end. That is, suppose that ~Aw and tx'A'y are right-sentential forms such that FIRSTk(W) = FIRSTk(y ) a n d ~Aw==~tzflw and t~'A'y=>~'fl'y= txfly. rm
rill
However, the L R ( k ) c o n d i t i o n stipulates that tzA = tz'A', so that both fl = fl' and A = A'. Thus the left end of the handle is uniquely specified. (3) There can be no confusion of type (3), since A = A' above. Thus the nonterminal which is to replace the handle is always uniquely determined. Let us now give some examples of LR and non-LR grammars. Example 5.24
Let G~ be the right-linear grammar having the productions
S - - - ~ CID C D We shall show G~ to be LR(1).I" t In fact, G~ is LR(0).
> aClb > aDlc
378
ONE-PASS
NO
BACKTRACK
PARSING
CHAP.
5
Every (rightmost) derivation in G'i (the augmented version of G~) is either of the form i
S' ~
S ~
C ~
S'-----~ S ~
D ~
aiC ~
aib
for i ~ 0
aid ~
aic
for i ~ 0
or i
Let us refer to the LR(1) definition, and suppose that we have derivation S' ~ VIII
tzAw =~, ocflw and S ' ==~ 7 B x =~ t~fly. Then since G'i is right-linear, rm
rm
r/n
we must have w = x = e. If F I R S T i ( w ) = FIRST~(y), then y = e also. We must now show that txA = 7,B; i.e., ~ = 7' and A = B. Let B ---~ ,5 be the production applied going from 3,Bx to ocfly. There are three cases to consider. Case 1: A = S ' (i.e., the derivation S ' = ~ o~Aw is trivial). Then tx = e rill
and fl = S. By the form of derivations in G'i, there is only one way to derive the right-sentential from S, so 7' = e and B = S', as was to be shown. Case 2: A = C. Then fl is either a C or b. In the first case, we must have B = C, for only C and S have a production which end in C. If B = S, then 3' = e by the form of derivations in G'i. Then 7B ~ ~fl. Thus we m a y conclude that B = C, 6 = aC, and 3' = ~. In the second case (fl b), we must have B = C, because only C has a production ending in b. The conclusion that 7' = tz and B A is again immediate. Case 3" A ---- D. This case is symmetric to case 2.
N o t e that G 1 is not LL.
D
Example 5.25
Let G 2 be the left-linear g r a m m a r with productions S=
> A b l Bc
A~
> Aale
B
>Bale
Note that L(G2) = L(G1) for G1 above. However, G 2 is not LR(k) for any k. Suppose that G2 is LR(k). Consider the two rightmost derivations in the a u g m e n t e d g r a m m a r G~" S'~----~ S ~ rm
Aakb ~ rm
akb rm
and S' ~
S ~ rm
Bakc ~ rm
akc rm
sec. 5.2
DETERMINISTIC BOTTOM-UP PARSING
379
These two derivations satisfy the hypotheses of the LR(k) definition with tz = e, fl = e, w = akb, y = e, and y = akc. Since A ~ B, G2 is not LR(k).
Moreover, this violation of the LR(k) condition holds for any k, so that G2 i s n o t L R . 5 The grammar in Example 5.25 is not uniquely invertible, and although we know where the handle is in any right-sentential form, we do not always know whether to reduce the first handle, which is the empty string, to A or B if we allow ourselves to scan only a finite number of terminal symbols beyond the right end of the handle. Example 5.26
A situation in which the location of a handle cannot be uniquely determined is found in the grammar G 3 with the productions S-----~AB A ----~ a B ~
CDIaE
C .-----~. ab D-
>bb
E-
> bba
G3 is not LR(1). We can see this by considering the two rightmost derivations in the augmented grammar" S' ~
S ~
AB ~
A CD ~
A Cbb ~
Aabbb
and S' ~
S ~
AB ~
AaE ~
Aabba
In the right-sentential form A a b w we cannot determine whether the right end of the handle occurs between b and w (when w = bb) or to the right of A a b (when w = ha) if we know only the first symbol of w. Note that G3 is LR(2), however. El We can give an informal but appealing definition of an LR(k) grammar in terms of its parse trees. We say that G is LR(k) if when examining a parse tree for G, we know which production is used at any interior node after seeing the frontier to the left of that node, what is derived from that node, and the next k terminal symbols. For example in Fig. 5.10 we can determine with certainty which production is used at node A by examining uv and FIRSTk(W). In contrast, the LL(k) condition states that the production which is used at A can be determined by examining u and F I R S T k ( V W ) .
380
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
S
u
A
w
Fig. 5.10
1J
P a r s e tree.
In Example 5.25 we would argue that G 2 is not LR(k) because after seeing the first k a's, we cannot determine whether production A ~ e or B ---~ e is to be used to derive the empty string at the beginning of the input. We cannot tell which production is used until we see the last input symbol, b or c. In Chapter 8 we shall try to make rigorous arguments of this type, but although the notion is intuitively appealing, it is rather difficult to formalize. 5.2.3
Implications of the LR(k) Definition
We shall now develop the theory necessary to construct LR(k) parsers. DEFINITION
Suppose that S ~ rm
~ A w =~ ~flw is a rightmost derivation in grammar G. rm
We say that a string ~, is a viable prefix of G if ~, is a prefix of ~fl. That is, ~, is a string which is a prefix of some right-sentential form but which does not extend past the right end of the handle of that right-sentential form. The heart of the LR(k) parser is a set of tables. These are analogous to the LL tables for LL grammars, which told us, given a lookahead string, what production might be applied next. For an LR(k) grammar the tables are associated with viable prefixes. The table associated with viable prefix ~, will tell us, given a lookahead string consisting of the next k input symbols, whether we have reached the right end of the handle. If so, it tells us what the handle is and which production is to be used to reduce the handle. Several problems arise. Since ~, can be arbitrarily long, it is not clear that any finite set of tables will suffice. The LR(k) condition says that we can uniquely determine the handle of a right-sentential form if we know all of the right-sentential form in front of the handle as well as the next k input symbols. Thus it is not obvious that we can always determine the handle by knowing only a fixed amount of information about the string in front of the handle. Moreover, if S ~ rm
~Aw ~
~flw and the question "Can
rm
txflw be derived rightmost by a sequence of productions ending in production p ?" can be answered reasonably, it may not be possible to calculate the tables
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
381
for aA from those for aft in a way that can be "implemented" on a pushdown transducer (or possibly in any other convenient way). Thus we must consider a table that includes enough information to computethe table corresponding to aA from that for aft if it is decided that aAw ==~ apw for an appropriate w. rm
We thus make the following definitions. DEFINITION
Let G = (N, E, P, S) be a CFG. We say that [A ---~ fll • flz, u] is an LR(k) item (for k and G, but we usually omit reference to these parameters when they are understood) if A ---~ fl~fl2 is a production in P and u is in E *k. We say that LR(k) item [A ----~fll • flz, u] is valid for afl~, a viable prefix of G, if there is a derivation S =-~ a A w = , ocfllflzw such that u = FIRSTk(w ). rnl
rIi1
Note that fix may be e and that every viable prefix has at least one valid LR(k) item.
Example 5.27 Consider grammar G1 of Example 5.24. Item [C ---~ a • C, e] is valid for aaa, since there is a derivation S ~
aaC => aaaC. That is, ~ = aa and
rill
rill
w = e in this example. Note the similarity of our definition of item here to that found in the description of Earley's algorithm. There is an interesting relation between the two when Earley's algorithm is applied to an LR(k) grammar. See Exercise 5.2.16. The LR(k) items associated with the viable prefixes of a grammar are the key to understanding how a deterministic right parser for an LR(k) grammar works. In a sense we are primarily interested in LR(k) items of the form [A ~ fl -, u], where the dot is at the right end of the production. These items indicate which productions can be used to reduce right-sententia! forms. The next definition and the following theorem are at the heart of LR(k) parsing. DEFINITION
We define the e-free first function, EFF~(00 as follows (we shall delete the k and/or G when clear)" (1) If a does not begin with a nonterminal, then EFFk(a) = FIRSTk(00. (2) If a begins with a nonterminal, then *
EFFk(a) = (w I there is a derivation a ~ fl ~ wx, rlTl
rm
where B ~ A w x for any nonterminal A), and w = FIRSTk(WX) Thus, EFFk(a ) captures all members of FIRSTk(a ) whose derivation does
382
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
not involve replacing a leading nonterminal by e (equivalently, whose rightmost derivation does not use an e-production at the last step, when g begins with a nonterminal). Example 5.28
Consider the grammar G with the productions S
>AB
A
>Bale
B~
> CbIC
C
>cle
FIRST2(S) = {e, a, b, c, ab, ac, ba, ca, cb} EFF2(S ) = {ca, cb}
D
Recall that in Chapter 4 we considered a bottom-up parsing algorithm which would ~not work on grammars having e-productions. For LR(k) parsing we can permit e-productions in the grammar, but we must be careful when we reduce the empty string to a nonterminal. We shall see that using the E F F function we are able to correctly determine when the empty string is the handle to be reduced to a nonterminal. First, however, we introduce a slight revision of the LR(k) definition. The two derivations involved in that definition really play interchangeable roles, and we can therefore assume without loss of generality that the handle of the second derivation is at least as far right as that of the first. LEMMA 5.2 If G = (N, ~, P, S') is an augmented grammar which is not LR(k), then there exist derivations S' ~
gAw ~
rm
t~j3w and S' ~
rm
rm
?Bx ~
?#x = g/3y,
rill
l r,~t >_ Is,at but ?Bx ~ gay.
where FIRSTk(w ) = FIRSTk(y ) and
Proof. We know by the LR(k) definition that we can find derivations satisfying all conditions, except possibly the condition ]76[ > I~Pl. Thus, assume that Iral < I~Pl. We shall show that there is another counterexample to the LR(k) condition, where ?~ plays the role of t~,a in that condition. Since we are given that ?6x = ~fly and i ral < I~/~ !, we find that for some z in E+, we can write gfl = ?6z. Thus we have the derivations S' ~
?Bx ~ rill
?~x, rill
and S' ~
~Aw ~ rill
~pw = ?d~zw rill
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
383
Now z was defined so that x - - z y . Since FIRSTk(w ) = FIRSTk(y), it follows that FIRSTk(x ) = FIRST~(zw). The LR(k) condition, if it held, would say that ocAw = 7Bzw. We would have 7Bz = ~A and 7Bzy = o~Ay, using operations of "cancellation" and concatenation, which preserve equality. But zy = x, so we have shown that e~Ay = 7Bx, which we originally assumed to be false. If we relate the two derivations above to the LR(k) condition, we see that they satisfy the conditions of the lemma, when the proper substitutions of string names are made, of course. LR(k) parsing techniques are based on the following theorem. THEOREM 5.9 A grammar G = (N, 1~, P, S) is LR(k) if and only if the following condition holds for each u in ]~.k. Let ~fl be a viable prefix of a right-sentential form ~]~w of the augmented grammar G'. If LR(k) item [A --~ f l . , u] is valid for ~p, then there is no other LR(k) item [Ax ~ fl~ • P2, v] which is valid for ~fl with u in EFFk(flzv ). (Note that f12 may be e.) Proof. Only if: Suppose that [A --~ p . , u] and [A~ ~ p~ • P2, v] are two distinct items valid for 0eft. That is to say, in the augmented grammar S' ~
ocAw ~ rm
S' ~
0~tA~x ~ rill
with FIRSTk(w ) : u
ocpw rill
~P2x
with FIRSTk(x ) = v
rm
and ~p = 0c~p~. Moreover, P2x ~
uy for some y in a (possibly zero step)
rm
derivation in which a leading nonterminal is never replaced by e. We claim that G cannot be LR(k). To see this, we shall examine three cases depending on whether (1) P2 = e, (2) P2 is in E+, or (3) ,02 has a nonterminal. Case 1: If P2 = e, then u = v, and the two derivations are :o
S' ~
ocAw ~ rill
~pw rill
and S' ~
~,AlX ---->. ~ p ~ x rill
rill
where FIRSTk(W) = FIRSTk(x ) = u ---- v. Since the two items are distinct, either A ~ A~ or fl ~ fl~. In either case we have a violation of the LR(k) definition.
384
ONE-PASS
NO
BACKTRACK
PARSING
CHAP.
5
Case 2: If f12 = z for some z in E +, then S' ~
aAw ~ rill
aflw riD.
and S' ~
a~A ~x ~ rill
a~fl~zx rm
where aft = a~fl~ and FIRSTk(zx ) = u. But then G is not LR(k), since a A z x cannot be equal to a~Alx if z ~ E+. Case 3: Suppose that f12 contains at least one nonterminal symbol. Then f12 ==~ u~Bu3 ==~ u~u2u3, where u~u2 ~ e, since a leading nonterminal is not rm
rm
to be replaced by e in this derivation. Thus we would have two derivations S' ~
aAw ~
rm
aflw rill
and
rm
I'm
rill
alfllu~Bu3x ~
rm
alfllulu2u3x
such that alfl~ = aft and uluzu3x = uy. The LR(k) definition requires that OCAUlUzU3X = ~xfllulBu3x. That is, aAu~u2 = a~fl~u~B. Substituting aft for a~fl~, we must have Aulu2 = flu~B. But since uluz ~ e, this is impossible. Note that this is the place where the condition that u is in EFFk(fl2v ) is required. If we had replaced E F F by F I R S T in the statement of the theorem, then u~u2 could be e and aAu~u2u3x could then be equal to a~fl~u~Bu3x (if ulu2 = e and fl = e). If: Suppose that G is not LR(k). Then there are two derivations in the augmented grammar
(5.2.1)
S' ~
aAw ~ rm
aflw rm
and (5.2.2)
S' ~
~Bx - - ~ 7~x = afly rill
rill
such that FIRSTk(w ) = FIRSTk(y ) = u, but a A y ~ ~,Bx. Moreover, we can choose these derivations such that aft is as short as possible. By Lemma 5.2, we may assume that ]~,~[ ~ [aft[. Let alA1yx be the last
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
385
right-sentential form in the derivation
S' ~
},Bx rm
such that the length of its open portion is no more than lafll + 1. That is,
I~A~ I _ alAy =--~ alfly i'm
i'm
where alfl = aft. Thus el = a and aAy = o~Bx, contrary to the hypothesis that G is not LR(k). [~] The construction of a deterministic right parser for an LR(k) grammar requires knowing how to find all valid LR(k) items for each viable prefix of a right-sentential form. DEFINITION
Let G be a C F G and y a viable prefix of G. We define V~(~,) to be the set of LR(k) items valid for ~, with respect to k and G. We again delete k and/or G if understood. We define ,~ = { a l a -= V~(~') for some viable prefix ? of G} as the collection of the sets of valid LR(k) items for G. ~ contains all sets of LR(k) items which are valid for some viable prefix of G. We shall next present an algorithm for constructing a set of LR(k) items for any sentential form, followed by an algorithm to construct the collection of the sets of valid items for any grammar G.
386
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
ALGORITHM 5.8
Construction of V~(?).
Input. C F G G = (N, X, P, S) and 7 in (N u X)*. Output. V~(r). Method. If 7 = X t X 2 " ' " X,, we construct V~(7)by constructing Vk(e), v~(x~), v~(xxx~), . . ., v~(x,x~
...
x.).
(1) We construct Vk(e) as follows" (a) If S ~ ~ is in P, add [S ----, • ~, e] to Vk(e). (b) If [A --~ • B~, u] is in Ve(e) and B ~ fl is in P, then for each x in FIRSTk(~u) add [ B - - , . ,8, x] to Vk(e), provided it is not already there. (c) Repeat step (b) until no more new items can be added to Vk(e). (2) Suppose that we have constructed V,(X1X2... Xi_x), i ~ n. We construct Vk(XtX2 "'" Xi) as follows" (a) If [A---* ~ - X , fl, u] is in Vk(Xx... Xi_,), add [A--* ~Xt" fl, u] to Vk(X, . . . X,). (b) If [A --* ~ • Bfl, u] has been placed in Vk(Xi "'" X3 and B ---, J is in P, then add [B---.. J, x] to V k ( X , . . . X3 for each x in FIRSTk(flu), provided it is not already there. (c) Repeat step (2b) until no more new items can be added to v~(x,
... x,).
El
DEFINITION
The repeated application of step (lb) or (2b) of Algorithm 5.8 to a set of items is called taking the closure of that set. We shall define a function GOTO on sets of items for a grammar G = (N, Z, P, S). If a is a set of items such that ~ = V~(7), where 7 ~ (N U Z)*, then GOTO(~, X) is that ~t' such that ~ t ' = V~(?X), where X ~ (N u Z). In Algorithm 5.8 step (2) computes V k ( X a X 2 . . . X~) = G O T O ( V k ( X ,
X 2 . . . X t _ ,), Xt).
Note that step (2) is really independent of X1 . " X~_ ,, depending only on the set Vk(Xi ... Xt-,) itself. Example 5.29
Let us construct Va(e), V,(S), and Vx(Sa) for the augmented grammar S'
>S
S
> SaSb
S
>e
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
387
(Note, however, that Algorithm 5.8 does not require that the grammar be augmented.) We first compute V(e) using step 1 of Algorithm 5.8. In step (la) we add [S' --, • S, e] to V(e). In step (lb) we add [S ----~ • SaSb, e] and [S ~ . , e] to V(e). Since [S ----~ • SaSb, e] is now in V(e), we must also add IS ---~ • SaSb, x] and [S ---~., x] to V(e) for all x in FIRST(aSb) = a. Thus, V(e) contains the following items" IS'
~ • S, e]
[S --->-. SaSb, e/a] [S~
> ., e/a]
Here we have used the shorthand notation [A ~ ~x. ,8, x ~ / x 2 / " ' / x , ] for the set of items [A ----~~ . ,8, x~], [A ---~ 0~ • fl, x 2 ] , . . . , [A ~ ~ . fl, x,]. To obtain V(S), we compute GOTO(V(e), S). F r o m step (2a) we add the three items [S' ---~ S . , e] and IS ---~ S . aSb, e/a] to V(S). Computing the closure adds no new items to V(S), so V(S) is IS'
[S ~
> S . , e]
S . aSb, e/a]
V(Sa) is computed as GOTO(V(S), a). V(Sa) contains the following six items" [S
> Sa • Sb, e/a]
[s
> • SaSh, a/b]
[S = ~ ., a/b]
[[]
We now show that Algorithm 5.8 correctly computes V~(y). THrORnM 5.10 An LR(k) item is in V~(7) after step (2) of Algorithm 5.8 if and only if that item is valid for ?.
Proof. If: It is left to the reader to show that Algorithm 5.8 terminates and correctly computes Vk(e). We shall show that if all and only the valid items for X1 "-- Xt_l are in Vk(XxX2 . " Xt-1), then all and only the valid items for X1 .." X~ are in V~(Xi . . . Xt). Suppose that [A ~ fl~ • ,82, u] is valid for X1 .." X~. Then there exists a derivation
S ==~ o~Aw ~ rill
ocfli[32w such that
0q81 = X~Xz . . . Xt
and
rill
u = FIRSTk(w ). There are two cases to consider. Suppose ,8, = ,8'~X~. Then [A ---~ ,8~ • X,82, u] is valid for X~ . . . Xt_~
308
ONE-PASS
NO
BACKTRACK
PARSING
CHAP.
5
and, by the inductive hypothesis, is in Vk(X~ . . . Xt_x). By step (2a) of Algorithm 5.8, [A ---~ ]3'~X~ • ,132, u] is added to Vk(X~ . . . Xt). Suppose that fl~ = e, in which case ~ = X~ . . . Xt. Since S => ~ A w is rm
a rightmost derivation, there is an intermediate step in this derivation in which the last symbol Xt of 0c is introduced. Thus we can write S ~
ogBy
rm
• 'TXfly =-~ ocAw, where ~'~, = X ~ . . . X~_~, and every step in the deririll
rill
vation ~'TXfly ~
~ A w rewrites a nonterminal to the right of the explicitly
rm
shown Xt. Then [B ~ 7 . Xfl, v], where v = FIRSTk(y), is valid for X~ . . . Xt_ ~, and by the inductive hypothesis is in Vk(X~ "'" X~_ ~). By step (2a) of Algorithm 5.5, [B ~ 7X~. ~, v] is added to V k ( g ~ ' ' " Xt). Since @=:,. Aw, we can find a sequence of nonterminals D~, D2,... , D m and strings rm
0 2 , . . . , Om in (N W E)* such that $ begins with D~, A = Din, and production D~ ----~Dr+ ~0~+~ is in P for 1 ~ i ~ m. By repeated application of step (2b), [A ~ • f12, u] is added to V k ( X 1 . . . X~). The detail necessary to show that u is a valid second component of items containing A ~ • f12 is left to the reader. Only if: Suppose that [A ----~fit • f12, u] is added to Vk(X~ . . . Xt). We show by induction on the number of items previously added to Vk(Xt . . . Xt) that this item is valid for X~ . . . X~. The basis, zero items in Vk(X~ "'" Xt), is straightforward. In this case [A --~ 1/~ • f12, u] must be placed in Vk(X~ "'" X~) in step (2a), so fl~ = fl'~X~
and [A ~
fl'~ • Xtfl~, u] is in Vk(X~ " " X~_ ~). Thus, S ~ rm
txAw =-~ o~fl'~X~flzw rill
and 0~fl'~ = X ~ . . . X~_~. Hence, [A --~ i l l . flz, u] is valid for X ~ . . . Xt. For the inductive step, if [A ---~ fl~ • flz, u] is placed in Ve(X~ . . . X~) at step (2a), the argument is the same as for the basis. If this item is added in step (2b), then fl~ = e, and there is an item [B --~ 7 • A$, v] which has been previously added to Vk(Xt "'" X~), with u in FIRST~($v). By the inductive hypothesis [B---~ ~, • A$, v] is valid for X~ . . . X~, so there is a derivation S ~ rm
rE'By ==~ tx'TAJy, where 0¢'7 = X~ . . . X~. Then rm
S =~ X , . . . X , A @ =~ X , . . . X, A z =~ X , . . . X, fl2z, rm
rm
where u : FiRST~(z). Hence [A ~
rm
• ,8~, u] is valid for X~ . . . X i.
Algorithm 5.8 provides a method for constructing the set of LR(k) items valid for any viable prefix. In the construction of a right parser for an LR(k) grammar G we are interested in the sets of items which are valid for all viable prefixes of G, namely the collection of the sets of valid items for G. Since a grammar contains a finite number of productions, the number of sets of
SEC. 5.2
DETERMINISTIC BOTTOM-UP PARSING
389
items is also finite, but often very large. If ? is a viable prefix of a right sententiai form ?w, then we shall see that Vk(?) contains all the information about ? needed to continue parsing yw. The following algorithm provides a systematic method for computing the sets of LR(k) items for G. ALGORITHM 5.9
Collection of sets of valid LR(k) items for G. Input. CFG G = (N, X, P, S) and an integer k. Output. S = [t~l(2 = Vk(?), and ? is a viable prefix of G}. Method. Initially S is empty.
(1) Place Vk(e) in g. The set Vk(e) is initially "unmarked." (2) If a set of items a in $ is unmarked, mark 6t by computing, for each X in N U X, GOTO(~t, X). (Algorithm 5.8 can be used here.) If e t ' = GOTO(a, X) is nonempty and is not already in $, then add a ' to $ as an unmarked set of items. (3) Repeat step (2) until all sets of items in $ are marked. D DEFINITION
If G is a CFG, then the collection of sets of valid LR(k) items for its augmented grammar will be called the canonical collection of sets of LR(k) items for G. Note that it is never necessary to compute GOTO(~, S'), as this set of items will always be empty. Example 5.30
Let us compute the canonical collection of sets of LR(1) items for the grammar G whose augmented grammar contains the productions St
=
> S
S
> SaSb
S
>e
We begin by computing tg0 = V(e). (This was done in Example 5.29.) (go:
IS'
> • S, e]
IS
> • SaSb, e/a]
IS
> . , e/a]
We then compute GOTO((g o, X) for all X ~ (S, a, b}. Let GOTO(ao, S) be a~1.
300
CHAP. 5
ONE-PASS NO BACKTRACK PARSING
121" [S' [S
> S.,e] > S . aSb, e/a]
GOTO(120, a) and GOTO(a0, b) are both empty, since neither a nor b are viable prefixes of G. Next we must compute GOTO(al, X) for X ~ {S, a, b}. GOTO(~i, S) and GOTO(~i, b) are empty and a; 2 = GOTO(12~, a) is
122" [S
> Sa • Sb, e/a]
[S
> • SaSb, a/b]
[S
> . , a/b]
Continuing, we obtain the following sets of items" t~3 : 124:
[S
> S a S . b, e/a]
[S
> S.
[S
, S a . Sb, a/b]
[S
> . SaSb, a/b]
[S
> ., a/b]
125" [s 126"
127"
aSb, a/b]
[S
~ S a S h . , e/a] > S a S . b, a/b]
[S
> S . aSb, a/b]
[S
> S a S b ., a/b]
The GOTO function is summarized in the following table" Grammar Symbol S a b Set of Items
120 121 122 123 124 125 126 127
121
u
m
122
m
124
t25
124
127
123 126
Note that GOTO(12, X) will always be empty if all items in 12 have the dot at the right end of the production. Here, 125 and 127 are examples of such sets of items. The reader should note the similarity in the GOTO table above and the GOTO function of the LR(1) parser for G in Fig. 5.9. [Z]
SEC. 5.2
DETERMINISTIC
BOTTOM-UP
PARSING
391
THEOREM 5.11 Algorithm 5.9 correctly determines ~.
Proof. By Theorem 5.10 it sumces to prove that a set of items (~ is placed in S if and only if there exists a derivation S =* ocAw =~ 6¢flw, where 7 is rm
rm
a prefix of 0eft and 6 - - Vk(7). The "only if" portion is a straightforward induction on the order in which the sets of items are placed in S. The "if" portion is a no less straightforward induction on the length of 7. These are both left for the Exercises. [-] 5.2.4.
Testing for the LR(k) Condition
It may be of interest to know that a particular grammar is LR(k) for some given value of k. We can provide an algorithm based on Theorem 5.9 and Algorithm 5.9. DEFINITION Let G -- (N, Z, P, S) be a C F G and k an integer. A set 6 of LR(k) items for G is said to be consistent if no two distinct members of 6 are of the form [A --~ fl-, u] and [B ~ fll " f12, v], where u is in EFFk(flzv). flz may be e. ALGORITHM 5.10 Test for LR(k)-ness.
Input. C F G G -- (N, X, P, S) and an integer k ~ 0. Output. "Yes" if G is LR(k); "no" otherwise. Method. (1) Using Algorithm 5.9, compute g, the canonical collection of the sets of LR(k) items for G. (2) Examine each set of LR(k) items in $ and determine whether it is consistent. (3) If all sets in g are consistent, output Yes. Otherwise, declare G not to be LR(k) for this particular value of k. [---] The correctness of Algorithm 5.10 is merely a restatement of Theorem 5.9. Example 5.31
Let us test the grammar in Example 5.30 for LR(l)-ness. We have S = {60, • • •, 67}. The only sets of LR(1) items which need to be tested are those that contain a dot at the right end of a production. These sets of items are 60, 6 I, 6 z, 64, 65, and 67. Let us consider 60. I n the items [S' --, • S, e] and [ S - - , • SaSb, e/a] in 6o, EFF(S) and EFF(Sa) are both empty, so no violation of consistency with the items [S - - , . , e/a] occurs.
392
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
Let us consider a 1. Here EFF(aSb)= EFF(aSba)= a, but a is not a lookahead string of the item [S'----~ S . , e]. Therefore, ~1 is consistent. The sets of items ~2 and ~4 are consistent because EFF(Sbx)and EFF(SaSbx) are both empty for all x. The sets of items ~5 and ~7 are clearly consistent. Thus all sets in S are consistent, so we have shown that G is LR(1). D 5.2.5.
Deterministic Right Parsers for
LR(k) Grammars In this section we shall informally describe how a deterministic extended pushdown transducer with k symbol lookahead can be constructed from an LR(k) grammar to act as a right parser for that grammar. We can view the pushdown transducer described earlier as a shift-reduce parsing algorithm which decides on the basis of its state, the top pushdown list entry, and the lookahead string whether to make a shift or a reduction and, in the latter use, what reduction to make. To help make the decisions, the parser will have in every other pushdown list cell an "LR(k) table," which summarizes the parsing information which can be gleaned from a set of items. In particular, if ~ is a prefix of the pushdown string (top is on the right), then the table attached to the rightmost symbol of ~ comes from the set of items Vk(00. The essence of the construction of the right parser, then, is finding the LR(k) table associated with a set of items. DEFINITION
Let G be a C F G and let S be a collection of sets of LR(k) items for G. T(~), the LR(k) table associated with the set of items ~ in S, is a pair of functions ( f , g~. f is called the parsing action function and g the goto
function. (1) f maps E *k to {error, shift, accept} U {reduce i{ i is the number of a production in P, i _~ 1}, where (a) f ( u ) = shift if [A ~ i l l " f12, v] is in ~, f12 ~ e, and u is in EFFk(fl2v). (b) f(u) = reduce i if [A ~ f l . , u] is in ~ and A ---~ fl is production
iinP, i ~ 1. (c) f(e) = accept if [S' --, S . , e] is in ~. (d) f ( u ) = error otherwise. (2) g, the goto function, determines the next applicable table. Some g will be invoked immediately after each shift and reduction. Formally, g maps N U E to the set of tables or the message error, g(X) is the table associated with GOTO(~, X). If GOTO(~, X) is the empty set, then g(X)= error. We should emphasize that by Theorem 5.9, if G is LR(k) and $ is the
sEc. 5.2
DETERMINISTIC BOTTOM-UP PARSING
393
canonical collection of sets of LR(k) items for G, then there can be no conflicts between actions specified by rules (la), (1 b), and (l c) above. We say that the table T(~) is associated with a viable prefix ? of G if a
=
DEFINITION
The canonical set of LR(k) tables for an LR(k) grammar G is the pair (~, To), where 5 the set of LR(k) tables associated with the canonical collection of sets of LR(k) items for G. T o is the LR(k) table associated with V~(e). W e shall usually represent a canonical LR(k) parser as a table, of which each row is an LR(k) table. The LR(k) parsing algorithm given as Algorithm 5.7 using the canonical set of LR(k) tables will be called the canonical LR(k)parsing algorithm or canonical LR(k) parser, for short. W e shall now summarize the process of constructing the canonical set of LR(k) tables from an LR(k) grammar. ALGORITHM 5.11
Construction of the canonical set of LR(k) tables from an LR(k) grammar.
Input. An LR(k) grammar G = (N, Z, P, S). Output. The canonical set of LR(k) tables for G. Method. (1) Construct the augmented grammar G ' = (N U IS'}, Z, P U {S' --~ S}, S'). S ' - - , S is to be the zeroth production. (2) From G' construct S, the canonical collection of sets of valid LR(k) items for G. (3) Let 3 be the set of LR(k) tables for G, where ~3= {T[T = T(a) for some Ct ~ 8}. Let To = T(Cto), where Cto = V~(e). [~ Example 5.32
Let us construct the canonical set of LR(1) tables for the grammar G whose augmented grammar is
(o) s' (1) S
> SaSb
(2) S
>e
The canonical collection $ of sets of LR(1) items for G is given in Example 5.30. From S we shall construct the set of LR(k) tables.
394
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
Let us construct To = , the table associated with a~0. Since k = 1, the possible lookahead strings are a, b, and e. Since ~o contains the items [S ~ . , e/a], f o ( e ) = fo(a)= reduce 2. From the remaining items in ~to we determine that fo(b)= error [since EFF(S00 is empty]. To compute the GOTO function go, we note that GOTO(a~o, S ) = ~i and GOTO(~o, X) is empty otherwise. If T1 is the name given to T(~i), then go(S)= Tx and go(X) = e r r o r for all other X. We have now completed the computation of To. We can represent To as follows" fo b To
2
X
2
,Sc
go a
b
T1
X
X
Here, 2 represents reduce using production 2, and X represents error. Let us now compute the entries for T~ = (f~, g~). Since [S'--~ S . , e] is in ~t, we have f~(e)= accept. Since IS ~ S . aSb, e/a] is in ~1, f t ( a ) = shift. Note that the lookahead strings in this item have no relevance here. Then, f j ( b ) = error. Since GOTO(t~, a ) = t~2, we let gx(a)= T2, where T~ = T ( ~ ) .
Continuing in this fashion we obtain the set of LR(1) tables given in Fig. 5.9 on p. 374. [[] In Chapter 7 we shall discuss a number of other methods for producing LR(k) parsers from a grammar. These methods often produce parsers much smaller than the canonical LR(k) parser. However, the canonical LR(k) parser has several outstanding features, and these will be used as a yardstick by which other LR(k) parsers will be evaluated. We will mention several features concerning the behavior of the canonical LR(k) parsing algorithm" (1) A simple induction on the number of moves made shows that each table on the pushdown list is associated with the string of grammar symbols to its left. Thus as soon as the first k input symbols of the remaining input are such that no possible suffix could yield a sentence in L(G), the parser will report error. At all times the string of grammar symbols on the pushdown list must be a viable prefix of the grammar. Thus an LR(k) parser announces error at the first possible opportunity in a left to right scan of the input string. (2) Let Tj = (~, gj). If f~(u)= shift and the parser is in configuration (5.2.4)
(ToXaTaXzT z . . . XiT ~, x, tO
then there is an item [B--, fl~ •/~z, v] which is valid for XaX2 . . . Xj, with u ,
in EFF(flzv). Thus by Theorem 5.9, if S' =~ XxX2 . . . Xjuy for some y in rm
Y*, then the right end of the handle of Xa . . . X~uy must occur somewhere to the right of X~.
sEc. 5.2
DETERMINISTIC B O T T O M - U P P A R S I N G
395
(3) If f j ( u ) = reduce i in configuration (5.2.4) and production i is A ---~ YiY2 "'" Yr, then the string Xj_r+iX'j_,+z . . . X 1 on the pushdown list in configuration (5.2.4) must be Y1 "'" Yr, since the set of items from which table T~ is constructed contains the item [A ----~ Y1Y2"'" Y, ", u]. Thus in a reduce move the symbols on top of the pushdown list do not need to be examined. It is only necessary to pop 2r symbols from the pushdown list. (4) If f / u ) = accept, then u = e. The pushdown list at this point is ToST, where T is the LR(k) table associated with the set of items containing
[ S ' - . S ., el. (5) A D P D T with an endmarker can be constructed to implement the canonical LR(k) parsing algorithm. Once we realize that we can store the lookahead string in the finite control of the DPDT, it should be evident how an extended D P D T can be constructed to implement Algorithm 5.7, the LR(k) parsing algorithm. We leave the proofs of these observations for the Exercises. They are essentially restatements of the definitions of valid item and LR(k) table. We thus have the following theorem. THEOREM 5.12 The canonical LR(k) parsing algorithm correctly produces a right parse of its input if there is one, and declares "error" otherwise.
Proof. Based on the above observations, it follows immediately by induction on the number of moves made by the parsing algorithm that if e is the string of grammar symbols on its pushdown list and x the unexpended input including the lookahead string, then otx :=~ w, where w is the original input string and ~ is the current output. As a special case, if it accepts w and emits output nR, then S ==~ w. [Z] A proof of the unambiguity of an LR grammar is a simple application of the LR condition. Given two distinct rightmost derivations S ==~ el => .. • rill
rill
=~ e, ==~ w and S ==~ fl~ =~ . . . ==~ tim :=~ W, consider the smallest i such rm
rm
rm
rm
rm
that OCn-t ~ flm-r A violation of the LR(k) definition for any k is immediate. We leave details for the Exercises. It follows that the canonical LR(k) parsing algorithm for an LR(k) grammar G produces a right parse for an input w if and only if w ~ L(G). It may not be completely obvious at first that the canonical LR(k) parser operates in linear time, even when the elementary operations are taken to be its own steps. That such is the case is the next theorem. THEOREM
5.13
The number of steps executed by the canonical LR(k) parsing algorithm in parsing an input of length n is 0(n).
396
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
P r o o f Let us define a C-configuration of the parser as follows"
(1) An initial configuration is a C-configuration. (2) A configuration immediately after a shift move is a C-configuration. (3) A configuration immediately after a reduction which makes the stack shorter than in the previous C-configuration is a C-configuration. In parsing an input of length n the parser can enter at most 2n C-configurations. Let the characteristic of a C-configuration be the sum of the number of grammar symbols on the pushdown list plus twice the number of remaining input symbols. If C 1 and C z are successive C-configurations, then the characteristic of C1 is at least one more than the characteristic of C2. Since the characteristic of the initial configuration is 2n, the parser can enter at most 2n C-configurations. Now it suffices to show that there is a constant c such that the parser can make at most c moves between successive C-configurations. To prove this let us simulate the LR(k) parser by a D P D A which keeps the pushdown list of the algorithm as its own pushdown list. By Theorem 2.22, if the D P D A does not shift an input or reduce the size of its stack in a constant number of moves, then it is in a loop. Hence, the parsing algorithm is also in a loop. But we have observed that the parsing algorithm detects an error if there is no succeeding input that completes a word in L(G). Thus there is some word in L(G) with arbitrarily long rightmost derivations. The unambiguity of LR(k) grammars is contradicted. We conclude that the parsing algorithm enters no loops and that hence the constant c exists. D 5.2.6.
Implementation of LL(k) and LR(k) Parsers
Both the LL(k) and LR(k) parser implementations seem to require placing large tables on the pushdown list. Actually, we can avoid this situation, as follows: (1) Make one copy of each possible table in memory. Then, on the pushdown list, replace the tables by pointers to the tables. (2) Since both the LL(k) tables and LR(k) tables return the names of other tables, we can use pointers to the tables instead of names. We note that the grammar symbols are actually redundant on the pushdown list and in practice would not be written there.
EXERCISES
5.2.1.
Determine which of the following grammars are LR(1): (a) Go. (b) S - - , AB, A --. OAlt e, B --~ 1B[ 1. (c) S - - . OSI l A, A --~ 1All. (d) S ----~ S + A IA, A ---~ (S) t a(S) l a.
EXERCISES
397
The last grammar generates parenthesized expressions with operator + and with identifiers, denoted a, possibly singly subscripted. 5.2.2.
Which of the grammars of Exercise 5.2.1 are LR(0)?
5.2.3.
Construct the sets of LR(1) tables for those grammars of Exercise 5.2.1 which are LR(1). Do not forget to augment the grammars first.
5.2.4.
Give the sequence of moves made by the LR(1) right parser for Go with input (a + a) • (a + (a + a ) . a).
• 5.2.5.
5.2.6.
Prove or disprove each of the following: (a) Every right-linear grammar is LL. (b) Every right-linear grammar is LR. (c) Every regular grammar is LL. (d) Every regular grammar is LR. (e) Every regular set has an LL(1) grammar. (f) Every regular set has an LR(1) grammar. (g) Every regular set has an LR(0) grammar. Show that every LR grammar is unambiguous.
*5.2.7.
Let G = (N, E, P, S), and define GR = (N, E, PR, S), where PR is P with all right sides reversed. That is, PR = {A ~ 0~RIA ~ t~ is in P]. Give an example to show that GR need not be LR(k) even though G is.
*5.2.8.
Let G = (N, E, P, S) be an arbitrary CFG. Define R~(i, u), for u in E , e and production number i, to be {ocflulS ~
o~Aw ~
[m
u = FIRSTk(w) and production i is A ~ regular.
txflw, where
rlTl
fl}. Show that R~(i, u) is
5.2.9.
Give an alternate parsing algorithm for LR(k) grammars by keeping track of the states of finite automata that recognize R~(i, u) for the various i and u.
'5.2.10.
Show that G is LR(k) if and only if for all ~, fl, u, and v, ~ in R~(i, u) and t~fl in R~(j, v) implies fl = e and i = j.
'5.2.11.
Show that G is LR(k) if and only if G is unambiguous and for all w, x, y, and z in E*, the four conditions S ~
way, A ~
and FIRSTk(y) = FIRSTk(Z) imply that S ~
wAz.
x, S ~
wxz,
*'5.2.12.
Show that it is undecidable whether a CFG is LR(k) for some k.
*'5.2.13.
Show that it is undecidable whether an LR(k) grammar is an LL grammar.
5.2.14.
Show that it is decidable whether an LR(k) grammar is an LL(k) grammar for the same value of k.
"5.2.15.
Show that every e-free CFL is generated by some e-free uniquely invertible CFG.
'5.2.16.
Let G = (N, E, P, S) be a CFG grammar, and w = al, . . . an a string in E". Suppose that when applying Earley's algorithm to G, we find
398
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
item [A--~ 0c-fl, j] (in the sense of Earley's algorithm) on list Ii. . Show that there is a derivation S ==~ y x such that item [A - ~ 0c • fl, u] rm
(in the LR sense) is valid for y, u = FIRSTk(x), and 7 ~
a l . . . a~.
"5.2.17.
Prove the converse of Exercise 5.2.16.
"5.2.18.
Use Exercise 5.2.16 to show that if G is LR(k), then Earley's algorithm with k symbol lookahead (see Exercise 4.2.17) takes linear time and space.
5.2.19.
Let X be any symbol. Show that EFFk(Xa) = EFF~(X) ~
FIRST~(a).
5.2.20.
Use Exercise 5.2.19 to give an efficient algorithm to compute EFF(t~) for any 0~.
5.2.21.
Give formal details to show that cases 1 and 3 of Theorem 5.9 yield violations of the LR(k) condition.
5.2.22.
Complete the proof of Theorem 5.11.
5.2.23.
Prove the correctness of Algorithm 5.10.
5.2.24.
Prove the correctness of Algorithm 5.11 by showing each of the observations following that algorithm. In Chapter 8 we shall prove various results regarding LR grammars. The reader may wish to try his hand at some of them now (Exercises 5.2.25-5.2.28).
• *5.2.25.
Show that every LL(k) grammar is an LR(k) grammar.
• *5.2.26.
Show that every deterministic C F L has an LR(1) grammar.
• 5.2.27.
Show that there exist grammars which are (deterministically) rightparsable but are not LR.
• 5.2.28.
Show that there exist languages which are LR but not LL.
• *5.2.29.
Show that every LC(k) grammar is an LR(k) grammar.
• *5.2.30.
What is the maximum number of sets of valid items an LR(k) grammar can have as a function of the number of grammar symbols, productions, and the length of the longest production ?
"5.2.31.
Let us call an item essential if it has its dot other than at the left end [i.e., it is added in step (2a) of Algorithm 5.8]. Show that other than for the set of items associated with e, and for reductions of the empty string, the definition of the LR(k) table associated with a set of items could have restricted attention to essential items, with no change in the table constructed.
• 5.2.32.
Show that the action of an LR(1) table on symbol a is shift if and only if a appears immediately to the right of a dot in some item in the set from which the table is constructed.
SEC. 5.3
PRECEDENCE GRAMMARS
399
Programming Exercises 5.2.33,
Write a program to test whether an arbitrary grammar is LR(1). Estimate how much time and space your program will require as a function of the size of the input grammar.
5.2.34.
Write a program that uses an LR(1) parsing table as in Fig. 5.9 to parse input strings.
5.2.35.
Write a program that generates an LR(1) parser for an LR(1) grammar.
5.2.36.
Construct an LR(1) parser for a small grammar.
*5.2.37.
Write a program that tests whether an arbitrary set of LR(1) tables forms a valid parser for a given CFG. Suppose that an LR(1) parser is in the configuration (ocT, ax, 7t) and that the parsing action associated with T and a is error. As in LL parsing, at this point we would like to announce error and transfer to an error recovery routine that modifies the input and/or the pushdown list so that the LR(1) parser can continue. As in the LL case we can delete the input symbol, change it, or insert another input symbol depending on which strategy seems most promising for the situation at hand. Leinius [1970] describes a more elaborate strategy in which LR(1) tabIes stored in the pushdown list are consulted.
5.2.38.
Write an LR(1) grammar for a sma11 language. Devise an error recovery procedure to be used in conjunction with an LR(i) parser for this grammar. Evaluate the efficacy of your procedure.
BIBLIOGRAPHIC
NOTES
LR(k) grammars were first defined by Knuth [1965]. Unfortunately, the method given in this section for producing an LR parser will result in very large parsers for grammars of practical interest. In Chapter 7 we shall investigate techniques developed by De Remer [1969], Korenjak [1969], and Aho and Ullman [i971], which can often be used to construct much smaller LR parsers. The LR(k) concept has also been extended to context-sensitive grammars by Walters [1970]. The answer to Exercises 5.2.8-5.2.10 are given by Hopcroft and Ullman [1969]. Exercise 5.2.12 is from Knuth [1965].
5.3.
PRECEDENCE G R A M M A R S
The class of shift-reduce parsable g r a m m a r s includes the LR(k) g r a m m a r s and various subsets of the class of LR(k) g r a m m a r s . In this section we shall precisely define a shift-reduce parsing a l g o r i t h m a n d consider the class of
400
ONE=PASSNO BACKTRACK PARSING
CHAP. 5
t
precedence grammars, an important class of grammars which can be parsed by an easily implemented shift-reduce parsing algorithm. 5.3.1.
Formal Shift-Reduce Parsing Algorithms
DEHNITION Let G = (N, ~, P, S) be a C F G in which the productions have been numbered from I to p. A shift-reduce parsing algorithm for G is a pair of functions (2 = (f, g),tt where f is called the shift-reduce function and g the reduce function. These functions are defined as follows:
(1) f maps V* x (~ U [$})* to {shift, reduce, error, accept], where V = N U ~ u {$}, and $ is a new symbol, the endmarker. (2) g maps V* x (E u [$})* to {1, 2 . . . . , p, error], under the constraint that if g(a, w) = i, then the right side of production i is a suffix of a. A shift-reduce parsing algorithm uses a left-to-right input scan and a pushdown list. The function f decides on the basis of what is on the pushdown list and what remains on the input tape whether to shift the current input symbol onto the pushdown list or call for a reduction. If a reduction is called for, then the function g is invoked to decide what reduction to make. We can view the action of a shift-reduce parsing algorithm in terms of configurations which are triples of the form ($Xi "'" Xm, al " " a . $ , p l "'" P.) where (1) $X1 " " Xm represents the pushdown list, with Xm on top. Each X't is in N W E, and $ acts as a bottom of the pushdown list marker. (2) al . " a, is the remaining portion of the original input, a 1 is the current input symbol, and $ acts as a right endmarker for the input. (3) p l . . . Pr is the string of production numbers used to reduce the original input to X~ . . . Xma~ " " a,. We can describe the action of (2 by two relations, t--a-and !-¢-, on configurations. (The subscript (2 will be dropped whenever possible.) (1) If f ( a , aw) = shift, then (a, aw, r~)p- (aa, w, z~) for all a in V*, w in (E U {$})*, and n in { 1 , . . . , p ] * . (2) If f(afl, w) = reduce, g(o~fl, w) = i, and production i is A ~ fl, then (aft, w, n:) ~ (aA, w, n:i). (3) If f ( a , w) = accept, then (a, w, n ) ~ accept. (4) Otherwise, (a, w, 70 ~ error. •lThese functions are not the functions associated with an LR(k) table.
SEC. 5.3
PRECEDENCE GRAMMARS
401
We define 1- to be the union of ~ and ~--.. We then define ~ and ~ to have their usual meanings. We define ~(w) for w ~ ~* to be n if ($, w$, e) t----($S, $, re) 1--- accept, and ~(w) = error if no such n exists. We say that the shift-reduce parsing algorithm is valid for G if (1) L(G) = [wl ~2(w) ~ error}, and (2) If ~(w) = zc, then ~ is a right parse of w. Example 5.33
Let us construct a shift-reduce parsing algorithm ~ = (f, g) for the grammar G with productions (i) S-~- > SaSb (2) S
re
The shift-reduce function f is specified as follows: For all ~ ~ V* and x ~ (~: u [$})*, (1)
f(o~S, ex)
= shift if c ~ {a, b}.
(2) f(ow, dx) = reduce if e ~ {a, b} and d ~ {a, b}.
(3) f ( $, ax) f($, bx)
(4)
(5) (6) (7) (8)
= reduce. = error. f(~zX, $ ) = error for X ~ IS, a}. f(~b, $) = reduce. f($S, $) = accept. f ( $ , $) = error.
The reduce function g is as follows. For all ~ ~ V* and x ~ (~ U [$})*, (1) (2) (3) (4) (5)
g($, ax) = 2 g(oca,cx) = 2 for c ~ {a, b}. g($SaSb, cx)= 1 for c ~ {a, $}. g(ocaSaSb, cx)= 1 for c ~ {a, b}. Otherwise, g(~, x) = error.
Let us parse the input string aabb using ~. The parsing algorithm starts off in the initial configuration ($, aabb$, e) The first move is determined by f ( $ , aabb$), which we see, from the specification o f f , is reduce. To determine the reduction we consult g($, aabb$), which we find is 2. Thus the first move is the reduction
($, aabb$, e) ~ ($S, aabb$, 2)
402
ONE-PASS NO BACKTRACK PARSING
CrIAP. 5
The next move is determined by f($S, aabb$), which is shift. Thus the next move is
($S, aabb$, 2) ~ ($Sa, abb$, 2) Continuing in this fashion, the shift-reduce parsing algorithm e would make the following sequence of moves:
($, aabb$, e) ~ ($S, aabb$, 2) ~- ( $Sa, abb $, 2) ~-- ($SaS, abb$, 22) ($SaSa, bb$, 22) ($SaSaS, bb $, 222) .L ($SaSaSb, b$, 222) _r_ ($SaS, b$, 2221) ($SaSb, $, 2221) .z_ ($S, $, 22211)
accept Thus, e(aabb) = 22211. Clearly, 22211 is the right parse of aabb.
D
In practice, we do not want to look at the entire string in the pushdown list and all of the remaining input string to determine what the next move of a parsing algorithm should be. Usually, we want the shift-reduce function to depend only on the top few symbols on the pushdown list and the next few input symbols. Likewise, we would like the reduce function to depend on only one or two symbols below the left end of the handle and on only one or two of the next input symbols. In the previous example, in fact, we notice that f depends only on the symbol on top of the pushdown list and the next input symbol. The reduce function g depends only on one symbol below the handle and the next input symbol. The LR(k) parser that we constructed in the previous section can be viewed as a "shift-reduce parsing algorithm" in which the pushdown alphabet is augmented with LR(k) tables. Treating the LR(k) parser as a shift-reduce parsing algorithm, the shift-reduce function depends only on the symbol on top of the pushdown list [the current LR(k) table] and the next k input symbols. The reduce function depends on the table immediately below the handle on the pushdown list and on zero input symbols. However, in our present formalism, we may have to look at the entire contents of the stack in order to determine what the top table is. The algorithms to be discussed
SEC. 5.3
PRECEDENCE GRAMMARS
403
subsequently in this chapter will need only information near the top of the stack. We therefore adopt the following convention. CONVENTION
I f f and g are functions of a shift-reduce parsing algorithm and f(0~, w) is defined, then we assume that f(floc, wx) = f(oc, w) for all fl and x, unless otherwise stated. The analogous statement applies to g. 5.3.2,
Simple Precedence Grammars
The simplest class of shift-reduce algorithms are based on "precedence relations." In a precedence grammar the boundaries of the handle of a rightsentential form can be located by consulting certain (precedence) relations that hold among symbols appearing in right-sentential forms. Precedence-oriented parsing techniques were among the first techniques to be used in the construction of parsers for programming languages and a number of variants of precedence grammars have appeared in the literature. We shall discuss left-to-right deterministic precedence parsing in which a right parse is to be produced. In this discussion we shall introduce the following types of precedence grammars: (1) (2) (3) (4) (5)
Simple precedence. Extended precedence. Weak precedence. Mixed strategy precedence. Operator precedence.
The key to precedence parsing is the definition of a precedence relation 3> between grammar symbols such that scanning from left to right a rightsentential form ocflw, of which fl is the handle, the precedence relation 3> is first found to hold between the last symbol of fl and the first symbol of w. If we use a shift-reduce parsing algorithm, then the decision to reduce will occur whenever the precedence relation 3> holds between what is on top of the pushdown list and the first remaining input symbol. If the relation 3> does not hold, then a shift may be called for. Thus the relation .> is used to locate the right end of a handle in a rightsentential form. Location of the left end of the handle and determination of the exact reduction to be made is done in one of several ways, depending on the type of precedence being used. The so-called "simple precedence" parsing technique uses three precedence relations 4 , -~-, and 3> to isolate the handle in a right-sentential form ocflw. If fl is the handle, then the relation ~ or ~-- is to hold between all pairs of symbols in 0~, is to hold between the last symbol of fl and the first symbol of w.
404
ONE=PASS NO BACKTRACK PARSING
CHAP. 5
Thus the handle of a right-sentential form of a simple precedence grammar can be located by scanning the sentential form from left to right until the precedence relation > is first encountered. The left end of the handle is located by scanning backward until the precedence relation .~ holds. The handle is the string between < and 3>. If we assume that the grammar is uniquely invertible, then the handle can be uniquely reduced. This process can be repeated until the input string is either reduced to the sentence symbol or no further reductions are possible. DEHNITION The Wirth-Weber precedence relations for a C F G G = (N, ~, P, S) are defined on N u ~ as follows: (t) We say that X . ~ Y if there exists A ---~ ~XBfl in P such that B ~ YT'. (2) We say that X-~- Y if there exists A --~ ~XYfl in P. (3) 3> is defined on (N U E ) x ~E, since the symbol immediately to the right of a handle in a right-sentential form is always a terminal. We say that X 3> a if A ---~ ~BYfl is in P, B ~ 7,X, and Y *~ a~. Notice that Y will be a in the case Y ~ ad~. In precedence parsing procedures we shall find it convenient to add a left and right endmarker to the input string. We shall use $ as this endmarker + and we assume that $ .~ X for all X such that S ==~ X~ and Y 3> $ for all Y such that S ~ ~ Y. The calculation of Wirth-Weber precedence relations is not hard. We leave it to the reader to devise an algorithm, or he may use the algorithm to calculate extended precedence relations given in Section 5.3.3. DEFINITION A C F G G = (N, E, P, S) which is proper, t which has no e-productions, and in which at most one Wirth-Weber precedence relation exists between any pair of symbols in N u ~ is called a precedence grammar. A precedence grammar which is uniquely invertible is called a simple precedence
grammar. By our usual convention, we define the language generated by a (simple) precedence grammar to be a (simple) precedence language. Example 5.34
Let G have the productions
S
> aSSblc
?Recall that a CFG G is proper if there is no derivation of the form A ~ A, if G has no useless symbols, and if there are no e-productions except possibly S ~ e in which case S does not appear on the right side of any production.
SEC. 5.3
PRECEDENCE GRAMMARS
405
The precedence relations for G, together with the added precedence relations involving the endmarkers, are shown in the precedence matrix of Fig. 5.11. Each entry gives the precedence relations that hold between the symbol labeling the row and the symbol labeling the column. Blank entries are interpreted as error.
•
a
b
c
•>
.>
.>
.>
on the reals, integers, etc. For example, "-- is not usually an equivalence relation; ~Z and .> are not normally transitive, and they may be symmetric or reflexive. Since there is at most one precedence relation in each entry of Fig. 5.1 i, G is a precedence grammar. Moreover, all productions in G have unique right sides, so that G is a simple precedence grammar, and L(G) is a simple precedence language. Let us consider $accb$, a right-sentential form of G delimited by endmark-
406
CHAP. 5
ONE-PASSNO BACKTRACK PARSING
ers. We have $ < a, a < c, and c 3> c. The handle of accb is the first c, so the precedence relations have isolated this handle. We can often represent the relevant information in an n × n precedence matrix by two vectors of dimension n. We shall discuss such representations of precedence matrices in Section 7.1. The following theorem shows that the precedence relation ~ occurs at the beginning of a handle in a right-sentential form, ~ holds between adjacent symbols of a handle, and .> holds at the right end of a handle. This is true for all grammars with no e-productions, but it is only in a precedence grammar that there is at most one precedence relation between any pair of symbols in a viable prefix of a right-sentential form. First we shall show a consequence of a precedence relation holding between two symbols. LEMMA 5.3
Let G = (N, E, P, S) be a proper C F G with no e-productions. (1) I f X < A o r X " AandA~YaisinP, t h e n X - < Y. (2) If A < a, A " a, or A -> a and A ~ a Yis a production, then Y 3> a.
Proof We leave (1) for the Exercises and prove (2). If A < a, then there is a right side fllABfl2 such that B ~ a? for some 7. Since A ~ 0~Y, Y -> a is immediate. If A " a, there is a right side flIAafl2. As we have a ==~ a and A ~ 0~Y, it follows that Y .2> a again. If A .> a, then there is a right side flxBXfl2, where B =* ?A and X *~ aO for some 7 and ~. Since B ~ ?ix Y, we again have the desired conclusion. [Z] THEOREM 5.14 Let G = (N, E, P, S) be a proper C F G with no e-productions. If
ss$ @
...
x
+
Aa,
...
XpXp_~ . . . Xk+aX~,'" X~a~ . . . aq then (1) (2) (3) (4)
For p < i < k, either X~+~ < X~ or Xt+~ "--- X~; Xk+a i ;> 1, Xt+~ --" X~; and X~ > a~.
Proof. The proof will proceed by induction on n. For n = 0, we have $S$ ~
$Xk . . . Xa$. F r o m the definition of the precedence relations we
SEC. 5.3
PRECEDENCE GRAMMARS
407
have $ < X k , X~+, " X ~ f o r k > i ~ I a n d X , > $. Note that X k . . . X 1 cannot be the empty string, since G is assumed to be free of e-productions. For the inductive step suppose that the statement of the theorem is true for n. Now consider a derivation'
$S$~X,,...Xk+IAa,...aq n
X p . . . Xk+lX1, . . .
Xaal ...
Xp . . .
YiXj_,
Xj+,Y,
...
aq ...
X1a I
. . aq
That is, Xj is replaced by Yr "'" Y1 at the last step. Thus, X j _ i , . . . , X, are terminals; the case j = 1 is not ruled out. By the inductive hypothesis, Xj+ a ~ X j or Xj+ 1 "-- Xj. Thus, Xj+ , relation is found, and then scanning back until a a. (c)
f($S,
$) = accept.l"
(d) f ( X , a) = error otherwise. (These rules can be implemented by consulting the precedence matrix itself.) (3) The reduce function g depends only on the string on top of the pushdown list up to one symbol below the handle. The remaining input does not affect g. Thus we define g only on (N U X t,.) {$})* as follows: (a) g(X~+lXkXk_l "'" XI, e) = i if Xk+~ < Xk, Xj+~ " Xj for k > j > 1, and production i is A ~ X k X k - ~ ' ' " Xi. (Note that the reduce function g is only invoked when X~ 3> a, where a is the current input symbol.) (b) g(~z, e) = error, otherwise. [Z] Example 5.35
Let us construct a shift-reduce parsing algorithm ~ = ( f , g) for the grammar G with productions (1) S (2) S ~
> aSSb c
The precedence relations for G are given in Fig. 5.11 on p. 405. We can use the precedence matrix itself for the shift-reduce function f. The reduce function g is as follows" (1) g(XaSSb) = 1 if X ~ IS, a, $]. (2) g(Xc) = 2 if X ~ [S, a, $}. (3) g(00 = error, otherwise. tNote that this rule may take priority over rules (2a) and (2b) when X = S and a = $.
SEC. 5.3
PRECEDENCE GRAMMARS
409
With input accb, ¢~ would make the following sequence of moves"
($, accb $, e) ~ ($a, ccb $, e) (Sac, cb $, e) _r_ ( $aS, cb $, 2) ($aSc, b$, 2) ._r_ ( $aSS, b $, 22) ($aSSb, $, 22) .z_ ($S, $, 221) In configuration (Sac, cb$, e), for example, we have f(c, b ) = reduce and g(ac, e) = 2. Thus
(Sac, cb $, e) R ( $aS, cb $, 2) Let us examine the behavior of a on aeb, an input not in L(G). With acb as input a would make the following moves:
($, acb $, e) ~ ( $a, cb $, e) ~2_ (Sac, b $, e) [--($aS, b$,2)
12- (SaSh, $, 2) R
error
In configuration ($aSb, $, 2), f(b, $) = reduce. Since $ < a and a-~- S ~ b, we can make a reduction only if aSb is the right side of some production. However, no such production exists, so g(aSb, e) = error. In practice we might keep a list of "error productions." Whenever an error is encountered by the g function, we could then consult the list of error productions to see if a reduction by an error production can be made. Other precedence-oriented error recovery techniques are discussed in the bibliographical notes at the end of this section. D THEOREM 5.15 Algorithm 5.12 constructs a valid shift-reduce parsing algorithm for a simple precedence grammar.
Proof. The proof is a straightforward consequence of Theorem 5.14, the unique invertibility property, and the construction in Algorithm 5.12. The details are left for the Exercises. El
410
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
It is interesting to consider the classes of languages which can be generated by precedence grammars and simple precedence grammars. We discover that every CFL without e has a precedence grammar but that not every CFL without e has a simple precedence grammar. Moreover, for every CFL without e we can find an e-free uniquely invertible CFG. Thus, insisting that grammars be both precedence and uniquely invertible diminishes their language-generating capability. Every simple precedence grammar is an LR(1) grammar, but the LR(1) language (a0'l'l i ~ 1} w {b0tl2'l i ~ 1} has no simple precedence grammar, as we shall see in Section 8.3. 5.3.3.
Extended Precedence Grammars
It is possible to extend the definition of the Wirth-Weber precedence relations to pairs of strings rather than pairs of symbols. We shall give a definition of extended precedence relations that relate strings of m symbols to strings of n symbols. Our definition is designed with shift-reduce parsing in mind. Understanding the motivation of extended precedence requires that we recall the two roles of precedence relations in a shift-reduce parsing algorithm" (1) Let ~X be the m symbols on top of the pushdown list (with X on top) and a w the next n input symbols. If ~X < a w or ~ X " aw, then a is to be shifted on top of the pushdown list. If ~ X 3> aw, then a reduction is to be made. (2) Suppose that X~, . . . X 2 X i is the string on the pushdown list and that a 1 . . . a~ is the remaining input string when a reduction is called for (i.e., Xm "-" Xi 3> a 1 . . . a,). If the handle is X k " ' " X1, we then want
for k > j ~ l
and Xm+k''"
Xk+l < Xk ''" Xlal
""
a._k.'t
Thus parsing according to a uniquely invertible extended precedence grammar is similar to parsing according to a simple Wirth--Weber precedence grammar, except that the precedence relation between a pair of symbols X and Y is determined by ~X and Y f l , where ~ is the m -- 1 symbols to the left of X and fl is the n -- 1 symbols to the right of Y. We keep shifting symbols onto the pushdown list until a 3> relation is encountered between the string on top of the pushdown list and the remaining input. We then scan back into the pushdown list over " relations until "tWo assume that X, X r - t . . . X l a l . . . a,_,. is XrX,-1 . . . X,-n+l if r ~ n.
SEC. 5.3
PRECEDENCE GRAMMARS
4'1 1
the first .~ relation is encountered. The handle lies between the relations. This discussion motivates the following definition. DEFINITION
Let G = (N, E, P, S) be a proper C F G with no e-production. We define the (m, n) precedence relations 4 , ~ , and 3> on (N U E U {$})m × (N U ~ U {$})" as follows" Let $r"s$" ~
XpXp_ 1 ...
Xk+ a A a ~ . . .
aq
rm
X~,X~,_ i " " " X k + l X k " " " X l a i . . . aq
be any rightmost derivation. Then, (1) 0c .~ fl if 0c consists of the last m symbols of XpXp_~ . . . Xk+~, and either (a) fl consists of the first n symbols of X k " " X~a~ . . . aq, o r (b) X k is a terminal and fl is in FIRST,(Xk • • • X~a~ . . . aq). (2) 0c ~- fl for all j, k > j ~ 1, such that 0c consists of the last m symbols of X p X p _ ~ . . . X j + ~, and either (a) p consists of the first n symbols of X)Xj_~ . . . X~a~ . . . aq, or (b) X i is a terminal and fl is in FIRST,(XjXj_~ . . . X ~ a ~ . . . aq). (3) X ~ X m _ ~ . . . X~ 3> a~ . . . a,, We say that G is an (m, n) p r e c e d e n c e g r a m m a r if G is a proper C F G with no e-production and the relations ~ , __~__,and 3> are pairwise disjoint. It should be clear from Lemma 5.3 that G is a precedence grammar if and only if G is a (1, i) precedence grammar. The details concerning endmarkers are easy to handle. Whenever n = 1, conditions (lb) and (2b) yield nothing new. We also comment that the disjointness of the portions of
.> Fig. 5.12 (1, 1) precedence relations.
PRECEDENCEGRAMMARS
SEC. 5.3
413
and $01 and 011 (substrings of $0115 of length 3). Consideration of $$S adds $$0. Consideration of $0S adds $00, 00S, and 001. Consideration of 0S1 adds 111, and consideration of 00S adds 000. These are all the members of $. To construct $ from $S$ and i 1 3> 1 from 0S1. The (2,1)precedence relations for G are shown in Fig. 5.13. Strings of length 2 which are not in the domain of ~-, do not appear. S
0
l
$
->
">
OCflw, then every l'Ill rill
substring of ocflu of length m + n is in $, where u = FIRST,(w). That steps (4)-(6) correctly compute is a straightforward consequence of the definitions of these relations. [--] We may show the following theorem, which is the basis of the shift-reduce parser for uniquely invertible (m, n) precedence grammars analogous to that of Algorithm 5.12. THEOREM 5 . 1 7
Let G - - (N, E, P, S) be an arbitrary proper C F G and let m and n be integers. Let
(5.3.1)
l-m
x , x , _ , . . . x , + xAax
...
a,
l-m> XpXp_ 1 . . . Xk +l Xk . . . X l a 1 . . . aq (1) For j such that p -
m ~j
> k, let 0c be the last m symbols of
X~Xp_~ . . . Xj+~ and let fl be the first n symbols of X j X j _ ~ . . . Xaa~ . . . aq.
If fl ~ (E u {$})*, then either ~ OSAltOA1
A-
>1
The precedence relations for G' are shown in Fig. 5.15. S
A
0
1 ,
S
"=
"= =" I ~
YinG. We have shown that if in G' we have X .~ Y and X 3> Y, then in G, either X ~ Y and X -> Y or X--~ Y and X .> Y. Either situation contradicts
422
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
the assumption that G is a weak precedence grammar. Thus, Sala
would be (1, 0)-BRC without the proviso of an augmented grammar. As in Example 5.22 (p. 373), we cannot determine whether to accept S in the rightsentential form Sa without looking ahead one symbol. Thus we do not want G to be considered (1, 0)-BRC. We shall prove later than every (m, k)-BRC grammar is LR(k). However, not every LR(0) grammar is (m, n)-BRC for any m and n, intuitively because the LR definition allows us to use the entire portion of a right-sentential form to the left of the handle to make our parsing decisions, while the BRC condition fimits the portion t o the left of the handle which we may use to m symbols. Both definitions limit the use of the portion to the right of the handle, of course.
Example 5.41 The grammar G i with productions S
> aAc
A
> Abblb
is a (1, 0)-BRC grammar. The right-sentential forms (other than S' and S) are aAb2"c for all n > 0 and ab2"+le for n > 0. The possible handles are aAe, Abb, and b, and in each right-sentential form the handle can be uniquely determined by scanning the sentential form from left to right until aAe or Abb is encountered or b is encountered with an a to its left. Note that neither b in Abb could possibly be a handle by itself, because A or b appears to its left. On the other hand, the grammar G2 with productions S
> aAc
A
> bAb[b
generates the same language but is not even an LR grammar.
D
SEC. 5.4
OTHER CLASSES OF SHIFT-REDUCE PARSABLE GRAMMARS
429
Example 5.42
The grammar G with productions
S
> aAlbB
A ---+ 0A l 1 B----+ 0BI 1 is an LR(0) grammar, but fails to be BRC, since the handle in either of the right-sentential forms a0"l and b0"l is !, but knowing only a fixed number of symbols immediately to the left of 1 is not sufficient to determine whether A ~ 1 or B ~ 1 is to be used to reduce the handle. Formally, we have derivations
smstsn ~
SmaOmASn~
$maOml $"
smsts" zz~ rm
$mbOmB$" ~
SmbOml $"
rill
and rl'n
Referring to the BRC definition, we note that a -- $ma0m, 0~' --- ~' --- $mb0m, fl - - 6 = 1, and y = w - - x -- $n. Then ~ and ~' end in the same m symbols, 0 m; w and y begin with the same n, $"; and [xl < ]y l, but ~'Ay =/=?Bx. (A and B are themselves in the BRC definition.) The grammar with productions
S
> aA [bA
A ---~ 0A I 1 generates the same language and is (0, 0)-BRC.
[-7
Condition (3) in the definition of BRC may at first seem odd. However, it is this condition that guarantees that if, in a right-sentential form a'fly, fl is the leftmost substring which is the right side of some production A ~ fl and the left and right Context of fl in a'fly is correct, then the string a'Ay which results after the reduction will be a right-sentential form. The BRC grammars are related to some of the classes of grammars we have previously considered in this chapter. As mentioned, they are a subset of the LR grammars. The BRC grammars are extended precedence grammars, and every uniquely invertible (m, n) precedence grammar is a BRC grammar. The (1, 1)-BRC grammars include all uniquely invertible weak precedence grammars. We shall prove this relation first. THEOREM 5.20 If G = (N, E, P, S) is a uniquely invertible weak precedence grammar, then it is a (1, 1)-BRC grammar.
430
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
Proof Suppose that we have a violation of the (1, 1)-BRC condition, i.e., a pair of derivations
$S'$ ~
rm
aAw ~
riD.
aflw
and
$s'$ ~= rBx ~
~,ax = a' fly
where a and a' end in the same symbol; w and y begin with the same symbol; and l xl _< l yl, but ~,Bx ~ a'Ay. Since G is weak precedence, by Theorem 5.14 applied to 75x, we encounter the 3> relation first between 5 and x. Applying Theorem 5.14 to ~flw, we encounter 3> between fl and w, and since w and y begin with the same symbol, we encounter 3> between fl and y. Thus, l a'13l >_ I~'~ I. Since we are given Ixl _< i y l, we must have a'fl = 7di and x = y. If we can show that fl = O, we shall have a' = 7- But by unique invertibility, A = B. We would then contradict the hypothesis that 7Bx ~ ~'Ay. If f l ¢ J, then one is a suffix of the other. We consider cases to show
/ / = ,~.
Case 1:]3 = eX5 for some e and X. X is the last symbol of y, and therefore we have X fl and B --~ d~ are the same, then with x = y we conclude 7Bx = oFAy, contrary to hypothesis. But since one of ~6 or eft is a suffix of the other, we have a violation of (1) if A --, fl and B ~ ~ are distinct.
Only if: Given a violation of (1) or (2), a violation of the (m, n)-BRC condition is easy to construct. We leave this part for the Exercises. [Z We can now give a shift-reduce parsing algorithm for BRC grammars. Since it involves knowing the sets ~ ( A ) and 9Z, we shall first discuss how to compute them. ALGORITHM 5.16
Construction of ~m,,(A) and 9Zm...
Input. A proper grammar G = (N, Z, P, S). Output. The sets ~m,,(A) for A ~ N and 9Z. Method. (1) Let I be the length of the longest right side of a production. Compute $, the set of strings 7 such that (a) ] ) , [ - - - - m - q - n + l , or [ ~ , [ < m - q - n - J r - I and y begins with $m; ( b ) ) ' is a substring of ocflu, where ocflw is a right-sentential form with handle fl and u ----- FIRST,(w); and (c) ~, contains at least one nonterminal. We can use a method similar to the first part of Algorithm 5.13 here. (2) For A ~ N, let 5e(A) be the set of (ct, fl, x) such that there is a string ~,ocAxy in $, A --, fl is in P, [~[ = m, and Ix] = n. (3) Let 9Z be the set of (0~, x) such that there exists ~,By in $, B --~ ~ in P, ctx is a substring of ~,Oy, and ~ is within ?O, exclusive of the last symbol of ~,O. Of course, we also require that I x ] = n and that [~[ -- m ÷ / , where l is the longest right side of a production or ~ begins with $" and [~[ <
m-+- l. F-] THEOREM 5.22
Algorithm 5.6 correctly computes ~ ( A ) and 9Z.
Proof Exercise.
[Z]
OTHER CLASSESOF SHIFT-REDUCE PARSABLE GRAMMARS
SEC. 5.4
4:33
ALGORITHM 5.17
Shift-reduce parsing algorithm for BRC grammars.
Input.
An (m,n)-BRC grammar grammar G' = (N', X, P, S').
Output.
G= (N,X,P,S),
with
augmented
et = ( f , g), a shift-reduce parsing algorithm for G.
Method. (1) Let f ( a , w) = shift if (~, w) is in 9Zm... (2) f(a, w) = reduce if ~ = axa2, and (~1, a2, w) is in 5Cm.,(A) for some A, unless A = S', a~ = $, and a2 = S. (3) f($ss, $") = accept. (4) f ( a , w) = error otherwise. (5) g(a, w) = i if we can write a = a~a2, (aa, ~2, w) is in ~(A), and the ith production is A --~ ~2. (6) g(a, w) = error otherwise. D THEOREM 5.23
The shift-reduce parsing algorithm constructed by Algorithm 5.17 is valid for G.
Proof. By Lemma 5.6, there is never any ambiguity in defining f and g. By definition of ~(A), whenever a reduction is made, the string reduced, t~2, is the handle of some string flocloczwz. If it is the handle of some other string fl'oclo~zwz', a violation of the BRC condition is immediate. The only difficult point is ensuring that condition (3), [xl _< lY l, of the BRC definition is satisfied. It is not hard to show that the derivations of floclet2wz and fl'oc~ctzwz' can be made the first and second derivations in the BRC definition in one order or the other, so condition (3) will hold. [Z] The shift-reduce algorithm of Algorithm 5.17 has f and g functions which clearly depend only on bounded portions of the argument strings, although one must look at substrings of varying lengths at different times. Let us discuss the implementation o f f and g together as a decision tree. First, given on the pushdown list and x on the input, one might branch on the first n symbols of x. For each such sequence of n symbols, one might then scan backwards, at each step making the decision whether to proceed further on 0c or announce an error, a shift, or a reduction. If a reduction is called for, we have enough information to tell exactly which production is used, thus incorporating the g function into the decision tree as well. It is also possible to generalize Domolki's algorithm (See the Exercises of Section 4.1.) to make the decisions.
434
ONE-PASSNO BACKTRACKPARSING
CHAP. 5
Example 5.43
Let us consider the grammar G given by
(0) s '
~s
(1) s
, OA
(2) S
,~ 1S
(3) a ~ (4) A
OA 71
G is (1, 0)-BRC. To compute 5C(A), 5C(S), and ~ , we need the set of strings of length 3 or less that can appear in the viable prefix of a rightsentential form and have a nonterminal. These are $S', $S, $0A, SIS, 00A, 11 S, 10A, and substrings thereof. We calculate
5C,,o(S' ) -~ [($, S, e)} ~,,o(S) = {($, OA, e), ($, 1S, e), (1, OA, e), (1, 1S, e)} ~Cx,o(A) = [(0, OA, e), (0, 1, e)} i~ consists of the pairs (a, e), where a is $, $0, $00, 000, $1, $11, $10, 111,100, or 110. The functions f and g are given in Fig. 5.17. By "ending of a" we mean the shortest suffix of a necessary to determine f ( a , e) and to determine g(a, e) if necessary.
Ending of a
f(a, e)
$0A 10A 00A $IS 11S O1 O0 $0 $10
reduce reduce reduce reduce reduce
110 $1 $11 111 $ $S
reduce shift shift shift shift shift shift shift shift accept
Fig. 5.17 Shift-reduce functions.
g(a, e)
SEC. 5.4
OTHER CLASSES OF SHIFT-REDUCE PARSABLE GRAMMARS
435
A decision tree implementing f and g is shown in Fig. 5.18 on p. 436. We omit nodes with labels A and S below level 1. They would all have the outcome error, of course. [~ 5.4.2,
Mixed Strategy Precedence Grammars
Unfortunately, the shift-reduce algorithm of Algorithm 5.17 is rather expensive to implement because of the storage requirements for the f and g functions. We can define a less costly shift-reduce parsable class of grammars by using precedence to locate the right end of the handle and then using local context to both isolate the left end of the handle and determine which nonterminal is to replace the handle. Example 5.44
Consider the (non-UI) weak precedence grammar G with productions S-
> aA l bB
A
> CAIIC1
B
> DBE1 [DE1
C
>0
D-
>0
E
>1
G generates the language [a0"l"ln ~ 1} U (b0"l 2"In > 1], which we shall show in Chapter 8 not to be a simple precedence language. Precedence relations for G are given in Fig. 5.19 (p. 437). Note that G is not uniquely invertible, because 0 appears as the right side in two productions, C --~ 0 and D ~ 0. However, ~1,0(C) = [(a, 0, e), (C, 0, e)} and ~l,0(D) = {(b, 0, e), (D, 0, e)~ Thus, if we have isolated 0 as the handle of a right-sentential form, then the symbol immediately to the left of 0 will determine whether to reduce the 0 to C or to D. Specifically, we reduce 0 to C if that symbol is a or C and we reduce 0 to D if that symbol is b or D. [~] This example suggests the following definition. DEFINITION
Let G = (N, E, P, S) be a proper C F G with no e-productions. We say that G is a (p, q; m, n) mixed strategy precedence (MSP) grammar if (1) The extended (p, q) precedence relation .> is disjoint from the union of the (p, q) precedence relations < and 2__. (2) If A ~ ~fl and B ~ fl are distinct productions in P, then the following three conditions can never be all true simultaneously:
/
°
-
0
L~
,d o .,,.f 0
a0
•~'-~ -
;,,~
1,1
0
436
SEe. 5.4
OTHER CLASSES OF SHIFT-REDUCE PARSABLE GRAMMARS
S
A
B
C
D
E
a
b
0
1
437
$
.>
.>
.>
.>
•>
.>
.>
->
.>
.>
Plemit s , L2
L1 says that if the string on top of the pushdown list is ~ and the current input symbol is a, then replace ~ by fl, emit the string s, move the input head one symbol to the right (indicated by the presence of ,), and go next to statement L2. Thus the parser would enter the configuration (L2, 7fl, x, ns). The symbol a may be e, in which case the current input symbol is not relevant, although if the • is present, an input symbol will be shifted anyway. If statement L1 did not apply, because the top of the pushdown list did not match ~ or the current input symbol was not a, then the statement immediately following L1 on the list of statements must be applied next. Both labels on a statement are optional (although we assume that each statement has a name for use in configurations). If the symbol --~ is missing, then the pushdown list is not to be changed, and there would be no point in having fl ~ e. If the symbol • is missing, then the input head is not to be moved. If the (next label~ is missing, the next statement on the list is always taken. Other possible actions are accept and error. A blank in the action field indicates that no action is to be taken other than the pattern matching and possible reduction. Initially, the parser is in configuration (L, $, w$, e), where w is the input string to be parsed and L is a designated statement. The statements are then serially checked until an applicable statement is found. The various actions specified by this statement are performed, and then control is transferred to the statement specified by the next label. The parser continues until an error or accept action is encountered. The output is valid only when the accept is executed.
SEC. 5.4
OTHER CLASSES OF S H I F T - R E D U C E PARSABLE GRAMMARS
445
We shall discuss production language in the context of shift-reduce parsing, but the reader should bear in mind that top-down parsing algorithms can also be implemented in production language. There, the presumption is that we can write a : a,a2a3 and fl : a,Aa3 or fl : o~,Aa3a if the • is present. Moreover, A ~ a2 is a production of the grammar we are attempting to parse, and the output s is just the number of production A ---~ ~ z. Floyd-Evans productions can be modified to take "semantic routines" as actions. Then, instead of emitting a parse, the parser would perform a syntaxdirected translation, computing the translation of A in terms of the translations of the various components of a2. Feldman [1966] describes such a system. Example 5.47
We shall construct a production language parser for the grammar Go with the productions (I) E--~ E-1- T (3) r---~ T , F (5) F ~ (E)
(2) E ~ T (4) r---~ F (6) F ---~a
The symbol ~ is used as a match for any symbol. It is presumed to represent the same symbol on both sides of the arrow. L11 is the initial statement. L0:
( #
LI:
a # ---~ y #
emit 6
> T~
emit 3 emit 4
L2:
T, F~
~(#
L3:
F..~
T~
L4:
T•
>T*~
E-t-TO L6: T~ L5:
L7:
E -q- #
,L0
,L0
----~ E ~
emit 1
>E~
emit 2
(E) # ---~ F#
emit 5
L9:
$E$
>
accept
$
>
error
Lll:
L7
,L0
>E+#
L8:
L10:
L4
,L2
,LO
The parser would go through the following configurations under input
(a+ a),a"
446
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
[L11, $, (a -t- a) • aS, e] ~- [LO, $(, a + a) • aS, e] R [LO, $(a, + a) • aS, e] R [L1, $(a, + a) • aS, e] R [L2, $(F +, a) • aS, 6] R [L3, $(F +, a) • aS, 6] R [L4, $(T + , a) • aS, 64] R [L5, $(T + , a) • aS, 64] R [L6, $(T ÷, a) • aS, 64] R [L7, $(E ÷ , a) • aS, 642] F- [L0, $(E ÷ a, ) • aS, 642] R ILl, $(E ÷ a, ) • aS, 642] 1- [L2, $(E + F), • aS, 6426] R [L3, $(E + F), • aS, 6426] R [L4, $(E + T), • aS, 64264] r--- [L5, $(E ÷ T), • aS, 64264] R [L7, $(E),, aS, 642641] R [L8, $(E), • aS, 642641] R [L2, $F,, aS, 6426415] l- [L3, SF ,, aS, 6426415] I-- [L4, ST ,, aS, 64264154] R [L0, ST, a, $, 64264154] l--- [L1, ST, a, $, 64264154] R [L2, ST, F$, e, 642641546] 1-- [L4, $T$, e, 6426415463] 1---[L5, $T$, e, 6426415463] R [L6, $T$, e, 6426415463] t-- [L7, $E$, e, 64264154632] R [L8, $E$, e, 64264154632] t- [L9, $E$, e, 64264154632] R accept [Z It should be observed that a Floyd-Evans parser can be simulated by a DPDT. Thus each Floyd-Evans parser recognizes a deterministic CFL. However, the language recognized may not be the language of the grammar
SEC. 5 . 4
OTHER CLASSES OF SHIFT-REDUCE PARSABLE GRAMMARS
447
for which the Floyd-Evans parser is constructing a parse. This phenomenon occurs because the flow of control in the statements may cause certain reductions to be overlooked. Example 5.48
Let G consist of the productions (1) S:
> aS
(2) S
> bS
(3) S
>a
L(G) = (a -t-- b)*a. The following sequence of statements parses words in b*a according to G, but accepts no other words: #
L0: LI:
a
~#1 >S
, [
L4
emit 3
L2:
b
!
L3:
$
[
error
L4:
aS
>S
l
emit 1
L4
L5:
bS
S
[
emit 2
L4
L6:
$S $
L7:
> $S$[
L0
accept I
•
error
With input ba, the parser makes the following moves" [L0, $, ba$, e] }---[L1, $b, aS, e] [L2, $b, aS, e] }- [L0, $b, a$, e] l- [L1, $ba, $, e] [L4, $bS, $, 3] t--- [L5, $bS, $, 3] F-- [L4, $S, $, 32] t-- [L5, $S, $, 32] [L6, $S, $, 32] The input is accepted at statement L6. However, with input aa, the following sequence of moves is made"
448
ONE-PASS NO BACKTRACK PARSING
CHAP. 5
[L0, $, aa$, e] F- [L1, Sa, aS, e] [L4, $S, aS, 3] [L5, $S, aS, 3] 1-- [L6, $S, aS, 3] [L7, $S, aS, 3] An error is declared at L7, even though aa is in L(G). There is nothing mysterious going on in Example 5.48. Production language programs are not tied to the grammar that they are parsing in the way that the other parsing algorithms of this chapter are tied to their grammars. [Z In Chapter 7 we shall provide an algorithm for mechanically generating a Floyd-Evans production language parser for a uniquely invertible weak precedence grammar. 5.4.5.
Chapter Summary
The diagram in Fig. 5.22 gives the hierarchy of grammars encountered in this chapter. All containments in Fig. 5.22 can be shown to be proper. Those inclusions which have not been proved in this chapter are left for the Exercises. The inclusion of LL grammars in the LR grammars is proved in Chapter 8. Insofar as the classes of languages that are generated by these classes of grammars are concerned, we can demonstrate the following results: (1) The class of languages generated by each of the following classes of grammars is precisely the class of deterministic context-free languages" (a) LR. (b) LR(1). (c) BRC. (d) (1, 1)-BRC. (e) MSP. (f) Simple MSP. (g) Uniquely invertible (2, 1) (h) Floyd-Evans parsable. precedence. (2) The class of LL languages is a proper subset of the deterministic CFL's. (3) The uniquely invertible weak precedence grammars generate exactly the simple precedence languages, which are (a) A proper subset of the deterministic CFL's and (b) Incommensurate with the LL languages. (4) The class of languages generated by operator precedence grammars is the same as that generated by the uniquely invertible operator precedence grammars. This class of languages is properly contained in the class of simple precedence languages.
SEC. 5.4
OTHER CLASSES OF SHIFT-REDUCE PARSABLE GRAMMARS
449
Context-free grammars
~ U ~ m b i g u o u s ~ Floyd,-Evans~ " parsable
CFG's I R BRC
Operator precedence LR(I)
!
I MSP
Simple MSP Uniquely invertible extended precedenc/
Uniquely invertible weak precedence Simple precedence Fig. 5.22
Hierarchy of grammars.
Many of these results on languages will be found in Chapter 8. The reader may well ask which class of grammars is best suited for describing programming languages and which parsing technique is best. There is no clear'cut answer to such a question. The simplest class of grammars may often require manipulating a given grammar in order to make it fall into that class. Often the grammar becomes unnatural and unsuitable for use in a syntax-directed translation scheme. The LL(1) grammars are particularly attractive for practical use. For each LL(1) grammar we can find a parser which is small, fast, and produces a left parse, which is advantageous for translation purposes. However, there are some disadvantages. An LL(1) grammar for a given language can be unnatural and difficult to construct. Moreover, not every deterministic CFL has an LL grammar, let alone an LL(1) grammar, as we shall see in Chapter 8. Operator precedence techniques have been used in several compilers, are
450
ONE-PASS NO B A C K T R A C K PARSING
CHAP. 5
easy to implement and work quite efficiently. The (1, 1)-precedence grammars are also easy to parse, but obtaining a (1, 1)-precedence grammar for a language often requires the addition of many single productions of the form A ---~ X to make the precedence relations disjoint. Also, there are many deterministic CFL's for which no uniquely invertible simple or weak precedence grammar exists. The LR(1) technique presented in this chapter closely follows Knuth's original work. The resulting parsers can be extremely large. However, the techniques to be presented in Chapter 7 produce LR(1) parsers whose size and operating speed are competitive with precedence parsers for a wide variety o f programming languages. See Lalonde et al. [1971] for some empirical results. Since the LR(1) grammars embrace a large class of grammars, LR(1) parsing techniques are also attractive. Finally we should point out that it is often possible to improve the performance of any given parsing technique in a specific application. In Chapter 7 we shall discuss some methods which can be used to reduce the size and increase the speed of parsers.
EXERCISES
5.4.1. Give a shift reduce parsing algorithm based on the (1, 0)-BRC technique for G1 of Example 5.41. 5.4.2. Which of the following grammars are (1, 1)-BRC? (a) S--~ aAIB A ---~ 0A1 la B ----~0B1 lb. (b) S --. aA l bB ,4 ~
0AllOl
B ---, 0Bli 01. (c) E----,E ÷ TI E - TIT T---, T . F! T/F! F F---. (E)I-- EIa. DEFINITION
A proper CFG G = (N, Z, P, S) is an (m, n)-bounded context (BC) grammar if the three conditions (1) $mS'$" ~ t~Alyl ==~ 0tlf1171 in the augmented grammar, (2) $'~S'$" ~ ot2A272 ~ ot2f1272 = ~3fllY3 in the augmented grammar, and (3) The last m symbols of etl and ~t3 agree and the first n symbols of 71 and 73 agree imply that ~3A173 = 0~A2~2.
EXERCISES
5.4.3.
451
Show that every (m, n)-BC grammar is an (m, n)-BRC grammar.
5.4.4. Give a shift-reduce parsing algorithm for BC grammars. 5.4.5.
Give an example of a BRC grammar that is not BC.
5.4.6.
Show that every uniquely invertible extended precedence grammar is BRC. 5.4.7. Show that every BRC grammar is extended precedence (not necessarily uniquely invertible, of course). 5.4.8.
For those grammars of Exercise 5.4.2 which are (1, 1)-BRC, give shiftreduce parsing algorithms and implement them with decision trees.
5.4.9.
Prove the "only if" portion of Lemma 5.6.
5.4.10.
Prove Theorem 5.22.
5.4.11.
Which of the following grammars are simple MSP grammars ? (a) Go. (b) S--~ A [ B A ~ 0All01 B~
2B1 !1.
(c) S - - , A I B A - ~ OAllOl
B--~ OB1 I1. (d) S--~ A I B A ~ 0A1 [01 B --~ 01B1 [01.
5.4.12.
Show that every uniquely invertible weak precedence grammar is a simple MSP grammar.
5.4.13.
Is every (m, n; m, n)-MSP grammar an (m, n)-BRC grammar .9
5.4.14.
Prove Theorem 5.24.
5.4.15.
Are the following grammars operator precedence ? (a) The grammar of Exercise 5.4.2(b). (b) S ~ S ~
if B then S else 5' if B then S
S---~ s B---~ B o r b
B---~ b. (c) 5' --~ if B then St else S S --~ if B then S $1 --~ if B then $1 else $1
S--, s S1----). S B---~ B o r b
B--lb. The intention in (b) and (c)is that the terminal symbols are if, then, else, or, b, and s.
452
ONE-PASSNO BACKTRACK PARSING
CHAP. 5
5.4.16.
Give the skeletal grammars for the grammars of Exercise 5.4.15.
5.4.17.
Give shift-reduce parsing functions for those grammars of Exercise 5.4.15 which are operator precedence.
5.4.18.
Prove Theorem 5.25.
5.4.19.
Show that the skeletal grammar G, is uniquely invertible for every operator grammar G.
*5.4.20.
Show that every operator precedence language has an operator precedence grammar with no single productions.
"5.4.21.
Show that every operator precedence language has a uniquely invertible operator precedence grammar.
5.4.22.
Give production language parsers for the grammars of Exercise 5.4.2.
5.4.23.
Show that every BRC grammar has a Floyd-Evans production language parser.
5.4.24.
Show that every LL(k) grammar has a production language parser (generating left parses).
**5.4.25.
Show that it is undecidable whether a grammar is (a) BRC.
(b) BC. (c) MSP. 5.4.26.
Prove that a grammar G = (N, E, P, S) is simple MSP if and only if it is a weak precedence grammar and if A ---~ ~ and B ~ 0~ are in P, A ~ B, then I(A) ~ I(B) = ~ .
*5.4.27.
Suppose we relax the condition on an operator grammar that it be proper and have no e-productions. Show that under this new definition, L is an operator precedence language if and only if L - [e} is an operator precedence language under our definition.
5.4.28.
Extend Domolki's algorithm as presented in the Exercises of Section 4.1 to carry along information on the pushdown list so that it can be used to parse BRC grammars. DEFINITION
We can generalize the idea of operator precedence to utilize for our parsing a set of symbols including all the terminals and, perhaps, some of the nonterminals as well. Let G = (N, ~, P, S) be a proper C F G with no e-production and T a subset of N u E, with E ~ T. Let V denote N u ~. We say that G is a T-canonical grammar if (1) For each right side of a production, say ocXYfl, if X is not in T, then Y is in E, and (2) If A is in T and A =~ ~, then tz has a symbol of T. Thus a E-canonical grammar is the same as an operator grammar. If G is a T-canonical grammar, we say that T is a token set for G.
EXERCISES
453
If G is a T-canonical grammar, we define T-canonical precedence relations, < , --~-, and 3> on T U {$}, as follows: (1) If there is a production A ---~ ~zX[JY?, X and Y are in T, and /~ is either e or in ( V - T), then X ~ Y. +
(2) If A ----~ ~zXBt~ is in P and B ==~ ? YS, where X and Y are in T and 7 is either e or in V - T, then X < Y. (3) Let A ~
oclBoc2Ztx3 be in P, where ~z2 is either e or a symbol
of V -- T. Suppose that Z ~ * flla[32, where fll is either e or in V -- T, and a is in X. (Note that this derivation must be zero steps if tx2 ~ e, by the T-canonical grammar definition.) Suppose also that there is a derivation B~
YiC1J1 =~" 7172C25251 ~
"'" = ~ 71 "'" ? k - l C k - l J k - 1
"'" 01
=~ 71 "'" 7 k X J k "'" 51,
where the C's are replaced at each step, the J ' s are all in {el u V and X is in T. Then we say that X .> a.
T,
(4) If S ~ aXfl and a is in {e) u V - T , then $ < X . I f f l i s i n [el kJ V - - T, then X - > $. N o t e that if T = Z, we have defined the operator precedence relations, and if T = V, we have the W i r t h - W e b e r relations.
Example 5,49 Consider Go, with token set A = {F, a, (,), + , ,]. We find (..-~), since (E) is a right side, and E is not a token. We have -t- < *, since there is a right side E --t--T, and a derivation T ~ T . F, and T is not a token. Also, -4- .> + , since there is a right side E -4- T and a derivation E ~ E + T, and T is not a token. The A-canonical relations for Go are shown in Fig. 5.23. a
.>
.>
">
.>
1. Let the
462
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
rule for A be A ---~ B C ] D . Suppose that for i = 1 and 2, A ~ (xt ~"y~, ri) was formed by rule (4) from B ~ (ui ~"% tt) and (possibly) C =g (u~ I v~, t~) and/or D ~ (u~' ~"vt", t'/). Then m 1 < n 1, so the inductive hypothesis applies to give ua = u2, vl = v2, and tl = t2. N o w two cases, depending on the value of t t, have to be considered. tt tt tot/ tt Case 1: t~ t2 = f Then since l~ < na, we have u~ = u2, = %, and t'~' = t~'. Since x~ u~, Yi vi, and r~ = t~ for i = 1 and 2 in this case, the desired result follows. Case 2: t 1 -- t~ = s. Then u~v~ ' ' = v~ for i - - 1 and 2. Since k~ < nl, we may conclude that u'a = u'z, v', = v~, ' and t', = t,. ' If t'l = s, then xi = u~u' ? yt = v~, and r,. = s for i - - 1 and 2. We reach the desired conclusion. If t'l = f, the argument proceeds with u,.~t and v,.~t as in case 1. U] It should also be noted that a T D P L p r o g r a m need not have a response to every input. F o r example, any p r o g r a m having the rule S ~ S S / S , where S is the start symbol, will not recognize any sentence (that is, the ~ relation is empty). The notation we have used for a T D P L to this point was designed for ease of presentation. In practical situations it is desirable to use more general rules. F o r this purpose, we now introduce e x t e n d e d T D P L rules and define their meaning in terms of the basic rules" (1) We take the rule A ~ B C to stand for the pair of rules A ~ B C / D and D ---~f, where D is a new symbol. (2) We take the rule A ---~ B / C to stand for the pair of rules A ---~ B D / C and D ---~ e. (3) We take the rule A --~ B to stand for the rules A ~ B C and C ---~ e. (4) We take the rule A ---~ AtA2 " - A,, n > 2, to stand for the set of rules A ~ A1B1, B1 ~ A z B 2 , . . . , B._3 ~ A . _ 2 B . _ 2 , B . _ z ~ A . _ i A . . (5) We take the rule A ~ 0~,/~2/"'/oc., where the oc's are strings of nonterminals, to stand for the set of rules A ~ B~/C 1, C~ ~ B z / C z . . . . . C,_3 ~ B , _ 2 / C , - 2 , C , - z ---" B , _ i / B , , and B1 ~ 0cl, B2 ~ 0cz. . . . , B, ---, 0c,. If n - - 2 , these rules reduce to A - - , B1/Bz, Ba ~ oct, and Bz--~ ~ . F o r 1 < i ~ n if I~] -- 1, we can let B,. be ~ and eliminate the rule B~ ~ ~ . (6) We take rule A ---, ~ / 0 % / . . . / ~ , , where the ~'s are strings of nonterminals and terminals, to stand for the set of rules A ----, ~'~/0~/.../0c',, and X~ ~ a for each terminal a, where ~'~ is 0~t with each terminal a replaced by
xo. Henceforth we shall allow extended rules of this type in T D P L programs. The definitions above provide a mechanical way of constructing an equivalent T D P L p r o g r a m that meets the original definition. These extended rules have natural meanings. For example, if A has the rule A ~ Y1 Y2 "'" Y,, then A succeeds if and only if Y1 succeeds at the
SEC. 6.1
LIMITED BACKTRACK TOP-DOWN PARSING
463
input position where A is called, Y2 succeeds where Y~ left off, Y3 succeeds where Y2 left off, and so forth. Likewise, if A has the rule A ~ ~ 1 / ~ 2 / " " / ~ , , then A succeeds if and only if ~a succeeds where A is called, or if ~ fails, and ~2 succeeds where A is called, and so forth. Example 6.2
Consider the extended TDPL program P = ({E, F, T}, {a, (,), + , ,}, R, S), where R consists of E
>T+E/T
T
>F,T/F
F
~ (E)/a
'Ihe reader should convince himself that L ( P ) = L(Go), where GO is our standard grammar for arithmetic expressions. To convert P to standard form, we first apply rule (6), introducing nonterminals Xo, X~, X~, X+, and X,. The rules become E
> TX+E/T
T
> FX, T/F
F - - + XcEX>/X,, Xa
>a
x< >( x, >) x+ ---~ + X,
>,
By rule (5), the first rule is replaced by E --~ B i / T and B~ ---~ TX+E. By rule (4), B1 ---~ TX+E is replaced by B~ ---~ TB2 and B2 ---~ X+E. Then, these are replaced by B 1 ---~ TBz/D, Bz ~ X+E/D, and D----,f Rule E ~ B i / T is replaced by E---, B aC/T and C ~ e. The entire set of rules constructed is Xa
>a
D-
>f
X(
>(
E----~.B1C/T
X>
>)
B1
X+
> q-
B2 ~
> TB2/D X+E/D
X,
>•
T~
B3C/F
C
>e
B3 ~
FB4/D
B4
> X,T/D
F
>BsC/X~
B5
> X~B6/D
B6 ~ > EX)/D
464
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
We have simplified the rules by identifying each nonterminal X that has rule X ---~ e with C and each nonterminal Y that has rule Y ----~f with D. Whenever a TDPL program recognizes a sentence w, we can build a "parse tree" for that sentence top-down by tracing through the sequence of T D P L statements executed in recognizing w. The interior nodes of this parse tree correspond to nonterminals which are called during the execution of the program and which report success. ALGORITHM 6.1 Derivation tree from the execution of a TDPL program.
Input. A TDPL program P = (N, E, R, S) and a sentence w in X* such that S ~ (w r', s). output. A derivation tree for w. Method. The heart of the algorithm is a recursive routine buildtree which takes as argument a statement of the form A ~ (x ~'y, s) and builds a tree whose root is labeled A and whose frontier is x. Routine buildtree is initially called with the statement S =%- (w ~', s) as argument. Routine buildtree: Let A ~ (x r"Y, s) be input to the routine. (1) If the rule for A is A ---, a or A ---~ e, then create a node with label A and one direct descendant, labeled a or e, respectively. Halt. (2) If the rule for A is m~A ~ BC/D and we can write x = x tx2 such that ml B =:~ (x 1 r"x2y, s) and c =:~ (x 21 Y, s), create a node labeled A. Execute routine buildtree with argument B ~ (x 1 I xzY, s), and then with argument mi C =:~ (x2 I Y, s). Attach the resulting trees to the node labeled A, so that the roots of the trees resulting from the first and second calls are the left and right direct descendants of the node. Halt. (3) If the rule for A is A ----~B C / D but (2) does not hold, then it must be m3 that D :=~ (x ~'y, s). Call routine buildtree with this argument and make the root of the resulting tree the lone direct descendant of a created node labeled A. Halt. Note that routine buildtree calls itself recursively only with smaller values of m, so Algorithm 6.1 must terminate. Example 6.3
Let us use Algorithm 6.1 to construct a parse tree generated by the TDPL program P of Example 6.1 for the input sentence aba. We initially call routine buildtree with the statement S ~ (aba I, s) as argument. The rule S ~ A B / C succeeds because A and B each succeed,
LIMITEDBACKTRACKTOP-DOWNPARSING
SEC. 6.1
465
recognizing a and ba, respectively. We then call routine buildtree twice, first with argument A =~. (a I, s) and then with argument B ~ (ha I, s). Thus the tree begins as shown in Fig. 6.I(a). A succeeds directly on a, so the node labeled A is given one descendant labeled a. B succeeds because its rule is B ~ CB/A and C and B succeed on b and a, respectively. Thus the tree grows to that in Fig. 6.1 (b).
A/s
S
i
/\
a
C
(a)
B
(b)
s
I/
A
BIB
C
b
A
I
a
(c) Fig. 6.1 Construction from parse tree in TDPL. C succeeds directly on b, so the node labeled C gets one descendant labeled b. B succeeds on a because A succeeds, so B gets a descendant labeled A, and that node gets a descendant labeled a. The entire tree is shown in Fig. 6.1(c). D We note that if the set of TDPL rules is treated as a parsing program, then whenever a nonterminal succeeds, a translation (which may later have to be "canceled")can be produced for the portion of input that it recognizes, in terms of the translations of its "descendants" in the sense of the parse tree just described. This method of translation is similar to the syntax-directed translation for context-free grammars, and we shall have more to say about this in Chapter 9.
466
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
It is impossible to "prove" Algorithm 6.1 correct, since, as we mentioned, it itself serves as the constructive definition of a parse tree for a T D P L program. However, it is straightforward to show that the frontier of the parse tree is the input. One shows by induction on the number of calls of routine buildtree that when called with argument A ~ (x ~"y, s) the result is a tree with frontier x. Some of the successful outcomes will be "canceled." That is, if we have + A ~ (xy ~"z, s) and the rule for A is A --~ BC/D, we may find that B ==~ (x I yz, s) but that C *~ (r' yz, f). Then the relationship B ~ (x F"yz, s) and all the successful recognitions that served to build up B ~ (x ~"yz, s) are not really involved in defining the parse tree for the input (except in a negative way). These Successful recognitions are not reflected in the parse tree, nor is routine buiidtree Called with argument B ~ (x I yz, s). But all those successful recognitions which ultimately contribute to the successful recognition of the input are included, i 6.1.2.
TDPL and Deterministic Context-Free Languages
It can be quite difficult to determine what language is defined by a T D P L program. To get some feel for the power of T D P L programs, we shall prove that every deterministic CFL with an endmarker has a T D P L program recognizing it. Moreover, the parse trees for that T D P L program are closely related to the parse trees from the "natural" C F G constructed by Lemma 2.26 from a D P D A for the language. The following lemma will be used to simplify the representation of a D P D A in this section. LEMMA 6.2
If L = L,(M~) for D P D A M~, then L = L,(M) for some D P D A M which never increases the length of its pushdown list by more than 1 on any single move.
Proof. A proof was requested in Exercise 2.5.3. Informally, replace a move which rewrites Z as X ~ . . . X,, k > 2, by moves which replace Z by YuXk, Yk by Yk_~X'k_~,..., Y4 by Y3X3, and Y3 by XiX 2. The Y's are new pushdown symbols and the pushdown top is on the left. D LEMMA 6.3
If L is a deterministic CFL and $ is a new symbol, then L$ = L,(M) for some D P D A M.
Proof The proof is a simple extension of Lemma 2.22. Let L = L(M1). Then M simulates M1, keeping the next input symbol in its finite control. M erases its pushdown list if M1 enters a final state and $ is the next input symbol. No other moves are possible or needed, so M is deterministic. D
LIMITED BACKTRACK TOP-DOWN PARSING
SEC. 6.1
467
THEOREM 6.1
Let M = (Q, 2~, F, ~, q0, Z0, F) be a D P D A with L , ( M ) = L. Then there exists a T D P L program P such that L = L(P). Proof. Assume that M satisfies Lemma 6.2. We construct P = (N, 2~, R, S), an extended T D P L program, such that L(P) = Le(M). N consists of
(1) The symbol S; (2) Symbols of the form [qZp], where q and p are in Q, and Z ~ F ; and (3) Symbols of the form [qZpL, where q, Z, and p are as in (2), and a ~ ~. The intention is that a call of the nonterminal [qZp] will succeed and recognize string w if" and only if (q, w, Z)[--~-. (p, e, e), and [qZp] will fail under all other conditions, including the case where (q, w, Z)[--~-(p', e, e) for some p' ~ p. A call of the nonterminal [qZpL recognizes a string w if and only if (q, aw, Z)[-~--(p, e,e). The rules of P are defined as follows: (1) The rule for S is S ~ [qoZoqo]/[qoZoq~]/"'"/[qoZoqA, where q0, q~, • . . , qk are all the states in Q. (2) If ,~(q, e, Z ) = (p, e), then the rule for [qZp] is [qZp] ~ e, and for all p' ~ p, [qZp'] --, f is a rule. (3) If di(q, e, Z) = (p, X), then the rule for [qZr] is [qZr] ~ [pXr] for all rinQ. (4) If ~(q, e, Z ) = (p, X Y ) , then for each r in Q, the rule for [qZr] is [qZr] ~ [pXqo][qoYr]/[pXqi][q~Yr]/ . . . /[pXqk][qkYr], where qo, qt, . . . , qk are all the states in Q. (5) If c~(q, e , z ) is empty, let a ~ , . . . , a ~ be the symbols in X for which c~(q, a, Z) ~ ~ . Then for r ~ Q, the rule for nonterminal [qZr] is [qZr] ----, al[qZr]aJa2[qZrL/."/al[qZr]o,. If l = 0, the rule is [qZr] ---~f. (6) If ~(q, a, Z) = (p, e) for a ~ ~, then we have rule [qZpL---~ e, and for p' ~ p, we have rule [qZp']~ ---~f. (7) If cS(q, a, Z) = (p, X), then for each r ~ Q, we have a rule of the form [qZr]~--~ [pXr]. (8) If c~(q, a , Z ) = (p, X Y ) , then for each r ~ Q, we have the rule [qZr]a ~ [pXqo][qo Yr]/[pXq ~][q ~Yr]/ . . . /[pXqk][qkYr]. We observe that because M is deterministic, these definitions are consistent; no member of N has more than one rule. We shall now show the following" (6.1.1)
[qZp] ~
(w ~"x, s), for any x, if and only if (q, wx, Z)[ ÷ (p, x, e)
(6.1.2)
If (q, wx, Z)l-~- (p, x, e), then for all p' ~ p, [qZp'] ~
(1 wx, f )
We shall prove (6.1.2) and the "if" portion of (6.1.1) simultaneously by induction on the number of moves made by M going from configuration (q, wx, Z) to (p, x, e). If one move is made, then d~(q, a, Z) = (p, e),
i:
468
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
where a is e or the first symbol of wx. By rule (2) or rules (5) and (6), [qZp] ~ (a l y, s), where ay = wx, and [qZp'] -----~(r wx, f ) for all p ' ~ p. Thus the basis is proved. Suppose that the result is true for numbers of moves fewer than the number required to go from configuration (q, wx, Z) to (p, x, e). Let w = aw' for some a ~ ~ U [e}. There are two cases to consider.
Case 1: The first move is (q, wx, Z ) ~ (q', w'x, X) for some X in F. By the inductive hypothesis, [q'Xp] ~ (w' r"x, s) and [qZp'] ~ (I w'x, f ) for p' ~ p. Thus by rule (3) and rules (5) and (7), we have [qXp] =-~ (w r"x, s) and [qXp'] ~ (r" wx, f) for all p' ~ p. The extended rules of P should be first translated into rules of the original type to prove these contentions rigorously. Case 2: For some X and Y in I', we have, assuming that w ' = yw", (q, wx, Z) ~---(q', w' x, X Y ) I--e-(q", w" x, Y)I--~- (p, x, e), where the pushdown list always has at least two symbols between configurations (q', w'x, XY) and (q", w"x, Y). By the inductive hypothesis, [q'Xq"] ~ (y I w"x, s) and [q"Yp] ~ (w" I x, s). Also, if p' ~ q", then [q'Xp'] ~ (I w'x, f). Suppose first that a = ¢. If We examine rule (4) and use the definition of the extended T D P L statements, we see that every sequence of the form [q'Xp'][p'Yp] fails for p' -~ q". However, [q'Xq"][q"Yp] succeeds and so [qZp] ~ (w ["x, s) as desired. We further note that if p ' -~ p, then [q"Yp'] =~. (F"w"x, f), so that all terms [q'Xp"][p"Yp'] fail. (If p " ~ q", then [q'Xp"] fails, and if p " = q", then [p"Yp'] fails.) Thus, [qZp'] ~ (r" wx, f ) for p' ~ p. The case in which a ~ E is handled similarly, using rules (5) and (8). We must now show the "only if" portion of (6.1.1). If [qZp] ~ (w ~"x, s) then [qZp] ~ (w I x, s) for some n.f We prove the result by induction on n. If n = 1, then rule (2) must have been used, and the result is elementary. Suppose that it true for n < no, and let [qZp] ~ (w ~"x, s).
Case 1: The rule for [qZp] is [qZp]---, [q'Xp]. Then J(q, e, Z ) = (p, X), [q'Xp] ~ (w I x, s), where nl < no. By the inductive hypothesis, (q', wx, X)l.-~- (p, x, e). Thus, (q, wx, Z)I -e- (p, x, e). and
Case 2: The rule for [qZp] is [qZp] ~ [q'Xqo][qoYp]/ "" /[q'Xqk][qkYp]. Then we can write w = w'w" such that for some p', [q'Xp'] ~ (w' I w"x, s) and [p'Yp] ~ (w" r"x, s), where nl and nz are less than n 0. By hypothesis, (q', w'w"x, XY)[---(p', w"x, Y)[---(p, x, e). By rule (4), ~(q, e, Z) = (q', XY). Thus, (q, wx, Z) ~ (p, x, e). tThe step counting must be performed by converting the extended rules to the original form.
LIMITED BACKTRACK TOP-DOWN PARSING
SEC. 6.1
469
Case 3: The rule for [qZp] is defined by rule (5). That is, O(q, e, Z ) = ~ . Then it is not possible that w = e, so let w = aw'. If the rule for nonterminal [qZp]° is [qZp]a ~ e, we know that O(q, a, Z) = (p, e), so w' = e, w = a, and (q, w, Z) ~ (p, e, e). The situations in which the rule for [qZp], is defined by (7) or (8)are handled analogously to cases 1 and 2, respectively. We omit these considerations. To complete the proof of the theorem, we note that S ~ (w l, s) if and only if for some p, [qoZoP] ~ (w I, s). By (6.1.1), [qoZoP] ~ (w l, s) if and only if (q0, w, Z0)[-~-- (p, e, e). Thus, L(P) = L,(M). COROLLARY
If L is a deterministic C F L and $ a new symbol, then L$ is a T D P L language.
Proof. From Lemma 6.3. 6,1.3,
D
A Generalization of TDPL
We note that if we have a statement A ~ BC/D in TDPL, then D is called if either B or C fails. There is no way to cause the flow of control to differ in the cases in which B fails or in which B succeeds and C fails. To overcome this defect we shall define another parsing language, which we call G T D P L (generalized TDPL). A program in G T D P L consists of a sequence of statements of one of the forms
A ---, B[C, D] A --~ a A ~ e A --~ f The intuitive meaning of the statement A ~ B[C, D] is that if A is called, it calls B. If B succeeds, C is called. If B fails, D is called at the point on the input where A was called. The outcome of A is the outcome of C or D, whichever gets called. Note that this arrangement differs from that of the T D P L statement A ~ BC/D, where D gets called if B succeeds but C fails. Statements of types (2), (3), and (4) have the same meaning as in TDPL. We formalize the meaning of G T D P L programs as follows. (1) (2) (3) (4)
DEFINITION
A GTDPL program is a 4-tuple P = (N, ~, R, S), where N, ~, and S are as for a T D P L program and R is a list of rules of the forms A ~ B[C, D], A ~ a, or A ~ f Here, A, B, C, and D are in N, a is in ~ U [e}, and f is the failure metasymbol, as in the T D P L program. There is at most one rule with any particular A to the left of the arrow. We define relations ~ as for the T D P L program"
470
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
(1) If A has rule A ---~ a, for a in Z U [e}, then A ~ (a ~"w, s) for all w ~ Z*, and A ~ (I w, f ) for all w ~ Z* which do not have prefix a. (2) If A has rule A ~ f, then A =g. (~' w, f ) for all w ~ Z*. (3) If A has rule A ----~B[C, D], then the following hold" (a) If B =~ (w I xy, s) and C ~ (b) If B ~
(w r"x, s) and C ~ (F"x, f), then A"-----~2 (l wx, f).
(c) If B =~ (~ wx, f ) and D ~ (d) If B ~ We say that A ~
(x r"y, s), then A ,,___~i(wx I Y, s).
(I w, f ) and D ~
(w r' x, s), then A m____~)(W~'X, S). (r' w, f ) , then A m---e-~~ (I w, f).
(x ~"y, r) if A ~ (x [' y, r) for some n ~ 1. The language
defined by P, denoted L(P), is the set {w l S =~ (w I, s)}. Example 6.4
Let P be a GTDPL program with rules
s CA B E
A[C, El > S[B, E] >a >b >e
We claim that P recognizes {a"b"]n ~ 0}. We can show by simultaneous induction on n that S ~ (a"b" ~"x, s) and C =~ (a"b"+I I x, s) for all x and n. For example, with input aabb, we make the following sequence of observations"
A ~
([" bb, f )
E~
(~"bb, s)
S~
(I bb, s)
B ___k. (b I b, s)
c
(b I b, s)
A~
(a I bb, s)
S ---2-.,.(ab I b, s) B
(b
s)
SEC. 6.1
LIMITED BACKTRACK TOP-DOWN PARSING
C~
(abb r', s)
A~
(a r"abb, s)
s ~
(aabb l, s) D
471
The next example is a G T D P L program that defines a non-context-free language.
Example 6.5 We construct a G T D P L program to recognize the non-CFL [0"1"2" In > 1}. By Example 6.4, we know how to write rules that check whether the string at a certain point begins with 0"1" or 1"2". Our strategy will be to first check that the input has a prefix of the form 0 ~ 1"2 for m ~ 0. If not, we shall arrange it so that acceptance cannot occur. If so, we shall arrange an intermediate failure outcome that causes the input to be reconsidered from the beginning. We shall then check that the input is of the form 0'l J2 J. Thus both tests will be met if and only if the input is of the form 0"1"2" for n~l. We shall need nonterminals that recognize a single terminal or cause immediate success or failure; let us list them first" (1) X
>0
(2) Y-
3i
(3) Z
~2
(4) E
>e
(5) r
>f
We can utilize the program of Example 6.4 to recognize 0mlm2 by a nonterminal Sa. The rules associated with Sa are (6) S ,
~ A[Z, Z]
(7) A
> X[B, El
(8) B - - ÷ A[Y, E] Rules (7), (8), (1), (2), and (4) correspond to those of Example 6.4 exactly. Rule (6) assures that S~ will recognize what A recognizes (0mlm), followed by 2. Note that A always succeeds, so the rule for $1 could be $1 ---~A[Z, W] for any W. Next, we must write rules that recognize 0* followed by 1~2J for some j. The following rules suffice"
472
CHAP. 6
LIMITED BACKTRACK PARSING ALGORITHMS
(9) $2
> X[S2, C]
(10) C
~ Y[D, E]
(11) D
> C[Z, El
Rules (10), (11), (2), (3), and (4) correspond to Example 6.4, and C recognizes 1J2j. The rule for $2 works as follows. As long as there is a prefix of O's on the input, $2 recognizes one of them and calls itself further along the input. When X fails, i.e., the input pointer has shifted over the O's, C is called and recognizes a prefix of the form t j 2 j. Note that C always succeeds, so S z always succeeds. We must now put the subprograms for S~ and $2 together. We first create a nonterrL aal $3, which never consumes any input, but succeeds or fails as S~ fails or succeeds. The rule for $3 is (12) $3
> S,[F, E]
Note that if S~ succeeds, $3 will call F, which must fail and retract the input pointer to the place where S~ was called. If S~ fails, $3 calls E, which succeeds. Thus, $3 uses no input in any case. Now we can let S be the start symbol, with rule (13) S
> Sa[F, $21
If $1 succeeds, then $3 fails and $2 is called at the beginning of the input. Thus, S succeeds whenever $1 and $2 succeed on the same input. If $1 fails, then S 3 succeeds, so S fails. If $1 succeeds but Sz fails, then S also fails. Thus the program recognizes {0"1"2" In ~ 1}, which is the intersection of the sets recognized by S~ and $2. Hence there are languages which are not contextfree which can be defined by GTDPL programs. The same is true for TDPL programs, incidentally. (See Exercise 6.1.1.) [Z] We shall now investigate some properties of GTDPL programs. The following lemma is analogous to Lemma 6.1. LEMMA 6.4 +
Let P = (N, X, R, S) be any GTDPL program. If A ==~(xr'y, r l) and A ~ (u ~"v: r2), where xy = uv, then x = u, y = v, and r l = r2.
Proof. Exercise.
5
We now establish two theorems about GTDPL programs. First, the class of TDPL definable languages is contained in the class of GTDPL definable languages. Second, every language defined by a GTDPL program can be recognized in linear time on a reasonable random access machine.
SEC. 6.1
LIMITED BACKTRACK TOP-DOWN PARSING
473
THEOREM 6 . 2
Every TDPL definable language is a G T D P L definable language.
Proof Let L = L(P) for a T D P L program P = (N, Z, R, S). We define the G T D P L program P ' = (N', Z, R', S), where R' is defined as follows" (1) If A ---~ e is in R, add A ---~ e to R'. (2) If A ----~a is in R, add A ----~a to R'. (3) If A ----~f is in R, add A ~ f to R'. (4) Create nonterminals E and F and add rules E ~ e and F ---~f to R'. (Note that other nonterminals with the same rules can be identified with these.) (5) If A ~ BC/D is in R, add
A
> A'[E, D]
A'
> B[C, F]
to R', where A' is a new nonterminal. Let N' be N together with all new nonterminals introduced in the construction of R'. It is elementary to observe that if B and C succeed, then A' succeeds, and that otherwise A' fails. Thus, A succeeds if and only if A' succeeds (i.e., B and C succeed), or A' fails (i.e., B fails or B succeeds and C fails) and D succeeds. It is also easy to check that B, C, and D are called at the same points on the input by R' that they are called by R. Since R' simulates each rule of R directly, we conclude that S ~p (w T"~ s) if and only if S ~p , (w ~" s),
andL(P)=L(P').
5
Seemingly, the G T D P L programs do more in the way of recognition than T D P L programs. For example, a G T D P L program can readily be written to simulate a statement of the form
A
> BC/(Di, D2)
in which D 1 is to be called if B fails and D 2 is to be called if B succeeds and C fails. It is open, however, whether the containment of Theorem 6.2 is proper. As with TDPL, we can embellish G T D P L with extended statements (see Exercise 6.1.12). ~For example, every extended T D P L statement can be regarded as an extended form of G T D P L statement. 6.1.4.
Time Complexity of GTDPL Languages
The main result of this section is that we can simulate the successful recognition of an input sentence by a G T D P L program (and hence a T D P L
474
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
program) in linear time on a machine that resembles a random access computer. The algorithm recalls both the Cocke-Younger-Kasami algorithm and Earley's algorithm of Section 4.2, and works backward on the input string. ALGORITHM 6.2 Recognition of G T D P L languages in linear time. Input. A G T D P L program P = (N, Z, R, S), with N = [A~, A z , . . . , Ak}, and S = A~, and an input string w = a~a2 . . . a, in Z*. We assume that a,+ ~ = $, a right endmarker. Output. A k × (n + 1) matrix [ t j . Each entry is either undefined, an integer m such that 0 < m < n, or the failure symbol f. If t;~ = m, then A~ ~ (ajaj+~ . . . aj+~_, ["aj+m "'" a., s). If t,l = f , At =~" (r" aj . . . a,, f ) . Otherwise, t,j is undefined. Method. We compute the matrix of ttj's as follows. Initially, all entries are undefined.
(1) Do steps (2)-(4) f o r j = n + 1, n , . . . , 1. (2) For each i, 1 < i < k, if As --* f is in R, set t,j = f . If A s ~ e is in R, set t;j = 0. If A --~ a~ is in R, set t;j = 1, and if A ~ b is in R, b ~ aj, set t,j - - f (We take a,+ ~ to be a symbol not in Z, so A --* a,+ ~ is never in R.) (3) Do step (4) repeatedly for i - - 1, 2 , . . . , k, until no changes to the t j s occur in a step. (4) Let the rule for At be of the form At ~ Ap[Aq, A,], and suppose that tsj is not yet defined. (a) If t~j = f a n d trj = X, then set t,j = x, where x is an integer or f (b) If tpj = m~ and tq 1, where elementary steps are of the type used for Algorithm 4.3.
Proof The crux of the proof is to observe that in step (3) we cycle through all the nonterminals at most k times for any given j. D Last we observe that from the matrix in Algorithm 6.2 it is possible to build a tree-like parse structure for accepted inputs, similar to the structure that was built in Algorithm 6.1 for TDPL programs. Additionally, Algorithm 6.2 can be modified to recognize and "parse" according to a TDPL (rather than GTDPL) program. Example 6.6
Let P = (N, E, R I, E), where N=
[E,E+,T,T.,F,F', X , Y , P , M , A , L , R],
~E = [a, (,), + , ,], and R1 consists of (1) g
• T[E+, X]
(2) g+ (3) Z
,PIE, Y] > F[T., X]
(4) T.
> M[T, Y]
(5) F
• L[F', A]
(6) F'
• E[R, X]
(7) X---> f (8) Y
>e
(9) P
>+
(I0) M
• *
(11) A
•a
(12) L:
>(
(13) R
•)
This GTDPL program is intended to recognize arithmetic expressions over + and ,, i.e., L(Go). E recognizes an expression consisting of a sequence of terms (T's) separated by ÷ ' s . The nonterminal E+ is intended to recognize an expression with the first term deleted. Thus rule (2) says that E+ recognizes a + sign (P) followed by any expression, and if there is no + sign, the empty
476
CHAP. 6
LIMITED BACKTRACK PARSING ALGORITHMS
string (Y) serves. Then we can interpret statement (1) as saying that an expression is a term followed by something recognized by E÷, consisting of either the empty string or an alternating sequence of q-'s and terms beginning with •q- and ending in a term. A similar relation applies to statements (3) and (4). Statements (5) and (6) say that a factor (F) is either ( followed by an expression followed by ) or, if no ( is present, a single symbol a. Now, suppose that (a + a) • a is the input to Algorithm 6.2. The matrix [ t j constructed by Algorithm 6.2 is shown in Fig. 6.2. Let us compute the entries in the eighth column of the matrix. The entries for P, M, A, L, and R have value f, since they look for input symbols in and the eighth input symbol is the right endmarker. X" always yields value f, and Y always yields value 0. Applying step (3) of Algorithm 6.2, we find that in the first cycle through step (4) the values for E+, T., and F can be filled in and are 0, 0, and f, respectively. On the second cycle, T is given the value f. The values for E and F ' can be computed on the third cycle. (
a
-4-
a
)
•
a
$ f
E E+ T
7
3
f
1
f
f
1
0
0
2
0
0
0
0
0
7
1
f
1
f
f
1
f
T,
0
0
0
0
0
2
0
0
F F' X Y P M A L
5 f f 0 f f f
1 4 f 0 f f 1
f f f 0 1 f f
1 2 f 0 f f 1
f f f 0 f f f
f f f 0 f 1 f
1 f f 0 f f 1
f f f 0 f f f
1
f
f
f
f
f
f
f
R
f
f
f
f
1
f
f
f
Fig. 6.2 Recognition table from Algorithm 6.2. An example of a less trivial computation occurs in column 3. The bottom seven rows are easily filled in by statement (2). Then, by statement (3), since the P entry in column 3 is 1, we examine the E entry in column 4 ( = 3 --t- 1) and find that this is also 1. Thus the E÷ entry in column 3 is 2 ( = 1 -q- 1). D 6.1.5.
Implementation of GTDPL Programs
In practice, implementation of GTDPL-like parsing systems do not take the tabular form of Algorithm 6.2. Instead, a trial-and-error method is normally used. In this section we shall construct an automaton that "implements" the recognition portion of a G T D P L program. We shall leave it to
SEC. 6.1
LIMITED BACKTRACK TOP-DOWN PARSING
477
the reader to show how this automaton could be extended to a transducer which would indicate the successful sequence of routine calls from which a "parse" or translation can be constructed. The automaton consists of an input tape with an input pointer, which may be reset; a three-state finite control; and a pushdown list consisting of symbols from a finite alphabet and pointers to the input. The device operates in a way that exactly implements our intuitive idea of routines (nonterminals) calling one another, with the retraction of the input pointer required on each failure. The retraction is to the point at which the input pointer dwelt when the call of the failing routine occurred. DEFINITION
A parsing machine is a 6-tuple M = (Q, E, F, 6, begin, Zo) , where (1) Q = {success, failure, begin]. (2) E is a finite set of input symbols. (3) r is a finite set of pushdown symbols. (4) ~ is a mapping from Q x ( E u { e } ) x F to Q x F .2, which is restricted as follows" (a) If q is success or failure, then 0(q, a, Z) is undefined if a ~ E, and ~(q, e, Z) is of the form (begin, Y) for some Y ~ F. (b) If ~(begin, a, Z) is defined for some a ~ ~, then ~(begin, b, Z) is undefined for all b #= a in E U [e}. (c) For a in E, ~(begin, a, Z) can only be (success, e) if it is defined. (d) ~(begin, e, Z) can only be of the forms (begin, YZ), for some Y in F, or of the form (q, e), for q : success or failure. (5) begin is the initial state. (6) Z0 in r is the initial pushdown symbol. M resembles a pushdown automaton, but there are several major differences. We can think of the elements of I" as routines that either call or transfer to each other. The pushdown list is used to record recursive calls and the position of the input head each time a call was made. The state begin normally causes a call of another routine, reflected in that if ~(begin, e, Z) = (begin, YZ), where Y is in r and Z is on top of the list, then Y will be placed above Z on a new level of the pushdown list. The states success and failure cause transfers to, rather than calls of, another routine. If, for example, ~(success, e, Z) : (begin, Y), then Y merely replaces Z on top of the list. We formally define the operation of M as follows. A configuration of M is a triple (q, w I x, ~,), where (1) q is one of success, failure, or begin; (2) w and x are in E*; I is a metasymbol, the input head; (3) ~, is a pushdown list of the form (Z~, i~) . . . (Zm, ira), where Zj ~ U and ij is an integer, for 1 ~ j ~ m. The top is at the left. The Z's are "routine" calls; the i's are input pointers.
478
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
We define the I-~- relation, or ~ when M is understood, on configurations as follows: (1) Let ~(begin, e, Z) = (begin, YZ) for Y in F . Then
(begin, w ~"x, (Z, i)7 ) t-- (begin, w f x, (Y, y)(z, i)~,), where j = I wl. Here, Y is "called," and the position of the input head. when Y is called is recorded, along with the entry on the pushdown list for Y. (2) Let ~(q, e, Z) = (begin, Y), where Y ~ F, and q = success or failure. Then (q, w I x, (Z, i)7) t- (begin, .w ["x, (Y, i)r). Here Z "transfers" to Y. The input position associated with Y is the same as that associated with Z. (3) Let ~(begin, a, Z) = (q, e) for a in E u {e}. If q = success, then (begin, w Iax, (Z, i)~,) ~- (success, wa ~ x, ~,). If a is not a prefix of x or q = failure, then (begin, w I x, (Z, i)7) ~ (failure, u I v, ~,), where uv = wx and [ul = i. In the latter case the input pointer is retracted to the location given by the pointer on top of the pushdown list. Note that if ~(begin, a, Z) = (success, e), then the next state of the parsing machine is success if the unexpended input string begins with a and failure otherwise. Let ~ be the transitive closure of ~ . The language defined by M, denoted L(M), is {w[w is in E* and (begin, I w, (Z0, 0))[ --~- (success, w r', e)}. Example 6,7
Let M = (Q, (a, b}, (Z0, Y, A, B, E}, ~, begin, Zo), where 6 is given by (1) 6(begin, e, Zo) = (begin, YZo) (2) O(success, e, Z o ) = (begin, Z0)
(3) $(failure, e, Z0) = (begin, E) (4) $(begin, e, Y) = (begin, A Y) (5) (6) (7) (8) (9)
O(success, e, Y ) = (begin, Y) 6(failure, e, Y) = (begin, B) $(begin, a, A) = (success, e) O(begin, b, B) = (success, e) O(begin, e, E) = (success, e)
M recognizes e or any string of a's and b's ending in b, but does so in a peculiar way. A and B recognize a and b, respectively. When Y begins, it looks for an a, and i f Y finds it, Y "transfers" to itself. Thus the pushdown list remains intact, and a's are consumed on the input. If b or the end of the input is reached, Y in state failure causes the top of the pushdown list to be erased. That is, Y is replaced by B, and, whether B succeeds or fails, that B is eventually erased. Z0 calls Y and transfers to itself in the same way that Y calls A. Thus
SEC. 6.1
LIMITED BACKTRACK TOP-DOWN PARSING
479
any string of a's and b's ending in b will eventually cause Z0 to be erased and state success entered. The action of M on input abaa is given by the following sequence of configurations" (begin, I abaa, (Z o, 0)) t--- (begin, I abaa, (Y, O)(Zo, 0)) t- (begin, l' abaa, (A, O)(Y, O)(Zo, 0)) t- (success, a ~'ban, (Y, O)(Zo, 0)) t- (begin, a ["ban, (Y, O)(Zo, 0)) (begin, a ~"ban, (A, 1)(Y, O)(Zo, 0)) (failure, a l ban, (Y, O)(Zo, 0)) (begin, a ~"ban, (B, O)(Zo, 0)) (success, ab ~"an, (Zo, 0)) (begin, ab ~"an, (Zo, 0)) (begin, ab l an, (Y, 2)(Zo, 0)) (begin, ab I an, (A, 2)(Y, 2)(Zo, 0)) [-- (success, aba ~ a, (Y, 2)(Zo, 0)) [- (begin, aba ~"a, (Y, 2)(Z o, 0)) t- (begin, aba ~"a, (A, 3)(Y, 2)(Zo, 0)) (success, abaa I, (Y, 2)(Zo, 0)) (begin, abaa ~', (Y, 2)(Zo, 0)) (begin, abaa l', (A, 4)(Y, 2)(Zo, 0)) [- (failure, abaa ~', (Y, 2)(Zo, 0)) [- (begin, abaa I, (B, 2)(Z o, 0)) t-- (failure, ab ~"an, (Zo, 0)) [- (begin, ab ["aa, (E, 0)) [- (success, ab ~ an, e) Note that abaa is not accepted because the end of the input was not reached at the last step. However, ab alone would be accepted. It is important also to note that in the fourth from last configuration B is not "called" but replaces Y. Thus the number 2, rather than 4, appears on top of the list, and when B fails, the input head backtracks. [-7 We shall now prove that a language is defined by a parsing machine if and only if it is defined by a GTDPL program. LEMMA 6.5 If L -- L(M) for some parsing machine M -- (Q, E, r', 6, begin, z0), then L -- L(P) for some GTDPL program P.
400
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
Proof. Let P = (N, Z, R, Z0), where N = F u {X} with X a new symbol. Define R as follows"
(1) X has no rule. (2) If ~(begin, a, Z) = (q, e), let Z --~ a be the rule for Z if q = success, and Z--~ f be the rule if q = failure. (3) For the other Z's in F define Y~, Y2, and Y3 as follows' (a) If ~(begin, e, Z) (begin, YZ), let Y~ = Y.
(b) If ~(q, e, Z) = (begin, Y), let Y2 = Y if q is success and let Y3 = Y if q is failure. (c) If Yt is not defined by (a) or (b), take Yt to be X for each Yt not otherwise defined. Then the rule for Z is Z ---~ Y~[Y2, Y3]. We shall show that the following statements hold for all Z in r . (6.1.3)
Z ---> (w ~'x, s) if and only if
(begin, I wx, (Z, 0))Ira- (success, w I x, e) (6.1.4)
Z ~
([' w, f ) if and only if (begin, I w, (Z, 0))[-~-- (failure, I w, e)
We prove both simultaneously by induction on the length of a derivation in P or computation of M. Only if: The bases for (6.1.3) and (6.1.4) are both trivial applications of the definition of the l- relation. For the inductive step of (6.1.3), assume that Z ~ (w ~'x, s) and that (6.1.3) and (6.1.4) are true for smaller n. Since we may take n > 1, let the rule for Z be Z ~ Yi[Y2, Y3]. Case 1: w = wlw2, Y1 ~ (wl ~"w2x, s) and Y2 ~ (w2 I x, s). Then n 1 and n2 are less than n, and we have, by the inductive hypothesis (6.1.3),
(6.1.5)
(begin, ~"wx, (Y1, 0))12- (success, w 1 I w2x, e)
and (6.1.6)
(begin, I w2x, (Y2, 0))12- (success, wz I x, e)
We must now observe that if we insert some string, w 1 in particular, to the left of the input head of M, then M will undergo essentially the same action. Thus from (6.1.6) we obtain (6.1.7)
(begin, w 1 ~ w2x , (Y2, 0))12- (success, w I x, e)
This inference requires an inductive proof in its own right, but is left for the Exercises.
SEC. 6.1
LIMITED BACKTRACK T O P - D O W N PARSING
481
From the definition of R we know that ~(begin, e, Z ) = (begin, YtZ) and ~(suceess, e, Z) = (begin, Y2). Thus (6.1.8)
(begin, T'wx, (Z, 0)) ~ (begin, ~' wx, (Y~, 0)(Z, 0))
and (6.1.9)
(success, wl r"w2x, (Z, 0)) ~ (begin, w 1 I w2x, (Y2, 0)).
Putting (6.1.9), (6.1.5), (6.1.8), and (6.1.7) together, we have (begin, [' wx, (Z, 0))l-~-- (success, w r' x, e), as desired.
Case 2: Yl ~ (~ wx, f ) and Y3 ~ (w p x, s). The proof in this case is similar to case 1 and is left for the reader. We now turn to the induction for (6.1.4). We assume that Z ~ (~ w, f).
Case 1 : Y 1 ~ (wl ~ w2, s) and Yz ~ (~ wz, f), where w lw z = w. Then nl, n2 < n, and by (6.1.3) and (6.1.4), we have (6.1.10) (6.1.11)
(begin,p w, (Y1, 0))[--~-(success,w 1 ~"w z, e) (begin, r"w2, (Y2, 0))I~ (failure, ~'wz, e)
If we insert w a to the left of 1"in (6.1.11), we have (6.1.12)
(begin,wl ~"w2, (Yz, 0))I--~- (failure, ~"w 1wz, e)
The truth of this implication is left for the Exercises. One has to observe that when (Yz, 0) is erased, the input head must be set all the way to the left. Otherwise, the presence of w l on the input cannot affect matters, because numbers written on the pushdown list above (Y1, 0) will have 1w~ [ added to them [when constructing the sequence of steps represented by (6.1.12) from (6.1.11)], and thus there is no way to get the input head to move into w x without erasing (Y~, 0). By definition of Y~ and Y2, we have (6.1.13)
(begin, ~'w, (Z, 0)) ~ (begin, ~"w, (Y~, O)(Z, 0))
and (6.1.14)
(success, w l ~"w2, (Z, 0)) F- (begin, w t ["w2, (Y2, 0))
Putting (6.1.13), (6.1.10), (6.1.14), and (6.1.12) together, we have (begin, ["w, (Z, 0))I --~- (failure, 1"w, e).
482
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
Case 2 : Y 1 ~ (~ w, f ) and Y3 =%. ([' w, f). This case is similar and left to the reader. lf: The "if" portion of the proof is similar to the foregoing, and we leave the details for the Exercises. As a special case of (6.1.3), Z0 =~ (w l', s) if and only if (begin, [' w, (Z0, 0)) !-~-- (success, w [', e), so L(M) = L(P). D LEMMA 6.6 If L = L(P) for some GTDPL program P, then L = L(M) for a parsing machine M.
Proof. Let P = (N, X, R, S) Define ~ as follows:
and define M = (Q, X, N, ~, begin, S).
(1) If R contains rule A ---~ B[C, D], let ~(begin, e, A) = (begin, BA), $(sueeess, e, A) = (begin, C) and dr(failure, e, A) = (begin, D). (2) (a) If A----~ a is in R, where a is in X u [e}, let $(begin, a, A ) = (success, e). (b) If A ---~ f is in R, let 6(begin, e, A) = (failure, e). A proof that L ( M ) = L(G) is straightforward and left for the Exercises.
D THEOREM 6.5
A language L is L(M) for some parsing machine M if and only if it is L(P) for some GTDPL program P.
Proof Immediate from Lemmas 6.5 and 6.6.
D
EXERCISES
6.1.1. Construct TDPL programs to recognize the following languages"
(a) L(ao).
6.1.2.
(b) The set of strings with an equal number of a's and b's. (c) {WCWR[w ~ (a + b)*}. *(d) {a2"ln > 1}. Hint: Consider S ~ aSa/aa. (e) Some infinite subset of FORTRAN. Construct GTDPL programs to recognize the following languages: (a) The languages in Exercise 6.1.1. (b) The language generated by (with start symbol E)
E----~ E + T[T T >T,FIF F > (E)II I----+ a[ a(L) L - > a[a,L **(c) {a"'ln ~ 1}.
EXERCISES
483
"6.1.3.
Show that for every LL(1) language there is a GTDPL program which recognizes the language with no backtracking; i.e., the parsing machine constructed by Lemrna 6.6 never moves the input pointer t o the left between successive configurations.
"6.1.4.
Show that it is undecidable whether a TDPL program P = (N, X~, R, S) recognizes
(a) ~. (b) X*. 6.1.5.
Show that every TDPL or GTDPL program is equivalent to one in which every nonterminal has a rule. Hint: Show that if A has no rule, you can give it rule A ~ A A / A (or the equivalent in GTDPL) with no change in the language recognized.
"6.1.6.
Give a TDPL program equivalent to the following extended program. What is the language defined ? From a practical point of view, what defects does this program have ? s ~
Zln/C
h ------~ a B -----~ S C A C--+
6.1.7.
Give a formal proof of Lemma 6.3.
6.1.8.
Prove Lemma 6.4,
6.1.9.
b
Give a formal proof that P of Example 6.5 defines {0"l"2"ln > 1}.
6.1.10.
Complete the proof of Theorem 6.2.
6.1.11.
Use Algorithm 6.2 to show that the string ((a)) + a is in L ( P ) , where P is given in Example 6.6.
6.1.12.
GTDPL statements can be extended in much the same manner we extended TDPL statements. For example, we can permit GTDPL statements of the form A
> X~g~Xzg2 ... Xkgk
where each Xt is a terminal or nonterminal and each gt is either e or a pair of the form [czi, fli], where ~zi and fit are strings of symbols. A reports success if and only if each Xtgt succeeds where success is defined recursively, as follows. The string ~.[0¢t, fit] succeeds if and only if (1) ~ succeeds and oct succeeds or (2) Xi fails and fit succeeds. (a) Show how this extended statement can be replaced by an equivalent set of conventional GTDPL statements. (b) Show that every extended TDPL statement can be replaced by an equivalent set of (extended) GTDPL statements.
484
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
6.1.13.
Show that there are TDPL (and GTDPL) programs in which the number of statements executed by the parsing machine of Lemma 6.6 is an exponential function of the length of the input string.
6.1.14.
Construct a GTDPL program to simulate the meaning of the rule A ~ BC/(D1, D2) mentioned on p. 473.
6.1.15.
Find a GTDPL program which defines the language L ( M ) , where M is the parsing machine given in Example 6.7.
6.1.16.
Find parsing machines to recognize the languages of Exercise 6.1.2. DEFINITION A TDPL or GTDPL program P = (N, ~Z, R, S) has a partial acceptance failure on w if w = uv such that v -~ e and S ~ (u T"v, s). We say that P is well formed if for every w in E*, either S ~ (1' w, f ) or
s=~ (w, Ls). "6.1.17.
Show that if L is a TDPL language (alt. G T D P L language) and $ is a new symbol, then L$ is defined by a TDPL program (alt. GTDPL program) with no partial acceptance failures.
"6.1.18.
Let L1 be defined by a TDPL (alt. GTDPL) program and L2 by a wellformed TDPL (alt. GTDPL) program. Show that (a) L1 U L2, (b) L,2, (c) L1 ~ L2, and (d) L1 -- .L2 are TDPL (alt. GTDPL) languages.
"6.1.19.
Show that every GTDPL program with no partial acceptance failure is equivalent to a well-formed G T D P L program. Hint: It suffices to look for and eliminate "left recursion." That is, if we have a normal form G T D P L program, create a CFL by replacing rules ,4 --~ B[C, D] by productions A ~ B C and A ~ D. Let A ~ a or A ~ e be productions of the CFL also. The "left recursion" referred to is in the CFL constructed.
*'6.1.20.
Show that it is undecidable for a well-formed TDPL program P = (N, X, R, S) whether L ( P ) = ~ . Note: The natural embedding of Post's correspondence problem proves Exercise 6.1.4(a), but does not always yield a well-formed program.
6.1.21.
Complete the proof of Lemma 6.5.
6.1.22.
Prove Lemma 6.6.
Open P r o b l e m s
6.1.23.
Does there exist a context-free language which is not a GTDPL language ?
6.1.24.
Are the TDPL languages closed under complementation ?
s~c. 6.2
LIMITED BACKTRACK BOTTOM-UP PARSING
485
6.1.25. Is every TDPL program equivalent to a well-formed TDPL program? 6.1.26. Is every GTDPL program equivalent to a TDPL program? Here we conjecture that {a"'ln _~ 1} is a GTDPL language but not a TDPL language.
Programming Exercises 6.1.27.
Design an interpreter for parsing machines. Write a program that takes an extended GTDPL program as input and constructs from it an equivalent parsing machine which the interpreter can then simulate.
6.1.28.
Design a programming language centered around GTDPL (or TDPL) which can be used to specify translators. A source program would be the specification of a translator and the object program would be the actual translator. Construct a compiler for this language.
BIBLIOGRAPHIC
NOTES
TDPL is an abstraction of the parsing language used by McClure [1965] in his compiler-compiler TMG.I" The parsing machine in Section 6.1.5 is similar to the one used by Knuth [1967]. Most of the theoretical results concerning TDPL and GTDPL reported in this section were developed by Birman and Ullman [1970]. The solutions to many of the exercises can be found there. GTDPL is a model of the META family of compiler-compilers [Schorre, 1964] and others.
6.2.
LIMITED BACKTRACK BOTTOM-UP PARSING
We shall discuss possibilities of parsing deterministically and bottom-up in ways that allow more freedom t h a n the shift-reduce methods of Chapter 5. In particular, we allow limited backtracking on the input, and the parse produced need not be a right parse. The principal method to be discussed is that of Colmerauer's precedence-based algorithm. 6.2.1.
Noncanonical Parsing
There are several techniques which might be used to deterministically parse grammars which are not LR. One technique would be to permit arbitrarily long lookahead by allowing the input pointer to migrate forward along the input to resolve some ambiguity. When a decision has been reached, the input pointer finds its way back to the proper place for a reduction. tTMG comes from the word "transmogrify," whose meaning is "to change in appearance or form, especially, strangely or grotesquely.'"
486
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
Example 6.8 Consider the grammar G with productions
S
• AalBb
A
> 0A1101
B
> 0Bl11011
G generates the language [O"Ya[n > 1} U {0"12"b ]n > 1}, which is not a deterministic context-free language. However, we can clearly parse G by first moving the input pointer to the end of an input string to see whether the last symbol is a or b and then returning to the beginning of the string and parsing as though the string were 0"1" or 0"12,, as appropriate. [-7 Another parsing technique would be to reduce phrases which may not be handles. DEFINITION
If G = (N, E, P, S) is a CFG, then fl is a phrase of a sentential form ~zfly if there is a derivation S ~ ~zAy:=~ ~zflT. If X t . . . X k and X j . . . X1 are phrases of a sentential form X 1 . . - X,, we say that phrase X ~ . . . Xk is to the left of phrase Xj . - . Xi if i < j or if i = j and k < l. Thus, if a grammar is unambiguous, the handle is the leftmost phrase of a right-sentential form. Example 6.9
Consider the grammar G having productions
S
> OABb[OaBc
A
>a
B~
>Blll
L(G) is the regular set 0al+(b + c), but G is not LR. However, we can parse G bottom-up if we defer the decision of whether a is a phrase in a sentential form until we have scanned the last input symbol. That is, an input string of the form 0al" can be reduced to OaB independently of whether it is followed by b or c. In the former case, OaBb is first reduced to OABb and then to S. In the latter case, OaBc is reduced directly to S. Of course, we shall not produce either a left or right parse. [Z] Let G = (N,Z, P, S) be a C F G in which the productions are numbered from 1 to p and let (6.2.1)
S:
~o ~ ~ 1 - - - - > ' o ~ 2
:-""
:-o~n :
w
sEc. 6.2
LIMITED BACKTRACK BOTTOM-UP PARSING
487
be a derivation of w from S. For 0 < i < n, let at = fl~At6~, suppose that A~ ----~Yt is production p~ and suppose that this production is used to derive a~+l = fl~?i5~ by replacing the explicitly shown A~. We can represent this step of the derivation by the pair of integers (p,, It), where 1, = I fl, I. Thus we can represent the derivation (6.2.1) by the string of n pairs
(p0,/0)(p,, t0-.. (;._,,/._,)
(6.2.2)
If the derivation is leftmost or rightmost, then the second components in (6.2.2), those giving the position of the nonterminal to be expanded in the next step of the derivation, are redundant. DEFINITION
We shall call a string of pairs of the form (6.2.2) a (generalized) top-down parse for w. Clearly, a left parse is a special case of a top-down parse. Likewise, we shall call the reverse of this string, that is,
(P,-1, t,-~)(P,-z, l , - 2 ) " " (Pa, ll)(Po, 1o) a (generalized) bottom-up parse of w. Thus a right parse is a special case of a bottom-up parse. If we relax the restriction of scanning the input string only from left to right, but instead permit backtracking on the input, then we can deterministically parse grammars which cannot be so parsed using only the left-toright scan. 6,2.2.
T w o - S t a c k Parsers
To model some backtracking algorithms, we introduce an automaton with two pushdown lists, the second of which also serves the function of an input tape. The deterministic version of this device is a cousin of the twostack parser used in Algorithms 4. I and 4.2 for general top-down and bottomup parsing. We shall, however, put some restrictions on the device which will make it behave as a bottom-up precedence parser. DEFINITION
A two-stack (bottom-up) parser for grammar G = (N, E, P, S) is a finite set of rules of the form (a, fl) ~ (y, 6), where a, fl, ~,, and ~ are strings of symbols in N U E U [$}; $ is a new symbol, an endmarker. Each rule of the parser (a, fl) ~ (y, 6) must be of one of two forms: either (1) fl = X6 for some X E N U I:, and y = aX, or (2) a = ~,6 for some 6 in (N u E)*, 6 = Aft, and A ~ e is a production in P. In general, a rule (a, fl) ~ (Y, 6) implies that if the string a is on top of
488
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
the first pushdown list and if the string fl is on top of the second, then we can replace 0c by ~, on the first pushd.own list and fl by y on the second. Rules of type (1) correspond to a shift in a shift-reduce parsing algorithm. Those of type (2) are related to reduce moves; the essential difference is that the symbol A, which is the left side of the production involved, winds up on the top of the second pushdown list rather than the first. This arrangement corresponds to limited backtracking. It is possible to move symbols from the first pushdown list to the second (which acts as the input tape), but only at the time of a reduction. Of course, rules of type (1) allow symbols to move from the second list to the first at any time. A configuration of a two-stack parser T is a triple (0~, fl, n), where tx $(N u E)*, fl ~ (N u E)* $, and rt is a string of pairs consisting of an integer and a production number. Thus, n could be part of a parse of some string in L(G). We say that (0c, fl, zt)]-T (0¢, fl', n') if (1) ~ = txltx2, fl = fl2fll, (t~2, f12) "-~ (r, ~) is a rule of T; (2) ~' = 0c17, fl' = ~flx ; and (3) If ((x2, f12) ~ (Y, d~) is a type 1 rule, then zt' = rt; if a type 2 rule and production i is the applicable production, then ~z'= n(i, j), where j is equal
to I ~ ' 1 - 1.t Note that the first stack has its top at the right and that the second has its at the left. We define I-~-, I~--, and [-~- from I T in the usual manner. The subscript T will be dropped whenever possible. The translation defined by T, denoted z(T), is [(w,n)l($, w$, e)! ~ ($, S$, zt)3. We say that T is valid for G if for every w ~ L(G), there exists a bottom-up parse n of w such that (w, rt) ~ z(T). It is elementary to show that if (w, zt) ~ z(T), then rt is a bottom-up parse of w. T is deterministic if whenever (0~1, ill) ----~(Yl, ~1) and (~2, flz) --~ (Y2, ~2) are rules such that 0~ is a suffix of ~2 or vice versa and fll is a prefix of f12 or vice versa, then ~'1 = ~'2 and di~ = ~2. Thus for each configuration C, there is at most one C' such that C ~- C'. Example 6.10
Consider the grammar G with productions (1) S
> aSA
(2) S
> bSA
(3) S
>b
(4) A -
>a
tThe --1 term is present since ~' includes a left endmarker.
SEC. 6.2
LIMITED BACKTRACK BOTTOM-UP PARSING
489
This grammar generates the nondeterministic CFL [wba"lw ~ (a + b)* and n = [wl}. We can design a (nondeterministic) two-stack transducer which can parse sentences according to G by first putting all of an input string on the first pushdown list and then parsing in essence from right to left. The rules of T are the following: (1) (e, X) ---~ (X, e) for all X ~ {a, b, S, A}. (Any symbol may be shifted from the second pushdown list to the first.) (2) (a, e) ---~ (e, A). (An a may be reduced to A.) (3) (b, e) --~ (e, S). (A b may be reduced to S.) (4) (aSA, e) ---~ (e, S). (5) (bSA, e) ---, (e, S). [The last two rules allow reductions by productions (I) and (2).] Note that T is nondeterministic and that many parses of each input can be achieved. One bottom-up parse of abbaa is traced out in the following sequence of configurations"
($, abbaa$, e) ~ ($a, bbaa$, e) R ($ab, baa$, e) k- ($abb, aa$, e) k- ($ab, Saa$, (3, 2)) k- ($abS, aa$, (3, 2)) ($abSa, aS, (3, 2)) k- ($abSaa, $, (3, 2)) [- ($abSa, AS, (3, 2)(4, 4)) }--- ($abS, AA$, (3, 2)(4, 4)(4, 3)) [--- ($abSA, AS, (3, 2)(4, 4)(4, 3)) ]- ($a, SA$, (3, 2)(4, 4)(4, 3)(2, 1)) R ($aS, AS, (3, 2)(4, 4)(4, 3)(2, 1)) 1---($aSA, $, (3, 2)(4, 4)(4, 3)(2, 1)) ~- ($, S$, (3, 2)(4, 4)(4, 3)(2, 1)(1, 0)) The string (3, 2)(4, 4)(4, 3)(2, 1)(1, 0) is a bottom-up parse of abbaa, corresponding to the derivation
S =~ aSA =~ abSAA =~ abSaA =~ abSaa =:¢,.abbaa.
[-q
The two-stack parser has an anomaly in common with the general shiftreduce parsing algorithms; if a grammar is ambiguous, it may still be pos-
400
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
sible to find a deterministic two-stack parser for it by ignoring some of the possible parses. Later developments will rule out this problem. Example 6.11
Let G be defined by the productions
S----~'AIB A
> aAla
B
> Bala
G is an ambiguous grammar for a ÷. By ignoring B and its productions, the following set of rules form a deterministic two-stack parser for G' (e, a)
> (a, e)
(a, $) ~
(e, A $)
(a, A)
> (aA, e)
(aA, $)
> (e, AS)
($, A) ----~ ($A, e)
($A, $) ~ 6.2.3.
($, S$) G
Colmerauer Precedence Relations
The two-stack parser can be made to act in a manner somewhat similar to a precedence parser by assuming the existence of three disjoint relations, < , ._~._,and .~, on the symbols of a grammar, letting .~ indicate a reduction, and < and " indicate shifts. When reductions are made, < will indicate the left end of a phrase (not necessarily a handle). It should be emphasized that, at least temporarily, we are not assuming that the relations < , -~--, and ~• bear any connection with the productions of a grammar. Thus, for example, we could have X Y even though X and Y never appear together on the right side of a production. DEFINITION
Let G = (N, Z, P, S) be a CFG, and let < , ~---, and -~ be three disjoint relations on N U Z U [$}, where $ is a new symbol, the endmarker. The two-stack parser induced by the relations < , ---~, and -~ is defined by the following set of rules" (1) (X, Y) ~ (XY, e) if and only if X < Y or 2" " Y. (2) ( X Z 1 . . " Zk, Y ) ~ (X, A Y ) if and only if Zk "~ Y: Zt-=-Zt÷~ for 1 ~ i < k, X < Z1, and A ~ Z~ . . . Zk is a production. We observe that if G is uniquely invertible, then the induced two-stack parser is deterministic, and conversely.
sEc. 6.2
LIMITED BACKTRACK BOTTOM-UP PARSING
491
Example 6.12
Let G be the grammar with productions (1) S
> aSA
(2) S
> bSA
(3) S
>b
(4) A
>a
as in Example 6.i0. Let < , ~--, and 3> be defined by Fig. 6.3. a
b
.>
'
on N U E U {$} which a deterministic two-stack parser which is valid for G.
We call the three relations above Colmerauer precedence relations. Note that condition (3) implies that a Colmerauer grammar must be uniquely invertible. Example 6.13
The relations of Fig. 6.3 are Colmerauer precedence relations, and G of Examples 6.10 and 6.12 is therefore a Colmerauer grammar. D Example 6.14
Every simple precedence grammar is a Colmerauer grammar. Let be the Wirth-Weber precedence relations for the grammar G = (N, ~, P, S). If G is simple precedence, it is by definition proper and unambiguous. The induced two-stack parser acts almost as the shift-reduce precedence parser. However, when a reduction of right-sentential form ocflw to 0~Aw is made, we wind up with $~ on the first stack and Aw$ on the second, whereas in the precedence parsing algorithm, we would have $~A on the pushdown list and w$ on the input. If Xis the last symbol of $~, then either X < A or X ~ A, by Theorem 5.14. Thus the next move of the two-stack parser must shift the A to the first stack. The two-stack parser then acts as the simple precedence parser until the next reduction. Note that if the Colmerauer precedence relations are the Wirth-Weber ones, then the induced two-stack parser yields rightmost parses. In general, however, we cannot always expect this to be the case. D
SEC. 6.2 6.2.4.
LIMITED BACKTRACK BOTTOM-UP PARSING
493
Test for Colmerauer Precedence
We shall give a necessary and sufficient condition for an unambiguous, proper grammar to be a Colmerauer grammar. The condition involves three relations which we shall define below. We should recall, however, that it is undecidable whether a C F G is unambiguous and that, as we saw in Example 6.12, there are ambiguous grammars which have deterministic two-stack parsers. Thus we cannot always determine whether an arbitrary C F G is a Colmerauer grammar unless we know a priori that the grammar is unambiguous. DEFINITION
Let G = (N, X, P, S) be a CFG. We define three new relations 2 (for left), ,u (for mutual or adjacent), and p (for right) on N W X as follows: For all X a n d Yin N U X, A in N, (1) A2Y if A ~ Y0¢ is a production, (2) XtuY if A ~ ocXYfl is a production, and (3) X p A if A ---~ ~zX is a production. o0
As is customary, for each relation R we shall use R ÷ to denote U Rt and i=1 oo
R* to denote U RE Recall that R ÷ and R* can be conveniently computed i=0
using Algorithm 0.2. Note that the Wirth-Weber precedence relations < , ~--, and .> on N W £ can be defined in terms of 2,/t, and p as follows: (1) < = / t 2 +. (2) ±
=
u.
(3) > = p + a 2 * m ( N u X )
x £.
The remainder of this section is devoted to proving that an unambiguous, proper C F G has Colmerauer precedence relations if and only if
(1) p+,u ~ lt2* = ~3, and (2) ,u ~ p*/.z2+ = ~ . Example 6.15
Consider the previous grammar
Here
S
> aSA [bSA
A
>a
Ib
494
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
~, = {(S, a), (S, b), (A, a)} /t = {(a, S), (S, A), (b, S)} p = [(A, S), (b, S), (a, A)] p+/, = [(A, A), (b, A), (a, A)}
~u~,* = {(a, S), (S, A), (b, S), (a, a), (a, b), (S, a), (b, a), (b, b)} p*/z~ + -- [(a, a), (a, b), (b, a), (b, b), (S, a), (A, a)} Since p+/t A ,uA* = ~ and p*,uA + ~ ,u = ~ , G has Colmerauer precedence relations, a set of which we saw in Fig. 4.3. We shall now show that if a grammar G contains symbols X and Y such that X l t Y and Xp*lt2+Y, then G cannot be a Colmerauer grammar. Here, X and Y need not be distinct. LEMMA 6.7
Let G = (N, Z, P, S) be a Colmerauer grammar with Colmerauer precedence relations < , ~-, and ->. If XItY, then X ' Y. Proof Since G is proper, there exists a derivation in which a production A - - ~ a X Y f l is used. When parsing a word w in L(G) whose derivation involves that production, at some time aXYfl must appear at the top of stack 1 and be reduced. This can happen only if X_~- Y. [Z] LEMMA 6.8
Let G -- (N, Z, P, S) be a C F G such that for some X and Y in N u Z, X p*/zA + Y and X/z Y. Then G is not a Colmerauer grammar. Proof. Suppose that G is. Let G have Colmerauer relations < , --', and >• , and let T b e the induced two-stack parser. Since G is assumed to be proper, there exist x and y in Z* such that X *=~ x and Y *=~ y. Since X/z Y, there is a production A ~ aXYfl and strings wl, w2, w3, and w4 in I;* such that S *~ wlAw4 ==~ wlocXYflw4 ==~ * wlwzXYw3w 4 ==~ * wlw2xyw3w4. Since X p*,uA + Y, there exists a production B ~ 7ZC5 such that Z ~ ~,'X, C =~YO', and for some zl, z2, z3, and z4, we have S ~ zaBz4 =-~ za?ZCSza zly?'XYO'Sz4 =~ zlz~ XYz3 z4 =~ zlz2xyz3z4. By Lemma 6.7, we may assume that X " Y. Let us watch the processing by T of the two strings u = wlw2xyw 3 w4 and v = zlz2xyz3z 4. In particular, let us concentrate on the strings to which x and y are reduced in each case, and whether these strings appear on stack 1, stack 2, or spread between them. Let 01, 0 2 , . . . be the sequence of strings to which xy is reduced in u and W1, W2,. • • that sequence in v. We know that there is some j such
SEC. 6.2
LIMITED BACKTRACK BOTTOM-UP PARSING
495
that 0j = X Y , because since G is assumed to be unambiguous, X and Y must be reduced together in the reduction of u. If ~F~ = 0~ for 1 _< i < j , then when this Y is reduced in the processing of v, the X to its left will also be reduced, since X " Y. This situation cannot be correct, since C =:~ Y6' for some 8' in the derivation of v.t Therefore, suppose that for some smallest i, 1 < i < j , either 0~ ~ tF t or ~F~ does not exist (because the next reduction of a symbol of ~Ft also involves a symbol outside of W~). We know that if i > 2, then the break point between the stacks when 0~_ 1 and tF~_ i were constructed by a reduction was at the same position in 0~-1 as in tF~_ 1. Therefore, if the break point migrated out of 0t_ ~ before 0t was created, it did so for tF~_x, and it left in the same direction in each case. Taking into account the case i = 2, in which 01 = ~F1 = xy, we know that immediately before the creation of 0~ and ~F~, the break point between the stacks is either (1) Within 0t and ~F~, and at the same position in both cases; i.e., 0,._1 and W~_1 straddle the two stacks; or (2) To the right of both 0~_ 1 and tF~_1; i.e., both are on stack 1. Note that it is impossible for the break point to be left of 0~_ 1 and tF t_ 1 and still have these altered on the next move. Also, the number of moves between the creation of 0~_1 and 0~ may not be the same as the number between ~F~_~ and ~F~. We do not worry about the time that the break point spends outside these substrings; changes of these strings occur only when the break point migrates back into them. It follows that since 0t ~ ~F~, the first reduction which involves a symbol of tF~_ 1 must involve at least one symbol outside of tF~_ 1; i.e., tF t really does not exist, for we know that the reduction of 0~_ 1 to 0t involves only symbols of 0~_1, by definition of the 0's. If the next reduction involving W~-i were wholly within ~t-1, the result would, by (1) and (2) above, have to be that ~ - 1 was reduced to 0t. Let us now consider several cases, depending on whether, in 0~_ 1, x has been reduced to X and/or y has been reduced to Y. Case 1" Both have been reduced. This is impossible because we chose i~j. Case 2: y has not been reduced to Y, but x has been reduced to X. Now the reduction of ~ _ 1 involves symbols of ~ _ i and symbols outside of ~ _ 1. Therefore, the breakpoint is written 0~-1 and ~ - 1 , and a prefix of both is reduced. The parser on input u thus reduces X before Y. Since we have assumed that T is valid, we must conclude that there are two distinct parse
"l'Note that we are using symbols such as X and Y to represent specific instances of that symbol in the derivations, i.e., particular nodes of the derivation tree. We trust that the intended meaning will be clear.
496
LIMITED BACKTRACK PARSING ALGORITHMS
CHAP. 6
trees for u, and thus that G is ambiguous. Since G is unambiguous, we discard this case. Case 3: x has not been reduced to X, but y has been reduced to Y. Then 0~_ ~ = OY for some 0. We must consider the position of the stack break point in two subcases" (a) If the break point is within 0i_ x, and hence Wi_ x, the only way that tF t could be different from 0~ occurs when the reduction of 0~_~ reduces a prefix of 0t_ ~, and the symbol to the left of 0~_ 1is < related to the leftmost symbol of 0~_~. However, for W~-i, the " relation holds, so a different reduction occurs. But then, some symbols of Wt-1 that have yet to be reduced to X are reduced along with some symbols outside of W~_~. We rule out this possibility using the argument of case 2. (b) If the break point is to the right of Ot_ 1, then its prefix 0 can never reach the top of stack 1 without Y being reduced, for the only way to decrease the length of stack 1 is to perform reductions of its top symbols. But then, in the processing of u, by the time x is reduced to X, the Y has been reduced. However, we know that the X and Y must be reduced together in the unique derivation tree for u. We are forced to conclude in this case, too, that either T is not valid ~or G is ambiguous. Case 4: Neither x nor y have been reduced to X or Y. Here, one of the arguments of cases 2 and 3 must apply. We have thus ruled out all possibilities and conclude t h a t / t ~ p*,u~, + must be empty for a Colmerauer grammar. D
We now show that if there are symbols X and Y in a C F G G such that X p+,u Y and X,u2* Y, then G cannot be a Colmerauer grammar. LEMMA 6.9 Let G = (N, X, P, S) be a C F G such that for some X and Y in N u X, X p + g Y and X g 2 * Y . Then G is not a Colmerauer grammar. Proof. The proof is left for the Exercises, and is similar, unfortunately, to Lemma 6.8. Since X p*g Y, we can find A----~ o~ZYfl in P such that Z *~ ~'X. Since X g~+ Y, we can find B ~ yXC8 in P such that C ~ YS'. By the properness of G, we can find words u and v in L(G) such that each derivation of u involves the production A --, o~ZY]3 and the derivation of ~'X from that Z; each derivation o f v involves B ~ 7XC8 and the derivation of YS' from C. In each case, X derives x and Y derives y for some x and y in ~*. As in Lemma 6.8, we watch what happens to xy in u and v. In v, we find that Y must be reduced before X, while in u, either X and Y are reduced at the same time (if Z *~ ~'X is a trivial derivation) or X is reduced before
SEC. 6.2
LIMITED B A C K T R A C K
B O T T O M - U P PARSING
497
Y (if Z ~ 0t'X). Using arguments similar to those of the previous lemma, we can prove that as soon as the strings to which xy is reduced in u and v differ, one derivation or the other has gone astray. D Thus the conditions g N p*,u2 + = ~3 and p*g n g2 + = ~ are necessary in a Colmerauer grammar. We shall now proceed to show that, along with unambiguity, properness, and unique invertibility, they are sufficient. LEMMA 6.10 Let G = (N, Z, P, S) be any proper grammar. Then if t~XYfl is any sentential form of G, we have X p*,u2* Y.
Proof. Elementary induction on the length of a derivation of txXYfl.
D
LEMMA 6.11 Let G = (N, Z, P, S) be unambiguous and proper, with ,u n p*,u2 ÷ = and p+,u n , u 2 * = ~3. If otYXt... XkZfl is a sentential form of G, then the conditions X1 ,u X2 . . . . . X;,_i ,u Xk, Y p*,u2 + X1, and Xk p+g2* Z imply that X~ ... Xk is a phrase of ~YXx "" XkZfl.
Proof. If not, then there is some other phrase of otYX~ ... XkZfl which includes X1. Case 1: Assume that X 2 , . . . , Xk are all included in this other phrase. Then either Y or Z is also included, since the phrase is not X~ . . - X~. Assuming that Y is included, then YgX~. But we know that Y p*,uA + X~, so that ,u n p*,u2 + ~ $3. I f Z is included, then XkltZ. But we also have Xk P+g~* Z. If 2* represents at least one instance of 2, i.e., Xe p+,u2 + Z, then ,u n p*,u2 + ~ ~ . If 2* represents zero instances of 2, then Xkp +ltZ. Since Xk g Z, we have Xk g2* Z, SO p+g N g2* ~ ~. Case 2: X~ is in the phrase, but Xg+~is not for some i such that 1 < i < k . Let the phrase be reduced to A. Then by Lemma 6.10 applied to the sentential form to which we may reduce ttYX1... XkZfl, we have A p*,u2* Xt+I, and hence Xg p+,u2* X~+~. But we already have X~ ,u Xg÷l, so either ,u n p*,u2 + ~ or p+,u n , u 2 * ~ ~ , depending on whether zero or more instances of 2 are represented by 2* in p+/t2*. D LEMMA 6.12 Let G = (N, Z, P, S) be a C F G which is unambiguous, proper, and uniquely invertible and for which ,u n p*,u2 + = ~ and p+,u n ,u2* = ~ . Then G is a Colmerauer grammar.
Proof. We define Colmerauer precedence relations as follows"
498
LIMITEDBACKTRACK PARSING ALGORITHMS
CHAP. 6
(1) X " Y if and only if X/2 Y. (2) X < Y if and only if X/22+Y or X = $, Y :-# $. (3) X 3> Y if and only if X ¢ $ and Y = $, or X p+/22" Y but X/2~ + Y is false. It is easy to show that these relations are disjoint. If " n < ~ ;2 or - : - n .> :¢= ;2, then /2 n p*/22 + :¢= ;2 or /2 n p+/2 = ;2, in which case /22* n p+/2 :¢= ;2. If < ~ .> ¢ ;2, then X/22 + Y. Also, X/22 + Y is false, an obvious impossibility. Suppose that T, the induced two-stack parser, has rule ( Y X ~ . . . Xk, Z) --~(Y, AZ). Then Y < X ~ , so Y = $ or Y/22+X~. Also, X~ ' X~+~ for 1 < i < k, so Xt/2 Xt+ ~. Finally, X~ .> Z, so Z = $ or X~ p+/22" Z. Ignoring the case Y = $ or Z = $ for the moment, Lemma 6.11 assures us that if the string on the two stacks is a sentential form of G, then X t . . . Xk is a phrase thereof. The cases Y = $ or Z = $ are easily treated, and we can conclude that every reduction performed by T on a sentential form yields a sentential form. It thus suffices to show that when started with w in L(G), T will continue to perform reductions until it reduces to S. By Lemma 6.10, if X and Y are adjacent symbols of any sentential form, then X p*/22* Y. Thus either X/2 Y, X p+/2 Y, X/2~+ Y, or X p+/22+ Y. In each case, X a n d Y are related by one of the Colmerauer precedence relations. A straightforward induction on the number of moves made by T shows that if X and Y are adjacent symbols on stack I, then X < Y or X " Y. The argument is, essentially, that the only way X and Y could become adjacent is for Y to be shifted onto stack I when X is the top symbol. The rules of T imply that X < Y or X " Y. Since $ remains on stack 2, there is always some pair of adjacent symbols on stack 2 related by 3>. Thus, unless configuration ($, S$, zr) is reached by T, it will always shift until the tops of stack 1 and 2 are related by .>. At that time, since the < relation never holds between adjacent symbols on stack 1, a reduction is possible and T proceeds. E] THEOREM 6.6
A grammar is Colmerauer if and only if it is unambiguous, proper, uniquely invertible, and/2 n p*/22÷ = p÷/2 n / 2 2 " = ;2.
Proof Immediate from Lemmas 6.8, 6.9, and 6.12. Example 6.16
We saw in Example 6.15 that the grammar S ~ aSAIbSAIb, A ~ a satisfies the desired conditions. Lemma 6.12 suggests that we define Colmerauer precedence relations for this grammar according to Fig. 6.4. [Z]
EXERCISES
b
s
A
= p÷/t u o are Colmerauer percedence relations capable of parsing any sentential form of G.
6.2.13.
Show that every two-stack parser operates in time 0(n) on strings of length n.
"6.2.14.
Show that if L has a Colmerauer grammar, then L R has a Colmerauer grammar.
"6.2.15.
Is every (1, 1)-BRC grammar a Colmerauer grammar?
"6.2.16.
Show that there exists a Colmerauer language L such that neither L nor L R is a deterministic CFL. Note that the language [wba"11w i = n~},which we have been using as an example, is not deterministic but that its reverse is.
• 6.2.17.
We say that a two-stack parser T recognizes the domain of 'r(T) regardless of whether a parse emitted has any relation to the input. Show that every recursively enumerable language is recognized by some deterministic two-stack parser. Hint: It helps to make the underlying grammar ambiguous.
Open Problem 6.2.18.
Characterize the class of CFG's, unambiguous or not, which have valid deterministic two-stack parsers induced by disjoint "precedence" relations.
BIBLIOGRAPHIC
NOTES
Colmerauer grammars and Theorem 6.6 were first given by Colmerauer [1970]. These ideas were related to the token set concept by Gray and Harrison [1969]. Cohen and Culik [1971] consider an LR(k)-based scheme which effectively incorporates backtrack. tThis is a special case of Lemma 6.7 and is included only for completeness.
APPENDIX
The Appendix contains the syntactic descriptions of four programming languages: (1) (2) (3) (4) ments.
A simple base language for an extensible language. SNOBOL4, a string-manipulating language. PL360, a high-level machine language for the IBM 360 computers. PAL, a language combining lambda calculus with assignment state-
These languages were chosen for their diversity. In addition, the syntactic descriptions of these languages are small enough to be used in some of the programming exercises throughout this book without consuming an excessive amount of time (both human and computer). At the same time the languages are quite sophisticated and will provide a flavor of the problems incurred in implementing the more traditional programming languages such as ALGOL, FORTRAN, and PL/I. Syntactic descriptions of the latter languages can be found in the following references: (1) (2) (3) (4) A.1.
ALGOL 60 in Naur [1963]. ALGOL 68 in Van Wijngaarden [1969]. FORTRAN in [ANS X3.9, 1966] (Also see ANSI-X3J3.) PL/I in the IBM Vienna Laboratory Technical Report TR 25.096.
S Y N T A X FOR AN EXTENSIBLE BASE LANGUAGE
The following language was proposed by Leavenwortht as a base language which can be extended by the use of syntax macros. We shall give the syntax ?B. M. Leavenworth, "Syntax Macros and Extended Translation" Comm. A C M 9, No. 11 (November 1966), 790-793. Copyright © 1966, Association for Computing Machinery, Inc. The syntax of the base language is reprinted here by permission of the Association for Computing Machinery. 501
502
APPENDIX
of this language in two parts. The first part consists of the high-level productions which define the base language. This base language can be used as a block-structured algebraic language by itself. The second part of the description is the set of productions which defines the extension mechanism. The extension mechanism allows new forms of statements and functions to be declared by means of a syntax macro definition statement using production 37. This production states that an instance of a (statement) can be a (syntax macro definition), which, by productions 39 and 40, can be either a (statement macro definition) or a (function macro definition). In productions 41 and 42 we see that each of these macro definitions involves a (macro structure) and a (definition). The (macro structure) portion defines the form of the new syntactic construct, and the (definition) portion gives the translation that is to be associated with the new syntactic construct. Both the (macro structure) and (definition) can be any string of nonterminal and terminal symbols except that each nonterminal in the (definition) portion must appear in the (macro structure). (This is similar to a rule in an SDTS except that here there is no restriction on how many times one n0nterminal can be used in the translation element.) We have not given the explicit rules for (macro structure) and (definition). In fact, the specification that each nonterminal in the (definition) portion appear in the (macro structure) portion of a syntax macro definition cannot be specified by context-free productions. Production 37 indicates that we can use any instance of a (macro structure) defined in a statement macro definition wherever (sentence) appears in a sentential form. Likewise, production 43 allows us to use any instance of a (macro structure) defined in a function macro definition anywhere (primary) appears in a sentential form. For example, we can define a sum statement by the derivation (statement) ~
(syntax macro definition) (statement macro definition) smacro (macro structure) define (definition) endmacro
A possible (macro structure) is the following" sum (expression) ~1~with (variable) ~-- (expression) (2) to (expression) C3~ We can define a translation for this macro structure by expanding (definition) as begin local t; local s; local r; t~
0;
(variable) ~ (expression) (2) ;
A.1
SYNTAX FOR AN EXTENSIBLE BASE LANGUAGE
503
r" if (variable) > (expression) ~3~then goto s; t < t + (expression)~l~; (variable) ,-- (variable) + 1, goto r;
s" result t end
Then if we write the statement sum a with b <
c to d
this would first be translated into begin local t; local s; local r;
t< b<
-0; :c;
r' if b > d then goto s;
t ~ - - - t . + a; b<
b+l;
goto r; s: result t end
before being parsed according to the high-level productions. Finally, the nonterminals (identifier), (label), and (constant)are lexical items which we shall leave unspecified. The reader is invited to insert his favorite definitions of these items or to treat them as terminalsymbols. High-Level Productions
1 (program) (block) 2 (block)--~ begin (opt local ids) (statement list) end 3 (opt local ids)--~ (opt local ids) local (identifier); [ e 5 (statement list) (statement) ](statement list); (statement) 7 (statement) --~ (variable) ~-- (expression) [goto (identifier)[ if (expression) then (statement) [(block) [result (expression) 1 (label): (statement)
504
APPENDIX
13 (expression) --~ (arithmetic expression) (relation op) (arithmetic expression) ! (arithmetic expression) 15 (arithmetic expression) --~ (arithmetic expression) (add op) (term) [(term) 17 --~ endmacro 42 (function macro definition) fmacro define (definition) endmacro 43 (primary>--~ (macro structure) Notes
1. (macro structure) and (definition) can be any string of nonterminal or terminal symbols. However, any nonterminal used in must also appear in the corresponding . 2. , (identifier>, and (label> are lexical variables which have not been defined here.
A.2
SYNTAX OF SNOBOL4 STATEMENTS
A.2.
505
SYNTAX OF SNOBOL4 STATEMENTS
Here we shall define the syntactic structure of SNOBOL4 statements as described by Griswold et al.t The syntactic description is in two parts. The first part contains the context-free productions describing the syntax in terms of lexical variables which are described in the second part using regular definitions of Chapter 3. The division between the syntactic and lexical parts is arbitrary here, and the syntactic description does not reflect the relative precedence or associativity of the operators. All operators associate from left-to-right except --1, !, and **. The precedence of the operators is as follows: 1. & 4.@ 7./ 10. !**
2. I 5.+-8., 11. $.
3. 6.@ 9.% 12.--t ?
High-Level Productions
1 (statement>--~ (assignment statement> i (matching statement>! (replacement statement>l(degenerate statement>l 6 (assignment statement> (optional label> (equal> (object field> (goto field> 7 (matching statement> (subject field> (eos> 8 (replacement statement> --~ (goto field> (eos> 9 (degenerate statement> ---~ (optional label> (goto field> I (optional label> (goto field> (eos> 11 (end statement> --~ END l END (eos> I END END (eos> 14 (label> [e 16 --, (blanks> (element> tR. E. Griswold, J. F. Poage, and I. P. Polonsky, The SNOBOL4 Programming Language (2nd ed.) (Englewood Cliffs, N.J.: Prentice-Hall, Inc., 1971) pp. 198-199.
506
APPENDIX
17 (equal) ---~ (blanks) = 18 (object field) (blanks) (expression) 19 (goto field) (blanks) : (optional blanks)(basic goto)l e 21 (basic goto)----~ (goto) !S (goto) (optional blanks) (optional F goto) I F (goto)(optional blanks)(optional S goto) 24 (goto)---~ ((expression)) [ < (expression) > 26 (optional S goto) ---~ S (goto) Ie 28 (optional F goto) --~ F(goto)le 30 (eos) (optional blanks~ ;i (optional blanks~ (eol~ 32 (pattern field~ (blanks) (expression~ 33 (element) (optional unaries~ (basic element~ 34 (optional unaries~---~ (operator) (optional unaries) 1e 36 ~basic element~ ~identifier~ [~literal~ i (function call) I(reference~ I((expressionS) 41 (function call~ ---~ (identifier~ ((arg list~) 42 (reference~ (identifier) < (arg list) > 43 (arg list~ ---~ (arg list), (expression~ I(expression) 45 (expression~----~ (optional blanks~ (element) (optional blanks) [ (optional blanks~ (operation~ (optional blanks) 1 (optional blanks~ 48 (optional blanks~ (blanks)! e 50 (operation~ (element) (binary) (element) l (element) (binary> (expression~ Regular Definitions for Lexical Syntax
(digit~ = 0111213141516171819
A.3
SYNTAXfOR PL360
507
(letter) =
AtBICI...IZ (alphanumeric) = (letter) l (digit) (identifier) = (letter) ((alphanumeric) i. I )* (blanks) = (blank character) + (integer) = (digit) + (real) = (integer). (integer) i (integer). (operator) =
~l?l$1.ltt%l,I/l#1+l-l@llt&
(binary) = (blanks 5 [(blanks) (operator) (blanks51 (blanks ~ ** (blanks) (sliteral 5 = ' ((EBCDIC charactery -- ')* 't (dliteral) = " ( ( E B C D I C character) -- ")* " (literal) = (sliteral) [(dliteral) [(integer) I(real) (label) = (alphanumeric) ((EBCDIC character) -- ((blank character) 1;) )* -- E N D Lexical Variables
(blank character) ( E B C D I C character) (eol~:l: A.3.
S Y N T A X FOR PL360
This section contains the syntactic description of PL360, a high-level machine language devised by Nicklaus Wirth for the IBM 360 computers. The syntactic description is the precedence grammar given by Wirth [1968].§ tThe minus sign is a metasymbol here and on the following lines. :l:end of line. §Niklaus Wirth, "PL360, a Programming Language for the 360 Computers" J. ACM 15, No. 1 (January, 1968), 37-74. Copyright © 1968, Association for Computing Machinery, Inc. The syntax is reprinted by permission of the Association for Computing Machinery.
508
APPENDIX
High-Level Productions
1 (register)--~ (identifier) 2 (cell identifier) ----~ (identifier) 3 (procedure identifier)----~ (identifier) 4 (function identifier) (identifier) 5 (cell)---~ (cell identifier) ](celll)) l (cell2)) 8 (celll) --, (cell2) (arith op) (number)! (cell3) (number) lO (cell2) (cell3) (register) 11 (cell3)
(cell identifier) ( 12 (unary op)--~ abs [ neg t neg abs
15 (arith opt +i-I,i/!++i-21 (logical op) ---~ and iorlxor
24 (shift op)---~ shla i shra Ishll [shrl 28 (register assignment) --~ (register) := (cell) 1 (register) := (number)l (register) := (string)[ (register) := (register) i (register) "= (unary op) (cell) ! (register) := (unary op)(number)l (register) := (unary op)(register) ! (register) := @ (cell)] (register assignment) (arith op) (cell) l (register assignment) (arith op) (number)[ (register assignment) (arith op) (register) ! (register assignment) (logical op) (cell) ! (register assignment) (logical op) (number)[ (register assignment) (logical op) (register) ! (register assignment) (shift op) (number) I (register assignment) (shift op) (register)
A.3
SYNTAX FOR P L 3 6 0
509
44 (funcl) --~ (func2) (number) I (func2) (register)[ (func2) (cell) [ (func2) (string) 48 (func2)--~ (function identifier) ) ](func 1), 50 (case sequence)----~ case (register) of begin I(case sequence) (statement) ; 52 (simple statement) ---~ (cell) := (register) I(register assignment) Inull [goto (identifier) [ (procedure identifier) ](function identifier) !(func 1) ( ] (case sequence) end I(blockbody) end 61 (relation) --~
1=[~=
67 (not)---~ --1
68 (condition) (register) (relation) (cell) I (register) (relation) (number)[ (register) (relation) (register) [ (register) (relation) (string) ! overflow 1(relation) I (cell) [(not) (cell) 76 (compound condition) (condition) ](comp aor) (condition) 78 (comp aor)--~ (compound condition) and 1(compound condition) or 80 (cond then.)----~ (compound condition) then 81 (true part) ---~ (simple statement) else 82 (while)--~ while 83 (cond do)---~ (compound condition) do 84 (assignment step) ---~ (register assignment) step (number) 85 (limit) until (register) Iuntil (cell) Iuntil (number) 88 (do) do
89 (statement*)
510
APPENDIX
(simple statement)[ if (cond then) (statement*) I if (cond then) (true part) (statement*) I (while) (cond do) (statement*) l for (assignment step) (limit) (do) (statement*) 94 (statement)----~ (statement*) 95 (simple type) short integer [ integer ! logical treal [ long real Ibyte [ character 102 (type) ---~ (simple type) !array (number) (simple type) 104 (decll)--~ (type) (identifier) I(decl2) (identifier) 106 (decl2) (decl7), 107 (decl3) ---~ (decll~ = 108 (decl4) --~ (decl3) (I (decl5), 110 (decl5~ ---o (decl4) (number) I(decl4) (string) 112 (decl6) ---~ (decl3) 113 (dee17) --~ (decll) I (decl6) (number) [ (decl6) (string) I (decl5)) 117 ~function declarationl) function I(function declaration7) 119 (function declaration2) ---~ (function declarationl) (identifier) 120 (function declaration3)---~ (function declaration2) ( 121 (function declaration4) ---~ (function declaration3) (number) 122 (function declaration5) (function declaration4), 123 (function declaration6) (function declaration5) (number) 124 (function declaration7) --~ (function declaration6) ) 125 (synonymous dcl) ----~
A.3
SYNTAXFORPL360
511
(type) (identifier) syn[ (simple type) register (identifier) synl (synonymous dc3) (identifier) syn 128 (synonymous dc2) ----~ (synonymous de l) (cell) I (synonymous dcl) (number) 1 (synonymous dcl) (register) 131 (synonymous dc3) --~ (synonymous dc2), 132 (segment head) ----~ segment 133 (procedure headingl) --~ procedure [(segment head) procedure 135 (procedure heading2) --~ (procedure headingl) (identifier) 136 (procedure heading3) ---~ (procedure heading2) ( 137 (procedure heading4) ---~ (procedure heading3) (register) 138 (procedure heading5) (procedure heading4) ) 139 (procedure heading@ --~ (procedure heading5) ; 140 (declaration) --~ (decl7) I(function declaration7) l (synonymous dc2) l (procedure heading@ (statement*) I(segment head) base (register) 145 (label definition) (identifier): i 46 (blockhead) ---~ begin I(blockhead) (declaration) ; 148 (blockbody) (blockhead) 1 (blockbody) (statement) ;1 (blockbody) (label definition) 151 (program) ---~ . (statement). Lexical Variables
(identifier) (string) (number)
512
APPENDIX
A.4. A S Y N T A X - D I R E C T E D S C H E M E FOR PAL
TRANSLATION
Here we provide a simple syntax-directed translation scheme for PAL, a programming language devised by J. Wozencraft and A. Evanst embodying lambda calculus and assignment statements. PAL is an acronym for Pedagogic Algorithmic Language. The simple SDTS presented here maps programs in a slightly modified version of PAL into a postfix Polish notation. This SDTS is taken from DeRemer [1969].$ The SDTS is presented in two parts. The first part is the underlying context-free grammar, a simple LR(1) grammar. The second part defines the semantic rule associated with each production of the underlying grammar. We also provide a regular definition description of the lexical variables < relational functor > , < variable > and < constant > used in the SDTS. High-Level Productions
I (program)--~ (definition list)I := (tuple) Igoto (combination) [ res (tuple) I(tuple) (tuple) (T1)I(T1), (tuple) (T1) ---~ (T1) aug (conditional expression) I(conditional expression) (conditional expression) (boolean) --~ (conditional expression) I @onditional expression) [
33 (T2) $ (combination) !(boolean) 35 (boolean)--~ (boolean) or @onjunction) I@onjunction) 37 (conjunction) (conjunction) & (negation) l (negation) 39 (negation) not (relation) I(relation) 41 (relation) --~ (arithmetic expression)(relational functor) (arithmetic expression) [ (arithmetic expression) 43 (arithmetic expression)--~ (arithmetic expression) + (term)l (arithmetic expression) -- (term)! + ( t e r m ) l - (term)[ (term) 48 (term)--~ (term) • (factor) i (term) / (factor) [(factor) 51 (factor) --~ (primary) ** (factor) I(primary) 53 (primary) --~ (primary) 7o (variable) (combination) I(combination) 55 (combination) (combination) (rand) [(rand) 57 (rand) (variable) 1@onstant) I((expression)) I[(expression)] 61 (definition) --~
514
63 65 67 69
73 75 78
APPENDIX
(inwhich definition) within (definition) l (inwhich definition) (inwhich definition) (inwhich definition) inwhich (simultaneous definition) I (simultaneous definition) (simultaneous definition)---~ (rec definition) and (simultaneous definition) I (rec definition) (rec definition) tee (basic definition) I(basic definition) (basic definition) (variable list)= (expression)[ (variable) (bv part) = (expression) I ((definition)) I[(definition)] (bv part)----~ (bv part) (basic bv) 1(basic bv) (basic bv) (variable) I((variable list)) i ( ) (variable list) (variable), (variable list) [(variable)
Rules Corresponding to the High-Level Productions
1 (program)= (definition list) I(expression) 3 (definition list) = (definition) (definition list) defl (definition) lastdef 5 (expression)= (definition) (expression) let l (bv part) (expression) lambda I (where expression) 8 (where expression) = (valof expression) (rec definition) where I (valof expression) 10 (valof expression) = (command) valof !(command) 12 (command)= (labeled command)(command); I(labeled command) 14 (labeled command) = (variable) (labeled command): I(conditional command) 16 (conditional command) = (boolean) (labeled command) (labeled command) test-true I (boolean) (labeled command) (labeled command) test-false I
A.4
A SYNTAX-DIRECTED TRANSLATION SCHEME FOR PAL
(boolean) (labeled command) ift (boolean) (labeled command) unless I (boolean~ (labeled command~ while I (boolean~ (labeled command) until I (basic command~ 23 (basic command~ = (tuple) ~tuple) := l (combination~ goto[ (tuple) res I(tuple~ 27 (tuple)= (T1) I(T1) (tuple), 29 (TI~ = (T1 ~ (conditional expression) aug I(conditional expression~ 31 (conditional expression) = (boolea!3 ~ (conditional expression~ (conditional expression) test-true I (T2) 33 ( T 2 ) = (combination> $ I(boolean) 35 (boolean)= (boolean) (conjuction> or i (conjunction~ 37 (conjunction)= (conjunction) (negation) & l (negation) 39 (negatio@ = (relation~ not I(relation) 41 (relation)= (arithmetic expression) (arithmetic expression~ ..... (relational functor~ (arithmetic expression) 43 (arithmetic expression~ = (arithmetic expression~ (term~ ÷[ (arithmetic expression)(term) --[ (term~ pos !(term~ neg [(term) 48 ( t e r m ) = (term) (factory • [(term) (factor) / [(factor) 51 (factor)= (primary) (factory exp 1(primary) 53 (primary)= (primary) (variable~ (combination) % I(combination) 55 (combination)= (combination~ (rand) gamma l (randy 57 (randy = (variable~ 1(constant) [(expression) I(expression) 61 (definition)=
51 5
51 6
63 65 67 69
73 75 78
APPENDIX
gj, and (3) M~j -- -- 1 implies that f. < gj Assertion (1) is immediate from step (2) of Algorithm 7.1. To prove assertion (2), we note that if M~j ---- + 1, then edge (if,., (~j) is added to the linearization graph. Hence, f. > gj, since the length of a longest path to a leaf from node^ if,. must be at least one more than the length of a longest path from G j if the linearization graph is acyclic. Assertion (3) follows similarly. O n l y if: Suppose that a linear precedence matrix M has linear precedence functions f and g but that the linearization graph for M has a cycle consisting of the sequence of nodes Na, N2, •. •, Nk, N~+,, where N k + 1 - - N~ and k > 1. Then by step (3), for all i, 1 < i < k, we can find nodes H~ and I~+ 1 such that
(1) H~ and I~+~ are original F's and G's; (2) H,. and I,.+1 are represented by N,. and N,.+~, respectively; and, (3) Either Hi is Fro, I~+ ~ is G, and Mm,, -- + 1, or H~ is G m, It+ ~ is F, and M,,m = - - 1.
550
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
We observe by rule (2) that if nodes Fm and G, are represented by the same Nt, then fm must equal g, if f and g a r e to be linearizing functions for M. Let f a n d g be the supposed linearizing functions for M. Let ht be fm if Ht is F= and let h t be gm if H t is Gm. Let h~ be f~ if I t is Fm and let h~ be gm if I t is Gm. Then ha>h~ =hz>h~
.....
h~>h~+a
But since Nk+ 1 is Na, we have h~+l = ha. However, we just showed that ha > h~+ a. Thus, a precedence matrix with a cyclic linearization graph cannot have linear precedence functions. [Z] COROLLARY
Algorithm 7.1 computes linear precedence functions for M whenever they exist and produces the answer "no" otherwise. D 7.1.2.
Applications to Operator Precedence Parsing
We can try to find precedence functions for any matrix whose entries have at most three values. The applicability of this technique is not affected by what the entries represent. To illustrate this point, in this section we shall show how precedence functions can be applied to represent operator precedence relations. Example 7.4
Consider our favorite grammar G Owith productions
E
>E+TIT
T
> T , FIF
F
>(E)[a
T h e m a t r i x giving the operator precedence relations for G o is shown in Fig. 7.6.
•>
(
+
*
a
Fig. 7.6 Matrixof operator precedence relations for Go.
LINEAR PRECEDENCEFUNCTIONS
SEC. 7.1 1
2
3
-1
--1 --1
-1
--1 --1
-1
+1 --1 +1
--1 +1
$
(
4
551
5
0
+1 +1
+1
+1
+1 a,)
+
.
)
o
Fig. 7.7 Reduced precedence matrix M'.
Fig. 7.8 Linearization graph for M'. If we replace ~ by --1, " by 0, and -> by -t-1, we obtain the reduced precedence matrix M ' shown in Fig. 7.7. Here we have combined the rows labeled a and ) and the columns labeled ( and a. The linearization graph for M ' is shown in Fig. 7.8. From this graph we obtain the linear precedence functions f'= (0, 0, 2, 4, 4) and g' = (0, 5, 1, 3, 0) for M'. Hence, the linear precedence functions for the original matrix are f = (0, 0, 2, 4, 4, 4) and g = (0, 5, 1, 3, 5, 0). [Z 7.1.3.
Weak Precedence Functions
As pointed o u t previously, --1, 0, and ÷ 1 of the matrix of Algorithm 7.1 can be identified with the Wirth-Weber precedence relations a. (2) If b ?~ a, then b ? a. (3) If A ?~ a, then either (a) A ? a or (b) For all Z in N U Z such that A ~ e Z is a production in P the relation Z .>~ a is false. (4) If X ?~ A, then either (a) X ? A or (b) For all Z in N U Z such that A ~ Z~¢ is in P the relation X ~ a . In the parsing of yxw, we note that the first symbol of w is not shifted until yx is reduced to flA. Therefore, the parsing of yxa will proceed exactly as that of yxw--until configuration ($IIA, aS, n') is reached. But A ?ca holds while A ? a does not, so the two parsers are not exactly equivalent.
Case 3: Suppose that condition (4) is violated. That is, we have X ?c A a n d some production A----~ Y~ such that X ? A is false and X
.>
•>
.>
="
.>
•>
.>
.>
.>
•>
.>
.>
.>
•>
.>
.>
.>
•>
.>
.>
.>
(
and error. We then attempt to find linear precedence functions for M, again attempting to represent as many blanks as possible by O's.
7.1.11.
Represent the < and ' relations of Fig. 7.13 with linear precedence functions. Use Theorem 7.2 to locate the essential blanks and attempt to preserve these blanks.
"7.1.12.
Show that under the definition of exact equivalence for weak precedence parsers a blank entry (X, Y) of the matrix of Wirth-Weber precedence relations is an essential blank if and only if one of the following conditions holds" (1) X and Y are in ~ U {$}; or (2) X is in N, Y is in X U {$}, and there is a production X ~ such that Z 3>c Y.
czZ
In the following problems, "equivalent" is used in the sense of Example 7.8. • 7.1.13.
Let ~c and ~ be shift-reduce parsing algorithms for a simple precedence grammar as in Theorem 7.2. Prove that ~c is equivalent to ~ if and only if the following conditions are satisfied" (1) (a) If X < c Y, then X < Y. (b) If X - - ~ Y, then X " Y. (c) If X 3>~ a, then X 3> a. (2) If b ?~ a, then b < a is false. (3) If A ?~a and A < a or A " a, then there is no derivation A ~r m 0~aX~=-~ . . . :=~ ~mXm, m > 1, such that for 1 < i < m, Xt % rm rm ~ a and X t 3> a, and Xm ">c a, or Xm is a terminal and X,n 3> a. (4) I f A ~ < a o r A~ " a f o r s o m e a , then there does not exist a derivation A~ ==~ A2 ==~ . . . ~ Am ==~ B0~, m > 1, a symbol X, and a production B ~ Y f l such that (a) X ? ¢ A ~ b u t X < A t , for2~i. Show that the parsers constructed from M and Mc by Algorithm 5.12 (or 5.14) are equivalent.
"7.1.16.
Consider a shift-reduce parsing algorithm for a simple precedence grammar in which after each reduction a check is made to determine whether the < or ~ relation holds between the symbol that was immediately to the left of the handle and the nonterminal to Which the handle is reduced. Under what conditions will an arbitrary matrix of precedence relations yield a parser that is exactly equivalent (or equivalent) to the parser of this form constructed from the canonical WirthWeber precedence relations ?
*'7.1.17.
Show that every CFL has a precedence grammar (not necessarily uniquely invertible) for which linear precedence functions can be found.
Research Problems 7,1,18,
Give an efficient algorithm to find linear precedence functions for a weak precedence grammar G that yields a parser which is equivalent to the canonical precedence parser for G.
7,1,19,
Devise good error recovery routines to be used in conjunction with precedence functions.
Programming Exercises 7,1,20,
Construct a program that implements Algorithm 7.1.
7,1,21,
Write a program that implements a shift-reduce parsing algorithm using linear precedence functions to implement the f and g functions.
7,1,22,
Write a program that determines whether a CFG is a precedence grammar that has linear precedence functions.
7,1,23,
Write a program that takes as input a simple precedence grammar G that has linear precedence functions and constructs for G a shift-reduce parser utilizing the precedence functions.
BIBLIOGRAPHIC
NOTES
Floyd [1963] used linear precedence functions to represent the matrix of operator precedence relations. Wirth and Weber [1966] suggested their use for representing Wirth-Weber precedence relations. Algorithms to compute linear precedence functions have been given by Floyd [1963], Wirth [1965], Bell [1969], Martin [1972], and Aho and Ullman [1972a]. Theorem 7.2 is from Aho and Ullman [1972b], which also contains answers to Exercises 7.1.13 and 7.1.15. Exercise 7.1.17 is from Martin [1972].
564
7.2.
TECHNIQUES FOR PARSER OPTIMIZATION
OPTIMIZATION
OF F L O Y D - E V A N S
CHAP. 7
PARSERS
A shift-reduce parsing algorithm provides a conceptually simple method of parsing. However, when we attempt to implement the two functions of the parser, we are confronted with problems of efficiency. In this section we shall discuss how a shift-reduce parsing algorithm for a uniquely invertible weak precedence grammar can be implemented using the Floyd-Evans production language. The emphasis in the discussion will be on methods by which we can reduce the size o f the resulting Floyd-Evans production language program without changing the behavior of the parser. Although we only consider precedence grammars here, the techniques of this section are also applicable to the implementation of parsers for each of the other classes of grammars discussed in Chapter 5 (Volume I). 7.2.1.
Mechanical Generation of Floyd-Evans Parsers for Weak Precedence Grammars
We begin by showing how a Floyd-Evans production language parser can be mechanically constructed for a uniquely invertible weak precedence grammar. The Floyd-Evans production language is described in Section 5.4.4 of Chapter 5. We shall illustrate the algorithm by means of an example. As expected, we shall use for our example the weak precedence grammar Go with productions (1) E - ~ E + T (2) E ~ T (3) T ~
T, F
(4) T - r E (5) F ~ (E) (6) F---~a The Wirth-Weber precedence relations for Go were shown in Fig. 7.9. (p. 553). From each row of this precedence matrix we shall generate statements of the Floyd-Evans parser. We use four types of statements: shift statements, reduce statements, checking statements, and computed goto statements.? We shall give statements symbolic labels that denote both the type of statement and the top symbol of the pushdown list. In these labels we shall use S for shift, R for reduce, C for checking, and G for goto, followed by the symbol assumed to be on top of the pushdown list. We shall generate the shift statements first and then'the reduce statements. 1"The computed goto statement involves an extension of the production language of Section 5.4.4 in that the next label can be an expression involving the symbol .#, which, as in Section 5.4.4, represents an unknown symbol matching the symbol at a designated position on the stack or in the lookahead. While we do not wish to discuss details of implementation, the reader should observe that such computed gotos as are used here can be easily implemented on his favorite computer.
SEC. 7.2
OPTIMIZATION OF FLOYD-EVANS PARSERS
565
For weak precedence grammars the precedence relations ~ and " indicate shift and .> indicates a reduction. The E-row of Fig. 7.9 generates the statements (7.2.1)
El)
SE:
>
E)]
• S)
,S+
E l i - ---~ E + I
$E15
I
accept
El
I
error
The first statement states that if E is on top of the pushdown list and the current input symbol is ), then we shift ) onto the pushdown list, read the next input symbol, and go to the statement labeled S). If this statement does not apply, we see whether the current input symbol is + . If the second statement does not apply, we next see if the current input symbol is $. The relevant action here would be to go into the halting state accept if the pushdown list contained $E. Otherwise, we report error. Note that no reductions are possible if E is the top stack symbol. Since the first component of the label indicates the symbol on top of the pushdown list, we can in many cases avoid unnecessary checking of the top symbol of the pushdown list if we know what it is. Knowing that E is on top of the pushdown list, we could replace the statements (7.2.1) by SE:
1) ~
• s)
)1
I + ---~ $#15
,S+
+i
I
1
accept
I
error
Notice that it is important that the error statement appear last. When E is on top of the pushdown list, the current input symbol must be ), + , or $. Otherwise we have an error. By ordering the statements accordingly, we can first check for ), then for + , and then for $, and if none of these is the current input symbol, we report error. The row for T in Fig. 7.9 generates the statements ,S,
ST: RT:
E+T
>
T
(7.2.2)
CT:
E
CT
E
CT
)
SE
+
SE SE error
566
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
Here the precedence relation T ~--. generates the first statement. The precedence relations T-> ), T-> --k, and T-> $ indicate that with T on top of the pushdown list we are to reduce. Since we are dealing with a weak precedence grammar, we always reduce using the longest applicable production, by Lemma 5.4. Thus, we first look to see if E + T appears on top of the pushdown list. If so, we replace E + T by E. Otherwise, we reduce T to E. When the two R T statements are applicable, we know that T is on top of the pushdown list. Thus we could use
RT: E -k ~ l
, EI
CT
@l
> El
CT
and again avoid the unnecessary checking of the top symbol on the pushdown list. After we perform the reduction, we check to see whether it was legal. That is, we check to see whether the current input symbol is either ), + , or $. The group of checking statements labeled CT is used for this purpose. We report error if the current input symbol is not ), + , or $. Reducing first and then checking to see if we should have made a reduction may not always be desirable, but by performing these actions in this order we shall be able to merge common checking operations. To implement this checking, we shall introduce a computed goto statement of the form
G:
#i
I
S_~
indicating that the top symbol of the pushdown list is to become the last symbol of the label. Now we can replace the checking statements in (7.2.2) by the following sequence of statements"
CT:
G:
I)
I
G
1+ !$ 1
1 1 I
G
-~ !
i
G error
s#
We shall then be able to use these checking statements in other sequences. For example, if a reduction in G Ois accomplished with T on top of the stack, the new top of the stack must be E. Thus, the statements in the CT group could all transfer to SE. However, in general, reductions to several different nonterminals, could be made, and the computed goto is quite useful in establishing the new top of the stack.
SEC. 7.2
OPTIMIZATION OF FLOYD-EVANS PARSERS
567
Finally, for convenience we shall allow statements to have more than one label. The use of this feature, which is not difficult to implement, will become apparent later. We shall now give an algorithm which makes use of the preceding ideas. ALGORITHM 7.2 Floyd-Evans parser from a uniquely invertible weak precedence grammar. lnput. A uniquely invertible weak precedence grammar G = (N, E, P, S). Output. A Floyd-Evans production language parser for G. Method.
(1) Compute the Wirth-Weber precedence relations for G. (2) Linearly order the elements in N U E W [$} as (X1, X2 . . . . , Y m ) , (3) Generate statements for X~, X2 . . . . . Xm as follows. Suppose that X~ is not the start symbol. Suppose further that either X~ ~ a or Xt " a for all a in (al, a 2 , . . . , a:}, and Xt -~ b for all b in (b 1 , . . . , bl}. Also, suppose A 1~ ~lXi, A 2 ~ ~2X~,..., A k ~ ~kXi are the productions having X~ as the last symbol on the right-hand side, arranged in an order such that 0~Xt is not a suffix of 0~qX~for p < q. Moreover, let us assume that Ah ~ ehX~ has number Ph, 1 ~ h ~ k. Then generate the statements
sx, "
l a, l az
~ a, t
, Sa~
> a2 [
* Saz
°
l aj ~ RXi:
0~1~[ 0~2@1
• Sa k
a:l
-
>All
emitpl
cx,
>A21
emitp2
cx,
> Ak I
emit Pk
cx,
,
°
°
0~k~l
c~-
I
1
Ib~
I
error G G
G
[bl
L
error
568
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
If j is zero, then the first statement of the RX,. group also has label SX~. If k is zero, the error statement in the RX~ group has label RX~. If X,. is the start symbol, then we do as above and also add the statement
$@ 15
1
accept
to the end of the S X i group. S$ is the initial statement of the parser. (4) Append the computed goto statement: G:
#1
t
s#
D
Example 7.9
Consider the grammar Go. From the F row of the precedence matrix we would get the following statements: SF:
RF~f :
T,@ #
~T >T
emit 3
CF
emit 4
CF
error CF:
)
G
+
G G
$
G
error Note that the third statement is useless, as the second statement will always produce a successful match. We could, of course, incorporate a test into Algorithm 7.2 which would cause useless statements not to be produced. From now on we shall assume useless statements are not generated. From the a row we get the following statements: Sa:
Ra: Ca:
~ l
>F
emit 6
Ca
1)
G
I+ !,
G
Is I
G G
error
Notice that the checking statements for a are identical to those for F. ~Note the use of multiple labels for a location. Here, the SF group is empty.
SEC. 7 . 2
OPTIMIZATION OF FLOYD-EVANS PARSERS
569
In the next section we shall outline an algorithm which will merge redundant statements. In fact, the checking statements labeled CT could also be merged with CF if we write
Ca:
CF: CT:
1" I) I+ J$
I I 1 1
l
[
G c c c error
Our merging algorithm will also consider partial mergers of this nature. The row labeled ( in the precedence matrix generates the statements
I( [a
-
> (t
,s(
> al
,Sa
t
]
error
Similar statements are also generated by the rows labeled + , ,, and $.
[--]
We shall leave the verification of the fact that Algorithm 7.2 produces a valid right parser for G for the Exercises.
7.2.2.
Improvement of Floyd-Evans Parsers
In this section we shall consider techniques which can be used to reduce the number of shift and checking statements in the Floyd-Evans parser that results from Algorithm 7.2. Our basic technique will be to merge common shift statements and common checking statements. The procedure may introduce additional statements having the effect of an unconditional transfer, but we shall assume that these branch statements have relatively small cost. We shall treat the merger of shift statements here; the same technique can be used to merge checking statements. Let G = (N, E, P, S) be a uniquely invertible weak precedence grammar. Let M be its matrix of Wirth-Weber precedence relations. The matrix M determines the shift and checking statements that arise in Algorithm 7.2. From the precedence matrix M, we construct a merged shift matrix Ms as follows: (1) Delete all .> entries and replace the ~" entries by
X4 ">
~_
.~
.>
.>
.~
1. For the LR(0) case, the lookahead string will always be the empty string. Therefore, all errors must be caught by the goto entries. Thus, in a set of LR(0) tables not all error entries in the goto field are don't cares. We leave it for the Exercises to deduce which of these entries are don't cares. Example 7.18
Let G be the LR(I) grammar with productions (1) S - - ~ S a S b (2) S---, e The canonical set of LR(k) tables for G is shown in Fig. 7.24(a), and the tables after application of Algorithm 7.6 are shown in Fig. 7.24(b). Note that in Fig. 7.24(b) all error's in the right-hand portions of the tables (goto fields) have been replaced by ~0's. In the action fields, To has been left intact by rule (2a) of Algorithm 7.6. The only shift actions occur in tables T3 and T7; these result in T4, T~, or T7 becoming the governing table. Thus, e r r o r entries in the action fields of these tables are left intact. We have changed e r r o r to ~oelsewhere. [~] THEOREM 7.5 The set of tables 3' constructed by Algorithm 7.6 is (0-inaccessible and equivalent to the canonical set 3. P r o o f The equivalence of :3 and :3' is immediate, since in Algorithm 7.6, the only alterations of 3 are to replace error's by ~0's, and the LR(k) parsing algorithm does not distinguish between error and ~0 in any way.t We shall now show that if a ~0entry is encountered by Algorithm 7.5 using the set of tables 3', then 3' was not properly constructed from 3. Suppose that 3' is not ~0-inaccessible. Then there must be some smallest i such that Co ]'---" C, where Co is an initial configuration and in configuration C Algorithm 7.5 consults a ~0-entry of 3'. Since To is not altered in step (2a), we must have i > 0. Let C = (To X1T1 " " XmTm, w, 7z), where T,, = ( f , g ) and FIRSTk(w ) = u. There are three ways in which a ~0-entry might be encountered.
tThe purpose of the ~0'sis only to mark entries which can be changed.
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
action
591
goto
b
e
S
a
b
X
X
T2
X
X
X
To
2
X
2
T1
S
X
A
T2
2
2
X
T3
S
S
X
X
r4
rs
T4
2
2
X
T6
X
X
1
X
1
X
X
X
T6
S
S
X
X
T4
T7
T7
1
1
X
X
X
X
X
(a) Canonical Set of LR(1 ) Tables
action
~oto
a
b
e
S
a
To
2
X
2
T~
~
T1
S
~o
A
T2
2
2
X
75
:
T3
S
S
,p
~
T4 ~
T4
2
2
X
r6
~
Ts
1
X
1
~p
tp
~o
T6
S
S
:
r4
r7
T7
1
1
,p
~o
,p
X
b
T2
(b) ~-Free Set of Tables
Fig. 7.24 A set of LR(1) tables before and after application of Algorithm 7.6.
Case 1: Suppose that f(u) -- qo.Then by step (2aii) and (2aiii) of Algorithm 7.6, the previous move of the parser could not have been shift and must have been reduce. Thus, Co [i-~ C'k- C, where C' = ( T o X i T , . . . Xm_,T,,_, Y, U, . . .
YrU,, x, n'),
and the reduction was by production Xm ~ Y1 "'" Y,. (n is n' followed by the number of this production.) Let us consider the set of items from which the table Ur is constructed. (If r = O, read Tin_1 for Ur). This set of items must include the item
592
TECHNIQUES
[Xm ~ Y~ " " some y ~ Z* Suppose that That is, in the
FOR PARSER
CHAP.
OPTIMIZATION
7
Yr ", u]. Recalling the definition of a valid item, there is such that the string Xi . . . XmUy is a right-sentential form. the nonterminal Arm is introduced by production A --~ o~Xzfl. augmented grammar we have the derivation
s'
rm
rm
7 x.flx -
rm
where ),~t -- X, . . . X~_i. Since u is in FIRST(fix), item [A ~ ~Xm'fl, V] must be valid for X, . . . Xm if V = FIRST(x). We may conclude that the parsing action of table Tm on u in the canonical set of tables was not error and thus could not have been changed to ~0 by Algorithm 7.6, contradicting what we had supposed.
Case 2: Suppose that f ( u ) = shift and that a is the first symbol of u but that g(a)--~0. Since f (u) -- shift, in the set of items associated with table Tm there is an item of the form [A--~ ~.afl, v] such that u is in EFF(aflv) and [A ~ ~.afl, v] is valid for the viable prefix X 1 - . . Arm. (See Exercise 7.3.8.) But it then follows that [A--~ ~a.fl, v] is valid for X ~ . . . Xma and that X ~ . . . Xma is also a viable prefix. Hence, the set of valid items for X~ . . . X,,a is nonempty, and g(a) should not be ~0 as supposed. Case 3: Suppose that f ( u ) -- reduce p, where production p is Xr "'" Xm, T,_~ is ( f ' , g'), and g'(A) = ~0. Then the item [A --~ )Yr "'" Xm., u] is valid for X1 . . . X~, and the item [A ~ • X, . . . Xm, U] is valid for X1 . . . X~_t. In a manner similar to case 1 we claim that there is an item [B ~ ~A. fl, v] which is valid for X, . . . X~_iA, and thus g'(A) should not be ~0. [~ We can also show that Algorithm 7.6 changes as many error entries in the canonical set of tables to ~ as possible. Thus, if we change any e r r o r entry in 3' to (p, the resulting set of tables will no longer be ~0-inaccessible. 7.3.4.
Table Mergers by Compatible Partitions
In this section we shall present an important technique which can be used to reduce the size of a set of LR(k) tables. This technique is to merge two tables into one whenever this can be done without altering the behavior of the LR(k) parsing algorithm using this set of tables. Let us take a to-inaccessible set of tables, and let T1 and Tz be two tables i n t h i s set. Suppose that whenever the action or goto entries of Ti and T2 disagree, one of them is ~0. Then we say that Ti and T2 are compatible, and we can merge T, and 7"2, treating them as one table. In fact, we can do more. Suppose that T~ and 7'2 were almost compatible but disagreed in some goto entry by having 7'3 and 7'4 there. If 7'3 and T4 were themselves compatible, then we could simultaneously merge T3 with
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
593
T4 and T~ with T~. And if T3 and T4 missed being compatible only because they had T~ and T2 in corresponding goto entries, we could still do the merger. We shall describe this merger algorithm by defining a compatible partition on a set of tables. We shall then show that all members of each block in the compatible partition may be simultaneously merged into a single table. DEFINITION
Let (3, To) be a set of ¢p-inaccessible LR(k) tables and let II =[$1, • • •, Sp} be a partition on 3. That is, 8, U $2 U -.- U 8p -- 3, and for all i ~ j, ~;, and Sj are disjoint. We say that II is a compatible partition of width p if for all blocks 8~, 1 < i ~ p, whenever ( f l , g , ) and (fz, g , ) are in St, it follows that (1) f~(u) ~f2(u) implies that at least one off,(u) andf2(u) is ~0and that (2) g~(X) ~ g2(X) implies that either (a) At least one of g~(X) and g2(X) is ~, or (b) g~(X) and g2(X) are in the same block of II. We can find compatible partitions of a set of LR(k) tables using techniques reminiscent of those used to find indistinguishable states of an incompletely specified finite automaton. Our goal is to find compatible partitions of least width. The following algorithm shows how we can use a compatible partition of width p on a set of LR(k) tables to find an equivalent set containing p tables. ALGORITHM 7.7 Merger by compatible partitions.
Input. A ~0-inaccessible set (3, To) o f LR(k) tables and a compatible partition II = [ $ ~ , . . . , Sp) on 3. Output. An equivalent cp-inaccessible set (3', To) of LR(k) tables such that :/4:3' = p. Method. (1) For all i, 1 _~ i _~ p, construct the table Ut = ( f , g ) from the block St of II as follows(a) Suppose that ( f ' , g ' ) is in 8t and that for lookahead string u, f'(u) ~ cp. Then let f(u) = f ' ( u ) . If there is no such table in St, set f(u) = ~o. (b) Suppose that ( f ' , g') is in $~ and that g'(X) is in block ~j. Then set g ( X ) = Uj. If there is no table in St with g'(X) in $i, set g(x)
= ~o.
(2) T~ is the table constructed from the block containing To.
[--]
The definition of compatible partition ensures us that the construction given in Algorithm 7.7 will be consistent.
594
CHAP. 7
TECHNIQUES FOR PARSER OPTIMIZATION
Example 7.19
Consider Go, our usual grammar for arithmetic expressions" (1) E - - D E + (2) E ~
(3) (4) (5) (6)
T
T
T ~ T, F T----, F F--~ (E) r---~a
The (p-inaccessible set of LR(1) tables for G O is given in Fig. 7.25. We observe that T3 and T1 o are compatible and that T1, and 7'20 are compatible. Thus, we can construct a compatible partition with {T3, T10} and {T14, Tzo} as blocks and all other tables in blocks by themselves. If we replace {T3, Tlo} by U~ and {T~a, 7'20} by U2, the resulting set of tables is as shown in Fig. 7.26. Note that goto entries 7'3, T~o, T14, and 7'20 have been changed to U1 or Uz as appropriate. The compatible partition above is the best we can find. For example, Ti 6 and T17 are almost compatible, and we would group them in one block of a partition if we could also group T~ o and 7'20 in a block of the same partition. But T~ 0 and 7'20 disagree irreconcilably in the actions for + , ,, and).
D We shall now prove that Algorithm 7.7 produces as output an equivalent ~0-inaccessible set of LR(k) tables. THEOREM 7.6 Let 3' be the set of LR(k) tables constructed from 3 using Algorithm 7.7. Then 3 and 3' are equivalent and q~-inaccessible. P r o o f Let T' in 3' be the table constructed from the block of the compatible partition that contains the table T in 3. Let Co = (To, w, e) and Co = (To, w, e) be initial configurations of the LR(k) parser using 3 and 3', respectively. We shall show that
(7.3.1)
Co I--L- (ToXaT~ . . . XmTm, x,
if and only if
lg) using 3
Co I.-L-(T~X~T'~ . . . XmT'm, x, zr) using 3'
That is, the only difference between the LR(k) parser using 3 and 3' is that in using 3' the parser replaces table T in 3 by the representative of the block of T in the partition H. We shall prove statement (7.3.1) by induction on i. Let us consider the "only if" portion. The basis, i -- 0, is trivial. For the inductive step, assume that statement (7.3.1) is true for i. Now consider the i + 1st move. Since ;3 is to-inaccessible, the actions of Tm and T" on FIRSTk(x ) are the same.
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
595
goto
action a
+
•
(
)
e
E
T
F
To
S
X
X
S
X
X
T1
T2T3T
T1
~o
S
,p
¢
so
A
2
S
,p
~
2
a
+
•
(
)
4
~0
SO
T5
~0
~0
~0
~0
SO
T3
so
4
4
,#
~
4
T4
X
6
6
X
X
6
S
X
X
S
X
X
T8
T9
T10 Tll
V
SO T12
SO
T6
S
X
X
S
X
X
~0
T13
T3
Z4
~P
SO
T5
SO
T7
S
X
X
S
X
X
SO
SO T14
T4
~P
,p
T5
SO
r8
so
S
so
¢
S
~0
~o
~o
~0
~0
T16
SO
tp
T15
T9
so
2
S
so
2
so
SO
~p
SO
tp
~p
T17
SO
tp
Tlo
,p
4
4
so
4
,p
~p
~
tp
~p
~
SO
~p
~p
Tll
X
6
6
X
6
X
SO
SO
~P
SO
~#
SO
~P
~P
T12
S
X
X
S
X
X
T18
T9
Tlo
Tll
~P
~p
T12
~P
T~3
so
1
S
so
so
1
SO
SO
tp
SO
~p
T7
SO
SO
T14
~o
3
3
¢
~p
3
~p
~p
~p
SO
~P
SO
SO
~P
T15
X
5
5
X
X
5
9
~
9
9
9
~P
SO
9
T16
S
X
X
S
X
X
SO T19 T.IO Tll
~P
9
T12
9
T~7
S
X
X
S
X
X
tp
~p
~p
~o
T12
T18
so
S
so
so
S
,p
tp
~p
SO
SO T16
tp
~p
T21
T~9
so
1
S
so
1
so
~p
~
~0
SO
~o
Tl7
SO
~p
r:o
so
3
3
so
3
so
SO
~p
SO
SO
~p
~o
SO
~p
T~
X
5
5
X
5
X
~P
~
SO
SO
9
SO
9
9
Fig. 7.25
~0
SO
T20 Tll
to-inaccessible tables for Go.
Suppose that the action is shift, that a is the first symbol of x, and that the goto entry of Tm on a is T. By Algorithm 7.7 and the definition of compatible partition, the goto of T'~ on a is T' if and only if T' is the representative of the block of II containing T.
596
CHAP. 7
TECHNIQUESFOR PARSER OPTIMIZATION action
goto
a
+
*
(
)
e
E
T
F
a
+
*
(
)
ro
S
X
X
S
X
X
~q
r2
vl
r4
~
~
rs
~
rl
SO
S
SO
~o
SO
A
T2
,#
2
S
~0
SO
2
T4
X
6
6
X
X
6
SO
SO
SO
SO
SO
SO
SO
SO
rs
S
X
X
S
X
X
T8 7'9 v~ rx~
~
~
7"12 ~o
T~
S
X
X
S
X
X
SO T13
T7
S
X
X
S
X
X
T8
Wl
T4
SO
SO
T5
SO
,,0
~0 u2
7'4
~,
,:
7's
~0
,:
~
,,0
~
~
Tt5
716 ~
r9
SO
2
S
~o
2
~o
~o
~0
SO
SO
~0
T17
SO
SO
Tll
X
6
6
X
6
X
SO
~0
SO
SO
~o
SO
SO
~0
T12
S
X
X
S
X
X
Z18
T9
U 1 Zll
SO
SO T12
~0
T~3
SO
1
S
SO
SO
1
so
so
so
so
so
T7
so
so
Ti5
X
5
5
X
X
5
SO
SO
SO
SO
SO
SO
~P
SO
T~6
S
X
X
S
X
X
SO
T19
UI
Tli
SO
~
T12
SO
T17
S
X
X
S
X
X
so
so
U2 TiI
SO
so T12
SO
T~8
so
S
SO
SO-
S
SO
SO
SO
SO
SO T16
Ti9
SO
1
S
SO
1
SO
~0
SO
SO
SO
SO T17
SO
SO
X
5
5
X
5
X
SO
SO
SO
SO
SO
SO
SO
SO
Ux
SO
4
4
SO
4
4
SO
SO
SO
SO
SO
SO
SO
SO
U~
SO
3
3
SO
3
3
SO
SO
SO
SO
SO
SO
SO
SO
Fig. 7.26
SO
SO T21
Merged tables.
If the action is reduce by some production with r symbols on the right, then comparison of Tm-r and T'm-, yields the inductive hypothesis for i + 1. For the "if" portion of (7.3.1), we have only to observe that if T'm has a non-fo action on FIRST(x), then T'~ must agree with Tin, since 3 is ~0inaccessible. D We observe that Algorithm 7.7 preserves equivalence in the strongest sense. The two sets of tables involved always compute next configurations for the same number of steps regardless of whether the input has a parse.
SEC. 7.3 7.3.5.
TRANSFORMATIONS ON SETS OF LR(k) TABLES
597
Postponement of Error Checking
Our basic technique in reducing the size of a set of LR(k) tables is to merge tables wherever possible. However, two tables can be merged into one only if they are compatible. In this section we shall discuss a technique which can be used to change essential error entries in certain tables to reduce entries with the hope of increasing the number of compatible pairs of tables in a set of LR(k) tables. As an example let us consider tables T4 and T~ in Fig. 7.25 (p. 595) whose action fields are shown below" action a
+
.
(
)
e
T4:
X
6
6
X
X
6
Taa:
X
6
6
X
6
X
If we changed the action of T4 on lookahead ) from error to reduce 6 and the action of T11 on e from error to reduce 6, then T4 and Tll would be compatible and could be merged into a single table. However, before we make these changes, we would like to be sure that the error detected by Ta o n ) and t h e error detected by Tll on e will be detected on a subsequent move before a shift move is made. In this section we shall derive conditions under which we can postpone error detection in time without affecting the position in the input string at which an LR(k) parser announces error. In particular, it is easy to show that any such change is permissible in the canonical set of tables. Suppose that an LR(k) parser is in configuration (To X~ T~ . . . XmT=, w, zt) and that table Tm has action error on lookahead string u - FIRSTk(w ). Now, suppose that we change this error entry in Tm to reduce p, where p is production A --~ Y1 "'" Yr. There are two ways in which this error could be subsequently detected. In the reduction process 2r symbols are removed f r o m the pushdown list and table Tm-r is exposed. If the goto of Tm-r on A is ~, then error would be announced. We could change this qa to error. However, the goto of Tm-r on A could be some table T. If the action of T on lookahead u is error or qa (which we can change to error to preserve ~-inaccessibiiity), then we would catch the error at this point. We would also maintain error detection if the action of T on u was reduce p' and the process above was repeated. In short, we do not want any of the tables that become governing tables after the reduction by production p to call for a shift on lookahead u (or for acceptance). Note that in order to change error entries to reduce entries, the full
598
TECHNIQUESFOR PARSER OPTIMIZATION
CHAP. 7
generality of the definition of equivalence of sets of LR(k) tables is needed here. While parsing, the new set of tables may compute next configurations several steps farther than the old set, but no input symbols will be shifted. To describe the conditions under which this alteration can take place, we need to know, for each table appearing on top of the pushdown list during the parse, what tables can appear as the r -+- 1st table from the top of the pushdown list. We begin by defining three functions on tables and strings of grammar symbols. DEFINITION
Let (3, To) be a set of LR(k) tables for grammar G = (N, X, P, S). We extend the GOTO function of Section 5.2.3 to tables and strings of grammar symbols. GOTO maps 5 × (N U X)* to 3 as follows: (1) GOTO(T, e) = T for all T in 3. (2) If T = ( f , g>, GOTO(T, X) --- g ( X ) for all X in N U E and T in 5. (3) GOTO(T, eX) -- GOTO(GOTO(T, e), X) for all 0~ in (N W E)* and T i n :3. We say table that T in (5, To) has height r if GOTO(T0, e) = T implies that [a[ > r. We shall also have occasion to use GOTO-~, the "inverse" of the GOTO function. GOTO -~ maps 5 x (N u Z)* to the subsets of 5. We define GOTO-~(T, t~) = IT' [GOTO(T', 00 = T}. Finally, we define a function NEXT(T, p), where T is in 3 and p is production A ~ Xx . . . X, as follows: (1) If T does not have height r, then NEXT(T, p) is undefined. (2) If T has height r, then NEXT(T, p ) - - [ T ' ] t h e r e exists T" ~ 5 and ~ (N U E) r such that T" ~ GOTO-I(T, e) and T ' - - G O T O ( T " , A)}. Thus, NEXT(T, p) gives all tables which could be the next governing table after T if T is on t o p of the pushdown list and calls for a reduction by production p. Note that there is no requirement that Xx ..- Xr be the top r grammar symbols on the pushdown list. The only requirement is that there be at least r grammar symbols on the list. If the tables are the canonical ones, we can show that only for ~t -- X 1 --. Xr among strings of length r will GOTO -1 (T, e) be nonempty. Certain algebraic properties of the GOTO and NEXT functions are left for the Exercises. The GOTO function for a set of LR(k) tables can be conveniently portrayed in terms of a labeled graph. The nodes of the GOTO graph are labeled by table names, and an edge labeled X is drawn from node T,. to node Tj if GOTO(T,., X) = Tj. Thus, if GOTO(T, X i ) ( 2 . . . X~) = T ' , then there will be a path from node T to node T' whose edge labels spell out the string X1 X2 --..Yr. The height of a table T can then be interpreted as the length of the shortest path from To to T in the GOTO graph.
SEC. 7.3
TRANSFORMATIONS ON SETS OF
LR(k) TABLES
599
T h e N E X T function can be easily computed from the GOTO graph. To determine NEXT(T, i), where production i is A - - . X 1 X 2 . . . Xr, we find all nodes T" in the GOTO graph such that there is a path of length r from T" to T. We then add GOTO(T", A) to NEXT(T, i) for each such T". Example 7.20
The GOTO graph for the set of tables in Fig. 7.25 (p. 595) is shown in Fig. 7.27.
a
F ~
)
a
)j,,,_J +
Fig. 7.27 GOTO graph.
600
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
From this graph we can deduce that GOTO[T6, (E)] = T15, since GOTO[T6, (] = Ts, GOTO[Ts, E] = Ts, and GOTO[Ts, )] = T15. Table T6 has height 2, so NEXT(T6, 5), where production 5 is F ~ (E), is undefined. Let us now compute NEXT(T1 s, 5). The only tables from which there is a p a t h of length 3 to T15 are To, T6, and TT. Then GOTO(T0, F ) ~ - T 3 , GOTO(T6, F) = T3, and GOTO(TT, F) = T 14, and so NEXT(T1 s, 5) =
{T3, T14}. [~ We shall now give an algorithm whereby a set of q~-inaccessible tables can be modified to allow certain errors to be detected later in time, although not in terms of distance covered on the input. The algorithm we give here is not as general as possible, but it should give an indication of how the more general modifications can be performed. We shall change certain error entries and q~-entries in the action field to r e d u c e entries. For each entry to be changed, we specify the production to be used in the new reduce move. We collect permissible changes into what we call a postponement set. Each element of the postponement set is a triple (T, u, i), where T is a table name, u is a lookahead string, and i is a production number. The element (T, u, i) signifies that we are to change the action of table T on lookahead u to reduce i. DEFINITION
Let (3, To) be a set of LR(k) tables for a grammar G = (N, E, P, S). We call (P, a subset of 3 × E *k × P, a postponement set for (3, To) if the following conditions are satisfied. If (T, u, i) is in 6' with T = ( f , g), then (1) f ( u )
= e r r o r or ~0;
(2) If production i is A --~ e and T = GOTO(To, fl), then e is a suffix
of fl; (3) There is no i' such that (T, u, i') is also in (P; and (4) If T' is in NEXT(T, i) and T' = ( f ' , g'), then f'(u) = error or
~.
Condition (1) states that only error entries and (p-entries are to be changed to reduce entries. Condition (2) ensures that a reduction by production i will occur only if a appears on top of the pushdown list. Condition (3) ensures uniqueness, and condition (4) implies that reductions caused by introducing extra reduce actions will eventually be caught without a shift occurring. Referring to condition (4), note that (T', u, j) may also be in (P. In this case the value o f f ' ( u ) will also be changed from error or ~ to r e d u c e j. Thus, several reductions may be made in sequence before error is announced. Finding a postponement set for a set of LR(k) tables which will maximize the total number of compatible tables in the set is a large combinatorial problem. In one of the examples to follow we shall hint at some heuristic techniques which can be used to find appropriate postponement sets. How-
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
601
ever, we shall first show how a postponement set is used to modify a given set of LR(k) tables. ALGORITHM 7.8 Postponement of error checking.
Input. An LR(k) grammar G----(N, ~, P, S), a g-inaccessible set (3, To) of LR(k) tables for G, and a postponement set 6). Output. A a-inaccessible set 3' of LR(k) tables equivalent to 3. Method. (1) For each (T, u, i) in 6', where T = ( f , g), change f(u) to reduce i. (2) Suppose that (T, u, i) is in 6) and that production i is A ----~0c. For all T' = ( f ' , g ' ) such that GOTO(T', ~) = T and g'(A) = ~, change g'(A) to e r r o r . (3) Suppose that (T, u, i) is in 6' and that T' = ( f ' , g'~ is in NEXT(T, i). Iff'(u) = q~, change f'(u) to e r r o r . (4) Let 3' be the resulting set of tables with the original names retained. D E x a m p l e 7.21
Consider the grammar G with productions (1) S ~ A S
(2) S ~ b (3) A ~ # ~ (4) B ~ aB (5) B ~ b
The T-inaccessible set of LR(1) tables for G obtained by using Algorithm 7.6 on the canonical set of LR(1) tables for G is shown in Fig. 7.28. We can choose to replace both error entries in the action field of T4 with reduce 5 and the error entry in T8 with reduce 2. That is, we pick a postponement set 6) = {(T4, a, 5), (T4, b, 5), (Ta, e, 2)}. Production 5 is B ---~ b, and GOTO(T0, b) -~ GOTO(T2, b) ~ T4. Thus, the entries under B in To and T2 must be changed from ~ to error. Similarly, the entries under S for T3 and T7 are changed to e r r o r . Since NEXT(T4, 5) and NEXT(Ts, 2) are empty, no (0's in the action fields need be changed to e r r o r . The resulting set of tables is shown in Fig. 7.29. If we wish, we can now apply Algorithm 7.7 with a compatible partition grouping T4 with Ts, T~ with T~, and T5 with T6. (Other combinations of three pairs are also possible.) The resulting set of tables is given in Fig. 7.30. THEOREM 7.7
Algorithm 7.8 produces a a-inaccessible set of tables 3' which is equivalent to 3.
602
CHAP. 7
TECHNIQUES FOR PARSER OPTIMIZATION
action
goto
a
b
e
S
r0
S
S
X
T1T2
~q
SO
SO
A
SO
r2
S
S
SO
Ts
r3
S
S
X
r4
X
X
2
rs
SO
SO
r6
3
r7 7q r9
B
a
SO
T3 T4
SO
SO
SO
T2
~
T3 T4
~
r6rTT8
SO
SO
SO
SO
SO
1
SO
SO
SO
SO
SO
3
SO
SO
so
so
SO
so
S
S
X
~
Tg r7
T8
5
5
X
SO
so
so
SO
SO
4
4
SO
SO
so
SO
so
SO
Fig. 7.28
A
b
SO
~0-inaccessible tables for G.
action
goto
a
b
e
S
To
S
S
X
TiT2XT3T4
Ti
SO
SO
A
SO
SO
SO
SO
73
S
S
SO
Ts
T2
X
T3 T4
T3
s
s
x
X
,
T6 TT T8
r4
5
5
2
SO
SO
SO
SO
SO
r5
SO
SO
1
SO
SO
SO
SO
SO
r~
3
3
SO
SO
SO
SO
SO
SO
T7
S
S
X
x
~0 Tg T7 Ts
T8
5
5
2
so
so
so
so ¸
so
r9
4
4
SO
so
so
so
so
so
Fig. 7.29
A
B
a
b
Tables after postponement of error checking.
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES action
603
goto
b
e
S
A
B
a
b
ro
S
S
X
T1T 1
X
T3
T4
75
S
S
A
Ts
Ti
X
T3
T4
75
S
S
X
x
~
Ts
rT
r4
r4
5
5
2
~o
~o
~o
~o
~o
r5
3
3
1
~o
~o
so
~o
~o
T7
S
S
X
X
~o
T9
TT
T4
r9 ¸
4
4
~o
~o
~p
~p
tp
¢
Fig. 7.30
Merged tables.
P r o o f Let Co = (To, w, e) be an initial configuration of the LR(k) parser (using either :3 or 3' - - t h e table names are the same). Let Co k- C1 k- • • • k- C, be the entire sequence of moves made by the parser using ~ and let Co ~- C'1 [--- • • • k- C " be the corresponding sequence for 3'. Since 3' is formed from 3 by replacing only error entries and (0-entries, we must have m > n, and C~ -- C i, for 1 < i ~ n. (That is, the table names are the same, although different tables are represented.) We shall now show that either m = n or if m > n, then no shift or accept moves are made after configuration C', has been entered. If m -- n, the theorem is immediate, and if C, is an accepting configuration, then m = n. Thus, assume that C, declares an error and that m > n. By the definition of a p o s t p o n e m e n t set, since the action in configuration C, is error and the action in C', is not error, the action in configuration C', must be reduce. Thus, let s be the smallest integer greater than n such that the action in configuration C's is shift or accept. (If there is no such s, we have the theorem.) Then there is some r, where n < r _< s, such that the action entry consulted in configuration C'r is one which was present in one of the tables of 5. The case r -- s is not ruled out, as certainly the shift or accept entry of C'r was present in 3. The action entry consulted in configuration C'~_ i was of the form reduce i for some i. By our assumption on r, that entry must have been introduced by Algorithm 7.8. Let T~ and T2 be the governing tables in configurations C'r_ ~ and C',, respectively. Then Tz is in NEXT(T~, i), and condition (4) in the definition of a postponement set is violated. D
We shall now give a rather extensive example in which we illustrate how postponement sets and compatible partitions might be found. There are
604
TECHNIQUESFOR PARSER OPTIMIZATION
CHAP. 7
a number of heuristics used in the example. Since these heuristics will not be delineated elsewhere, the reader is urged to examine this example with care. Example 7.22
Consider the tables for G O shown in Fig. 7.25 (p. 595). Our general strategy will be to use Algorithm 7.8 to replace error actions by reduce actions in order to increase the number of tables with similar parsing action functions. In particular, we shall try to merge into one table all those tables which call for the same reductions. Let us try to arrange to merge tables T~ 5 and T21 , because they reduce according to production 5, and tables T4 and T11, which reduce according to production 6. To merge T~ 5 and T21, we must make the action of T15 on ) be reduce 5 and the action of T2~ on e be reduce 5. Now we must check the actions of NEXT(Tls, 5) = {T3, T14} and NEXT(T2~, 5) = {Tl0, T2o} on ) and e. Since T3 and T14 each have (~ action on), we could change these ~'s to error's and be done with it. However, then T3 and T1 o would no longer be compatible nor would T14 and T20---~so we would be wise to change the actions of T3 and T14 on ) instead to reduce 4 and reduce 3, respectively. We must then cheek NEXT(T3, 4) = NEXT(T14, 3) = [T2, T13}. A similar argument tells us that we should not change the actions of T2 and T13 on ) to error, but rather to reduce 2 and reduce 1, respectively. Further, we see that NEXT(T2, 2) = [Ti } = NEXT(T13, 1). There is nothing wrong with changing the action of Ti on ) to error, so at this point we have taken into account all modifications needed to change the action of T15 on ) to reduce 5. We must now consider what happens if we change the action of T21 on e to reduce 5. NEXT(T21, 4) = ITs0, T20}, but we do not want to change the actions of Tlo and T2o on e to error, because then we could not possibly merge these tables with T3 and T14. We thus change the actions of T10 and T20 on e to reduce 4 and reduce 3, respectively. We find that NEXT(Tlo, 4) = NEXT(T2o, 3) = {Tg, T19}. We do not want to change the actions of T9 and T19 on e to error, so let T9 have action reduce 2 on e and T19 have action reduce 1 on e. We find NEXT(Tg, 2) = NEXT(T19, 1) = {Ts, T18}. These tables will have their actions on e changed to error. We have now made T15 and T21 compatible without disturbing the possible compatibility ofT3 and Tlo, ofT1, and T20, ofT2 and T9, ofT13 and T19, or of T14 and T20. Now let us consider making T4 and T11 compatible by changing the action of T4 on ) to reduce 6 and the action of Tll on e to reduce 6. Since NEXT(T,, 6) = IT3, T14} and NEXT(T11, 6) = {T10, T20}, the changes we have already made to T3 and T14 and to T10 and T20 allow us to
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
605
make these changes to T4 and T11 without further ado. The complete postponement set consists of the following elements" [T2, ), 2]
[Tg, e, 2]
[T3, ), 4]
[T~ o, e, 4]
[T4, ), 6]
[T~ ,, e, 6]
[T,3, ), 1]
[T,9, e, 1]
[T,,, ), 3]
[Tzo, e, 3]
[r, 5, ), 5]
[r2,, e, 5]
The result of applying Algorithm 7.8 to the tables of Fig. 7.25 with this postponement set is shown in Fig. 7.31. Note that no error entries are introduced into the goto field. Looking at Fig. 7.3 l, we see that the following pairs of tables are immediately compatible"
T~-Ti0 T~-T~i T~-T~o T~-T~i Moreover, if these pairs form blocks of a compatible partition, then the following pairs may also be grouped"
T~-T, ~
T~'T12 T~'T17 T~-T~ T~-T~ T~-Ti ~ If we apply Algorithm 7.7 with the partition whose blocks are the above pairs and the singletons {To} and {T1}, we obtain the set of tables shown in Fig. 7.32. The smaller index of the paired tables is used as representative in each case. It is interesting to note that the "SLR" method of Section 7.4.1 can construct the particular set of tables shown in Fig. 7.32 directly from Go. To illustrate the effect of error postponement, let us parse the erroneous input string a). Using the tables in Fig. 7.25, the canonical parser would
606
CHAP. 7
TECHNIQUES FOR PARSER OPTIMIZATION
action a
+
•
(
)
e
E
ro
S
X
X
S
X
X
T1 T:
T3 T4
rl
~p
S
~
~p
X
A
r~
,#
2
S
~
2
2
4
4
¢
4
4
r3
T
F
,,goto a +
*
(
~
~
Ts
)
r4
X
6
6
X
6
6
rs
S
X
X
S
X
X
T8
T9
Zlo
Tll
~o
tp
T12
~0
r6
S
X
X
S
X
X
g9
T13
T3
T4
ho
~0
T5
g9
r7
S
X
X
S
X
X
tp
tp
T14
T4
ho
tp
T5
r8
~.
S
,¢
¢
S
X
~0
~
~0
~
T16
~0
~
T15
2
S
,~
2
2
~p
~o
tp
tp
tp
T17
~p
tp
T~8 T9 T10 T~
~o
~
T~2
r9
r~o
,~
4
4
¢
4
4
Tll
X
6
6
X
6
6
r~2
S
X
X
S
X
X
~o
I
S
~o
1
1
r~4
,~
3
3
~o
3
3
rls
X
5
5
X
5
r16
S
X
X
S
r17
S
X
X
S
T18
T19
~p
r~o r~l
X
tp
tp
tp
~0
tp
T7
tp
¢
5
tp
~
~
tp
~o
tp
tp
,¢
X
X
tp
T19 TlO Tll
tp
~p
T12
tp
S
X
X
¢
tp
T2o
Tll
¢
tp
T12
¢
¢
~,
S
X
,
~0
"9
,
T16
¢
:
T21
1
S
~
1
1
~p
tp
ho
9
9
T17
tp
~p
3
3
~o
3
3
5
5
X
5
5
~p
~p
tp
tp
~p
~o
¢
~o
Fig. 7.31
A p p l i c a t i o n of p o s t p o n e m e n t algorithm.
make one move: [To, a), e] F-- [ToaT4, ), e] The action of T4 on ) is error.
607
TRANSFORMATIONS ON SETS OF LR(k) TABLES
SEC. 7.3
goto
action a
+
*
(
)
e
r0
S
X
X
S
X
X
T1
~p
S
~
~p
X
A
T2
~p
2
S
~p
2
2
T3
,#
4
4
~0
4
4
T4
X
6
6
X
6
6
T5
S
X
X
S
X
X
r8 r2 r3 r4
T6
S
X
X
S
X
X
r~3 r3 73
T7
S
X
X
S
X
X
TI4
S
¢
¢
S
X
r8 T13
,p
1
S
~o
1
1
T14
~p
3
3
~
3
3
T15
X
5
5
X
5
5
Fig. 7.32
E
T
F
a
+
•
(
~p
~p
~p
~o
T6
~p
~p
~p
~o
~p
~p
~p
~p
~p
~p
T4
After application of merging algorithm.
However, using the tables in Fig. 7.32 to parse this same input string, the parser would now make the following sequence of moves: [To, a), e] [--- [ToaT4,
), e] [- [ToFT3, ), 6] I-- [ToTT2, ), 64] ~- [ToET1, ), 642]
Here three reductions are made before the announcement of error. 7.3.6.
[~
Elimination of Reductions by Single Productions
We shall now consider an important modification of LR(k) tables that does not preserve equivalence in the sense of the previous sections. This modification will never introduce additional shift actions, nor will it even cause a reduction when the unmodified tables detect error. However, this modification can cause certain reductions to be skipped altogether. As a result, a table appearing on the pushdown list may not be the one associated with the string of grammar symbols appearing below it (although it will be compatible with the table associated with that string). The modification of this
608
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
section has to do with single productions, and we shall treat it rather more informally than we did previous modifications. A production of the form A ~ B, where A and B are nonterminals, is called a single production. Productions of this nature occur frequently in grammars describing programming languages. For instance, single productions often arise when a context-free grammar is used to describe the precedence levels of operators in programming languages. From example, if a string al-t- az * a3 is to be interpreted as a 1 ÷ (a 2 • a3), then we say the operator • has higher precedence than the oPerator ÷ . Our grammar G Ofor arithmetic expressions m a k e s , of higher precedence than + . The productions in Go are (1) (2) (3) (4) (5) (6)
E ~ E + T E----~ T T---~ T . F T--~ F F ~ (E) F--~ a
We can think of the nonterminals E, T, and F as generating expressions on different precedence levels reflecting the precedence levels of the operators. E generates the first level of expressions. These are strings of T's separated by -t-'s. The operator -q- is on the first precedence level. T generates the second level of expressions consisting of F's separated by • ;s. The third level of expressions are those generated by F, and we can consider these to be the primary expressions. Thus, when we parse the string al -q- a2 * a3 according to G 0, we must first parse a 2 • a3 as a T before combining this Twith al into an expression E. The only function served by the two single productions E ~ T and T ----~F is to permit an expression on a higher precedence level to be trivially reduced to an expression on a lower precedence level. In a compiler the translation rules usually associated with these single productions merely state that the translations for the nonterminal on the left are the same as those for the nonterminal on the right. Under this condition, we may, if we wish, eliminate reductions by the single production. Some programming languages have operators on 12 or more different precedence levels. Thus, if we are parsing according to a grammar which reflects a hierarchy of precedence levels, the parser will often make many sequences of reductions by single productions. We can speed up the parsing process considerably if we can eliminate these sequences of reductions, and in most practical cases we can do so without affecting the translation that is being computed. In this section we shall describe a transformation on a set of LR(k) tables which has the effect of eliminating reductions by single productions wherever desired.
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
600
Let (3, To) be a 0-inaccessible set of LR(k) tables for an LR(k) grammar G = (N, l~, P, S), and assume that 3 has as many (p-entries as possible. Suppose that A ~ B is a single production in P. Now, suppose the LR(k) parsing algorithm using this set of tables has the property that whenever a handle Y~ Y2 "'" Y, is reduced to B on lookahead string u, B is then immediately reduced to A. We can often modify this set of tables so that Y~ • • • Y,. is reduced to A in one step. Let us examine the conditions under which this can be done. Let the index of production A ---~ B be p. Suppose that T = ( f , g~ is a table such that f ( u ) = reduce p for some lookahead string u. Let 3 ' = GOTO-~(T, B) and let '11. = {U IU = GOTO(T', A) and T' ~ 3'}. 3' consists of those tables which can appear immediately below B on the pushdown list when T is the table above B. If (3, To) is the canonical set of LR(k) tables, then 'It is NEXT(T, p)(Exercise 7.3.19). To eliminate the reduction by production p, we would like to change the entry g'(B) from T to g'(A) for each ( f ' , g ' ) in 3'. Then instead of making the two moves (yT' Yi U1 Yz Uz . . . Yr Ur, w, n:) ~ (?T'BT, w, n:i)
(?T'A U, w, nip) the parser would just make one move"
(?T' Y1 U1 Y2 U2 . . . Yr U,., w, rt) ~ (?T'A U, w, n:i) We can make this change in g'(B) provided that the entries in T and all tables in q.t are in agreement except for those lookaheads which call for a reduction by production p. That is, let T = ( f , g ) and suppose q'[ -~- ( ( f l, g~), ( f z, gz), . . . , (fro, g,,)}" Then we require that (1) For all u in E,k if f ( u ) is not ~o orreduce p, then f.(u) is either ~o or the same as f ( u ) for 1 < i < m. (2) For all X in N u E, if g(X) is not ~o, then g~(X) is either ~por the same a s g ( X ) for 1 ~ i ~ m . If both these conditions hold, then we modify the tables in 3' and 'It as follows" (3) We let g'(B) (4) For 1 ~ i ~ m (a) for each not ~0 or (b) for each
be g'(A) for all ( f ' , g ' ) in 3'. u ~ E *k change f~(u) to f(u) if f~(u) =~0 and if f ( u ) i s reduce p, and X ~ N U E change g,(X) to g(X) if g(X) -~ ~o.
The modification in rule (3) will make table T inaccessible if it is possible to reach T only via the entries g'(B) in tables ( f ' , g ' ) in 3'. Note that the modified parser can place symbols and tables on the push-
610
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
down list that were not placed there by the original parser. For example, suppose that the original parser makes a reduction to B and then calls for a shift, as follows"
(?T' Y1 U1 Yz U2 ... YrU,, aw, n:) ~ (?T'BT, aw, zti) F- (TT'BTaT", w, hi) The new parser would make the same sequence of moves, but different symbols would appear on the pushdown list. Here, we would have
(?T' Y1 U1 Y2 U2 ... Y,. Ur, aw, rt) ~ (?T'A U', aw, zti) (TT'A U'aT", w, hi) Suppose that T = ( f , g), T' = ( f ' , g'), and U = ( f . , g,). Then, U is g'(A). Table U' has been constructed from U according to rule (4) above. Thus, if f ( v ) = shift, where v = FIRST(aw), we know that f~(v) is also shift. Moreover, we know that gi(a) will be the same as g(a). Thus, the new parser makes a sequence of moves which is correct except that it ignores the question of whether the reduction by A ~ B was actually made or not. In subsequent moves the grammar symbols on the pushdown list are never consulted. Since the goto entries of U' and T agree, we can be sure that the two parsers will continue to behave identically (except for reductions by single production A ~ B). We can repeat this modification on the new set of tables, attempting to eliminate as many reductions by semantically insignificant single productions as possible. Example 7.23
Let us eliminate reductions by single productions wherever possible in the set of LR(1) tables for G Oin Fig. 7.32 (p. 607). Table T2 calls for reduction by production 2, which is E ~ T. The set of tables which can appear immediately below Tz on the stack is [To, T5}, since GOTO(T0, E ) = T1 and GOTO(Ts, E) -- Ts. We must check that except for the reduce 2 entries, tables T2 and T1 are compatible and T2 and T8 are compatible. The action of T2 on • is shift. The action of T1 and T8 on • is ~o. The goto of T2 on • is T7. The goto of T1 and T8 on • is ~. Therefore, T2 and T1 are compatible and T2 and T8 are compatible. Thus, we can change the goto o f table To on nonterminal T from T2 to T1 and the goto of T5 on T from T2 to Ts. We must also change the action of both T1 and Ts on • from f to shift, since the action of T2 on • is shift. Finally, we change the goto of both T1 and T8 on • from (p to T7 since the goto of T2 on • is Z7. Table T2 is now inaccessible from To and thus can be removed.
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
611
Let us now consider the reduce 4 moves in table T3. (Production 4 is T--~ F.)The set of tables which can appear directly below T3 is (To, T~, T6}. Now GOTO(T0, T) = T1, GOTO(Ts, T) = T8, and GOTO(T6, T) = T13. [Before the modification above, GOTO(T0, T) was T2, and GOTO(Ts, T) was Tz.] We must now check that T3 is compatible with each of T1, Ts, and T13. This is clearly the case, since the actions of T3 are either ~oor reduce 4, and the gotos of T3 are all ~o. Thus, we can change the goto of To, T~, and T6 on F to Tt, Ts, and Ti 3, respectively. This makes table T3 inaccessible from To. The resulting set of tables is shown in Fig. 7.33(a). Tables T2 and T3 have been removed. There is one further observation we can make. The goto entries in the columns under E, T, and F all happen to be compatible. Thus, we can merge these three columns into a single column. Let us label this new column by E. The resulting set of tables is shown in Fig. 7.33(b). The only additional change that we need to make in the parsing algorithm is to use E in place of T and F to compute the goto entries. In effect the set of tables in Fig. 7.33(b) is parsing according to the skeletal grammar (1) (3) (5) (6)
E ~ E + E E--> E , E E----~ (E) E--,a
as defined in Section 5.4.3. For example, let us parse the input string (a + a) • a using the tables in Fig. 7.33(b). The LR(1) parser will make the following sequence of moves. [To, (a + a) • a, el I- [To(Ts, a + a) • a, e] [To(T, aT4, + a) • a, e] ]--- [To(T~ET8, + a) • a, 6] [To(TsET8 + T6, a) • a, 6] [- [To(TsET8 + T6aT4, ) • a, 6] 1-- [To(TsET8 + T6ET~3, ) • a, 66] ~- [To(TsET8, ) • a, 661] [To(TsETs)Tls, • a, 661] t-- [ToET~, * a, 6615] [ToET1, TT, a, 6615] [ToET~ ,TTaT+, e, 6615] t-- [ToET~,TTET~4, e, 66156] ~- [ToETi , e, 661563]
612
CHAP. 7
TECHNIQUESFOR PARSER OPTIMIZATION action
goto
a
+
-,
(
)
e
E
T
F
a
+
•
(
To
S
X
X
S
X
X
~q rl
~
r4
~
~
r5
r~
~p
S
S
~p
X
A
T~
X
6
6
X
6
6
T~
S
X
X
S
X
X
T8 r8
T8 T4
~
~
r5
T~
S
X
X
S
X
X
T7
S
X
X
S
X
X
r~4
~
~
r~
T8
~p
S
S
,p
S
X
Tx~
~p
1
S
~p
1
1
rl4
,~
3
3
~p
3
3
X
5 ~ 5
X
5
5
)
~
r4
(a) before column merger
action
got__..o
a
+
*
(
)
e
E
To
S
X
X
S
X
X
r~
~p
S
S
~p
X
A
T4
X
6
6
X
6
6
T5
S
X
X
S
X
r~
S
X
X
S
r7
S
X
X
S
r8
+
*
(
7q r4
~
~
r5
X
r8
T4
~
~
T5
X
X
T~3 r4
~
~
Ts
S
X
X
r14 r4
~
~
r5
S
~
S
X
T~3
~p
1
S
,#
1
1
Ti4
~p
3
3
~p
3
3
TI5
X
5
5
X
5
5
a
~,
,p
T6
T7
~
T~5
~o
~p
~p
tp
~p
~¢
(b) after column merger Fig. 7.33 ~ LR(1) tables after elimination of single productions.
)
~
SEC. 7.3
TRANSFORMATIONS ON SETS OF LR(k) TABLES
613
The last configuration is an accepting configuration. In parsing this same input string the canonical LR(1) parser for G o would have made five additional moves corresponding to reductions by single productions. D The technique for eliminating single productions is summarized in Algorithm 7.9. While we shall not prove it here in detail, this algorithm enables us to remove all reductions by single productions if the grammar has at most one single production for each nonterminal. Even if some nonterminal has more than one single production, this algorithm will still do reasonably well. ALGORITHM 7.9 Elimination of reductions by single productions.
Input. An LR(k) grammar G = (N, Z, P, S) and a a-inaccessible set of tables (3, To) for G. Output. A modified set of tables for G that will be "equivalent" to (~3, To) in the sense of detecting errors just as soon but which may fail to reduce by some single productions. Method. (1) Order the nonterminals so that N ---- [A~ . . . . . A,], and if A~ --~ Aj is a single production, then i < j. The unambiguity of an LR grammar guarantees that this may be done. (2) Do step (3) f o r j = 1, 2 , . . . , n, in turn. (3) Let At--~ A~ be a single production, numbered p, and let T1 be a table which calls for the reduce p action on one or more lookahead strings. Suppose that T2 is in NEXT(T~, p)t and that for all lookaheads either the actions of T~ and T~ are the same or one is ~ or the action of T~ is reduce p. Finally, suppose that the goto entries of T~ and T2 also either agree or one is ~. Then create a new table T3 which agrees with T~ and T2 wherever they are not ~,, except at those lookaheads for which the action of T1 is reduce p. There, the action of T3 is to agree with T2. Then, modify every table T (f,g) such that g(AO = T~ and g(Ai)= T~ by replacing T~ and T2 by T~ in the range'of g. (4) After completing all modifications of step (3), remove all tables which are no longer accessible from the initial table. We shall state a series of results necessary to prove our contention that given an LR(1) grammar, Algorithm 7.9 completely eliminates reductions by single productions if no two having the same left or same right sides, provided that no nonterminal of the grammar derives the empty string only. (Such a tNEXT must always be computed for the current set of tables, incorporating all previous modifications of step (3)
614
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
nonterminal can easily be removed without affecting the LR(1)-ness of the grammar.) LEMMA 7.1 Let G -
(N,
X, P, S) be an LR(1) grammar such that for each C ~ N,
there is some w =/= e in E* such that C *=* w. Suppose that A =2, B by a sequence of single productions. Let a l and a 2 be the sets of LR(1) items that are valid for viable prefixes ),A and ),B, respectively. Then the following conditions hold" (1) If [ C - * ~, • Xfll, a] is in a 1 and [D--~ ~2 " Yflz, b] is in 002, then X =/= Y. (2) If [C--* el • ill, a] is in 6 1 and b is in EFF(flla),t then there is no item of the form [ D - ~ 0c2. f12, c] in 6 2 such that b ~ EFF(fl2c ) except possibly for [E --~ B . , b], where A *=* E =:> B by a sequence of single productions.
Proof A derivation of a contradiction of the LR(1) condition when condition (1) or.(2) is violated is left for the Exercises. [~ COROLLARY
The LR(1) tables constructed from ~; and 6 2 do not disagree in any action or goto entry unless one of them is cp or the table for t2z calls for a reduction by a single production in that entry.
Proof. By Theorem 7.7, since the two tables are constructed from the sets of valid items for strings ending in a nonterminal, all error entries are don't cares. Lemma 7.1(1) assures that there are no conflicts in the goto entries" part (2) assures that there are no conflicts in the action entries, except for resolvable ones regarding single productions. LEMMA 7.2 During the application of Algorithm 7.9, if no two single productions have the same left or the same right sides, then each table is the result of merging a list (possibly of length 1) of LR(1) tables T ~ , . . . , T~ which were constructed from the sets of valid items for some viable prefixes yA1, y A 2 , . . . , yAh, where A i---~ A i+1 is in P for 1 < i < n.
Proof Exercise.
[~]
THEOREM 7.8 If Algorithm 7.9 is applied to an LR(1) grammar G and its canonical set of LR(1) tables 3, if G has no more than one single production for any nonterminal, and if no nonterminal derives e alone, then the resulting set of tables has no reductions by single productions. ]'Note that f i x c o u l d be e here, in which case etl calls for a reduction on lookahead b. Otherwise, (~1 calls for a shift.
EXERCISES
61 5
Proof. Intuitively, Lemmas 7.1 and 7.2 assure that all pairs T1 and Tz considered in step (3) do in fact meet the conditions of that step. A formal proof is left for the Exercises.
EXERCISES
7.3.1.
Construct the canonical sets of LR(I) tables for the following grammars: (a) S---. A B A C A - - . aD B---, b l c C--, cld
D---~ D010 (b) S ---, a S S [ b (c) S --~ SSa l b (d) E ~ E + T [ T T---, T * F1F
V ~ P? FIP P ---~ (E)1 al a(L) L--~L, EIE
7.3.2.
Use the canonicalset of tables from Exercise 7.3.1(a) and Algorithm 7.5 to parse the input string aObaOOc.
7.3.3.
Show how Algorithm 7.5 can be implemented by (a) a deterministic pushdown transducer, (b) a Floyd-Evans production language parser.
7.3.4.
Construct a Floyd-Evans production language parser for Go from (a) The LR(1) tables in Fig. 7.33(a). (b) The LR(1) tables in Fig. 7.33(b). Compare the resulting parsers with those in Exercise 7.2.11.
7.3.5.
Consider the following grammar G which generates the language L = {anOa~bn[i, n ~ 0) W {Oa"laicn[i, n ~ 0}: S-
+AIOB
A ---+ aAb[O[OC B~
aBc [ I I1C
C - - - ~ aC[a
Construct a simple precedence parser and the canonical LR(1) parser for G. Show that the simple precedence parser will read all the a's following the 1 in the input string a"laib before announcing error. Show that the canonical LR(1) parser will announce error as soon as it reads the I. 7.3.6.
Use Algorithm 7.6 to construct a ~0-inaccessible set of LR(I) tables for each of the grammars in Exercise 7.3.1.
61 6
TECHNIQUESFOR PARSER OPTIMIZATION
CHAP. 7
7.3.7.
Use the techniques of this section to find a smaller equivalent set of LR(1) tables for each of the grammars of Exercise 7.3.1.
7.3.8.
Let ~ be the canonical collection of sets of LR(k) items for an LR(k) grammar G = (N, E, P, S). Let 12 be a set of items in ,~. Show that (a) If item [A ~ tg • fl, u] is in 12, then u ~ FOLLOWk(A). (b) If 12 is not the initial set of items, then 12 contains at least one item of the form [A --~ tzX. fl, u] for some X in N w E. (c) If [B ~ • fl, v] is in 12 and B ~ S', then there is an item of the form [A ~ tz - BT, u] in 12. (d) If [A --~ tg • Bfl, u] is in 12 and EFFi(Bflu) contains a, then there is an item of the form [C ~ • a~', v] in 12 for some ~, and v. (This result provides an easy method for computing the shift entries in an LR(1) parser.)
*7.3.9.
Show that if an error entry is replaced by tp in 3', the set of LR(k) tables constructed by Algorithm 7.6, the resulting set of tables will no longer be ~0-inaccessible.
"7.3.10.
Show that a canonical LR(k) parser will announce error either in the initial configuration or immediately after a shift move.
*'7.3.11.
Let G be an LR(k) grammar. Give upper and lower bounds on the number of tables in the canonical set of LR(k) tables for G. Can you give meaningful upper and lower bounds on the number of tables in an arbitrary valid set of LR(k) tables for G ?
"7.3.12.
Modify Algorithm 7.6 to construct a tp-inaccessible set of LR(0) tables for an LR(0) grammar.
"7.3.13.
Devise an algorithm to find all a-entries in an arbitrary set of LR(k) tables.
"7.3.14.
Devise an algorithm to find all tp-entries in an arbitrary set of LL(k) tables.
*'7.3.15.
Devise an algorithm to find all a-entries in an LC(k) parsing table.
"7.3.16.
Devise a reasonable algorithm to find compatible partitions on a set of LR(k) tables.
7.3.17.
Find compatible partitions for the sets of LR(1) tables in Exercise 7.3.1.
7.3.18.
Show that the relation of compatibility of LR(k) tables is reflexive and symmetric but not transitive.
7.3.19.
Let (3, To) be the canonical set of LR(k) tables for an LR(k) grammar G. Show that GOTO(T0, t~) is not empty if and only if t~ is a viable prefix of G. Is this true for an arbitrary valid set of LR(k) tables for G ?
*7.3.20,
Let (3, To) be the canonical set of LR(k) tables for G. Find an upper bound on the height of any table in 3 (as a function of G).
7.3.21.
Let 3 be the canonical set of LR(k) tables for LR(k) grammar G = (N, E, P, S). Show that for all T e 3, NEXT(T, p) is the set {GOTO(T', A)I T' ~ G O T O - I ( T , t~), where production p is A ~ t~}.
EXERCISES
617
7.3.22.
Give an algorithm to compute NEXT(T,p) for an arbitrary set of LR(k) tables for a grammar G .
*7.3.23.
Let 3 be a canonical set of LR(k) tables. Suppose that for each T ~ 3 having one or more reduce actions, we select one production, p, by which T reduces and replace all error and ~0 actions by reduce p. Show that the resulting set of tables is equivalent to 3.
*7.3.24.
Show that reductions by single productions cannot always be eliminated from a set of LR(k) tables for an LR(k) grammar by Algorithm 7.9.
*7.3.25.
Prove that Algorithm 7.9 results in a set of LR(k) tables which is equivalent to the original set of tables except for reductions by single productions.
7.3.26.
A binary operator 0 associates from left to right if aObOe is to be interpreted as ((aOb)Oe). Construct an LR(1) grammar for expressions over the alphabet [a, (,)} together with the operators { + , - - , . , / , ~}. All operators are binary except + and --, which are both binary and (prefix) unary. All binary operators associate from left to right except $. The binary operators 3- and -- have precedence level 1, • and / have precedence level 2, 1" and the two unary operators have precedence level 3.
**7.3.27.
Develop a technique for automatically constructing an LR(1) parser for expressions when the specification of the expressions is in terms of a set of operators together with their associativities and precedence levels, as in Exercise 7.3.26.
*7.3.28. **7.3.29.
Prove that 3 and ql. in Example 7.17 are equivalent sets of tables. Show that it is decidable whether two sets of LR(k) tables are equivalent. DEFINITION The following productions generate arithmetic expressions in which 0 1 , 0 2 , . . . , 0. represent binary operators on n different precedence levels. 01 has the lowest precedence, and 0. the highest. The operators associate from left to right.
Eo - - ~ Eo$1 E1 i E1 E1 ~
EiOzEz I E2
En-1- > E.-IOnEnIE. E. ----> (E1) l a We can also generate this same set of expressions by the following "tagged LR(1)" grammar" (1) E, --~ E, OjEy (2) Ei ~ (Eo) (3) Ei---~a
0~ i T*F[F
F
~ (E) la
The canonical collection of sets of LR(0) items for G is listed in Fig. 7.36, with the second components, which are all e, omitted. tWe could use FOLLOWk_I(A) here, since fl must generate at least one symbol of any string in EFFk(fl FOLLOWk(A)).
SEC. 7.4
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS 60:
E'---+ E--~ E----~ T---~ T---~
-E .E + T .T .T* F .F
F---,.(E) F ~
61: 62:
66:
E ~ E T---~ T----~ F----~ F---*
67"
T~T,
.a
E'~
E. E---+E.+T
68:
E-----~ T.
69:
+ -T .T* F .F .(E) -a
.F
F--~
.(E)
F--+
.a
F ----~ (E, ) E--~
T - - + T, , F
E.+T
E---~E+
T.
63:
T - - - , F.
64:
F---~ a.
6ta0:
T - - ~ T , F.
F---~ (.E)
61x :
F--~ (E).
65:
E---~ E---~ T--~ T---~ F ~
623
T-----~ T . , F
.E+ T .T .T, F .F .(E)
F .---~ . a
Fig. 7.36
LR(0) items for Go.
G o is not SLR(0) because, for example, 61 contains the two items [E'--~ E . ] and [E ~ E - ÷ T] and F O L L O W 0 ( E ' ) = {e} = E F F 0 [ + T FOLLOW0(E)].t
However, Go is SLR(1). To check the SLR(1) condition, it suffices to consider sets of items which (1) Have at least two items, and (2) Have an item with the dot at the right-hand end. Thus, we need concern ourselves only with 61, 62, and 69. For 61, we observe that FOLLOWi(E' ) = [ e } and E F F I [ + T FOLLOW i (E)] = { + } . Since [e} A { + ] = ~ , 61 satisfies condition (3) of the SLR(1) definition. 62 and 69 satisfy condition (3) similarly, and so we can conclude that G Ois SLR(1). Now let us attempt to construct a set of LR(1) tables for an SLR(1) grammar G starting from So, the canonical collection of sets of LR(0) items for G. Suppose that some set of items 6 in So has only the items [A --~ e -, e] and [B ~ fl.~,, e]. We can construct an LR(I) table from this set of tNote that for all ~, F I R S T o ( ~ ) -- E F F o ( ~ ) -- F O L L O W o ( ~ ) -- [e].
624
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
items as follows. The goto entries are constructed in the obvious way, as though the tables were LR(0) tables. But for lookahead a, what should the action be? Should we shift, or should we reduce by production A----~ ct. The answer lies in whether or not a E F O L L O W i (A). If a ~ FOLLOWi(A), then it is impossible that a is in EFFi(?) , by the definition of an SLR(1) grammar. Thus, reduce is the appropriate parsing action. Conversely, if a is not in F O L L O W i (A), then it is impossible that reduce is correct. If a is in EFF(~,), then shift is the action; otherwise, error is correct. This algorithm is summarized below. ALGORITHM 7.10 Construction of a set of LR(k) tables for an SLR(k) grammar.
Input. An SLR(k) grammar G = (N, E, P, S) and So, the canonical collection of sets of LR(0) items for G. Output. (3, To), a set of LR(k) tables for G, which we shall call the SLR(k) set of tables for G. Method. Let ~ be a set of LR(0) items in So. The LR(k) table T associated with ~ is the pair ( f , g), constructed as follows: (1) For all u in Z *k, (a) f(u) = shift if [A --~ ~ • fl, e] is in ~, ,8 ~ e, and u is in the set EFFk(fl FOLLOWk(A)). (b) f ( u ) = reduce i if [d---~ 0c., e] is in ~, d - - ~ o~ is production i in P, and u is in FOLLOWk(A). (c) f(e) = accept if [S' --~ S -, e] is in ~2.'~ (d) f ( u ) = error otherwise. (2) For all X in N t.3 X, g(X) is the table constructed from G O T O ( a , X). To, the initial table, is the one associated with the set of items containing
[s' ~
. S, d.
We can relate Algorithm 7.10 to our original method of constructing a set of tables from a collection of sets of items given in Section 5.2.5. Let ~' be the set of items [A ~ tx • fl, u] such that [A ~ e • fl, e] is in ~ and u is in FOLLOWk(A ). Let S~ be [ ~ ' [ ~ ~ g0}- Then, Algorithm 7.10 yields the same set of tables that would be obtained by applying the construction given in Section 5.2.5 to g~. It should be clear from the definition that each set of items in g; is consistent if and only if G is an SLR(k) grammar. Example 7.25
Let us construct the SLR(1) set of tables from the sets of items of Fig. 7.36 (p. 623). We use the name Tt for the table constructed from ~ r We shall consider the construction of table T2 only. "l'The canonical collection of sets of items is constructed from the augmented grammar.
SEC. 7.4
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
625
tt 2 is {[E--+ T .], [T ~ T - • F]}. Let T2 = ( f , g). Since F O L L O W ( E ) is [ + , ), e}, we have f ( + ) = f D ] = f ( e ) = reduce 2. (The usual p r o d u c t i o n n u m b e r i n g is being used.) Since E F F ( , F F O L L O W ( T ) ) = [,], f ( , ) = shift. F o r the other lookaheads, we have f ( a ) = f[(] = error. The only symbol X for which g ( X ) is defined is X = ,. It is easy to see by inspection o f Fig. 7.36 that g ( , ) = TT. The entire set of tables is given in Fig. 7.37.
a
+
action * (
)
e
E
r0
S
X
X
S
X
X
T~r2T3T4X
rl
X
S
X
X
X
A
X
X
X
X
r2 7q
X
2
S
X
2
2
X
X
X
X
4
4
X
4
4
X
X
r4
X
6
6
X
6
6
X
r5
S
X
X
S
X
X
r6
S
X
X
S
X
r7
S
X
X
S
r8
X
S
X
r9
X
1
T10
X X
goto
T
F
a
+
•
(
)
x
T5
X
T6
X
X
X
X
X
T7
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
Ts
T2
T3
T4
X
XTsX
X
X
Tg
T3
T4
X
X
T5
X
X
X
X
X
TlO Tg
X
T5
X
X
S
X
X
X
X
X
T6
x
XT~
S
X
1
1
x
x
x
x
x
rT
x
x
3
3
X
3
3
X
X
X
X
X
X
X
X
5
5
X
5
5
X
X
X
X
X
X
X
X
Fig. 7.37 SLR(1) tables for Go. Except for names and ~ entries, this set o f tables is exactly the same as that in Fig. 7.32 (p. 607). We shall now prove that A l g o r i t h m 7.10 will always p r o d u c e a valid set of LR(k) tables for an SLR(k) g r a m m a r G. In fact, the set o f tables p r o d u c e d is equivalent to the canonical set o f LR(k) tables for G. THEOREM 7.9
If G is an SLR(k) g r a m m a r , (3, To), the SLR(k) set o f tables constructed by A l g o r i t h m 7.10 for G is equivalent to (3 c, To), the canonical set o f LR(k) tables for G. P r o o f Let gk be the canonical collection o f sets o f LR(k) items for G and let So be the collection o f sets o f LR(0) items for G. Let us define the
626
CHAP. 7
TECHNIQUES FOR PARSER OPTIMIZATION
• core of a set of items R as the set of bracketed first components of the items in that set. For example, the core of [A ~ 0c • fl, u] is [A ~ 0c • fl].t We shall denote the core of a set of items R by CORE(R). Each set of items in So is distinct, but there may be several sets of items in Sk with the same core. However, it can easily be shown that So = {CORE(et) let ~ Sk}. Let us define the function h on tables which corresponds to the function C O R E on sets of items. We let h(T) -- T' if T is the canonical LR(k) table associated with R and T' is the SLR(k) table constructed by Algorithm 7.10 from CORE(R). It is easy to verify that h commutes with the GOTO function. That is, GOTO(h(T), %) = h(GOTO(T, X)). As before, let R'=
{[A---, tx. ,8, ull[A---~ ~ - f l , e] e R
and
u ~ FOLLOWk(A)}.
Let $~ = [R'IR ~ So}. We know that (3, To) is the same set of LR(k) tables as that constructed from S'0 using the method of Section 5.2.5. We shall show that (5, To) can also be obtained by applying a sequence of transformations to (3c, To), the canonical set of LR(k) tables for G. The necessary steps are the following. (1) Let 6' be the postponement set consisting of those triples (T, u, i) such that the action of T on u is error and the action of h(T) on u is reduce i. Use Algorithm 7.8 on 6" and (3 c, Tc) to obtain another set of tables (3'c, T'c). (2) Apply Algorithm 7.7 to merge all pairs of tables T~ and Tz such that h(T1) = h(Tz). The resulting set of tables is (3, To). Let (T, u, i) be an element of 6". To show that 6' satisfies the requirements of being a postponement set for 3c, we must show that if T " = ( f " , g " ) is in NEXT(T, i), then f " ( u ) = error. To this end, suppose that production i is A ---, 0c and T " = GOTO(To, flA) for some viable prefix flA. Then T = GOTO(T0,/~).
In contradiction let us suppose t h a t f " ( u ) ~ error. Then there is some item [B ~ y • ,5, v] valid for flA, where u is in EFF(~v).:I: Every set of items, except the initial set of items, contains an item in which there is at least one symbol to the left of the dot. (See Exercise 7.3.8.) The initial set of items is valid only for e. Thus, we may assume without loss of generality that 7 = y'A for some y'. Then [B ---~ y' • A6, v] is valid for fl, and so is [A ---~ • ~, u]. Thus, [A --~ ~ . , u] is valid for p~, and f ( u ) should not be error as assumed. We conclude that 6' is indeed a legitimate postponement set for ~o. Let ~3~ be the result of applying Algorithm 7.8 to 3c using the postponetWe shall not bother to distinguish between [A ---~ ~ • ,8] and [A ---~ • • ,13,e]. ~Note that this statement is true independent of whether ~ = e or not.
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
SEC. 7.4
627
ment set (P. Now suppose that T is a table in 3c associated with the set of items ~ and that T' is the corresponding modified table in 31- Then the only difference between T and T' is that T' may call for a reduction when T announces an error. This will occur whenever u is in FOLLOW(A) and the only items in ~ of the form [A ---, ~ . , v] have v ~ u. This follows from the fact that because of rule (lb) in Algorithm 7.10, T' will call for a reduction on all u such that u is FOLLOW(A) and [A --~ ct •] is an item in CORE(a). We can now define a partition II -- [(B t, ( B 2 , . . . , 6~r} on 31 which groups tables T1 and Tz in the same block if and only if h(T1) --h(T2). The fact that h commutes with GOTO ensures that II will be a compatible partition. Merging all tables in each block of this compatible partition using Algorithm 7.7 then produces 3. Since Algorithms 7.7 and 7.8 each preserve the equivalence of a set of tables, we have shown that 3, the set of LR(k) tables for G, is equivalent to 3c, the canonical set of LR(k) tables for G. [--] Before concluding our discussion of SLR parsing we should point out that the optimization techniques discussed in Section 7.3 also apply to SLR tables. Exercise 7.4.16 states which error entries in an SLR(1) set of tables are don't cares. 7.4.2.
Extending the SLR Concept to non-SLR Grammars
There are two big advantages in attempting to construct a parser for a grammar from So, the canonical collection of sets of LR(0) items. First, the amount of computation needed to produce So for a given grammar is much smaller in general than that required to generate ,~t, the sets of LR(1) items. Second, the number of sets of LR(0) items is generally considerably smaller than t h e number of sets of LR(1) items. However, the following question arises: What should we do if we have a grammar in which the F O L L O W sets are not sufficient to resolve parsing action conflicts resulting from inconsistent sets of LR(0) items ? There are several techniques we should consider before abandoning the LR(0) approach to parser design. One approach would be to try to use local context to resolve ambiguities. If this approach is unsuccessful, we might attempt to split one set of items into several. In each of the pieces the local context might result in unique parsing decisions. The following two examples illustrate each of these approaches. Example 7.26
Consider the LR(1) grammar G with productions (1) (2)
S ~
Aa
S ~
dAb
628
TECHNIQUES FOR PARSER OPTIMIZATION
(3) (4)
S ~
cb
S ~
dca
(5)
A-re
CHAP. 7
Even though L ( G ) consists only of four sentences, G is not an SLR(k) grammar for any k :> 0. The canonical collection of sets of LR(0) items for G is given in Fig. 7.38. The second components of the items have been omitted
~0: s' ~
.s
s ~ . A a [ . d A b [ .cb [ .dea A ---. . c ~x :
S ' ---> S .
¢gz :
S ~
a,3:
S---* d.Ab[d.ea A--+.e
a~4 :
A .a
S ---> c . b
A .--~ c. ¢g5:
S --~ Aa.
a~6 :
S ~
a7:
S --~ dc.a
dA.b
A ---~ cms: ~tg:
~a0:
S - - - ~ cb. S ~
dab.
S ~
dca.
Fig. 7.38 Sets of LR(0) items. and we have used the notation A --, al • fla i a~ • f121 "'" l a , " ft, as shorthand for the n items [ A - - , a l . f l ~ ] , [ A ~ a s - f l 2 ] , . . . , [ A ~ a , . f l , ] . There are two sets of items that are inconsistent, 64 a n d 67. Moreover, since FOLLOW(A) -- {a, b}, Algorithm 7.10 will not produce unique parsing actions from 64 and 67 on the lookaheads b and a, respectively. However, let us examine the GOTO function on the sets of items as graphically shown in Fig. 7.39.? We see that the only way to get to 64 from 6 0 is to have e on the pushdown list. If we reduce e to A, from the productions of the grammar, we see that a is the only symbol that can then follow A. Thus, 7'4, the table constructed from 64, would have the following unique parsing actions" c 4"
reduce 5
shift
d
error I error ]
tNote that this graph is acyclic but that, in general, a goto graph has cycles.
TECHNIQUESFOR CONSTRUCTINGLR(k) PARSERS
SEC. 7.4
S a
629
c
A
Fig. 7.39 GOTO graph. Similarly, from the GOTO graph we see that the only way to get to ~7 from a0 is to have de on the pushdown list. In this context if c is reduced to A, the only symbol that can then legitimately follow A is b. Thus, the parsing actions for TT, the table constructed from a7, would be a T7"[
shJift
b
c
d
reduce5 ] error ] error
The remaining LR(1) tables for G can be constructed using Algorithm 7.10 directly. [-] The grammar in Example 7.26 is not an SLR grammar. However, we were able to use lookahead to resolve all ambiguities in parsing action decisions in the sets of LR(0) items. The class of LR(k) grammars for which we can always construct LR parsers in this fashion is called the class of lookahead L R ( k ) grammars, L A L R ( k ) for short (see Exercise 7.4.1I for a more precise definition). The LALR(k) grammars are the largest natural subclass of the LR(k) grammars for which k symbol lookahead will resolve all parsing action conflicts arising in S0, the canonical collection of sets of LR(0) items. The lookaheads can be computed directly from the GOTO graph for S0 or by merging the sets of LR(k) items with identical cores. LALR grammars include all SLR grammars, but not all LR grammars are LALR grammars. We shall now give an example in which a set of items can be "split" to obtain unique parsing decisions.
630
TECHNIQUESFOR PARSER OPTIMIZATION
CHAP. 7
Example 7.27
Consider the LR(1) grammar G with productions (1) (2) (3) (4)
S ~ S--~ S ~ S -~
Aa dAb Bb dBa
(5) A - ~ c (6) B - - - ~ c This grammar is quite similar to the one in Example 7.26, but it is not an L A L R grammar. The canonical collection of sets of LR(0) items for the augmented grammar is shown in Fig. 7.40. The set of items et s is inconsistent because we do not know whether we should reduce by production A --~ e or B ~ e. Since FOLLOW(A) -- FOLLOW(B) = {a, b}, using these sets as lookaheads will not resolve this ambiguity. Thus G is not SLR(I). 12o:
S" ~
"S
S ~
.Aa
S ~
.dAb
S----~ . B b S ~ .dBa A ---+ . c B ---~ . e 121:
S" ----.~. S .
122:
S ~
A.a
123:
S ~
B.b
124:
S ~
d.Ab
S ~
d.Ba
as"
A - - ~ c. B--re.
126:
S----~Aa.
127 :
S---~Bb.
128 :
S ~
dA.b
129:
S ~
dB.a
121o:
S.--~ dAb.
121i:
S ~
dBa.
A .---~ . c B ---~ . c
Fig. 7.40
LR(O) items.
Examining the productions of the grammar, we know that if we have only e on the pushdown list and if the next input symbol is a, then we should use production A --~ e to reduce c. If the next input symbol is b, we should use production B---, e. However, if de appears on the pushdown list and the next input symbol is a, we should use production B---, c to reduce c. If the next input symbol is b, we should use A ----, c. The GOTO function for the set of items is shown in Fig. 7.41. Unfortunately, ~5 is accessible from a 0 under both ¢ and de. Thus, a 5 does not tell us whether we have e o r de on the pushdown list, and hence G is not LALR(1).
SEC. 7.4
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
S
A
631
c
b
B
A
b
Fig. 7.41 GOTO graph. However, we can construct an LR(1) parser for G by replacing ~5 by two identical sets of items ~ and ~ ' such that ~ is accessible only from t~4 and ~ ' is accessible from only t/,0. These new sets of items provide the additional needed information about what has appeared on the pushdown list. From ~ and ~2~' we can construct the tables with unique parsing actions as follows: a
b
c
d
T~" reduce 6
reduce 5
error
error
TJt" reduce 5
reduce 6
error
error
t
The value of the goto functions of T~ and T~' is always error. 7.4.3.
D
Grammar Splitting
In this section we shall discuss another technique for constructing LR parsers. It is not as easy to apply as the SLR approach, but it does work in situations where the SLR approach does not. Here, we partition a grammar G = (N, E , P , S) into several component grammars by treating certain nonterminal symbols as terminal symbols. Let N' ~ N be such a set of "splitting,' nonterminals. For each A in N' we can find GA, the component grammar with start symbol A, using the following algorithm. ALGORITHM 7.11 Grammar splitting.
lnput. A C F G G = (N, Z, P, S) and N', a subset of N.
632
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
Output. A set of component grammars GA for each A ~ N'. Method. For each A E N', construct GA as follows: (1) On the right-hand side of each production in P, replace every nonterminal B ~ N' by J~. Let ~ be [./~[B ~ N'} and let the resulting set of productions be P. (2) Define G~ = (N N' U {A}, Z U N, P, A). (3) Apply Algorithm 2.9 to eliminate useless nonterminals and productions from G]. Call the resulting reduced grammar Ga. D -
-
Here we consider the building of LR(1) parsers for each component grammar and the merger of these parsers. Alternatives involve the design of different kinds of parsers for the various components. For example, we could use an LL(1) parser for everything but expressions, for which we would use an operator precedence parser. Research extending the techniques of" this section to several types of parsers is clearly needed. Example 7.28
Let Go be the usual grammar and let N' = {E, T} be the splitting nonterminals. Then P consists of
T
> ~'* F I F
F
> (/~) [a
Thus, GE = ([E}, [/~, ~', +}, {E ~ E + ~'l~'}, E), and GT is given by (IT, F}, {~, E, (,), a, ,}, {T --~ 7~ • FI F, F ---} (~)[ a}, T). [~ We shall now describe a method of constructing LR(1) parsers for certain large grammars. The procedure is to initially partition a given grammar into a number of smaller grammars. If a collection of consistent sets of LR(1) items can be found for each component grammar and if certain conditions relating these sets of items are satisfied, then a set of LR(1) items can be constructed for the original grammar by combining the sets of items for the component grammars. The underlying philosophy of this procedure is that much less work is usually involved in building the collections of sets of LR(1) items for smaller grammars and merging them together than in constructing the canonical collection of sets of LR(1) items for one large grammar. Moreover, the resulting LR(1) parser will most likely turn out to be considerably smaller than the canonical parser. In the grammar-splitting algorithm we treat A as the start symbol of its own grammar and use FOLLOW(A) as the set of possible lookahead strings for the initial set of items for the subgrammar GA. The net effect will be to
SEC. 7.4
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
633
merge certain sets of items having common cores. The similarity to the SLR algorithm should be apparent. In fact, we shall see that the SLR algorithm is really a grammar-splitting algorithm with N' = N. The complete technique can be summarized as follows. (1) Given a grammar G = (N, E, P, S), we ascertain a suitable splitting set of nonterminals N' ~ N. We include S in N'. This set should be large enough so that the component grammars are small, and we can readily construct sets of LR(1) tables for each component. At the same time, the number of components should not be so large that the method will fail to produce a set of tables. (This comment applies only to non-SLR grammars. If the grammar is SLR, any choice for N ' will work, and choosing N ----N' yields the smallest set of tables.) (2) Having chosen N', we compute the component grammars using Algorithm 7.11. (3) Using Algorithm 7.12, below, we compute the sets of LR(1) items for each component grammar. (4) Then, using Algorithm 7.13, we combine the component sets of items into S, a collection of sets of items for the original grammar. This process may not always yield a collection of consistent sets of items for the original grammar. However, if $ is consistent, we then construct a set of LR(1) tables from 8 in the usual manner. ALGORITHM 7.12 Construction of sets of LR(1) items for the component grammars of a given grammar.
Input. A grammar G----(N, E, P, S), a subset N' of N, with S ~ N', and the component grammars GA for each A ~ N'. Output. Sets of LR(1) items for each component grammar. Method. For notational convenience let N' = [Sx, Sz,. • •, am}. We shall denote Gs, as Gi. If O, is a set of LR(1) items, we compute O0', the closure of ~ with respect to GA, in a manner similar, but not identical, to Algorithm 5.8. a ' is defined as follows: (i) Oo ~ O~'. (That is, all-items in 6t are in ~2'.) (2) If [B --~ ~ • Cfl, u] is in a,' and C --~ 7 is a production in G~, then [C ~ • 7, v] is in a ' for all v in FIRST~(fl'u), where fl' is fl with each symbol in N replaced by the original symbol in N. Thus all lookahead strings are in E*, while the first components of the items reflect productions in GA. For each Gi, we construct $i, the collection of sets of LR(1) items for G~, as follows:
634
TECHNIQUESFOR PARSER OPTIMIZATION
CHAP. 7
(1) Let ~ be the closure (with respect to Gi) of {[St --+ • a, a][S, ~ a is a production in G, and a is in FOLLOW?(S,)}. Let g, = [it, t}. (2) Then repeat step (3) until no new sets of items can be added to $i. (3) If ~ is in $i, let ~2, be {[A ~ a X . fl, u]l[A ~ a . Xfl, u] is in ~}. Here X is in N U E W N. Add ~", the closure (with respect to Gt) of ~', to S~. Thus, ~" = GOTO(~, X). [~] Note that we have chosen to add FOLLOW(A) to the lookahead set of each initial set of items rather than augmenting each component grammar with a zeroth production. The effect will be the same. Example 7.29
Let us apply Algorithm 7.12 to G 0, with N ' = [E, T}. We find that FOLLOW(E) -- {+, ), e} and FOLLOW(T) -- { + , . , ), e}. Thus, by step (1), ag consists of [E---+
.
E-+- f', -+- l ) l e]
[E > • ~P, - k - / ) / e l Likewise, ao~ consists of [T---+. ~',F, +/,/)/el
[r
> • F, + ~ , ~ ) ~ e l
[F
> • (~), + i • / ) / el
[F-
>.a,+l*l)le]
The complete sets of items generated for Ge are shown in Fig. 7.42, and those for Gr are shown in Fig. 7.43. Note that when a f is constructed from a f , for example, the symbol ~P is a terminal, and the closure operation yields no new items. [--] We now give an algorithm that takes the sets of items generated by Algorithm 7.12 for the component grammars and combines them to form
~o~.
/[E--+ -£' + L + / ) / e ] / [E---~ L + I)le] [e--~ D.+~, + l)le]
a~-
[E---> ~.,
+ I ) l e]
[E---> ~ + ' L
+ I)le]
[E--> ~ -t- ~', ÷ / ) / e ]
Fig. 7.42
Sets of items for Gz.
SEC. 7.4
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
a,0r"
{
[T----, . ~ . F , [T----, .F, [F---, •(E), [F---~-a,
635
+/./)/e] +/./)/e] + / • / ) / e] ÷/./)/e]
aT:
[T--~ f-*F, + l * l ) l e ]
t~"
[T--> F.,
+l./)/e]
a~'"
[F--~ (./~),
+/*/)/el
a~-
[F---~ a.,
+ / • / ) / e]
f[T---~ ~'*.F, ÷ / , / ) / e ] ~sr" I[F--~'(E)' + / , / ) / e ] ,[F---, .a, + / • / ) / e]
a~':
[F~(E.),
+/,/)/e]
eta':
[T----~f , F . , + / , / ) / e ]
at:
[F--~(E).,
+/,/)/e]
Fig. 7.43 Sets of items for Gr. a set of LR(1) tables for the original grammar, provided that certain conditions hold. ALGORITHM 7.13 Construction of a set of LR(1) tables from the sets of LR(1) items of component grammars. lnput. A C F G G = (N, E, P, $1), a splitting set N' = ($1, $2,. • . , Sin}, and a collection {6g, 6 ] , . . . , 6~,} of sets of LR(1) items for each component grammar Gi. Output. A valid set of LR(1) tables for G or an indication that the sets of items will not yield a valid set of tables. Method.
(1) In the first component of each item, replace each symbol of the form ,Si by S~. Each such S~ is in N'. Retain the original name for each set of items so altered. (2) Let ~'0 = [[S'1 ~ • $I, e]}. Apply the following augmenting operation to a0, and call the resulting set of items ~'0- a0 will then be the "initial" set of items in the merged collection of sets of items. Augmenting Operation. If a set of items 6 contains an item with a first c o m p o n e n t of the form A ~ ~ • Bfl and B *~ Sit, for some S i in N', y in G (N U ~)*, then add 6~ to 6. Repeat this process until no new sets of items can be added to 6.
(3) We shall now construct ,~, the collection of sets of items accessible from tt o. Initially, let S = (tt0). We then perform step (4) until no new sets of items can be added to 8.
636
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP.
7
(4) Let a be in g. a can be written as ~ U a,l) U ~tl,' U . . . u 6t[;, where ~t is either the empty set or {[S'1 ~ • S , , e]} or {[S', ~ S , . , e]}. For each X in N u X, let a,' = GOTO(a,, X) and g ~ = GOTO02L,5 X).? Let tf be the union of g' and these ~]~'s. Then apply the augmenting operation to a' and call the resulting set of items a'. Let KGOTO~: be the function such that KGOTO(a, X) = a' if a, X, and a' are related as above. Add a' to g if it is not already there. Repeat this process until for all ~ in $ and X in N u X, KGOTO(a, X) is in g. (5) When no new set of items can be added to g, construct a set of LR(1) tables from g using the methods of Section 5.2.5. If table T = ( f , g ) is being constructed from the set of items a, then g ( X ) is KGOTO(a, X). If any set of items produces parsing action conflicts, report failure. D Example 7.30
Let us apply Algorithm 7.13 to the sets of items in Figs. 7.42 and 7.43. The effect of step (1) should be obvious. Step (2) first creates the set of items ao = { [ E ' - - , • E , e]}, and after applying the augmenting operation, a0 = {[E' --~ • E, el} U ago U ao~. At the beginning of step (3), $ = [a0}. Applying step (4), we first compute ~I =
GOTO(go, E) = {[E' --~ E . , e]} U ale.
That is, GOTO({[E' --~ • E, el}, E) = [[E'--, E . , e]} and GOTO(ag, E) = af. GOTO(ff,~, E) is empty. The augmenting operation does not enlarge g,. We then compute g2 = GOTO(go, T) = ff,f Y ~'. The augmenting operation does not enlarge a 2. Continuing in this fashion, we obtain the following collection of sets of items for ,~" go = {[E'
> • E, e]} U 12o ~ U aTo
a, = {[E'
> E . , e]} U ~
a4 =
~
a~ = ~
u ~
u ~
~rs = elf u e~ tThe GOTO function for G],, is meant here. However if X is splitting nonterminal then use X in place of X. :l:The K honors A. J. Korenjak, inventor of the method being described.
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
SEC. 7.4
637
alo = a,~ atx : ~ All sets of items in S are consistent, and so from S we can construct the set of LR(1) tables shown in Fig. 7.44. Table Tt is constructed from a t. This set of tables is identical to that in Fig. 7.37 (p. 625). We shall see that this is not a coincidence. action
goto
a
+
*
(
)
e
E
T
F
a
+
*
To
s
X
x
s
x
x
T1
T2
T3
T4
X
xTsx
T1
X
S
X
X
X
A
X
X
X
X
r6
X
X
X
T2
X
2
S
X
2
2
X
X
X
X
X
T7
X
X
T3
X
4
4
X
4
4
X
X
X
X
X
X
X
X
T4
X
6
6
X
6
6
X
X
X
X
X
X
X
X
T5
S
X
X
S
X
X
T8
T2
T3
T4
X
X
Ts
X
T6
S
X
X
S
X
X
X
T9
T3
T4
X
X
T5
X
T7
S
X
X
S
X
X
X
X
TlO T4
X
X
T5
X
T8
X
S
X
X
S
X
X
X
X
X
76
X
X
Ti1
T9
X
1
S
X
1
1
X
X
X
X
X
T7
X
X
Tlo
X
3
3
X
3
3
X
X
X
X
X
X
X
X
Tll
X
5
5
X
5
5
X
X
X
X
X
X
X
X
Fig. 7.44
(
)
Tables for Go from Algorithm 7.13.
We shall now show that this approach yields a set of LR(1) tables that is equivalent to the canonical set of LR(1) tables. We begin by characterizing the merged collection of sets of items generated in step (4) of Algorithm 7.13. DEFINITION
Let KGOTO be the function in step (4) of Algorithm 7.13. We extend KGOTO to strings in the obvious way; i.e., (1) KGOTO(a, e) = a, and (2) KGOTO(a, czX) = KGOTO(KGOTO(a, ~), X).
638
TECHNIQUES
FOR PARSER
OPTIMIZATION
CHAP.
7
Let G -- (N, Z, P, S) be a C F G and N' a splitting set. We say that item [A --~ ~ • fl, a] is quasi-valid for string ), if it is valid (as defined in Section 5.2.3) or if there is a derivation S' *=~ O lBx *=, 61~2Ax =~ ~162~flx in the rm rm rm augmented g r a m m a r such that (1) 61t~2t~ : 7, (2) B is in N', and (3) a is in F O L L O W ( B ) . Note that if [A --~ ~ • fl, a] is quasi-valid for 7, then there exists a lookahead b such that [A ----. ~ • fl, b] is valid for y. (b might be a.) LEMMA 7.3 - ~ e t G - - ( N , Z , P , S) be a C F G as above and let KGOTO(a0, ) , ) - for some 7 in (N U E)*. Then tt is the set of quasi-valid items for 7.
Proof We prove the result by induction on l7 l- The basis, ~, -- e, is omitted, as it follows from observations made during the inductive step. Assume that ~ , - - y ' X and ~ , the set of quasi-valid items for ?', is K G O T O ( a 0 , y') Let ~ = all' U --- U ~.:" The case where [S' S, e] or [S' --~ S -, e] is in ~ can be handled easily, and we shall omit these details here. Suppose that [A---~ ~ . p, a] is in a = KGOTO(a0, 7"X). There are three ways in which [.4 ----~ t~ • fl, a] can be added to a. lk •
"
~
"
Case 1" Suppose that [A ~ tx- ,8, a] is in GOTO(a/~', X) for some p, a : a ' X and that [A --~ a' • Xfl, a] is in a[,". Then [A --~ a' . Xfl, a] is quasivalid for y', and it follows that [A --~ a • fl, a] is quasi-valid for ~,.
Case 2." Suppose that [A --~ a • fl, a] is in GOTO(a/,", X) and that a = e. Then there is an item [B ---, $1X" COz, b] in GOTO(~/,", X), and C *=, A w, rm where a ~ FIRST(wO2b). Then [ B - - , ~ . XCO2, b] is quasi-valid for ?', and [B ~ ~ i X . Cd~2, b] is quasi-valid for ?. If a is the first symbol of w or w ---- e and a comes from ~2, then [A --~ a • fl, a] is valid for ~,. Likewise, if [B --~ O l X . C~2, b] is valid for ),, so is [A - ~ a • fl, a]. Thus, suppose that w-----e, Oz =~ e, a = b, and [B---. O l X " C~2, b] is quasi-valid, but not valid for ?. Then there is a derivation S' *=, ~3Dx *=, ~3O4Bx =* ~3O4~XCO2x *=* O3~O~XAx, rm
rm
rm
rm
where 6 ~ 4 ~ X = ~,, D is in N', and b is in F O L L O W ( D ) . [A ~ ~z • fl, a] is quasi-valid for ~,, since a = b.
Thus, item
Case 3: Suppose that [h ~ a • fl, a] is added to tt during the augmenting operation in step (4). Then ~ = e, a n d there must be some [B ~ ~ X . COz,b] in GOTO(a/¢, X) such that
C ~r m Dw~ *~ A wzw~, D is in N', and a is in rm
FIRST(wzc) for some e in F O L L O W ( D ) . That is, D is St, where [A --~ ~x- ,8, a] is added to ~ when ~ is adjoined. The argument now proceeds as in case 2. We must now show the converse of the above, that if [A --~ ~ • fl, a] is
639
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
SEC. 7.4
quasi-valid for y, then it is in g. We omit the easier case, where the item is actually valid for ?. Let us assume that [A ~ e • fl, a] is quasi-valid, but not valid. Then there is a derivation
S' & ,~Bx& O,6~ax~ O,O~e#x rIIl
rm
where ? = d ~ 2 a , B is in N', and a is in FOLLOW(B). If ~ ~ e, then we may write ~ = ~'X. Then [A ~ ~' • Xfl, a] is quasi-valid for y' and is therefore in ~ . It is immediate that [A ---~ ~ • fl, a] is in ~. Thus, suppose that ~ = e. We consider two cases, depending on whether t~2 -~ e or g2 = e.
Case 1: ~ ~ e. Then there is a derivation B *~ t~3C =~ ~3t~4X~5 and rm
I'm
~5 *~ A. Then [C--~ $4" Xt~5, a] is quasi-valid for ?' and hence is in ~ . rm
It follows that [C ----~ g4X" ~5, a] is in ~. Since $5 *~ A, it is not hard to show that either in the closure operation or in the augmenting operation [A ---, ~ • fl, a] is placed in a. rm
Case 2:~2 = e. Then there is a derivation S' *~ g3Cy=~ t~3t~4Xt55y, where gsy*~ Bx. Then [ C - - - ~ 4 - X~5, c] is valid for y', where c is rill
rm
rill
FIRST(y). Hence, [C ~ ~ 4 X . ~5, c] is in g. Then, since d;sy *~ Bx, in the augmenting operation, all items of the form [B ~ . e , b] are added to g, where B ~ e is a production and b is in FOLLOW(B). Then, since B *~ A, the item [A ~ ~ • fl, a] is added to g, either in the modified closure of the set containing [B ~ • e, b] or in a subsequent augmenting operation. D rill
rm
THEOREM 7.10 Let (N, E, P, S) be a CFG. Let (3, To) be the set of LR(1) tables for G generated by Algorithm 7.13. Let (3 o To) be the canonical set. Then the two sets of tables are equivalent.
Proof We observe by Lemma 7.3 that the table of 3 associated with string y agrees in action with the table of 3c associated with ~, wherever the latter is not error. Thus, if the two sets are inequivalent, we can find an input w such that (To, w) ~ (TcX~T~ ... XmTm, x)t using 3~, and an error is then declared, while (To, w) ~ (To X~TI ... XmT'~, x) F-- (To Y~ U1 "'" Y.U., x)
t--- (To Yi Ui ... Y,U, aU, x') using 5. "l'We have omitted the output field in these configurations for simplicity.
-~
7
i~ - ~ :
640
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
Suppose that table U, is constructed from the set of items ~. Then ~ has some member [A ~ a - fl, b] such that ,8 ~ e and a is in EFF(fl). Since [A --~ a • fl, b] is quasi-valid for Y1 "'" Y,, by Lemma 7.3, there is a derivation S' *~ 7'Ay =~ Taffy for some y, where 7'a = Yi "'" Y,. Since we have rm rm derivation Yi . . . Y,, *~ X~ . . . Xm, rm
it
follows
that
there is an item
[B --~ ~ • e, c] valid for X~ . . . Xm, where a is in EFF(ec). (The case e = e, where a = e, is not ruled out. In fact, it will occur whenever the sequence of steps
(ToX, T'i . . . XmT~, x) l--~--(To Y1 U1 is not null. If that sequence is null, then [B ~
"
'
"
YnUn, X)
[ A - - , a . fl, b] suffices for
,~ • e, el.)
Since a = FIRST(x), the hypothesis that 3' declares an error in configuration (ToX~T'i . . . X, nT'm, x) is false, and we may conclude the theorem. [~ Let us compare the grammar-splitting algorithm with the SLR approach. The grammar-splitting algorithm is a generalization of the SLR method in the following sense. THEOREM 7.11 Let G = (N, E, P, S) be a CFG. Then G is SLR(1) if and only if Algorithm 7.13 succeeds, with splitting set N. If so, then the sets of tables produced by the two methods are the same.
Proof. If N' = N, then [A ~ a • fl, a] is quasi-valid for 7' if and only if [A ~ a • fl, b] is valid for some b and a is in FOLLOW(A). This is a consequence of the fact that if B *~ OC, then FOLLOW(B) ~ F O L L O W (C). rm
It follows from Lemma 7.3 that the SLR sets of items are the same as those generated by Algorithm 7.13. [~] Theorem 7.9 is thus a corollary of Theorems 7.10 and 7.11. Algorithm 7.13 brings the full power of the canonical LR(1) parser construction procedure to bear on each component grammar. Thus, from an intuitive point of view, if a grammar is not SLR, we would like to isolate in a single component each aspect of the given grammar that results in its being non-SLR. The following example illustrates this concept. Example 7.31
Consider the following LR(1) grammar" (1) S - - , Aa (2) S--~ dab
(3)
s ~
cb
SEC. 7.4
TECHNIQUES FOR CONSTRUCTING LR(k) PARSERS
(4) (5) (6) (7)
641
S---, BB
A ~c B---, Bc
B--, b
This g r a m m a r is n o t S L R because of p r o d u c t i o n s (1), (2), (3), and (5). Using the splitting set [S, B}, these four p r o d u c t i o n s will be together in one comp o n e n t g r a m m a r to which the full LR(1) technique will then be applied. With this splitting set A l g o r i t h m 7.13 would p r o d u c e the set of L R ( I ) tables in Fig. 7.45. [~ action
goto
a
b
c
d
e
S
A
B
a
b
c
d
To
X
S
S
S
X
T1
T2
T3
X
T4
T5
T6
T~
X
X
X
X
A
X
X
X
X
X
X
X
T2
S
X
X
X
X
x
x
x
TT
x
x
x
T3
X
S
S
X
X
x
x
T8
X
T4
T9
X
T4
X
7
7
X
7
X
X
X
X
X
X
X
T5
5
S
X
X
X
X
X
X
X
Tlo X
X
T6
X
X
S
X
X
X
Tll. X
X
X
T12 X
V7
X
X
X
X
1
X
X
X
X
X
X
X
T8
X
X
S
X
4
x
x
x
x
X
Tg
X
r9
X
6
6
X
6
X
X
X
X
X
X
X
T~o X
X
X
X
3
X
X
X
X
X
X
X
X
S
X
X
X
X
X
X
X
Tt3
X
X
X
5
X
X
X
X
X
X
X
X
X
X
T~3 X
X
X
X
2
X
X
X
X
X
X
X
r~2
Fig. 7.45
LR(1) tables.
Finally, we observe that neither A l g o r i t h m 7.10 n o r A l g o r i t h m 7.13 m a k e m a x i m a l use of the principles of e r r o r p o s t p o n e m e n t a n d table merger. F o r example, the SLR(1) g r a m m a r in Exercise 7.3.1(a) has a canonical set of 18 LR(1) tables. A l g o r i t h m 7.13 will yield a set o f LR(1) tables containing at least 14 tables a n d A l g o r i t h m 7.10 will p r o d u c e a set of 14 SLR(1) tables.
642
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
But a judicious application of error p o s t p o n e m e n t and table merger can result in an equivalent set of 7 LR(1) tables. •
EXERCISES
"7.4.1.
Consider the class {G1, G2 . . . . } of LR(0) grammars, where G. has the following productions"
S
~Ai
l~iaiA i
l 0.
*7.4.6.
Show that every LL(1) grammar is an SLR(1) grammar. Is every LL(2) grammar an SLR(2) grammar ?
7.4.7.
Using Algorithm 7.10, construct a parser for each grammar in Exercise 7.3.1.
7.4.8.
Let ,~c be the canonical collection of sets of LR(k) items for G. Let So be the sets of LR(0) items for G. Show that ,~c and 80 have the same sets of cores. Hint: Let ~ = GOTO(~0, ~), where ~0 is the initial set of So, and proceed by induction on l~ I.
7.4.9.
Show that CORE(GOTO(tY, ~)) = GOTO(CORE(~), 06), where ~ ~ Sc as above. DEFINITION A grammar G = (N, E , P , S) is said to be lookahead LR(k) [LALR(k)] if the following algorithm succeeds in producing LR(k) tables: (1) Construct So, the canonical collection of sets of LR(k) items for G. (2) For each tY ~ ~c, let (Y' be the union of those (B ~ ~ such that CORE((B) = CORE(G). (3) Let S be the set of those ~ ' constructed in step (2). Construct a set of LR(k) tables from S in the usual manner.
7.4.10.
Show that if G is SLR(k), then it is LALR(k).
EXERCISES
643
7.4.11.
Show that the LALR table-constructing algorithm above yields a set of tables equivalent to the canonical set.
7.4.12.
(a) Show that the grammar in Example 7.26 is LALR(1). (b) Show that the grammar in Example 7.27 is not LALR(k) for any k.
7.4.13.
Let G be defined by
S----~L= RIR L----~ * RI a
R -----~ L Show that G is LALR(1) but not SLR(1). 7.4.14.
Show that there are LALR grammars that are not LL, and conversely.
• 7.4.15.
Let G be an LALR(1) grammar. Let So be the canonical collection of sets of LR(0) items for G. We say that So has a shift-reduce conflict if some ~ ~ So contains items [ A ~ . ] and [ B ~ f l . ay] where a ~ FOLLOW(A). Show that the LALR parser resolves each shiftreduce conflict in the LR(0) items in favor of shifting.
"7.4.16.
Let (3, To) be the set of SLR(1) tables for an SLR(1) grammar G = (N, ~, P, S). Show that: (1) All error entries in the goto field of each table are don't cares. (2) An error entry on input symbol a in the action field of table T is essential (not a don't care) if and only if one of the following conditions holds. (a) T is To, the initial table. (b) There is a table T ' = (f, g) in ~3 such that T = g(b) for some binS. (c) There is some table T' = (f, g) such that T c NEXT (T', i) and f(a) = reduce i.
"7.4.17.
Let G = (N, ~, P, S) be a CFG. Show that G is LR(k) if and only if, for each splitting set N ' ~ N, all component grammars G,~ are LR(k), A ~ N'. [Note: G may not be LR(k) but yet have some splitting set N' such that each Ga is LR(k) for A ~ N'.]
7.4.18.
Repeat Exercise 7.4.17 for LL(k) grammars.
7.4.19.
Use Algorithms 7.12 and 7.13 to construct a set of LR(1) tables for Go using the splitting set {E, T, F}. Compare the set of tables obtained with that in Fig. 7.44 (p. 637).
**7.4.20.
Under what conditions will all splitting sets on the nonterminals of a grammar G cause Algorithms 7.12 and 7.13 to produce the same set of LR(1) tables for G ?
"7.4.21.
Suppose that the LALR(1) algorithm fails to produce a set of LR(1) tables for a grammar G = (N, ~, P, S) because a set of items containing [A----~ t~., a] and [B ~ f l ' 7 , b] is generated such that ~, ~ e and
644
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
a ~ EFF(?b). Show that if N' ~ N is a splitting set and A ~ N', then Algorithm 7.13 will also not produce a set of LR(1) tables with unique parsing actions. 7.4.22.
Use Algorithms 7.12 and 7.13 to try to construct a set of LR(1) tables for the grammar in Example 7.27 using the splitting set [S, A}.
7.4.23.
Use Algorithms 7.12 and 7.13 to construct a set of LR(1) tables for the grammar in Exercise 7.3.1(a) using the splitting set {S, ,4}.
*7.4.24.
Use error postponement and table merger to find an equivalent set of LR(1) tables for the grammar in Exercise 7.3,1(a) that has seven tables.
*7.4.25.
Give an example of a (1, 1)-BRC grammar which is not SLR(1).
*7.4.26.
Show that if Algorithm 7.9 is applied to a set of SLR(1) tables for a grammar having at most one single production for any nonterminal, then all reductions by single productions are eliminated.
Research Problems 7.4.27.
F i n d additional ways of constructing small sets of LR(k) tables for LR(k) grammars without resorting to the detailed transformations of Section 7.3. Your methods need not work for all LR(k) grammars but should be applicable to at least some of the practically important grammars, such as those listed in the Appendix of VolUme 1.
7.4.28.
When an LR(k) parser enters an error configuration, in practice we would call an error recovery routine that modifies the input and the pushdown list so that normal parsing can resume. One method of modifying the error configuration of an LR(1) parser is to search forward on the input tape until we find one of certain input symbols. Once such an input symbol a has been found, we look down into the pushdown list for a table T : (f, g~ such that T was constructed from a set of items ~ containing an item of the form [A ---, • 0c, a], A ~ S. The errOr recovery procedure is to delete all input symbols up to a a n d to remove all symbols and tables above T on the pushdown list. The nonterminal A is then placed on top of the pushdown list and table g(A) is placed on top of A. Because of Exercise 7.3.8(c), g(A) ~ error. The effect of this error recovery action is to assume that the grammar symbols above T on the pushdown list together with the input symbols up to a form an instance of A. Evaluate the effectiveness of this error recovery procedure, either empirically or theoretically. A reasonable criterion of effectiveness is the likelihood of properly correcting the errors chosen from a set of "likely" programmer errors.
7.4.29.
When a grammar is split, the component grammars can be parsed in different ways. Investigate ways to combine various types of parsers for the components. In particular, is it possible to parse one component bottom up and another top down ?
SEC. 7.5
PARSING AUTOMATA
645
Programming Exercises 7.4.30.
Write a program that constructs the SLR(1) set of tables from an SLR(I) grammar.
7.4.31.
Write a program that finds all inaccessible error entries in a set of LR(1) tables.
7.4.32.
Write a program to construct an LALR(1) parser from an LALR(1) grammar.
7.4.33.
Construct an SLR(1) parser with error recovery for one of the grammars in the Appendix of Volume I.
BIBLIOGRAPHIC
NOTES
Simple LR(k) grammars and LALR(k) grammars were first studied by DeRemer [1969, 1971]. The technique of constructing the canonical set of LR(0) items for a grammar and then using lookahead to resolve parsing decision ambiguities was also advocated by DeRemer. The grammar-splitting approach to LR parser design was put forward by Korenjak [1969]. Exercise 7.4.1 is from Earley [1968]. The error recovery procedure in Exercise 7.4.28 was suggested by Leinius [1970]. Exercise 7.4.26 is from Aho and Ullman [1972d].
7.5.
PARSING A U T O M A T A
Instead of looking at an L R parser as a routine which treats the LR tables as data, in this section we shall take the point of view that the L R tables control the parser. Adopting this point of view, we can develop another approach to the simplification of L R parsers. The central idea of this section is that if the LR tables are in control of the parser, then each table can be considered as the state of an automaton which implements the LR parsing algorithm. The automaton can be considered as a finite automaton with "side effects" that manipulate a pushdown list. Minimization of the states of the automaton can then take place in a manner similar to Algorithm 2.2. 7.5.1.
Canonical Parsing Automata
An LR parsing algorithm makes its decision to shift or reduce by looking at the next k input symbols and consulting the governing table, the table on top of the pushdown list. If a reduction is made, the new governing table is determined by examining the table on the pushdown list which is exposed d u r i n g the reduction.
646
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
I t is entirely possible to imagine that the tables themselves are parts of a program, and that program control lies with the governing table itself. Typically, the program will be written in some easily interpreted language, so the distinction between a set of tables driving a parsing routine and an interpreted program is not significant. Let G be an LR(0) grammar and (5, To) the set of LR(0) tables for G. From the tables we shall construct a parsing automaton for G that mimics the behavior of the LR(0) parsing algorithm for G using the set of tables 3. We notice that if T = ( f , g ) is an LR(0) table in 3, then f(e) is either shift, reduce, or accept. Thus, we can refer to tables as "shift" tables or "reduce" tables as determined by the parsing action function. We shall initially have one program for each table. We can interpret these programs as states of the parsing automaton for G. A shift state T ~ - ( f , g) does the following: (1) T, the name of the state, is placed on top of the pushdown list. (2) The next input symbol, say a, is removed from the input and control passes to the state g(a). In the previous versions of LR parsing we also placed the input symbol a on the pushdown list on top of the table T. But, as we have pointed out, storing the grammar symbols on the pushdown list is not necessary, and for the remainder of this section we shall not place a n y grammar symbols on the pushdown list. A reduce state does the following" (1) Let A ~ ~ be production i according to which the reduction is to be made. The top Itzl -- 1 symbols are removed (popped) from the pushdown list.t (If ~ = e, then the controlling state is placed on top of the pushdown list.) (2) The state name now on top of the pushdown list is determined. Suppose that that state is T = ( f , g). Control passes to the state g(A) and the production number i is emitted. A special case occurs when the "reduce" action really means accept. In that case, the entire process terminates and the automaton enters an (accepting) final state. It is straightforward to show that the collection of states defined above will do to the pushdown list exactly what the LR(k) parser does (except here we have not written any grammar symbols on the pushdown list) if we identify the state names and the tables from which the states are derived. The only exception is that the LR(k) parsing algorithm places the name of the governing table on top of the pushdown list, while here the name of tWe remove loci -- 1 symbols, rather than t~1 symbols because the table corresponding to the rightmost symbol of 0cis in control and does not appear on the list.
SEC. 7.5
PARSING AUTOMATA
647
that table does not appear but is indicated by the fact that program control lies with that table. We shall now define the parsing automaton which executes these parsing actions directly. There is a state of the automaton for each state (i.e., table) in the above sense. The input symbols for the automaton are the terminals of the grammar and the state names themselves. A shift state makes transitions only on terminals, and a reduce state makes transitions only on state names. I n fact, a state T calling for a reduction according to production A ~ ~ need have a transition specified for state T' only when T is in GOTO(T', ~). We should remember that this automaton is more than a finite automaton, in that the states have side effects on a pushdown list. That is, each time a state transition occurs, something happens to the pushdown list which is not reflected in the finite automaton model of the system. Nevertheless, we can reduce the number of states of the parsing automaton in a manner similar to Algorithm 2.2. The difference in this case is that we must be sure that all subsequent side effects are the same if two states are to be placed in the same equivalence class. We now give a formal definition of a parsing automaton. DEFINITION
Let G = (N, Z, P, S) be an LR(0) grammar and (3, To) its set of LR(0) tables. We define an incompletely specified automaton M, called the canonical parsing automaton for G. M is a 5-tuple (3, Z U 3 U {$}, ~, To, [T~}), where (1) 3 is the set of states. (2) E U 3 is the set of possible input symbols. The symbols in Z are on the input tape, and those i n 3 are on the pushdown list. Thus, 3 is both the set of states and a subset of the inputs to the parsing automaton. (3) ~ is a mapping from 3 × (Z U 3) to 3. c~is defined as follows: (a) If T ~ 3 is a shift state, 6(T, a ) = GOTO(T, a) for all a ~ 2~. (b) If T ~ 3 is a reduce state calling for a reduction by production A ~ ~x and if T' is in GOTO-I(T, tz) [i.e., T = GOTO(T', tz)], then c~(T, T') = GOTO(T', A). (c) O(T, X) is undefined otherwise. The canonical parsing automaton is a finite transducer with side effects on a pushdown list. Its behavior can be described in terms of configurations consisting of 4-tuples of the form (~t, T, w, n), where (1) 0c represents the contents of the pushdown list (with the topmost symbol on the right). (2) Z is the governing state. (3) w is the unexpended input. (4) n is the output string to this point. Moves can be reflected by a relation ~- on configurations. If T is a shift
648
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
state and J(T, a) = T', we write (oc, T, aw, n) ~ (ocT, T', w, n). If T calls for a reduction according to the ith production A ~ ~, and J(T, T') ---- T", we write (ocT'fl, T, w, n) ~ (ocT', T", w, hi) for all fl of length I~, [ -- 1. If l~'[----0, then (~z, T, w, n) ~ (aT, T", w, hi). In this case, T and T' are the same. Note that if we had included the controlling state symbol as the top symbol of the pushdown list, we would have the configurations of the usual LR parsing algorithm. We define ~--., I--, and ~ in the usual fashion. An initial configuration is one of the form (e, To, w, e), and an accepting configuration is one of the form (To, T1, e, z0. If (e, To, w, e)]--- (To, T1, e, n), then we say that n is the parse produced by M for w. Example 7.32
Let us consider the LR(0) grammar G
S--~ aA S--~ aB A --~ bA A --* c B - - , bB B--~ d
(l) (2) (3) (4) (5) (6)
generating the regular set ab*(c + d). The ten LR(0) tables for G are listed in Fig. 7.46. action
goto S
A
B
a
b
c
d
To
T1
X
X
T2
X
X
X
T1
X
X
X
X
X
X
X
T2
x
T3 T4
X
T5 T,
T7
T3
X
X
X
X
X
X
X
T4
X
X
X
X
X
X
X
T5
x
T8 T9 X
T5 T,
T7
T6
X
X
X
X
X
X
X
T7
X
X
X
X
X
X
X
T8
X
X
X
X
X
X
X
T9
X
X
X
X
X
X
X
e
Fig. 7.46
LR(0) tables.
SEC. 7.5
PARSING AUTOMATA
649
To, T2, and T5 are shift states. Thus, we have a parsing automaton with the following shift rules" 6(To, a) = T~ ,~(T~, b) = T~
6(T~, c) = T~ 6(T~, d) = r~ 6(T~, b) = T~
6(T~, c) = T~ 6(T5, d) = T 7 We compute the transition rules for the reduce states T3, T4, T6, T7, Ts and Tg. Table T3 reduces using production S ~ aA. Since GOTO-I(T3, aA)={To} and GOTO(To, S ) = TI, ~(T3, To) = T1 is the only rule for T3. Table T7 reduces by B ---~ d. Since GOTO-I(TT, d) = [T2, Ts}, GOTO(T2, B) = T4, and GOTO(Ts, B) = Tg, the rules for T7 are ~(TT, T2) = T4 and ~6(T7, Ts) = Tg. The reduce rules are summarized below"
~(T3, To) : T 1 ~(Z,, To) : T, ~(T6, T2) : T3
~(T6, Ts) : Ts
6(T~, T9 = T,
~(T7, Ts) : T9
~(Ts, T 9 = Z~
~(T8, Ts) : Ts
6(T~, T9 : T~
~(Tg, Ts) : T9
The transition graph of the parsing automaton is shown in Fig. 7.47. With input abc, this canonical automaton would enter the following sequence of configurations: (e, To, abe, e) ~- (To, Tz, be, e)
(ToTs, T~, c, e)
1-- (ToT2Ts, T6, e, e) (Torero, T~, e, 4)
(TOT2, T3, e, 43) 1- (To, Ti, e, 431) Thus, the parsing automaton produces the parse 431 for the input abc. D
650
TECHNIQUESFOR PARSER OPTIMIZATION
CHAP. 7
start
b
¢
c T7
Ts
T5
T8
T 9 ~ T5
/,o Fig. 7.47
7.5.2.
Transition graph of canonical automaton.
Splitting the Functions of States
One good feature of the parsing automaton approach to parser design is that we can often split the functions of certain states and attribute them to two separate states, connected in series. If state A is split into two states A I and A 2 while B is split into B 1 and B2, it may be possible to merge, say A 2 and B2, while it was not possible to merge A and B. Since the amount of work done by A1 and A 2 (or B1 and B2) exactly equals the amount done by A (or B), no increase in cost occurs if the split is made. However, if state mergers can be made, then improvement is possible. In this section we shall explore ways in which the functions of certain states can be split with the hope of merging common actions. We shall split every reduce state into a pop state followed by an interrogation state. Suppose that T is a reduce state calling for a reduction by production A --~ ~t whose number is i. When we split T, we create a pop state whose sole function is to remove [~l -- 1 symbols from the top of the pushdown list. If 0~ = e, the pop state will actually add the name T to the top of the pushdown list. In addition, the pop state will emit the production number i. Control is then transferred to the interrogation state which examines
SEC. 7.5
PARSINGAUTOMATA
651
the state name now on top of the pushdown list, say U, and then transfers control to GOTO(U, A). In the transition graph we can replace a reduce state T by a pop state, which we shall also call T, and an interrogation state, which we shall call T'. All edges entering the old T still enter the new T, but all edges leaving the old T now leave T'. One unlabeled edge goes from the new T to T'. This transformation is sketched in Fig. 7.48, where production i is A--~ 0~. Shift states and the accept state will not be split here.
T
Pop State
Interrogation State Old reduce state Split Reduce State Fig. 7.48 Splittingreduce states.
The automaton constructed from a canonical parsing automaton in the above manner is called a split canonical parsing automaton. Example 7.33
The split parsing automaton from Fig. 7.47 is shown in Fig. 7.49. We show shift states by [], pop states by A , and interrogation and accept states by (D. To compare the behavior of this split automaton with the automaton in Example 7.32, consider the sequence of moves the split automaton makes on input abc : (e, To, abe, e) ]-- (To, Tz, be, e)
(roT~, r~, c, e) (ToT2Ts., T6, e, e) [-- (TorzTs, T~, e, 4)
652
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
(ToTzTs, Ts, e, 4) I--- (TOT2, T~, e, 43) J-- (TOT2, T3, e, 43) (To, T~, e, 431) l- (To, T~, e, 431)
D start b
L I a¸
°
( s
Fig. 7.49 Split canonical automaton. If M1 and Mz are two parsing automata for a grammar G, then we say that Mi and Mz are equivalent if, for each input string w, they both produce the same parse or they both produce an error indication after having read
SEC. 7.5
PARSING AUTOMATA
653
the same number of input symbols. Thus, we are using the same definition of equivalence as for two sets of LR(k) tables. If the canonical and split canonical automata are run side by side, then it is easy to see that the resulting pushdown lists are the same each time the split automaton enters a shift or interrogation state. Thus, it should be evident that these two automata are equivalent. There are two kinds of simplifications that can be made to split parsing automata. The first is to eliminate certain states completely if their actions are not needed. The second is to merge states which are indistinguishable. The first kind of simplification eliminates certain interrogation states. If an interrogation state has out-degree 1, it may be removed. The pop state connected to it will be connected directly, by an unlabeled edge, to the state to which the interrogation state was connected. We call the automaton constructed from a split canonical automaton by applying this simplification a semireduced automaton. Example 7.34
Let us consider the split parsing automaton of Fig. 7.49. T~ and T~ have only one transition, on To. Applying our transformation, these states and the To transitions are eliminated. The resulting semireduced automaton is shown in Fig. 7.50. THEOREM 7.12 A split canonical parsing automaton M1 and its semireduced automaton M~ are equivalent. Proof. An interrogation state does not change the symbols appearing on the pushdown list. Moreover, if an interrogation state T has only one transition, then the state labeling that transition must appear at the top of the pushdown list whenever M1 enters state T. This follows from the definition of the GOTO function and of the canonical automaton. Thus, the first transformation does not affect any sequence of stack, input or output moves made by the automaton. [Z]
We now turn to the problem of minimizing the states of a semireduced automaton by merging states whose side effects (other than placing their own name on the pushdown list) are the same and which transfer on corresponding edges to states that are themselves indistinguishable. The minimization algorithm is similar in spirit to Algorithm 2.2, although modifications are necessary because we must account for the operations on the pushdown list. DEFINITION Let M = (Q, E, ~, q0, {ql}) be a semireduced parsing automaton, where q0 is the initial state and q l the accept state. Note that Q ~ E. Also, we shall
654
CHAP.
TECHNIQUES FOR PARSER OPTIMIZATION
7
start
T2
-
7 d
T5
Fig. 7.50 Semi-reducedautomaton.
use e to "label" transitions hitherto unlabeled. We say that p and q in Q are 0 ~ q, if one of the following conditions is satisfied by the transition diagram for M (the case p = q is not excluded):
O-indistinguishable, written p (1) p (2) p (3) p from the (That is, (4) p
and q are both shift states. and q are both interrogation states. and q are both pop states, and they pop the same number of symbols pushdown list and cause the same production number to be emitted. p and q reduce according to the same production.) = q = q l (the final state).
Otherwise, p and q are 0-distinguishable. In particular, states of different types are always 0-distinguishable. k We say that p and q are k-indistinguishable, written p - ~ q, if they are ( k - 1)-indistinguishable and one of the following holds:
SEC. 7.5
PARSING AUTOMATA
655
(1) p and q are shift states and (a) For every a ~ X u {e}, an edge labeled a leaves either both or neither of p and q. If an edge labeled a leaves each of p and q and enters p' and q', respectively, then p' and q' are ( k - 1)indistinguishable states. (b) There is no interrogation state with transitions on p and q to (k -- !)-distinguishable states. (2) p and q are pop states and the edges leaving them go to ( k - 1)indistinguishable states. (3) p and q are interrogation states, and for all states s either both or neither o f p and q have transitions on s. If both do, then the transitions lead to ( k - 1)-indistinguishable states. Otherwise, p and q are k-distinguishable. We say that p and q are indistinguishable, written p - - q, if they are k-indistinguishable for all k _> 0. Otherwise, p and q are distinguishable. LEMMA 7.4 Let M = (Q, E, 6, qo, {ql]) be a semireduced automaton. Then (1) For a l l k , k IS . an equivalence relation on Q, and k+l k+l k+2 (2) If ~ -- ~ , t h e n ~ -- ~ -- . . . . -
-
Proof Exercise similar to Lemma 2.11. Example 7.35 Consider the semireduced automaton of Fig. 7.50. Recalling the LR(0) tables from which that automaton was constructed (Fig. 7.46 on p. 648), we see that all six pop states reduce according to different productions and hence are 0,distinguishable. The other kinds of states are, by definition, 0 0-indistinguishable from those of the same kind, and so - - has equivalence classes [To, Tz, Ts}, {T1}, {T3], {T4}, [T6}, {TT}, {T8}, {Tg}, [T~, T~, T~, T~]. 1 To compute ~ , we observe that Tz and T5 are 1-distinguishable, because T~ branches to 0,distinguishable states T3 and T8 on T2 and Ts, respectively. Also, To is 1-distinguishable from Tz and Ts, because the former has a transition on a, while the latter do not. T~ and T~ are 1-distinguishable because they branch on Tz to 0-distinguishable states. Likewise, the pairs T~-T~, T~-T~ and T~-T~ are 1-distinguishable. The other pairs which are 0-indistinguishable are also 1-indistinguishable. Thus, the equivalence classes of ~ which have more than one member are {T~, T~} and {T~r, T~}. We find that =2 = ! DEFINITION Let M1 = (Q, E, J, q0, {ql}) be a semireduced automaton. Let Q' be the set of equivalence classes of -- and let [q] stand for the equivalence class
656
CI-IAP. 7
T E C H N I Q U E S FOR P A R S E R O P T I M I Z A T I O N
containing q. The reduced automaton for M1 is M2 -- (Q', E', 6', [q0], {[ql]}), where (1) E ' - - ( Z Q) u Q'; (2) For all q ~ Q and a ~ E U {e} -- Q, 6'([q], a) -- [~(q, a)]; and (3) For all q and p ~ Q, 5'([q], [p]) = [~(q, p)]. Example 7.36
In Example 7.35 we found that T~ ~ T~ and T~ ~ T~. The transition graph of the reduced automaton for Fig. 7.50 is shown in Fig. 7.51. T~ and T~r have been chosen as representatives for the two equivalence classes with more than one member. [~] From the definition of ~ , it follows that the definition of the reduced automaton is consistent; that is, rules (2) and (3) of the definition do not depend on which representative of an equivalence class is chosen. We can also show in a straightforward way that the reduced and semi-
start
-
b a
b d
/ 7
T1 Fig. 7.51 Reduced automaton.
SEC. 7.5
PARSING AUTOMATA
657
reduced automata are equivalent. Essentially, the two automata will always make similar sequences of moves. The reduced automaton enters a state representing the equivalence class of each state entered by the semireduced automaton. We state the correspondence formally as follows. THEOREM 7.13 Let Mi be a semireduced automaton and Mz the corresponding reduced automaton. Then for all i > 0, there exist T , , . . . , T~, T such that
(e, To, w, e)[--~, (TOT, . . . T,,,, T, x, n) if and only if (e, [To], w, e ) I ~ ([T0][T,] . . . [Tm], [T], x, n:), where To is the initial state of Mi and [u] denotes the equivalence class of state u.
Proof. Elementary induction on i.
[~
COROLLARY
M1 and M2 are equivalent. 7.5.3.
[~]
G e n e r a l i z a t i o n s t o LR(k) Parsers
We can also construct a canonical parsing automaton from a set of LR(k) tables for an LR(k) grammar with k > 0. Here we consider the case in which k = 1. The parsing automaton behaves in much the same fashion as the canonical parsing automaton for an LR(0) grammar, except that we now cannot classify each table solely as a shift, reduce, or accept table. As before, there will be a state of the automaton corresponding to each table. If the automaton is in configuration (ToT1 . . . Tin, T, w, n), the automaton behaves as follows: (1) The lookahead string a = FIRST(w) is determined. (2) A decision is made whether to shift or reduce. That is, if T = ( f , g), then f ( a ) is determined. (a) If f ( a ) = shift, then the automaton moves into the configuration (TOT1 "'" TroT, T', w', n), where T' = g(a) and aw' = w. (b) If f ( a ) -= reduce i and production i is A ----~a, where I~1 = r > 0, the automaton enters configuration (ToT1 -.- Tin-r+1, T', w, zti), where T' = g'(A) if Tin_r+ 1 = ( f ' , g ' ) . [If the production is A ---~ e, then the resulting configuration will be (TOT, . . . TroT, T', w, rti), where T' = g(A), assuming that T -= ( f , g).] (c) If f ( a ) = accept or error, then the automaton halts and reports acceptance or rejection of the input. It is possible to split states in various ways, so that particular pushdown
658
CHAP. 7
TECHNIQUES FOR PARSER OPTIMIZATION
list operations can be isolated with the hope of merging common operations. Here, we shall consider the following state-splitting scheme. Let Tbe a state corresponding to table T = ( f , g). We split this state into read, push, pop, and interrogation states as follows" (1) w e create a read state labeled T which reads the next input symbol. Read states are indicated by G. (2) If f(a) = shift, we then create a push state labeled T" and draw an edge with label a from the read state T to this push state. If g(a) = T', we then draw an unlabeled edge from T" to the read state labeled T'. The push state T" has two side effects. The input symbol a is removed from the input, and the table name T is pushed on top of the pushdown list. We indicate push states by ~ . (3) If f(a) = reduce i, then we create a pop state T1 and an interrogation state T2. A n edge labeled a is drawn from T to T1. If production i is A ~ a, then the action of state T~ is to remove l a[ -- 1 symbols from the top of the pushdown list and to emit the production number i. If a = e, then T1 places the original table name T on the pushdown list. Pop states are indicated by ~.. An unlabeled edge is then drawn from T1 to T2. The action of T2 is to examine the symbol now on top of the pushdown list. If GOTO-~(T, a) contains T' and GOTO(T', A) = T", then an edge labeled T' is drawn from T2 to the read state of T". Interrogation states are also indicated by G. The labels on the edges leaving distinguish these circles from read states. Thus, state T would be represented as in Fig. 7.52 if f(a) = shift and f(b) = reduce i.
b
Read State
Push State [
a
3'
7i
Pop State
Interrogation State
Fig. 7.52 Possiblerepresentation for state T.
(4) The accepting state is not split. Example 7.37
Let us consider the LR(1) grammar G
SEC. 7.5
PARSING AUTOMATA
(1) (2) (3) (4) (5)
659
s---, A s A ---~ aAb
A ---, e B---~ b B B----~b
A set of LR(1) tables for G is shown in Fig. 7.53. action
goto
a
b
e
S
A
B
a
b
To
S
3
X
T1
T2
X
T3
X
T~
X
X
A
X
X
X
X
X
T2
X
S
X
X
X
T4
x
r5
T3
S
3
X
x
/'6
X
T3
X
T4
X
X
1
X
X
X
X
X
T5
X
S
5
x
X
rT
X
rs
T6
X
S
X
X
X
X
X
T8
r7
X
X
4
X
X
X
X
X
r8
X
2
X
X
X
X
X
X
Fig. 7.53
L R ( I ) tables.
The parsing automaton which results from splitting states as described above is shown in Fig. 7.54. [-] There are several ways in which the number of states in a split automaton for an LR(1) grammar can be reduced: (1) If an interrogation state has only one edge leaving, the interrogation state can be eliminated. This simplification is exactly like the corresponding LR(0) simplification. (2) Let T be a read state such that in every path into T the last edge entering a pop state is always labeled by the same input symbol or always by e. Then the read state may be eliminated. (One can show that in this case the read state has only one edge leaving and that the edge is labeled by that symbol.) Example 7.38
Let us consider the automaton of Fig. 7.54. There are three interrogation states with out-degree 1, namely T~, T~, and T]. These states and the edges leaving can all be eliminated.
660
CHAP. 7
TECHNIQUESFOR PARSER OPTIMIZATION
Start
r3 T5
T2
T0
Fig. 7.54
Split a u t o m at o n for LR(1) grammar.
Next, let us consider read state T6. The only way T6 can be reached is via the paths from T3 and T~ or T8 and T~. The previous input symbol label is b in either case, meaning that if T3 or 7'8 see b on the input, they transfer to T~ or T~ for a reduction. The b remains on the input until 7'6 examines it and decides to transfer to T b for a shift. Since we know that the b is there, 7'6 is superfluous; T~ can push the state name 7'6 on the pushdown list without looking at the next input symbol, since that input symbol must be b if the automaton has reached state/'6.
SEC. 7.5
PARSINGAUTOMATA
661
Similarly, Tz can be eliminated. The resulting automaton is shown in Fig. 7.55. [[] As with the LR(0) semireduced automaton, we can also merge compatible states without affecting the action of the automaton. We leave these matters to the reader. In Fig. 7.55, the only pair of states which can be merged is T~ and T~.
Start
T3
To
Fig. 7.55 Semireduced automaton.
7.5.4.
Chapter Summary
In this chapter we have seen a large number of techniques that can be used to reduce the size and increase the speed of parsers. In view of all these possibilities, how should one go about constructing a parser for a given grammar? First a decision whether to use a top-down or bottom-up parser must be made. This decision is affected by the types of translations which need to be computed. The matter will be discussed in Chapter 9.
662
TECHNIQUES FOR PARSER OPTIMIZATION
CHAP. 7
If a top-down parser is desired, an LL(1) parser is recommended. To construct such a parser, we need to perform the following steps: (1) The grammar must be first transformed into an LL(1) grammar. Very rarely will the initial grammar be LL(1). Left factoring and elimination of left recursion are the primary tools in attempting t o make a grammar LL(1), b u t t h e r e is no guarantee that these transformations will always succeed. (See Examples 5.10 and 5.11 on p. 345 of Volume I.) (2) However, if we can obtain an equivalent LL(1) grammar G, then using the techniques of Section 5.1 (Algorithm 5.1 in particular), we can readily produce an LL(1) parsing table for G. The entries in this parsing table will in practice be calls to routines that manipulate the pushdown list, produce output, or generate error messages. (3) There are two techniques that can be used to reduce the size of the parsing table" (a) If a production begins with a terminal symbol, then it is not necessary to stack the terminal symbol if we advance the input pointer. (That is, if production A ---~ a~ is to be used and a is the current input symbol, then we put ~ on the stack and move the input head one symbol to the right.) This technique can reduce the number of different symbols that can appear on the pushdown list and hence the number of rows in the LL(1) parsing table. (b) Several nonterminals with similar parsing actions can be combined into a single nonterminal with a "tag" which describes what nonterminal it represents. Nonterminals representing expressions are amenable to this combination. (See Exercises 7.3.28 and 7.3.31.) If a bottom-up parser is desired, then we recommend a deterministic shift-reduce parsing algorithm such as an SLR(1) parser or LALR(1) parser, if necessary. It is easy to describe the syntax of most programming languages by an SLR(1) grammar, so little preliminary modification of the given grammar should be necessary. The size of an SLR(1) or LALR(1)parser can be reduced significantly by a few optimizations. It is usually worthwhile to eliminate reductions by single productions. Further space optimization is possible if we implement an LALR(1) parser in the style of the production language parsers of Section 7.2. The parsing action entries of each LR(1) table could be implemented as a sequence of shift statements, followed by a sequence of reduce statements, followed by one unconditional error statement. If all reduce statements involve the same production, then all these reduce statements and the following error statement could be replaced by one statement which reduces by that production regardless of the input. The error-detecting capability of the parser would not be affected by this optimization. See Exercises 7.3.23 and 7.5.13. The non-g0 goto
EXERCISES
663
entries for each LR(1) table could be stored as a list of pairs (A, T) meaning on nonterminat A place T on top of the stack. The gotos on terminals could be encoded in the shift statements themselves. Note that no p-entries would have to be stored. The optimizations of Section 7.2 merging common sequences of statements would then be applicable. These approaches to parser design have several practical advantages. First, we can mechanically debug the resulting parser by generating input strings that will check the behavior of the parser. F o r example, we can easily construct input strings that cause each useful entry in an LL(1) or LR(1) table to be exercised. Another advantage of LL(1) and LR(1) parsers, especially the former, is that minor changes to the syntax or semantics can be made by simply changing the appropriate entries in a parsing table. Finally, the reader should be aware that certain ambiguous grammars have " L L " o r " L R " parsers that are formed by resolving parsing action conflicts in an apparently arbitrary manner (see Exercise 7.5.14). Design of parsers of this type warrants further research.
EXERCISES
7.5.1.
Construct a parsing automaton M for Go. Construct from M an equivalent split, a semireduced, and a reduced automaton.
7.5.2.
Construct parsing automata for each grammar in Exercise 7.3.1.
7.5.3.
Split states to construct reduced parsing automata from the parsing automata in Exercise 7.5.2.
7.5.4.
Prove Lemma 7.4.
7.5.5.
Prove that the definition of the reduced automaton in Section 7.5.2 is consistent; that is, if p _= q, then d;(p, a) -_- 6(q, a) for all a in E' u {e}.
7.5.6.
Prove Theorem 7.13. DEFINITION
Let G = (N, E,P, S') be an augmented CFG with productions numbered 0, 1, . . . such that the zeroth production is S'--~ S. Let E' = (@ 0, @1 . . . . , @p} be a set of special symbols not in N U E. Let the ith production be A ---~ fl and suppose that S' *~ t~Aw =~. ocflw. Then rm t~fl#~ is called a characteristic string of the righ~-~sentential form t~flw. *7.5.7. 7.5.8.
Show that the set of characteristic strings for a CFG is a regular set. Show that a CFG G is unambiguous if and only if each right-sentential form of G, except S', has a unique characteristic string.
664
7.5.9.
TECHNIQUES FOR PARSER O P T I M I Z A T I O N
CHAP. 7
Show that a C F G is LR(k) if and only if each right-sentential form t~flw such that S" *~ ~Aw ==~ ~flw has a characteristic string which may be rm rm determined from only ~fl and FIRSTk(W).
7.5.10.
Let G = (N, E, P, S) be an LR(0) grammar and (3, To) its canonical set of LR(0) tables. Let M = (3, E U 3, ~, To, {T1}) be the canonical parsing automaton for G. Let M ' = (3 U (qr}, E u E', 6', To, {qs]) be the deterministic finite automaton constructed from M by letting (1) ~'(T, a) -- ~(T, a) for all T in 3 and a in E. (2) O'(T, :~,.) -- qs ifO(T, T') is defined and T' is areduce state calling for a reduction using production i. Show that L(M') is the set of characteristic strings for G.
7.5.11.
Give an algorithm for merging "equivalent" states of the semireduced automaton constructed for an LR(1) grammar, where equivalence is taken to mean that the two states have transitions on the same set of symbols and transitions on each symbol are to equivalent states.t
*%5.12.
Suppose that we modify the definition of "equivalence" in Exercise 7.5.11 to admit the equivalence of states that transfer to equivalent states on symbol a whenever both states have a transition on a. Is the resulting automaton equivalent (in the formal sense, meaning one may not shift if the other declares error) to the semireduced automation ?
7.5.13.
Suppose a read state T of an LR(1) parsing automaton has all its transitions to pop states which reduce by the same production. Show that if we delete T and merge all those pop states to one, the new automaton will make the reduction independent of the lookahead, but will be equivalent to the original automaton,
"7.5.14.
Let G be the ambiguous grammar with productions S -+ if b then SEIa E --~ else S [ e
(a) Show that L(G) is not an LL language. (b) Construct a 1-predictive parser for G assuming that whenever E is on top of the stack and else is the next input symbol, production E ~ else S is to be applied. (c) Construct an LR(1) parser for G by making an analogous assumption.
Research Problems 7.5.15.
Apply the technique used here--breaking a parser into a large number of active components and merging or eliminating some of t h e m - - t o parsers other than LR ones. For example, the technique in Section 7.2
tThis definition can be made precise by defining relations 0=, __1_. . . . as in Lemma 7.2.
BIBLIOGRAPHIC NOTES
665
effectively treated the rows of a precedence matrix as active elements. Develop techniques applicable to LL parsers and various kinds of precedence parsers. 7.5.16.
Certain states of a canonical parsing automaton may recognize only regular sets. Consider splitting the shift states of a parsing automaton into scan states and push states. The scan state might remove input symbols and emit output but would not affect the pushdown list. Then a scan state might transfer control to another scan state or a push state. Thus, a set of scan states cart behave as a finite transducer. A push state would place the name of the current state on the pushdown list. Develop transformations that can be used to optimize parsing automata with scan and push states as well as split reduce states.
7.5.17.
Express the optimizations of Section 7.3 and 7.5 in each other's terms.
Programming Exercises 7.5.18.
Design elementary operations that can be used to implement split canonical parsing automata. Construct an interpreter for these elementary operations.
7.5.19.
Write a program that will take a split canonical parsing automaton and construct from it a sequence of elementary operations that simulates the behavior of the parsing automaton.
7.5.20.
Construct two LR(1) parsers for one of the grammars in the Appendix of Volume 1. One LR(1) parser should be the interpretive LR(1) parser working from a set of LR(1) tables. The other should be a sequence of elementary operations simulating the parsing automaton. Compare the size and speed of the parsers.
BIBLIOGRAPHIC
NOTES
The parsing automaton approach was suggested by DeRemer [1969]. The definition of characteristic string preceding Exercise 7.5.7 is from the same source. Classes of ambiguous grammars that can be parsed by LL or LR means are discussed by Aho, Johnson, and Ullman [1972].
THEORY OF DETERMINISTIC PARSING
In Chapter 5 we were introduced to various classes of grammars for which we can construct efficient deterministic parsers. In that chapter some inclusion relations among these classes of grammars were demonstrated. For example, it was shown that every (m, k)-BRC grammar is an LR(k) grammar. In this chapter we shall complete the hierarchy of relationships among these classes of grammars. One can also ask what class of languages is generated by the grammars in a given class. In this chapter we shall see that most of the classes of grammars in Chapter 5 generate exactly the deterministic context-free languages. Specifically, we shall show that each of the following classes of grammars generates exactly the deterministic context-free languages" (1) (2) (3) (4)
LR(1), (1, 1)-BRC, Uniquely invertible (2, 1)-precedence, and Simple mixed strategy precedence.
In deriving these results we provide algorithms to convert grammars of one kind into another. Thus, for each deterministic context-free language we can find a grammar that can be parsed by a UI (2, 1)-precedence or simple mixed strategy precedence algorithm. However, if these conversion algorithms are used indiscriminately, the resulting grammars will often be too large for practical use. There are three interesting proper subclasses of the deterministic contextfree languages: (1) The simple precedence languages, (2) The operator precedence languages, and (3) The LL languages. 666
SEC. 8.1
THEORY OF LL LANGUAGES
667
The operator precedence languages are a proper subset of the simple precedence languages and incommensurate with the LL languages. In Chapter 5 we saw that the class of UI weak precedence languages is the same as the simple precedence languages. Thus, we have the hierarchy of languages shown in Fig. 8.1.
Deterministic Context-free Languages
Simple Precedence LL Operator Precedence
Fig. 8.1 Hierarchyof deterministic context-free languages. In this chapter we shall derive this hierarchy and mention the most striking features of each class of languages. This chapter is organized into three sections. In the first, we shall discuss LL languages and their properties. In the second, we shall investigate the class of deterministic languages, and in the third section we shall discuss the simple precedence and operator precedence languages. This chapter is the caviar of the book. However, it is not essential to a strict diet of"theory of compiling," and it can be skipped on a first reading.'t
8.1.
T H E O R Y OF LL L A N G U A G E S
We begin by deriving the principal results about LL languages and grammars. In this section we shall bring out the following six results" 'tReaders who dislike caviar can skip it on subsequent readings as well.
668
THEORY OF DETERMINISTIC PARSING
CHAP. 8
(1) Every LL(k) grammar is an LR(k) grammar (Theorem 8.1). (2) For every LL(k) language there is a Greibach normal form LL(k q-- 1) grammar (Theorem 8.5). (3) It is decidable whether two LL grammars are equivalent (Theorem 8.6). (4) An e-free language is LL(k) if and only if it has an e-free LL(k -q- 1) grammar (Theorem 8.7). (5) For k > 0, the LL(k) languages are a proper subset of the LL(k -t- 1) languages (Theorem 8.8). (6) There exist LR languages which are not LL languages (Exercise 8.1.11). 8.1.1.
LL and LR Grammars
Our first task is to prove that every LL(k) grammar is an LR(k) grammar. This result can be intuited by the following argument. Consider the derivation tree sketched in Fig. 8.2. s
w
x
y
Fig. 8.2
S k e t c h o f d e r i v a t i o n tree.
In scanning the input string wxy, the LR(k) condition requires us to recognize the production A ~ ~ knowing wx and FIRSTk(y ). On the other hand, the LL(k) condition requires us to recognize the production A ~ t~ knowing only w and F I R S T k ( x y ). Thus, it would appear that the LL(k) condition is more stringent than the LR(k) condition, so that every LL(k) grammar is LR(k). We shall now formalize this argument. Suppose that we have an LL(k) grammar G a n d two parse trees in G. Moreover, suppose that the frontiers of the two trees agree for the first m symbols. Then to every node of one tree such that no more than m - k leaves labeled with terminals appear to its left, there corresponds an "essentially identical" node in the other tree. This relationship is represented pictorially in Fig. 8.3, where the shaded region represents the "same" nodes. We assume that ] w[ = m -- k and that FIRSTk(xl) = FIRSTk(Xz). We can state this observation more precisely as the following lemma.
SEe. 8.1
THEORY OF LL LANGUAGES
S
S
W
669
W
X1
X2
(a)
(b)
Fig. 8.3 Two parse trees for an LL(k) grammar. LEMMA
8.1
Let G be an LL(k) g r a m m a r and let S ~Im w~ Aa ~lm w~x~ and S ~Ira w2Bfl ~l m w2x 2
be two leftmost derivations such that F I R S T ~ ( w l x l ) = FIRST~(w2xz)for 1 = k + max( I wl 1, I wz !). (That is, w~x~ and w~x2 agree at least k symbols beyond w~ and wz.) (1) If ms = m 2 , t h e n w 1 = w 2 , A = B , a n d a = f l . ~I
~ --~!
(2) If m, < m2, then S ~ira w~Aa ~ira
w2Bfl ==~ *
W z X 2.
Proof. Examining the LL(k) definition, we find that if m 1 ~ mz, each m2
of the first m 1 steps of the derivation S ==~ wzB fi are the same as those of Ira
ml
S ==~ wiAa, since the lookahead strings are the same at each step. (1) and Ira
(2) follow immediately.
[Z]
THEOREM 8.1 Every LL(k) g r a m m a r t is LR(k). Proof. Let G = (N, E, P, S) be an LL(k) g r a m m a r and suppose that it is not LR(k). Then there exist two rightmost derivations in the augmented grammar i
(8.1.1)
S'
(8.1.2)
S' ~i'm ?By ~r ra 7@
~
rm
aAx
I ===~ O~flx 1 rm
such that y@ = aflx2 for some x z for which F I R S T e ( x 2 ) = FIRSTe(xl). Since G is assumed not to be LR(1), we can assume that aAx2 ~ ?By. •~Throughout this book we are assuming that a grammar has no useless productions.
670
THEORY
OF DETERMINISTIC
PARSING
CHAP.
8
In these derivations, we can assume that i a n d j are both greater than zero. Otherwise, we would have, for example,
S'-~
S'---> S
S' ---> B y ---> S fly which would imply that G was left-recursive and hence that G was not LL. Thus, for the remainder of this proof we can assume that we can replace S' by S in derivations (8.1.1) and (8.1.2). Let x,, xp, x v, and x ~ b e terminal strings derived from a, fl, ~,, and t~, respectively, such that x, xpx2 = x~x~y. Consider the leftmost derivations which correspond to the derivations (8.1.3)
S - - r~m
ocAx~ ~r m aflx~
S ~rm
~,By ~r m
*
rm
X~XpX 1
and (8.1.4)
?'6y ==:> xvx~y rm
Specifically, let (8.1.5)
S ~l m
x.Arl ~
x.flrl ~I m
x.xprl ~I m
x.xpx,
and (8.1.6)
S -Y--> x~BO ---> x~OO ---> * x~x~O ::=, * x~x ~y lm Im lm lm
where t/and 0 are the appropriate strings in (N U Z)*. By Lemma 8.1, the sequence of steps in the derivation S *=> x~Arl is lm the initial sequence of steps• in the derivation S *=, xvBO or conversely. We Im assume the former case; the converse can be handled in a symmetric manner. Thus, derivation (8.1.6) can be written as (8.1.7)
S - -l m~ x,Arl ~l m
xJ3rl =Ira: ~ xvBO ~l m
xvt~O - -l m~ xvx~O ~l m
xyx,~y
Let us fix our attention on the parse tree T of derivation (8.1.7). Let nA be the node corresponding to A in x~Arl and nn the node corresponding to B in XyBO. These nodes are shown in Fig. 8.4. Note that nB may be a descendant of nA. There cannot really be overlap between xp and x~. Either they are disjoint or x~ is a subword of xp. We depict them this way merely to imply either case. Let us now consider two rightmost derivations associated with parse tree T. In the first, T is expanded rightmost up to (and including) node nA; in the second the parse tree is expanded up to node nB. The latter derivation can be written as s ~r m ~By ~1-m ~,,~y
SEC. 8.1
THEORY OF LL LANGUAGES
671
S
Xa t
X# , •
13 L
x~
X2 l. T
x6
. JIL
II
J
Y
Fig. 8.4 Parse tree T.
This derivation is in fact derivation (8.1.2). The rightmost derivation up to node nA is (8.1.8)
S ~r r l l o(Ax2 ~r m
~'flX2
for some a'. We shall subsequently prove that ~' = ~. Let us temporarily assume that ~' = ~. We shall then derive a contradiction of the LL(k)-ness of G and thus be able to conclude the theorem. If ~ ' = ~, then ~,8y = oc'flx2 = ~flx2. Thus, the same rightmost derivations can be used to extend derivations (8.1.2) and (8.1.8) to the string of terminals x~,x,sy. But since we assume that nodes nA and nn are distinct, derivations (8.1.2) and (8.1.8) are different, and thus the completed derivations are different. We may conclude that xrx,~y has two different rightmost derivations and that G is ambiguous. By Exercise 5.1.3, no LL grammar is ambiguous. Hence, G cannot be LL, a contradiction of what was assumed. Now we must show that ~' = ~. We note that ~' is the string formed by concatenating the labels from the left of those nodes of T whose direct ancestor is an ancestor of nA. (The reader should verify this property of rightmost derivations.) Now let us again consider the leftmost derivation (8.1.5), which has the same parse tree as the rightmost derivation (8.1.3). Let T' be the parse tree associated with derivation (8.1.3). The steps of derivation (8.1.5) up to x , Atl are the same as those of derivation (8.1.7) up to xo,Aql. Let n] be the node of T' corresponding to the nonterminal replaced in deriva-
672
THEORY OF DETERMINISTIC PARSING
CHAP. 8
tion (8.1.7) at the step x~Arl ==, x~flrl. Let II be the preorderingt of the intelm rior nodes of parse tree T up to node nA and II' the preordering of the interior nodes of T' up to node n]. The ith node in II matches the ith node in II' in that both these nodes have the same label and that corresponding descendants either are matched or are to the right of nA and n], respectively. t The nodes in T' whose direct ancestor is an ancestor of nA have labels which, concatenated from the left, form e. But these nodes are matched with those in T which form e', so that e' -- e. The proof is thus complete. [-] The grammar S
>A[B
A
> aAb[O
B
> aBbb[ 1
is an LR(0) grammar but not an LL grammar (Example 5.4), so the containment of LL grammars in LR grammars is proper. In fact, if we consider the LL and LR grammars along with the classes of grammars that are left- or right-parsable (by a D P D T with an endmarker), we have, by Theorems 5.5, 5.12, and 8.1, the containment relations shown in Fig. 8.5.
LR Left Parsable
Fig. 8.5
LL
Right Parsable
Relations between classes of grammars.
t I I is the sequence of interior nodes cff T in the order in which the nodes are expanded in a leftmost derivation.
THEORY OF LL LANGUAGES
SEC. 8.1
673
We claim that each of the six classes of grammars depicted in Fig. 8.5 is nonempty. We know that there exists an LL grammar, so we must s h o w the following. THEOREM 8.2 There exist grammars which are (1) (2) (3) (4) (5)
LR and left-parsable but not LL. LR but not left-parsable. Left- and right-parsable but not LR. Right-parsable but not LR or left-parsable. Left-parsable but not right-parsable.
Proof Each of the following grammars inhabits the appropriate region.
(1) The grammar Go with productions S
>AIB
A
~ aaA l aa
B
> aaBla
is LR(1) and left-parsable but not LL. (2) The grammar Go with productions S A ~ B
>AbIAc ABla ~a
is LR(1) but not left-parsable. See Example 3.27 (p. 272 of Volume I). (3) The grammar Go with productions S-
~AblBc
A
~ Aala
B
~ Bala
is both left- and right-parsable but is not LR. (4) The grammar Gd with productions S
>AblBc
A
~ACla
B ~ C-
BCIa ~a
is right-parsable but neither LR nor left-parsable. (5) The grammar G, with productions
674
THEORYOF DETERMINISTIC PARSING
S
>BAbICAc
A
> BAIa
B
>a
C
>a
CHAP. 8
is left-parsable but not right-parsable. See Example 3.26 (p. 271, Volume I).
8.1.2.
LL Grammars in Greibach Normal Form
In this section we shall consider transformations which can be applied to LL grammars while preserving the LL property. Our first results involve e-productions. We shall give two algorithms which together convert an LL(k) grammar into an equivalent LL(k-F- 1) grammar without e-productions. The first algorithm modifies a grammar so that each right-hand side is either e or begins with a symbol (possibly a terminal) which does not derive the empty string. The second algorithm converts a grammar satisfying that condition to one without e-productions. Both algorithms preserve the LL-ness of grammars. DEFINITION
Let G = (N, ~, P, S) be a CFG. We say that nonterminal A is nullable if A can derive the empty string. Otherwise, a symbol in N u E is said to be nonnullable. Every terminal is thus nonnullable. We say that G is nonnullable if each production in P is of the form A ~ e or A ---~ X1 " " Xk, where X1 is a nonnullable symbol. ALGORITHM 8.1 Conversion to a nonnullable grammar, lnput. C F G G = (N, E, P, S). Output. A nonnullable context-free grammar G1 = (Na, ~E' P~, S~) such that L(Gi) = L(G) - - [e}. Method.
(1) Let N ' = N U [A I A is a nullable nonterminal in N}. The barred nonterminal A will generate the same strings as A except e. Hence A is nonnullable. (2) If e is in L(G), let S~ = S. Otherwise S~ = S. (3) Each non-e-production in P can be uniquely written in the form A ----~ B i . . . BmX1 " " X , (with m ~ 0, n ~ 0, and m --F-n > 0), where each Bi is a nullable symbol, and if n 3> 0, X1 is a nonnullable symbol. The
SEC. 8.1
675
THEORY OF LL LANGUAGES
remaining Xj, 1 < j ~ n, can be either nullable or nonnullable. Thus, X~ is the leftmost nonnullable symbol on the right-hand side of the production. For each non-e-production A ---~ B1 " " B m X i " • X , , we construct P ' as follows" (a) If m ~ 1, add to P ' the m productions A=
> BIB2 ... BmX 1 ... X.
A
> B 2 B 3 . . . BmX" 1 ' ' ' X. °
.
A:
>BmX~ "'" X .
(b) If n ~ 1 and m ~ 0, add to P ' the production
>X~ ...X~
A
(c) In addition, if A is itself nullable, we also add to P' all the productions in (a) and (b) above with A instead of A on the left. (4) If A ---~ e is in P, we add A ~ e to P'. (5) Let G~ = (Na, Z, Pa, S~) be G ' = (N', Z, P', $1) with all useless symbols and productions removed. D Example 8.1
Let G be the LL(1) grammar with productions S---->AB A
>aAle
B
>bAle
Each of the nonterminals is nullable, so we introduce new nonterminals S-, A, and B; the first of these is the new start symbol. Productions A .--~ a A and B ---~ b A each begin with a nonnullabte symbol, and so their right-hand sides are of the form X 1 X 2 for the purpose of step (3) of Algorithm 8.1. Thus, we retain these productions and also add A ~ a A and B ~ b A to the set of productions. In the production S ~ A B , each symbol on the right is nullable, and so we can write the right-hand side as B I B 2 . This production is replaced by the following four productions" m
D
S
> ABIB
S
> ABIB
676
THEORY OF DETERMINISTIC PARSING
CHAP. 8
Since S is the new start symbol, we find that S is now inaccessible. The final set of productions constructed by Algorithm 8.1 is S
> ABIB
A
> aA
A
> aAle
B
>
B
>bAle
bA
The following theorem shows that Algorithm 8.1 preserves the LL(k)ness of a grammar. THEOREM 8.3 If G1 is constructed from G by Algorithm 8.1, then (1) L ( G , ) = L(G) -- {e}, a n d (2) If G is LL(k), then so is G1. Proof
(1) Straightforward induction on the length of strings shows that for all A in N, (a) A ~
G1
w if and only if A ~
G
w and
(b) A ~ G~ w if and only if w =/= e and A ~ 17 w. The details are left for the Exercises. For part (2), assume the contrary. Then we can find derivations in G1
s, ~ w ~ Gt - - .lm wp~ G~*-~ wx Gt lm lm $1 t71 ~ l m wAoc 17x ~ l m wToc ==~ wy (71 l m
where FIRSTk(x ) -- FIRST~,(y), fl ~ 7, and A is either A or A. We can construct from these two derivations corresponding derivations in G of w x and wy as follows. First define h(A) = h(A) = A for A ~ N and h(a)=afora ~ Z. For each production/~ --, t~ in P1 we can find a production B ---, O'h(O) in P from which it was constructed in step (3) of Algorithm 8.1. That is, B -- h(/t), and 6' is some (possibly empty) string of nullable symbols. Each time production/~ --~ ~ is used in one of the derivations in G~, in the corresponding derivation in G we replace it by B ---, O'h(d;), followed by a leftmost
SEC. 8.1
T H E O R Y OF LL L A N G U A G E S
677
derivation of e from ~' if ~' ~ e. In this manner, we construct the derivations in G" S ~G l m wAh(a) ===~ wfl'h(fl)h(a) G lm
*~ wh(fl)h(a) ~G l m wx
G lm
,
s ~
G lm
wAh(~) ~
G lm
w~'h(~)h(~) ~
G Im
wh(r)h(~) ~
G lm
wy
We can write the steps between wfl'h(fl)h(a) and wh(fl)h(a) as ~ wfl'h(fl)h(~)=wO~ ==~ lm wO~ - -Im
...
lm ~. wO.
= wh(,O)h(oO
and those between wy'h(y)h(a) and wh(y)h(a) as W ~ tth(y)h( 0C) =
WE" I
~lm
we z ~ l m
. . . ~lm
we,, = wh(y)h(a).
If z = F I R S T ( x ) = FIRST(y), then we claim that z is in FIRST(0,) and FIRST(e~) for all i, since fl' and y' consist only of nullable symbols (if fl' and y' are not e). Since G is LL(k), it follows that ~ -- e~ for all i. In particular, fl'h(fl) = y'h(y). Thus, fl and y are formed from the same production by Algorithm 8.1. If fl' ~ y', then in one of the above derivations, a nonterminal derives e, while in the other it does not. We may contradict the LL(k)-ness of G in this way and so conclude that fl' = y'. Since fl and y are assumed to be different, they cannot both be e, and so they each start with the same, nonnullable symbol. It is then possible to conclude that m = n and arrive at the contradiction fl -- ?. [--] Our next transformation will eliminate e-productions entirely from an LL grammar G. We shall assume that e q~ L(G). Our strategy will be to first apply Algorithm 8.1 to an LL(k) grammar, and then to combine nonnullable symbols in derivations with all following nullable symbols, replacing such a string by a single symbol. If the grammar is LL(k), then there is a limit to the number of consecutive nullable symbols which can follow a nonnullable symbol. DEFINITION
Let G = (N, X, P, S) be a CFG. Let VG be the set of symbols of the form [ X B 1 . '. B,] such that (1) X is a nonnullable symbol of G (possibly a terminal). (2) B 1 , . . . , B, are nullable symbols (hence nonterminals). (3) If i ~ j , then Bi ~ Bj. (That is, the list B 1 , . . . , B, contains no repeats.) Define the homomorphism g from VG to (N U E)* by g([a]) = a.
678
THEORYOF DETERMINISTIC PARSING
CHAP.
8
LEMMA 8.2
Let G = (N, Z, P, S) be a nonnullable LL(k) grammar such that every nonterminal derives at least one nonempty terminal string. Let Vo and g be as defined above. Then for each left-sentential form a of G such that a ~ e there is a unique fl in Vo* such that h(fl) = a.
Proof We can write a uniquely as a~a2...am, m > 1, where each a~ is a nonnullable symbol X followed by some string of nullable symbols (because every non-e-sentential form of G begins with a nonnullable symbol). It suffices to show that [a~] is in Vo for each i. If not, then we can write at as XflB?B5, where B is some nullable symbol and fl, 7, and 6 consist only of nullable symbols, i.e., a~ does not satisfy condition (3) of the definition of Vo. Let w be a nonempty string such that B ~ w. Then there are two disG tinct leftmost derivations: -*~ fiB?B5 ~lm B?B5 ~lm w?B5 - - lm
w
and
fiB?B5 ~lm B~ ~Im w5 Ira *-~ w From these, it is straightforward to show that G is ambiguous and hence not LL. [[] The following algorithm can be used to prove that every LL(k) language without e has an LL(k + 1) grammar with no e-productions. ALGORITHM 8.2 Elimination of e-productions from an LL(k) grammar.
Input. An LL(k) grammar Ga = (N~, E, P1, S1)Output. An LL(k + 1) grammar G = (N, Z, P, S) such that L ( G ) = L(G,) -- {e}. Method. (1) First apply Algorithm 8.1 to obtain a nonnullable LL(k) grammar G 2 = (N 2, Z, P2, $2). (2) Eliminate from G2 each nonterminal A that derives only the empty string by deleting A from the right-hand sides of productions in which it appears and then deleting all A-productions. Let the resulting grammar be G 3 : ( N 3, E, P3, $2)" (3) Construct grammar G = (N, Z, P, S) as follows: (a) Let N be the set of symbols [Xa] such that (i) X is a nonnullable symbol of G3, (ii) a is a string of nullable symbols,
SEC. 8.1
THEORY
OF LL LANGUAGES
679
(iii) Xe ~ E (i.e., we do not have X ~ l ~ and 0~ = e simultaneously), (iv) e has no repeating symbols, and (v) Xe actually appears as a substring of some left-sentential form of G 3. (b) S = [Szl. (c) Let g be the homomorphism g([e]) = e for all [e] in N and let g(a) = a for a in E. Since g-~(p)contains at most one member, we use g-~(fl) to stand for l' if 7 is the lone member of g-l(fl). We construct P as follows: (i) Let [Ae] be in N and let A ~ fl be a production in P3. Then [A~]--, g-~(fl~) is a production in P. (ii) Let [aeAfl] be in N, with a ~ Z and A ~ N 3. Let A ---, y be in P3, with ), =/= e. Then [aeAfl] --, ag-x(yfl) is in P. (iii) Let [aocAfl] be in N with a ~ E. Then [aocAfl]--. a is also i n P . [-7 Example 8.2
Let us consider the g r a m m a r of Example 8.1. Algorithm 8.1 has already been applied in that example. Step (2) of Algorithm 8.2 does not affect the grammar. We shall generate the productions of g r a m m a r G as needed to assure that each nonterminal involved appears in some left-sentential form. The start symbol is [S]. There are two S-productions, with right-hand sides A-B and/~. Since A a n d / ~ are nonnullable but B is nullable, g-a(A-B) = [A-B] and g-a(B-) -- [/~]. Thus, by rule (i) of step (3c), we have productions [Sl -~ ~ [ABll[B] Let us consider nonterminal [AB]. A has one production, A---* aA. Since g-~(aAB) -- [aAB], we add production
[AB]
> [aAB]
Consideration of [B] causes us to add s
[B]-- ~ [bA] We now apply rules (ii) and (iii) to the nonterminal [aAB]. There is one non-e-production for A and one for B. Since g - ~ ( a A B ) - - [ a A B ] , we add
[aAB]
> a[aAB]
corresponding to the A-production. Since g-l(bA) -= [bA], we add
[aAB] ~
a[bA]
680
T H E O R Y OF D E T E R M I N I S T I C P A R S I N G
CHAP. 8
corresponding to the B-production. By step (iii), we add
[(tAB],
>a
Similarly, from nonterminal [bA] we get
[bA]
> b[aA] l b
Then, considering the newly introduced nonterminal [aA], we add productions
[aA]
~ a[aA] l a
Thus, we have constructed the productions for all nonterminals introduced and so completed the construction of G. THEOREM
8.4
The grammar G constructed from G 1 in Algorithm 8.2 is such that (1) L(G) ----L(Gi) -- {e}, and (2) If G~ is LL(k), then G is LL(k + 1).
Proof Let g be the homomorphism in step (3c) of Algorithm 8.2. (1) A straightforward inductive argument shows that A *~ fl, where Ga A ~ N and fl ~ (N U Z)*, if and only if g ( A ) ~Gg ( f l )
and fl ~ e. Thus,
[$2] ~O w, for w i n E*, if and only if $2 ==~ w and w ~ e. Hence, L(G) Ga L(G3). That L(G2) ----L(G1) -- (e} is part (1) of Theorem 8.3, and it is easy to see that step (2) of Algorithm 8.2, converting G 2 to G 3, does not change the language generated. (2) Here, the argument is similar to that of the second part of Theorem 8.2. Given a leftmost derivation in G, we find a corresponding one in G3 and show that an LL(k q- 1) conflict in the former implies an LL(k) conflict in the latter. The intuitive reason for the parameter k + 1, instead of k, is that if a production of G constructed in step (3cii) or (iii) is used, the terminal a is still part of the lookahead when we mustdetermine which production to apply for o~Afl. Let us suppose that G is not LL(k + 1). Then there exist derivations S~
G lm
w a s - - ~ wp~ G lm
*
G lm
WX
and
S ~G lm wAoc ~G lm wyt~ ==~ wy G lm where fl ~ 7', but FIRSTk+i(x ) = FIRSTk+I(y ). We construct corresponding derivations in G 3 as follows"
SEC. 8.1
THEORY
OF LL LANGUAGES
681
(a) Each time production [Aa] ~ g-l(fla), which is in P because of rule (3ci), is used, we use production A ~ fl of G 3. (b) Each time production [aaAfl] ~ ag-1(3,fl), which is in P by rule (3cii), is used, we do a leftmost derivation of e from ~, followed by an application of production A ~ 3'. (c) Each time production [aaA/3] ---~ a is used, we do a leftmost derivation of e from aAfl. Note that this derivation will involve one or more steps, since t~Afl ~ e. Thus, corresponding to the derivation S ~
G lm
wAa is a unique derivation
g(S) ~
wg(A)g(a). In the case that A is a bracketed string of symbols Gs I m beginning with a terminal, say a, the bordert of wg(A)g(oO is one symbol to the right of w. Otherwise it is immediately to the right of w. In either case, since x and y agree for k + 1 symbols and Gs is LL(k), the steps in G 3 corresponding to the application of productions A ~ fl and A ~ 3, in G must be the same. Straightforward examination of the three different origins of productions of G and their relation to their corresponding derivations in G 3 suffices to show that we must have fl = 3,, contrary to hypothesis. That is, let A ~- [~]. If 6 begins with a nonterminal, say ~ = C~', case (2a) above must apply in both derivations. There is one production of G3, say C ~ ~", such that
/~
=
~ =
g-~(6,,6,).
If 6 begins with a terminal, say ~ - - a 6 ' , case (2b) or (2c) must apply. The two derivations in G3 replace a certain prefix of ~ by e, followed by the application of a non-e-production in case (2b). It is easy to argue that fl -- 3, in either case. We shall now prove that every LL(k) language has an LL(k + 1) grammar in Greibach normal form (GNF). This theorem has several important applications and will be used as a tool to derive other results. Two preliminary lemmas are needed. LEMMA 8.3 No LL(k) grammar is left-recursive.
Proof Suppose that G = (N, E, P, S) has a left-recursive nonterminal A. + Then there is a derivation A =~ A~. If ~ ~ e, then it is easy to show that G is ambiguous and hence cannot be LL. Thus, assume that ~ *~ v for some v ~ l~+. We can further assume that A *~ u for some u ~ E* and that there exists a derivation
S =L~ wA,~~l m wAock,5 l*-~ wuvkx lm m •l'Border as in Section 5.1.1.
682
THEORY OF DETERMINISTIC PARSING
CHAP. 8
Hence, there is another derivation"
S ~Im . wAc~ - -lm~ wAakc~ ~lm . wAa k+~c~~ lm *
WUV
k+ 1 x
Since FIRSTk(UVkX) = FIRSTk(uv k+lx), we can readily obtain a contradiction of the LL(k) definition from these two derivations, for arbitrary k. D LEMMA 8 . 4
Let G = (N, Z, P, S) be a C F G with A ~ Ba in P, where B ~ N. Let B ~ fla [f12[ "'" [fin be all the B-productions of G and let G~ = (N, E, P~, S) be formed by deleting A ---~ Ba from P and substituting the productions A ~ P ~ I P ~ I " " 1,O~. ThenL(G~) = L(G), and if G is LL(k), then so is G~.
Proof By Lemma 2.14, L(G1) = L(G). To show that Gi is LL(k) when G is, we observe that leftmost derivations in G1 are essentially the same as those of G, except that the successive application of the productions A ~ B~ and B ~ fit in G is done in one step in G 1. Informally, since Ba begins with a nonterminal, the two steps in G are dictated by the same k symbol lookahead string. Thus, when parsing according to G~, that lookahead dictates the use of the production A ~ ilia. A more detailed proof is left for the Exercises. [~] ALGORITHM 8.3
Conversion of an LL(k) grammar to an LL(k + 1) grammar in GNF.
lnput. LL(k) grammar G1 = (N~, Z, P1, $1). Output. LL(k q- 1) grammar G = (N, E, P, S) in Greibach normal form, such that L(G) = L(Gi) -- [e}. Method. (1) Using Algorithm 8.2, construct from G 1 an LL(k + 1) grammar G2 = (N2, E, P2, S) with no e-productions. (2) Number the nonterminals of N2, say N 2 = [ A 1 , . . . , Am}, such that if At ---~ Aja is in P2, then j > i. Since, by Lemma 8.3, G is not left-recursive, we can do this ordering by Lemma 2.16. (3) Successively, for i = m I, m - 2 , . . . , 1, replace all productions A t ---~ Aja by those A~ ~ fla such that Aj --~ fl is currently a production. We shall show that this operation causes all productions to have right-hand sides that begin with terminals. Call the new grammar G 3 = (N3, E, P3, S). (4) For each a in E, let Xa be a new nonterminal symbol. Let N=N
sU[Xala~
~}.
Let P be formed from P3 by replacing each instance of terminal a, which is
SEC. 8.1
THEORY OF LL LANGUAGES
683
not the leftmost symbol of its string, by X, and by adding productions Xa ~ a for each a. Let G = (N, E, P, S). G is in GNF. D THEOREM 8.5
Every LL(k) language has an LL(k -1- !) grammar in GNF. Proof It suffices to show that G constructed in Algorithm 8.3 is LL(k + 1) if G 1 is LL(k). By Lemma 8.4, G3 is LL(k + 1). We claim that the right-hand side of every production of G3 begins with a terminal. Since G1 is not ieftrecursive, the argument is the same as for Algorithm 2.14. It is easy to show that the construction of step (4) preserves the LL(k + 1) property and the language generated. It is also clear that G is in GNF. The proofs are left for the Exercises. E] 8.1.3.
The Equivalence Problem for LL Grammars
We are almost prepared to give an algorithm to test if two LL(k) grammars are equivalent. However, one additional concept is necessary. DEFINITION
Let G = (N, E, P, S) be a CFG. For a in (N u E)*, we define the thickness of ~ in G, denoted THO(a), to be the length of the shortest string w in E* such that a =~ w. We leave for the Exercises the observations that G
Tno(afl) = Tno(a) q- THO(fl) and that if a ~ fl, then WHO(a) ~ THO(fl). We further define THe(a, w), where a is in (N U E)* and w is in 2~*k, to be the length of the shortest string x in ~* such that a ==> x and w = G FIRSTk(X ). If no such x exists, THe(a, w) is undefined. We omit k and G from TH~ or TH ° where obvious. The algorithm to test the equivalence of two LL(k) grammars is based on the following lemma. LEMMA 8.5
Let G 1 = (N1, I2, P,, $1) and G z = (N2, E, P2, $2) be LL(k) grammars in G N F such that L ( G , ) = L(G2). Then there is a constant p, depending on G 1 and G 2, with the following property. Suppose that Si ---> wa ---> wx Gt l m
Gt l m
===> and that $2 G2 ~ l m w fl G2 *I m wy, where a and fl are the open portions of w0~and wfl, and FIRSTk(x) = FIRSTk(y). Then ITHO'(a) -- THO,(fl) ! ~ p.t P r o o f Let t be the maximum of THO'(y) or TH"'(?) such that ? is a righthand side of a production in P1 or P2, respectively. Let p = t(k ÷ 1), and suppose in contradiction that
(8.1.9)
WHO'(a) -- TH°'(fl) > p
•~Absolut¢ value, not length, is meant here by[
I.
684
THEORYOF DETERMINISTIC PARSING
CHAP. 8
We shall show that as a consequence of this assumption, L ( G ~ ) ~ L(Gz). Let z = FIRST(x) = FIRST(y). We claim that THO~(fl, z) ~ THO~(fl) + p, for there is a derivation fl G2 ===-~ y, and hence, since G 2 is in GNF, there is a lm derivation, fl 03 ~ l m z~ for some ~, requiring no m o r e t h a n k steps. It is easy to show that TH°~(6) < THO~(fl) + kt since "at worst" ~ is fl with most of the right-hand sides of k productions appended. We conclude that
(8.1.1o)
TH°*(fl, z) < k + THG*(8) _< THG~(fl) + p
It is trivial to show that THO,(a, z) ~ THO,(a), and so from (8.1.9) and (8.1.10) we have
(8.1.11)
T.H°'(~, z) > THO~(fl, z)
If we let u be a shortest string derivable from g, then TH°,(fl, z ) < l z u I. The string wzu is in L(G2), since S *~ wfl *~ wzg =~ * wzu. But it is impossible G~ G~ G~ that a =~zu, because by (8.1.11) THO,(a, z) > Jzu [. Since G1 is LL(k), if G~ there is any leftmost derivation of wzu in G~, it begins with the derivation $1 Gt - - l~m wa. Thus, wzu is not in L(G1) , contradicting the assumption that L(G1) = L(G2). We conclude that THO,(a) -- THO,(fl) < p = t(k + 1). The case THO~(fl) -- THO,(e) > p is handled symmetrically. ~ LEMMA 8.6
It is decidable, for DPDA P, whether P accepts all strings over its input alphabet. Proof By Theorem 2.23, L(P), the complement of L(P), is a deterministic language and hence a CFL. Moreover, we can effectively construct a CFG G such that L(G) = L(P). Algorithm 2.7 can be used to test if L(G) = ~ . Thus, we can determine whether P accepts all strings over its input alphabet.
We are now ready to describe the algorithm to test the equivalence of two LL(k) grammars. THEOREM 8.6
It is decidable, for two LL(k) grammars (N2, Z2, e2, $2), whether L(G~) = L ( G 2 ) .
G 1 = (N1, El, P1, $1) and
G2 =
Proof. We first construct, by Algorithm 8.3, G N F grammars G'~ and G~ equivalent to G 1 and Gz, respectively (except possibly for the empty string,
SEC. 8.1
THEORY OF LL LANGUAGES
685
which can be handled in an obvious way). We then construct a D P D A P which accepts an input string w in (E~ u E2)* if and only if (1) w is in both L(G~) and L(G2) or (2) w is in neither L(G1) nor L(G2). Thus, L(G~) = L(G2) if and only if L(P) = (E~ U E2)*. We can use Lemma 8.6 for this test. Thus, to complete the proof, all we need to do is show how we construct the D P D A P. P has a pushdown list consisting of two parallel tracks. P processes an input string by simultaneously parsing its input top-down according to G'~ and G'2" Suppose that P's input is of the form wx. After simulating I wl steps of the leftmost derivations in G'i and G~, P will have on its pushdown list the contents of each stack of the k-predictive parser for G'i and G~, as shown in Fig. 8.6. We note from Algorithm 5.3 that the stack contents are in each
i
w
!
i
Parser
open portion for G '1 = a open portion for G ~ = Fig. 8.6 Representation of left sentential forms w~ and wfl. case the open portions of the two current left-sentential forms, together with some extra information appended to the nonterminals to guide the parsing. We can thus think of the stack contents as consisting of symbols of G'i and G~. The extra information is carried along automatically. Note that the two open portions may not take the same amount of space. However, since we can bound from above the difference in their thicknesses, then, whenever L(G1) = L(G2), we know that P can simulate both derivations by reading and writing a fixed distance down its pushdown list. Since G'I and G~ are in GNF, P alternately simulates one step of the derivation in G'i,
686
THEORY OF DETERMINISTIC PARSING
CHAP. 8
one in G~, and then moves its input head one position. If one parse reaches an error condition, the simulation of the parse in the remaining grammar continues until it reaches an error or accepting configuration. It is necessary only to explain how the two open portions can be placed so that they have approximately the same length, on the assumption that L(G1) = L(G2). By Lemma 8.5, there is a constant p such that the thicknesses of the two open portions, resulting from processing the prefix of any input string, do not differ by more than p. For each grammar symbol of thickness t, P will reserve t cells of the appropriate track of its pushdown list, placing the symbol on one of them. Since G'x and G~ are in G N F , there are no nullable symbols in either grammar, and so t > 1 in each case. Since the two strings ~ and fl of Fig. 8.6 differ in thickness by at most p, their representations on P's pushdown list differ in length by at most p cells. To complete the proof, we design P to reject its input if the two open portions on its pushdown list ever have thicknesses differing by more than p symbols. By Lemma 8.5, L ( G ~ ) ~ L(G2) in this case. Also, should the thicknesses never differ by more than p, P accepts its input if and only if it finds a parse of that input in both G'i and G~ or in neither of G'~ and G~. Thus, P accepts all strings over its input alphabet if and only if L(Gi) = L(G2). E] 8.1.4,
The Hierarchy of LL Languages
We shall show that for all k > 0 the LL(k) languages are a proper subset of the LL(k + 1) languages. As we shall see, this situation is in direct contrast to the situation for LR languages, where for each LR language we can find an LR(1) grammar. Consider the sequence of languages L1, L 2 , . . , L k , . . . , where Lk = {a"w l n > 1 and w is in {b, c, bkd}"}.
In this section, we shall show that L k is an LL(k) language but not an LL(k -- I) language, thereby demonstrating that there is an infinite proper hierarchy of LL(k) languages. The following LL(k) grammar generates L k" S
> aT
T
>S A I A
A=
> bB[c
B
> bk-ld[ e
We now show that every LL(k) grammar for L k must contain at least one e-production.
SEC. 8.1
THEORY OF LL LANGUAGES
687
LEMMA 8.7 Lk is not generated by any LL(k) grammar without e-productions. P r o o f Assume the contrary. Then we may, by steps (2)-(4) of Algorithm 8.3, find an LL(k) grammar, G - - - ( N , {a, b, c, d}, P, S), in G N F such that L(G)--L k. We shall now proceed to show that any such grammar must generate sentences not in L k. Consider the sequence of strings ~, i - - 1, 2 , . . . , such that S==~
==~
lm
lm
for some J. Since G is LL(k) and in G N F , ~t is unique for each i. For if not, let ~ = ~j. Then it is easy to show that a~+kbj+k is in L(G), which is contrary to assumptions if i ~ j. Thus, we can find i such that I~,1>_2 k - - 1. Pick a value of i such that 0ct = / 3 B y for some fl and ? in N* and B ~ N such that lPt and 17,1 are at least k ~ 1. Since G is LL(k), the derivation of the sentence a~+k-~b ~+~'-~ is of the form S =~::,.a, ]3B r =~, a, ÷k - l b , ÷k -1. lm
Since G is in G N F and
I/~1>_ k
Im
-- 1, we must have fl *~ a ~-1bi, B *~ bl,
and ~, *~ b m for some j ~ 0, l _~ 1, and m _~ k -- 1, where i+k--
1 =j+l+m.
If we consider the derivation of the sentence at+k-le i+k-~, we can also conclude that B *=~ c" for some n :> 1. Finally, if we consider the derivation of the sentence at+k-abJ+t+k-adb m, we find that S *=~a~flB7 ~ lm
Im
a'+k-lbiB7 *=~a'+k-lbJ+l? *=~at+k-lbJ+l+k-ldb m. lm
Im
The existence of the latter derivation follows from the fact that the sentence at+k-~bJ+Z+k-~db r" agrees with a~+k-lb i+k-I for (i + k -- 1) + (j + 1 + k - - 1) * bk-ldb ~. symbols. Thus, ? =~ Putting these partial derivations together, we can obtain the derivation S ~
aiflBy ~ lm
ai+k-XbJB? -Y-> a~+k-XbJc"? ~ Ira
lm
ai+k-lbJc"bk-Xdb " lm
But the result of this derivation is not in L k, because it contains a substring o f the form ebk-ld. (Every d must have k b's before it.) We conclude that L k has no LL(k) grammar without e-productions. We now show that if a language L is generated by an LL(k) grammar with no e-productions, then L is an LL(k -- 1) language.
688
CHAP. 8
THEORYOF DETERMINISTIC PARSING
THEOREM 8.7
If a language L has an LL(k) grammar without e-productions, k > 2, then L has an LL(k -- I) grammar. P r o o f By the third step of Algorithm 8.3 we can find an LL(k) grammar in G N F for L. Let G -- (N, E, P, S) be such a grammar. From G, we can construct LL(k -- 1) grammar G 1 = (N1, E, P1, S), where
(1) N~ = N u { [ A , a ] [ A
~ N, a ~ E, and A ~ aa is in P for some a}
and (2) P~ = (A --~ a[A, a] l A --~ aoc is in P} u ([A, a] ~ a [A --~ aa is in P}. It is left for the Exercises to prove that G1 is LL(k -- 1). Note that the construction here is an example of left factoring. [~] Example 8.3
Let us consider Gk, the natural LL(k -+- 1) grammar for the language L k of Lemma 8.7, defined by aSAlaA
S
~
A
~ bkdlb[c
We construct an LL(k) grammar for L k by adding the new symbols [S, a], [A, b], and [A, c]. This new grammar G, is defined by productions S
> a[S, a]
A
, b[A, b]lc[A, el
[S, al
"~ S A I A
[A, b]
> bk-ldle
[A, c]
~e
It is left for the Exercises to prove that G k and G~, are, respectively, LL(k + 1) and LL(k). THEOREM 8.8
For all k ~ 1, the class of L L ( k within the L L ( k ) languages.
1) languages is properly included
P r o o f Clearly, the ~LL(0) languages are a proper subset of the LL(1) languages. For k > 1, using Lemma 8.7, the language L k has no LL(k) grammar without e-productions. Hence, by Theorem 8.4 it has no LL(k -- 1) grammar. It does, as we have seen, have an LL(k) grammar.
EXERCISES
689
EXERCISES
8.1.1.
Give additional examples of grammars which are (a) LR and (deterministically) left-parsable but not LL5 (b) Right- and left-parsable but not LR. (c) L R but not left-parsable.
8.1.2.
Convert the following LL(1) grammar to an LL(2) grammar with no e-productions"
E - - - ~ TE" E'-
~ + TE'[e
T - - - ~ FT' T'-
~,FT'[e
F---> al(E) 8.1.3.
Convert the grammar of Exercise 8.1.2 to an LL(2) grammar in G N F .
8.1.4'
Prove part (1) of Theorem 8.3.
8.1.5.
Complete the proof of Lemma 8.4.
8.1.6.
Complete the proof of Theorem 8.5.
8.1.7.
Show that Gi, constructed in the proof of Theorem 8.7, is LL(k -- 1).
8.1.8.
Show that a language is LL(0) if and only if it is ~ or a singleton.
8.1.9.
Show that Gk and G~, in Example 8.3 are, respectively, LL(k + 1) and LL(k).
• 8.1.10.
Prove that each of the grammars in Theorem 8.2 has the properties attributed to it there.
"8.1.11.
Show that the language L = ~a"b"ln ~ 1} u {a"c" In ~ 1} is a deterministic language which is not an LL language. Hint: Assume that L has an LL(k) grammar G in G N F . Show that L(G) must contain strings not in L by considering left-sentential forms of the appearance ai~ for i > 1.
8.1.12.
Show that L = {a"b m I1 ~ m 0 and arbitrary Z
lf: The basis, m = 1, is trivial. In this case, w must be a symbol in X, and [qq'] ----~ w must be a production in P. For the inductive step, assume that (8.2.1) is true for values smaller than m and that (q, w, Z)[--~- (q', e, Z). Then the configuration immediately before (q', e, Z) must be either of the form (p, a, Z) or (p, e, YZ). In the first case, p is a scan state, and (q, w, Z)[m-a (p, a, Z). Hence, (q, w', Z)Ira, 1 (p, e, Z) if w'a = w. By the inductive hypothesis, [qp] *~ w'. By definition of G, [qq'] ---~ [qp]a is in P, so [qq'] ==~ w. In the second case, we must be able to find states r and s and express w as wlw2, so that
(q, WlW~, z) ~ (r, w~, z)
~-- (s, w~, Yz) (p, e, YZ) ~---(q', e , Z ) where m~ < m, m 2 < m, and the sequence (s, w 2, Y Z ) ~ - ( p , e, YZ) never erases the explicitly shown Y. If m~ = 0, then r = q and w 2 = w. It follows that [qq']----, [sp] is in P. It also follows from the form of the moves that (s, w2, Y ) ~ (p, e, Y), and hence, [sp] *=, w 2. Thus, [qq'] =* w. If m~ > 0, then (q, w~, Z) ~ - (r, e, Z), and so by hypothesis, [qr] *=* Wx. As above, we also have [sp] *=~ w2. The construction of G assures that
SEC. 8.2
GRAMMARS GENERATING DETERMINISTIC LANGUAGES
695
[qq'] ~ [qr][sp] is in P, and so [qq'] *~ w. Only if: This portion is another straightforward induction and is left for the Exercises. The special case of (8.2.1) where q -- q0 and q' ----qr yields the theorem.
D COROLLARY 1 If L is a deterministic language with the prefix property and e is not in L,t then L is generated by a canonical grammar.
Proof. The construction of a normal form D P D A for such a language is similar to the construction in Theorem 8.9. [ZI COROLLARY 2 If L ~ E* is a deterministic language and ¢ is not in E, then L¢ has a canonical grammar. [~] 8.2.2.
Simple M S P Grammars and Deterministic Languages
We shall now proceed to prove that every canonical grammar is a simple MSP grammar. Recall that a simple MSP grammar is a (not necessarily UI) weak precedence grammar G = (N, E, P, S) such that if A ~ e and B --~ ct are in P, then l(A) n l(B) = ~ , where l(C) is the set of nonterminal or terminal symbols which may appear immediately to the left of C in a rightsentential form; i.e., I(C) = {XI X < C or X " C}. We begin by showing that a canonical grammar is a (I, 1)-precedence grammar (nOt necessarily UI). LEMMA 8.8 A canonical grammar is proper (i.e., has no useless symbols, e-productions, or cycles).
Proof. The construction of a canonical grammar eliminates useless symbols and e-productions. It suffices to show that there are no cycles, A cycle can occur only if there is a sequence of productions of type 3, say [q~q'i]---~ [qi+lq;+l], 1 _~i Y, we may conclude as in case 2 that q' is an erase state. Thus, a canonical g r a m m a r is a precedence g r a m m a r . ?To prove that a canonical grammar is simple MSP, we need only prove it to b e w e a k precedence, rather than (1, 1)-precedence. However, the additional portion of this lemma is interesting and easy to prove.
SEC. 8.2
GRAMMARS GENERATING DETERMINISTIC LANGUAGES
697
THEOREM 8.11
A c a n o n i c a l g r a m m a r is a simple m i x e d strategy p r e c e d e n c e g r a m m a r .
P r o o f By T h e o r e m 8.10 a n d L e m m a s 8.8 a n d 8.9, it suffices to s h o w t h a t for every canonical g r a m m a r G = (N, E, P, S) (I) If A ~ a X Y f l a n d B ~ Yfl are in P, t h e n X is n o t in I(B); a n d (2) If A --~ a a n d B - - , a are in P, A :/: B, t h e n I(A) ~ I(B) -- ;2. W e have (1) i m m e d i a t e l y , since if X alb B---+AC C
~D
D ---~ a
Then G1 is defined by
Ss ~
AsB~
As >a[b Ba------~ A A C ~ AA -----+ a l b CA
> DA
DA -
>a
In step (2) of Algorithm 8.5, CA ~ DA is replaced by CA ---~ a. DA is then useless. In step (3), we add nonterminals A~, A~, A~, A~., and C~. Then A S, AA, and CA become useless, so the resulting productions of G 3 are
ss,
~ A~B~IA~B~
A~
~a
A~
,b
704
CHAP. 8
THEORY OF DETERMINISTIC PARSING
B ~~ A~ AbA ~ C~:
A~AC~ I AbaCa~ >a >b >a
In step (4), we string out A~, AS, and C,5, and we string out A~ and A]. The resulting productions of Ga are Ss
> A~B.4 ]A ~B.4
OA ~
A~C~[
A~ ~
A.~
A].
b a AACA
> C]
C] -----> a A~ ---~. A~ A]
>b
[~
We shall now prove by a series of lemmas that G4 is a UI (2, 1)-precedence grammar. LEMMA 8.10 In Algorithm 8.5, ( 1 ) L ( G 1 ) = L(G), (2) G~ is a (l, 1)-precedence grammar, and (3) If A ~ ~ and B--~ ~ are in P~ and ~z is not a single terminal, then
A=B. Proof. Assertion (1) is a straightforward induction on lengths of derivations. For (2), we observe that if X and Y are related by in G~, then X' and Y', which are X and Y with subscripts (if any) deleted, are similarly related in G. Thus G~ is a (1, 1)-precedence grammar, since G is. For (3), we have noted that productions of types 2 and 4 are uniquely invertible in G. That is, if A --~ C X and B ~ C X are in P, then A = B. It thus follows that if Ar --~ CyXc and B r ~ C y X c are in P~, then Ar = Br. Similarly, if X is a terminal, and Ay ~ C y X and By ~ C y X are in P~, then Ay = By. We must consider productions in P1 derived from type 3 productions; that is, suppose that we have A x ~ Cx and Bx --~ Cx. Since we have eliminated useless productions in G l, there must be right-hand sides X D and X E :# of productions of G such that D ~G Atx and E ~G Bfl for some ~ and ft. Let X = [qq'], A = [pp'], B = [rr'], and C =[ss']. Then p and r are in the write sequence of q', and s follows both of them. We may conclude that p = r. Since A ---~ C and B ~ C are b o t h type 3 productions, we have
GRAMMARS GENERATING DETERMINISTIC LANGUAGES
SEC. 8.2
705
p' -- r'. That is, let 6 be the transition function of the D P D A underlying G. Then O(p, e, Z ) -- (s, Y Z ) for some Y, and O(s', e, Y) -- (p', e) -- (r', e). Thus, A -- B, and A x -- B x. The case X -- $ is handled similarly, with q0, the start state of the underlying D P D A , playing the role of q' in the above. [~ LEMMA 8.11
The three properties of G1 stated in Lemma 8.10 apply equally to G2 of Algorithm 8.5. P r o o f Again (1) is a simple induction. To prove (2), we observe that step (2) of Algorithm 8.5 introduces no new right-hand sides and hence no new " relationships. Also, every right-sentential form of G2 is a rightsentential form of Gi, and so it is easy to show that no relationships are introduced. For (3), it suffices to show that if A x - - ~ B x is a production in P1, then B x appears on the right of no other production and hence is useless and eliminated in step (2b). We already know that there is no production Cx ----, B x in P1 if C ~ A. The only possibilities are productions Cx --~ Bxa, Cx ~ BxDB, or Cr --~ X r B x . Now A --~ B is a type 3 production, and so if B = [qq'], then q' is an erase state. Thus, Cx ~ B x a and Cx --~ B x D n are impossible, because C ~ Ba and C --, B D are of types 2 and 4, respectively. Let us consider a production Cr ---* XrBx. Let X = [pp'] and A -- [rr']. Then q is the second state in the write sequence of p', because C - - ~ X B is a type (4) production. Also, since A --~ B is a type 3 production, q follows r in any write sequence in which r appears. But since X can appear immediately to the left of A in a right-sentential form of G, r appears in the write sequence of p', and r ~ p'. Thus, q appears twice in the write sequence of p', which is impossible. [~] LEMMA 8.12
The three properties of G1 stated in Lemma 8.10 apply equally to G3 of Algorithm 8.5. P r o o f Again, (1) is straightforward. To prove (2), we note that if X and Y are related in G 3 by 0A1 [01
B~
> 0BII[011
is easily shown to be an LL(2) grammar. Left factoring converts it to an LL(1) grammar. We claim that the language Lz = [0"al" In _~ 1} U {0"b2"ln _~ 1] is not an LL language. (See Exercise 8.3.2.) L~ is a simple precedence language; it has the following simple precedence grammar"
8.3.2.
S-
>A[B
A
> 0Alia
B
~ 0B21b
D
Operator Precedence Languages
Let us now turn our attention to the class of languages generated by the operator precedence grammars. Although there are ambiguous operator
712
THEORY OF DETERMINISTIG PARSING
CHAP. 8
precedence grammars (the grammar S ~ A IB, A ----~ a, B ---~ a is a simple example), we shall discover that operator precedence languages are a proper subset of the simple precedence languages. We begin by showing that every operator precedence language has an operator grammar which has no single productions and which is uniquely invertible. LEMMA 8 . 1 3
Every (not necessarily UI) operator precedence grammar is equivalent to one with no single productions. P r o o f Algorithm 2.11, which removes single productions, is easily seen to preserve the operator precedence relations between terminaIs. E]
We shall now provide an algorithm to make any context-free grammar with no single productions uniquely invertible. ALGORITHM 8.6 Conversion of a CFG with no single productions to an equivalent UI CFG. Input.
A CFG G = (N, X, P, S) with no single productions. An equivalent UI CFG G~ = (N~, X, P1, $I).
Output. Method.
(1) The nonterminals of the new grammar will be nonempty subsets of N. Formally, let N'a = {MIM ~ N and M ~ ~} W {S~}. (2) For each Wo, w ~ , . . . , w k in X* and M i , . . . , Mk in N'~, place in P'~ the production M - - ~ w o M ~ w 1 . . . M k w e , where M = [A [there is a production of the form A ~
woB~w ~ ...
Bkwk,
where Bt ~ Mr, for 1 ~ i ~ k} provided that M ~ ~ . Note that only a finite number of productions are so generated. (3) For all M _ N such that S ~ M, add the production S~ ~ M to P'i. (4) Remove all useless nonterminals and productions, and let N 1 and P1 be the useful portions, of N'i and P'i, respectively. [--] Example 8.5
Consider the grammar S ~
> a laAbS
A
> a laSbA
THEORY OF SIMPLE PRECEDENCE LANGUAGES
SEC. 8.3
713
From step (2) we obtain the productions
IS} - - + a{A}b{S} l a{S, A}b[S} I a[A]b{S, A} {A] - > a[S}b{A} l a{S, A}b{A} I a[S}b[S, A} IS, A]
> a l a{S, A}b[S, A}
From step (3) we obtain the productions S 1~
{S}[~S, A]
From step (4) we discover that all [S]- and (A]-productions are useless, and so the resulting grammar is S1
IS, A}
), [S~ A}
> a la[S, A}b{S, A}
LEMMA 8 . 1 4
The grammar G~ constructed from G in Algorithm 8.6 is UI if G has no single productions and is an operator precedence grammar if G is operator precedence. Moreover, L(G~) = L(G).
Proof. Step (2) certainly enforces unique invertibility, and since G has no single productions, step (3) cannot introduce a nonuniquely invertible right side. A proof that Algorithm 8.6 preserves the operator precedence relations among terminals is straightforward and left for the Exercises. Finally, it is easy to verify by induction on the length of a derivation that i f A e M , t h e n A ~ Gw i f a n d o n l y i f M * ~ w .G1 D We thus have the following normal form for operator precedence grammars. THEOREM 8.21 IfL is an operator precedence language, then L = L(G) for some operator precedence grammar G = (N, Z, P, S) such that (1) G is UI, (2) S appears on the right of no production, and (3) The only single productions in P have S on the left.
Proof Apply Algorithms 2.11 and 8.6 to an arbitrary operator precedence grammar.
[Z
We shall now take an operator precedence grammar G in the normal form given in Theorem 8.21 and change G into a UI weak precedence gram-
714
THEORYOF DETERMINISTIC PARSING
CHAP. 8
mar G1 such that L(Gi) -- (L(G), where ~ is a left endmarker. We can then construct a UI weak precedence grammar G2 such that L(Gz) --L(G). In this way we shall show that the operator precedence languages are a subset of the simple precedence languages. ALGORITHM 8.7
Conversion of an operator precedence grammar to a weak precedence grammar.
Input. A UI operator precedence grammar G -- (N, E, P, S) satisfying the conditions of Theorem 8.21. Output. A UI weak precedence grammar G~ = ( N ~ , E U {¢},P~,S~) such that L(G~) --eL(G), where ¢~ is not in E. Method. (1) Let N'I consist of all symbols [XA] such that X ~ E U {¢} and A E N. (2) Let S~ --[¢S]. (3) Let h be the homomorphism from N'~ w E U {¢} to N U ~ such that (a) h(a) -- a, for a ~ E u (¢~},and (b) h([aA]) -- aA. Then h-~(~) is defined only for strings ~ in (N u E)* which begin with a symbol in E u {¢} and d o n o t have adjacent nonterminals. Moreover, h-~(~) is unique if it is defined. That is, h-1 combines a nonterminal with the terminal appearing to its left. Let P'i consist of all productions [aA]---, h-l(aoO such that A --. ~ is in P and a is in ~: u [¢}. (4) Let N~ and P~ be the useful portions of N'i and P'~, respectively. [---] Example 8.6
Let G be defined by S
~A
A
~ aAbAe[aAd[ a
We shall generate only the useful portion of N'~ and P'i in Algorithm 8.7. We begin with nonterminal [¢~S]. Its production is [¢S] ~ [¢~A]. The productions for [CA] are [tEA]~ ¢[aA][bA]c, [CA]--~ ¢[aA]d, and leA]---, Ca. The productions for [aA] and [bA] are constructed similarly. Thus, G1 is [¢s] ~
[¢,4]
[CA]
> ¢[aA][bA]e I f~[aA]d[f~a
[aA]
~ a[aA][bA]c l a[aA]d [ aa
[bA] -----+ b[aA][bA]e [ b[aA]d [ ba
D
SEC. 8.3
THEORY
OF SIMPLE PRECEDENCE
LANGUAGES
715
LEMMA 8.15 In Algorithm 8.7 the grammar G~ is a UI weak precedence grammar such that L(G~) = eL(G).
Proof. Unique invertibility is easy to show, and we omit the proof. To show that G1 is a weak precedence grammar, we must show that < and " are disjoint from 3> in G~'t. Let us define the homomorphism g by (1) g(a) : a for a ~ 12, (2) g(¢) = $, and (3) g([aA]) : a. It suffices to show that (1) If X g(Y) in G.
Case/'Suppose that X ~ Y in G1, where X ~ ¢ and X ~ $. Then there is a right-hand side in P~, say ocX[aA]fl, such that [aA] ~ Y7 for some 7•
GI
If t~ is not e, then it is easy to show that there is a right-hand side in P with substring h(X[aA]):~ and thus g(X) ~ a in G. But the form of the productions in P~ implies that g(Y) must be a. Thus, g(X) " g(Y). If t~ is e and X ~ E, then the left-hand side associated with right-hand side ~X[aA]fl is of the form [XB] for some B E N. Then there must be a production in P whose right-hand side has substring XB. Moreover, B ~ aAh(fl) is a production in P. Hence, X refer to operator precedence relations in G and Wirth-Weber precedence relations in Gi. Sh is the homomorphism in Algorithm 8.7.
716
THEORY OF DETERMINISTIC PARSING
CHAP. 8
oc[aA]Zfl is a right-hand side of P1. Since G has no single productions except for starting productions, ~, is not e, and so g(X) is derived from A rather than a in the derivation [aA] =~ ?X. Thus, g(X) > g(Z), and g(X) .> g(Y) Ot in G. The case Y - - $ is handled easily. To complete the proof that G1 is a weak precedence grammar, we must show that if A --~ ~,Yfl and B - ~ fl are in P~, then neither X 1} U {bO"l"O"lm, n ~ 1} u ~aO"2"OmIm, n ~ 1} is simple precedence but neither operator precedence nor LL. (4) {a0"l"0mIm, n > 1} U {b0ml"0"l m, n ~> 1} is LL(1) and simple precedence but not operator precedence.
EXERCISES
717
CFL
deterministic CFL LR(1) (1,1)- BRC UI (2, 1) - precedence simple MSP
simple precedence UI weak precedence d
e
LL operator precedence
b
Fig. 8.7 Subclasses of deterministic languages. (5) {a0"l"ln >__ 1} U {b0"lZ"ln >__ 1] is LL(1) but not simple precedence. (6) [aO"l"]n >. 1} w {aO"2Z"ln > I} u {b0"l Z"ln > 1} is deterministic but not LL or simple precedence. (7) {0"t"[n > 1} U {0"lZ"l n ~> 1} is context-free but not deterministic. Proofs that these languages have the ascribed properties are requested in the Exercises.
EXERCISES
8.3.1.
Show that the grammar for L1 given in Theorem 8.20 is LL(2). Find an LL(t) grammar for L 1.
*8.3.2. Prove that the language Lz = [0"al"[n > 1] U {0"b2"ln > 1] is not an LL language. Hint: Assume that L2 has an LL(k) grammar in GNF. *8.3.3.
Show that L2 of Exercise 8.3.2 is an operator precedence language.
8.3.4.
Prove that Algorithm 2.11 (elimination of single productions) preserves the properties of (a) Operator precedence. (b) (m, n)-precedence. (c) (m, n)-BRC.
8.3.5.
Prove Lemma 8.14.
718
THEORY OF DETERMINISTIC PARSING
CHAP. 8
*8.3.6.
Why does Algorithm 8.6 not necessarily convert an arbitrary precedence grammar into a simple precedence grammar ?
8.3.7.
Convert the following operator precedence grammar to an equivalent simple precedence grammar: S
$1 ~
~ if b then S~ else S Iif b then S la
if b then $1 else S~la
8.3.8.
Prove case 2 of Lemma 8.15.
8.3.9.
Give an algorithm to remove the left endmarker from the language generated by a grammar. Show that your algorithm preserves the UI weak precedence property when applied to a grammar constructed by Algorithm 8.7.
8.3.10.
Show that the simple precedence language L = {a0"l"0 m In ~ 1, m ~__ 1} U {b0"l"0"ln ~ 1, m ~> 1}
is not an operator precedence language. Hint: Show that any grammar for L must have 1 1 as part of the operator precedence relations. '8.3.11.
Show that L of Exercise 8.3.10 is a simple precedence language.
• 8.3.12.
Prove that the languages given in Section 8.3.3 have the properties ascribed to them.
8.3.13.
Give additional examples of languages which belong to the various regions in Fig. 8.7.
8.3.14.
Generalize Theorem 8.18 to show that L1 of that theorem is not UI (1, k)-precedence for any k.
8.3.15.
Show that Gi of Algorithm 8.7 right-covers G of that algorithm.
8.3.16.
Does G~ constructed in Algorithm 8.6 right-cover G of that algorithm if G is proper ?
• '8.3.17.
Let L be the simple precedence language
(a"Oaib" [i, n ~ 0} u {Oa"laic" l i, n ~ 0}. Show that there is no simple precedence parser for L that will announce error immediately after reading the 1 in an input string of the form a"laib. (See Exercise 7.3.5.)
Open Question 8.3.18.
What is the relationship of the class of UI (1, k)-precedence languages, k > 1, to the classes shown in Fig. 8.7? The reader should be aware of Exercise 8.3,14.
BIBLIOGRAPHIC NOTES
BIBLIOGRAPHIC
719
NOTES
The proper inclusion of the simple precedence languages in the deterministic context-free languages and the proper containment of the operator precedence languages in the simple precedence languages were first proved by Fischer [1969].
TRANSLATION A N D CODE GENERATION
9
A compiler designer is usually presented with an incomplete specification of the language for which he is to design a compiler. Much of the syntax of the language can be precisely defined in terms of a context-free grammar. However, the object code that is to be generated for each source program is more difficult to specify. Broadly applicable formal methods for specifying the semantics of a programming language are still a subject of current research. In this chapter we shall present and give examples of declarative formalisms that can be used to specify some of the translations performed within a compiler. Then we shall investigate techniques for mechanically implementing the translations defined by these formalisms.
9.1.
THE ROLE OF T R A N S L A T I O N IN C O M P I L I N G
We recall from Chapter 1 that a compiler is a translator that maps strings into strings. The input to the compiler is a string of symbols that constitutes the source program. The output of the compiler, called the object (or target) program, is also a string of symbols. The object program can be (1) (2) (3) (4)
A sequence of absolute machine instructions, A sequence of relocatable machine instructions, An assembly language program, or A program in some other language.
Let us briefly discuss the characteristics of each of these forms of the object program. (1) Mapping a source program into an absolute machine language pro720
SEC. 9 . 1
THE ROLE OF T R A N S L A T I O N IN C O M P I L I N G
721
gram that can immediately b e executed is one way of achieving very fast compilation. W A T F O R is an example of such a compiler. This type of compilation is best suited for small programs that do n o t use separately compiled subroutines. (2) A relocatable machine instruction is an instruction that references memory locations relative to some movable origin. An object program in the form of a sequence of relocatable machine instructions is usually Called a relocatable object deck. This object deck can be linked together with other object decks such as separately compiled user subprograms, input-output routines, and library functions to produce a single relocatable object d e c k , often called a load module. The program that produces the load module from a set of binary decks is called a link editor. The toad module is then put into memory by a program called a loader that converts the relocatabie addresses into absolute addresses, The object program is then ready for execution. Although linking and loading consume time, most commercial compilers produce relocatable binary decks because of the flexibility in being able to use extensive library subroutines and separately compiled subprograms. (3) Translating a source program into an assembly language program that is then run through an assembler simplifies the design of the compiler. However, the total time now required to produce an executable machine language program is rather high because assembling the output of this type of compiler may take as much time as the compilation itself. (4) Certain compilers (e.g., SNOBOL) map a source program into another program in a special internal language. This internal program is then executed by simulating the sequence of instructions in the internal program. Such compilers are usually called interpreters. However, we can view the mapping of the source program into the internal language as an instance of compilation in itself. In this book we shall not assume any fixed format for the output of a compiler, although in many examples we use assembly language as the object code. In this section we shall review compi|ing and the main processes that map a source program into object code. 9.1.1.
Phases of Compilation
In Chapter 1 we saw that a compiler can be partitioned into several subtranslators, each of which participated in translating some representation of the source program toward the object program. We can model each subtranslator by a mapping that defines a phase of the compilation such that the composition of all the phases models the entire compilation. What we choose to call a phase is somewhatarbitrary. However, it is convenient to think of lexicai analysis, syntactic analysis, and code generation
722
TRANSLATION AND CODE GENERATION
CHAP. 9
as the main phases of compilation. However, in many sophisticated compilers these phases are often subdivided into several subphases, and other phases (e.g., code optimization) may also be present. The input to a compiler is a string of symbols. The lexical analyzer is the first phase of compilation. It maps the input into a string consisting of tokens and symbol table entries. During lexical analysis the original source program is compressed in size somewhat because identifiers are replaced by tokens and unnecessary blanks and comments are removed. After lexical analysis, the original source program is still essentially a string, but certain portions of this string are pointers to information in a symbol table. A finite transducer is a good model for a lexical analyzer. We discussed finite transducers and their application to lexical analysis in Section 3.3. tn Chapter 10 we shall discuss techniques that can be used to insert and retrieve information from symbol tables. The syntax analyzer takes the output of the lexical analyzer and parses it according to some underlying grammar. This grammar is similar to the one used in the specification of the source language. However, the grammar for the source language usually does not specify what constructs are to be treated as lexical items. Keywords and identifiers such as labels, variable names, and constants are some of the constructs that are usually recognized during lexical analysis. But these constructs could also be recognized by the syntax analyzer, and in practice there is no hard and fast rule as to what constructs are to be recognized lexically and what should be left for the syntactic analyzer. . After syntactic analysis, we can visualize that the source program has been transformed into a tree, called the syntax tree. The syntax tree is closely related to the derivation tree for the source program, often being the derivation tree with chains of single productions deleted. In the syntax tree interior nodes generally correspond to operators, and the leaves represent operands consisting of pointers into the symbol table. The structure of the syntax tree reflects the syntactic rules of the programming language in which the source program was written. There are several ways of physically representing the syntax tree, which we shall discuss in the next section. We shall call a representation of the syntax tree an intermediate program. The actual output of the syntactic analyzer can be a sequence of commands to construct the intermediate program, to consult and modify the symbol table, and to produce diagnostic messages where necessary. The model that we have used for a syntactic analyzer in Chapters 4-7 produced a left or right parse for the input. However, it is the nature of syntax trees used in practice that one may easily replace the production numbers in a left or right parse by commands that construct the syntax tree and perform symbol table operations, Thus, it is an appealing simplification to regard the left or right parses produced by the parsers of Chapters 4-7 as the inter-
SEC. 9.1
THE ROLE OF T R A N S L A T I O N IN C O M P I L I N G
723
mediate program itself, and to avoid making a major distinction between a syntax tree and a parse tree. A compiler must also check that certain semantic conventions of the source language have been obeyed. Some common examples of conventions of this nature are (1) Each statement label referenced must actually appear as the label of an appropriate statement in the source program. (2) No identifier can be declared more than once. (3) All variables must be defined before use. (4) The arguments of a function call must be compatible both in number and in attributes with the definition of the function. The checking of these conventions by a compiler is called semantic analysis. Semantic analysis often occurs immediately after syntactic analysis, but it can also be done in some later phases of compilation. For example, checking that variables are defined before use can be done during code optimization if there is a code optimization phase. Checking of operands for correct attributes can be deferred to the code generation phase. In Chapter 10, we shall discuss property grammars, a formalism that can be used to model aspects of syntactic and semantic analysis. After syntactic analysis, the intermediate program is mapped by the code generator into the object program. However, to generate correct object code, the code generator must also have access to the information in the symbol table. For example, the attributes of the operands of a given operator determine the code that is to be generated for the operator. For instance, when A and B are floating-point variables, different object code will be generated for A + B than will be generated when A and B are integer variables. Storage allocation also occurs during code generation (or in some subphase of code generation). Thus, the code generator must know whether a variable represents a scalar, an array, or a structure. Tbis information is contained in the symbol table. Some compilers have an optimizing phase before code generation. In this optimizing phase the intermediate program is subjected to transformations that attempt to put the intermediate program into a form from which a more efficient object language program can be produced. It is often difficult to distinguish some optimizing transformations from good code generation techniques. In Chapter 11 we shall discuss some of the optimizations that can be performed on intermediate programs or while generating intermediate programs. An actual compiler may implement a phase of the compilation process in one or more passes, where a pass consists of reading input from secondary memory and then writing intermediate results into secondary memory. What is implemented in a pass is a function of the size of the machine on
724
CHAP. 9
TRANSLATION AND CODE GENERATION
which the compiler is to run, the language for which the compiler is being developed, the number of people engaged in implementing the compiler, and so forth. It is even possible to implement all phases in one pass. What the optimal number of passes to implement a given compiler should be is a topic that is beyond the scope of this book. 9.1.2.
Representations of the Intermediate Program
In this section we shall discuss some possible representations for the intermediate program that is produced by the syntactic analyzer. The intermediate program should reflect the syntactic structure of the source program. However, it should also be relatively easy to translate each statement of the intermediate program into machine code. Compilers performing extensive amounts of code optimization create a detailed representation of the intermediate program, explicitly showing the flow of control inherent in the source program. In other compilers the representation of the intermediate program is a simple representation of the syntax tree, such as Polish notation. Other compilers, doing little code optimization, will generate object code as the parse proceeds. In this case, the "intermediate" program appears only figuratively, as a sequence of steps taken by the parser. Some of the more common representations for the intermediate program are (1) (2)' (3) (4) (5)
Postfix Polish notation, Prefix Polish notation, Linked list structures representing trees, Multiple address code with named results, and Multiple address code with implicitly named results.
Let us examine some examples of these representations. We defined Polish notation for arithmetic expressions in Section 3.1.1. For example, the assignment statement (9.1.1)
A--B+C,--
D
with the normal order of precedence for the operators and assignment symbol (--) has the postfix Polish representationt ABCD-
• -t- :
and the prefix Polish representation = A + B , C - - D
tin this representation, -- is a unary operator and ,, +, a n d : are all binary operators.
THE ROLE OF TRANSLATION IN COMPILING
SEC. 9.1
725
In postfix Polish notation, the operands appear from left to right in the order in which they are used. The operators appear right after the operands and in the order in which they are used. Postfix Polish notation is often used as an intermediate language by interpreters. The execution phase of the interpreter can evaluate the postfix expression using a pushdown list for an accumulator. (See Example 9.4.) Both types of Polish expressions are linear representations of the syntax tree for expression (9.1.1), shown in Fig. 9.1. This tree reflects the syntactic structure of expression (9.1.1). We can also use this tree itself as the intermediate program, encoding it as a linked list structure.
C
91
Syntax tree.
Another method of encoding of the syntax tree is to use multiple address code. For example, using multiple address code with named results, expression (9.1.1) could be represented by the following sequence of a s s i g n m e n t statements" T I -,t
O
Tz +---- *CT1 T3 ~,
q-BTz
A ~
T3"~
A statement of the form A +-- OB 1 . . . B r means that r-ary operator 0 is to be applied to the current values of variables B~, • • -, B r and that the resulting tNote that the assignment operator must be treated differently from other operators. A simple "optimization" is to replace T3 by A in the third line and to delete the fourth line.
726
TRANSLATION AND CODE GENERATION
CHAP. 9
value is to be assigned to variable A. We shall formally define this particular intermediate language in Section 11.1. Multiple address code with named results requires a temporary variable name in each assignment instruction to hold the value of the expression computed by the right-hand side of each statement. We can avoid the use of the temporary variables by using multiple address code with implicitly named results. In this notation we label each assignment statement by a number. We then delete the left-hand side of the statement and reference temporary results by the number assigned to the statement generating the temporary result. For example, expression (9.1.1) would be represented by the sequence 1:
--D
2:
, C (1)
3:
+B(2)
4:
= A (3)
Here a parenthesized number refers to the expression labeled by that number. Multiple address code representations are computationally convenient if a code optimization phase occurs after the syntactic analysis phase. During the code optimization phase the representation of the program may change considerably. It is much more difficult to make global changes on a Polish representation than a linked list representation of the intermediate program. As a second example let us consider representations for the statement (9.1.2)
if I : J then S~ else $2
where S 1 andS2 represent arbitrary statements. A possible postfix Polish representation for this statement might be I J EQUAL? L 2 JFALSE S'1 L JUMP S~ Here, S'i and S~ are the postfix representations of $1 and $2, respectively; EQUAL ? is a Boolean-valued binary operator that has the value true if its two arguments are equal and false otherwise. L2 is a constant which names the beginning of S~. JFALSE is a binary operator which causes a jump to the location given by its second argument if the value of the first argument is false and has no effect if the first argument is true. L is a constant which is the first instruction following S~. JUMP is a unary operator that causes a jump to the location giqen by its argument. A derivation tree for statement (9.1.2) is shown in Fig. 9.2. The important syntactic information in this derivation tree can be represented by the syntax tree shown in Fig. 9.3, where S'1 and S~ represent syntax trees for $1 and $2.
SEC. 9.1
THE ROLE OF TRANSLATION IN COMPILING
727
< if statement >
if
< expression >
then
< statement >
< expression >
< relop >
< var >
< var >
=
J
else
< statement >
Fig. 9.2 Derivation tree.
< if statement >
Fig. 9.3
S y n t a x tree.
A syntactic analyzer generating an intermediate program would parse expression (9.1.2) tracing out the tree in Fig. 9.2, but its output would be a sequence of commands that would construct the syntax tree in Fig. 9.3. In general, in the syntax tree an interior node represents an operator whose operands are given by its direct descendants. The leaves of the syntax tree correspond to identifiers. The leaves will actually be pointers into the symbol table where the names and attributes of these identifiers are kept. Part of the output of the syntax analyzer will be commands to enter information into the symbol table. For example, a source language statement of the form INTEGER I will be translated into a command that enters the attribute "integer" in the symbol table location reserved for identifier L There will be no explicit representation for this statement in the intermediate program.
728
TRANSLATION AND CODE GENERATION
9.1.3.
M o d e l s for Code Generation
CHAP. 9
Code generation is a mapping from the intermediate program to a string. We shall consider this mapping to be a function defined on the syntax tree and information in the symbol table. The nature of this mapping depends on the source language, the target machine, and the quality of the object code desired. One of the simplest code generation schemes would map each multiple address statement into a sequence of object language instructions independent of the context of the multiple address instructions. For example, an assignment instruction of the form A
<
-+-BC
might be mapped into the three machine instructions LOAD
B
ADD
C
STORE
A
Here we are assuming a one accumulator machine, where the instruction LOAD B places the value of memory location B in the accumulator, A D D C addst the value of memory location C to the accumulator, and STORE A places the value in the accumulator into memory location A. The STORE instruction leaves the contents of the accumulator unchanged. However, if the accumulator initially contained the value of memory location B (because, for example, the previous assignment instruction was B ~ - q - D E ) , then the LOAD B instruction would be unnecessary. Also, if the next assignment instruction is F ~ - F A G and no other reference is made to A, then the STORE A instruction is not needed. In Sections 11.1 and 11.2 of Chapter 11 we shall consider some techniques for generating code from multiple address statements.
EXERCISES
9.1.1.
Draw syntax trees for the following source language statements: (a) A = (B -- C)/(B + C) (as in FORTRAN). (b) I = LENGTH(C111 C2) (as in PL/I). (c) if B > C then if D > E t h e n , 4 " = B - q - C e l s e A " = B - else A := B • C (as in ALGOL).
C
tLet us assume for simplicity that there is only one type of arithmetic. If more than one, e.g., fixed and floating, is available, then the translation of + will depend on symbol table information about the attributes of B and C.
BIBLIOGRAPHIC NOTES
729
9.1.2.
Define postfix Polish representations for the programs in Exercise 9.1.1.
9.1.3.
Generate multiple address code with named results for the statements in Exercise 9.1.1.
9.1.4.
Construct a deterministic pushdown transducer that maps prefix Polish notation into postfix Polish notation.
9.1.5.
Show that there is no deterministic pushdown transducer that maps postfix Polish notation into prefix Polish notation. Is there a nondeterministic pushdown transducer that performs this mapping? Hint" see Theorem 3.15.
9.1.6.
Devise an algorithm using a pushdown list that will evaluate a postfix Polish expression.
9.1.7.
Design a pushdown transducer that takes as input an expression w in
L(Go) and produces as output a sequence of commands that will build a syntax tree (or multiple address code) for w. 9.1.8.
Generate assembly code for your favorite computer for the programs in Exercise 9.1.1.
• 9.1.9.
Devise algorithms to generate assembly code for your favorite computer for intermediate programs representing arithmetic assignments when the intermediate program is in (a) Postfix Polish notation. (b) Multiple address code with named results. (c) Multiple address code with implicitly named results. (d) The form of a syntax tree.
"9.1.10.
Design an intermediate language that is suitable for the representation of some subset of F O R T R A N (or PL/I or ALGOL) programs. The subset should include assignment statements and some control statements. Subscripted variables should also be allowed.
• "9.1.11.
Design a code generator that will map an intermediate program of Exercise 9.1.10 into machine code for your favorite computer.
BIBLIOGRAPHIC
NOTES
Unfortunately, it is impossible to specify the best object code even for common source language constructs. However, there are several papers and books that discuss the translation of various programming languages. Backus et al. [1957] give the details of an early F O R T R A N compiler. Randell and Russell [1964] and Grau et al. [1967] discussthe implementation of ALGOL 60. Some details of PL/I implementation are given in IBM [1969]. There are many publications describing techniques that are useful in code generation. Knuth [1968a] discusses and analyzes various storage allocation techniques. Elson and Rake [1970] consider the generation of code from a treestructured intermediate language. Wilcox [1971] presents some general models for code generation.
?
730
TRANSLATION AND CODE GENERATION
9.2.
SYNTAX-DIRECTED TRANSLATIONS
CHAP. 9
In this section we shall consider a compiler model in which syntax analysis and code generation are combined into a single phase. We can view such a model as one in which code generation operations are interspersed with parsing operations. The term syntax-directed compiling is often used to describe this type of compilation. The techniques discussed here can also be used to generate intermediate code instead of object or assembly code, and we give one example of translation to intermediate code. The remainder of our examples are to assembly code. Our starting point is the syntax-directed translation scheme of Chapter 3. We show how a syntax-directed translation can be implemented on a deterministic pushdown transducer that does top-down or bottom-up parsing. Throughout this section we shall assume that a D P D T uses a special symbol $ to delimit the right-hand end of the input string. Then we add various features to make the SDTS more versatile. First, we allow semantic rules that permit more than one translation to be defined at various nodes of the parse tree. We also allow repetition and conditionals in the formulas for these transIations. We then consider translations which are not strings; integers and Boolean variables are useful additional types of translations. Finally, we allow translations to be defined in terms of other translations found not only at the direct descendants of the node in question but at its direct ancestor. First, we shall show that every simple SDTS on an LL grammar can be implemented by a deterministic pushdown transducer. We shall then investigate what simple SDTS's on an LR grammar can be so implemented. We shall discuss an extension of the DPDA, called a pushdown processor, to implement the full class of SDT's whose underlying grammar is LL or LR. Then, the implementation of syntax-directed translations in connection with backtrack parsing algorithms is studied briefly. 9.2.1.
Simple Syntax-Directed Translations
In Chapter 3 we saw that a simple syntax-directed translation scheme can be implemented by a nondeterministic pushdown transducer. In this section we shall discuss deterministic implementations of certain simple syntaxdirected translation schemes. The grammar G underlying an S DTS can determine the translations definable on L(G) and the efficiency with which these translations can be directly implemented. Example 9.1
Suppose that we wish to map expressions generated by the grammar G, below, into prefix Polish expressions, on the assumption that ,'s are to take
SEC. 9.2
SYNTAX-DIRECTED TRANSLATIONS
7 3 '1
precedence over ÷ ' s , e.g., a , a q - a has prefix expression -t-* aaa, not • a + aa. G is given by E --~ a ÷ E[ a • El a. However, there is no syntaxdirected translation scheme which uses G as an underlying grammar and which can define this translation. The reason for this is that the output grammar of such an SDTS must be a linear CFG, and it is not difficult to show that the set of prefix Polish expressions over [-?, ,, a} corresponding to the infix expressions in L ( G ) is not a linear context free language. However, this particular translation can be defined using a simple SDTS with G O [except for production F---~ (E)] as the underlying grammar. In Theorem 3.14 we showed that if (x, y) is an element of a simple syntaxdirected translation, then the output y can be generated from the left parse of x using a deterministic pushdown transducer. As a consequence, if a grammar G can be deterministically parsed top-down in the natural manner using a D P D T M, then M can be readily modified to implement any simple SDTS whose underlying grammar is G. If G is an LL(k) grammar, then any simple SDT defined on G can be implemented by a D P D T in the following manner. THEOREM 9.1
Let T = (N, E, A, R, S) be a semantically unambiguous simple SDTS with an underlying LL(k) grammar. Then [(x$, y) l (x, y) ~ z(T)} can be defined by a deterministic pushdown transducer. P r o o f The proof follows from the methods of Theorems 3.14 and 5.4. If G is the underlying grammar of T, then we can construct a k-predictive parsing algorithm for G using Algorithm 5.3. We can construct a D P D T M with an endmarker to simulate this k-predictive parser and perform the translation as follows. Let A ' = [ a ' l a ~ A}, assume A' ~ E = ~ , and let h ( a ) = a' for all a E A. The parser of Algorithm 5.3 repeatedly replaces nonterminals by right-hand sides of productions, carrying along LL(k) tables with the nonterminals. M will do essentially the same thing. Suppose that the left parser replaces A by w o B l w ~ . . . Brow m [with some LL(k) tables, not shown, appended to the nonterminals]. Let A ~
woBlwl...
BmWm, xoB~x ~ . . .
BmX m
be a rule of R. Then M will replace A by woh(xo)BlWlh(Xl) .. : B m W m h ( x m ) . As in Algorithm 5.3, whenever a symbol of E appears on top of the pushdown list, it is compared with the current input symbol, and if a match occurs, the symbol is deleted from the pushdown list, and the input head moves right one cell. When a symbol a' in A' is found on top of the pushdown list, M emits a and removes a' from the top of the pushdown list without moving the input head. M does not emit production indices. A more formal construction of M is left to the reader.
732
TRANSLATION AND CODE GENERATION
CHAP. 9
Example 9.2
Let T be the simple SDTS with the rules
S ~ S=
aSbSc, 1S2S3 >d, 4
The underlying grammar is a simple LL(1) grammar. Therefore, no LL tables need be kept on the stack. Let M be the deterministic pushdown transducer (Q, Z, A, F, ~, q, S, {accept}), where Q = [q, accept, error} Z ---- [a, b, c, d, $}
r=[s,s}uz
ua
A = [1, 2, 3, 4} Since Z ¢q A -- ~ , we shall let A' -- A. ~ could be defined as follows"
,~(q, a, s) = (q, Sb2Sc3,
1)t
t~(q, d, S) = (q, e, 4) fi(q, b, b) = (q, e, e) 6(q, c, c) = (q, e, e) •6(q, e, 2) = (q, e, 2) -
6(q,e, 3) ---- (q, e, 3) ~(q, $, $) = (accept, e, e) Otherwise, ~(q, X, Y) = (error, e, e) It is easy to verify that ~(M) = {(x$, y)[ (x, y) E v(T)}. Let us now consider a simple SDTS in which the underlying grammar is LR(k). Since the class of LR(k) grammars is larger than the class of LL(k) grammars, it is interesting to investigate what class of simple SDTS's with underlying LR(k) grammars can be implemented by DPDT's. It turns out that there are semantically unambiguous simple SDT's which have an underlying LR(k) grammar but which cannot be performed by any DPDT. Intuitin these rules we have taken the liberty of producing an output symbol and shifting the input as soon as a production has been recognized, rather than doing these actions in separate steps as indicated in the proof of Theorem 9.1.
SEe. 9.2
SYNTAX-DIRECTED T R A N S L A T I O N S
733
tively, the reason for this is that a translation element may require the generation of output long before it can be ascertained that the production to which this translation element is attached is actually used. Example 9.3
Consider the simple SDTS T with the rules S
~ Sa, aSa
S-
~ Sb, bSb
S----> e, e
The underlying grammar is LR(1), but by Lemma 3.15 there is no D P D T defining {(x$, y) l (x, y) ~ v(T)}. The intuitive reason for this is that the first rule of this SDTS requires that an a be emitted before the use of production S ~ Sa can be recognized. [~] However, if a simple SDTS with an underlying LR(k) grammar does not require the generation of any output until after a production is recognized, then the translation can be implemented on a DPDT. DEFINITION An SDTS T = (N, E, A, R, S) will be called a p o s t f i x SDTS if each rule in R is of the form A ~ ~, fl, where fl is in N'A*. That is, each translation element is a string of nonterminals followed by a string of output symbols. THEOREM 9.2 Let T = (N, E, A, R, S) be a semantically unambiguous simple postfix SDTS with an underlying LR(k) grammar. Then {(x$, y)l(x, y)e z(T)} can be defined by a deterministic pushdown transducer. P r o o f Using Algorithm 5.1 I, we can construct a deterministic right parser M with an endmarker for the underlying LR(k).grammar. However, rather than emitting the production number, M can emit the string of output symbols of the translation element associated with that production. That is, if A --~ tx, f i x is a rule in R where fl E N* and x ~ A*, then when the production number of A --~ e is to be emitted by the parser, the string x is emitted by M. In this fashion M defines {(x$, y)l(x, y) ~ ~r(T)}. [Z]
We leave the converse of Theorem 9.2, that every deterministic pushdown transduction can be expressed as a postfix simple SDTS on an LR(1) grammar, for the Exercises. Example 9.4
Postfix translations are more useful than it might appear at first. Here, let us consider an extended D P D T that maps the arithmetic expressions of
734
TRANSLATION AND CODE GENERATION
CHAP. 9
L(Go) into machine code for a very convenient machine. The computer for this example has a pushdown stack for an accumulator. The instruction LOAD X puts the value held in location X on top of the stack; all other entries on the stack are pushed down. The instructions A D D and MPY, respectively, add and multiply the top two levels of the stack, removing the two levels but then pushing the result on top of the stack. We shall use semicolons to separate these instructions. The SDTS we have in mind is
E
>E+T, ET
E
>T,
T
T
>T,F,
TF
'ADD;' 'MPY;'
T
>F,
F
F
>(E),
E
F
> a,
' L O A Da;'
In this example and the ones to follow we shall use the SNOBOL convention of surrounding literal strings in translation rules by quote marks. Quote marks are not part of the output string. With input a + ( a . a)$, the D P D T enters the following sequence of configurations; we have deleted the LR(k) tables from the pushdown list. We have also deleted certain obvious configurations from the sequence as well as the states and bottom of stack marker. [e,
a ÷ (a • a)$, e] ÷(a.a)$,
e]
-q-- (a • a)$,
LOAD a;]
•-t- (a • a)$,
LOAD a;]
I--L-[E + (a,
• a)$,
LOAD a;]
!--[E + (F,
• a)$,
LOAD a; LOAD a;]
- - [ E + (T,
• a)$,
LOAD a; LOAD a;]
lad --[F, ~[E,
[_L [E -+- ( T , a, )$,
LOAD a; LOAD a;]
--[E + (T, F, )$, --[E + (T, )$,
LOAD a; LOAD a; LOAD a;] LOAD a; LOAD a; LOAD a; MPY;]
1--[E --q- (E,
LOAD a; LOAD a; LOAD a; MPY;]
)$,
SEC. 9.2
[E + (E), ~-3--[E -k T,
~[E,
SYNTAX-DIRECTED TRANSLATIONS
$, $, s,
735
LOAD a; LOAD a; LOAD a; MPY;] LOAD a; LOAD a; LOAD a; MPY;] LOAD a; LOAD a; LOAD a; MPY; ADD;]
Note that if the different a's representing identifiers are indexed, so that the input expression becomes, say, a 1 + (a z • a3), then the output code would be LOAD a, LOAD a2 LOAD a 3 MPY ADD which computes the expression correctly on this machine.
~]
While the computer model used in Example 9.4 was designed for the purpose of demonstrating a syntax-directed translation of expressions with few of the complexities of generating code for more common machine models, the postfix scheme is capable of defining useful classes.of translations. In the remainder of this chapter, we shall show how object code can be generated for other machine models using a pushdown transducer operating on what is in essence a simple postfix SDTS with an underlying LR grammar. Suppose that we have a simple SDTS which has an underlying LR(k) grammar, but which is not postfix. How can such a translation be performed ? One possible technique is to use the following multipass translation scheme. This technique illustrates a cascade connection of DPDT's. However, in practice this translation would be implemented in one pass using the techniques of the next section for arbitrary SDTS's. Let T = (N, E, A, R, S) be a semantically unambiguous simple SDTS with an underlying LR(k) grammar G. We can design a four-stage translation scheme to implement r(T). The first stage consists of a DPDT. The input to the first stage is the input string w$. The output of the first stage is z~, the right parse for w according to the underlying input grammar G. The second stage reverses n to create nR, the right parse in reverse.t The input to the third stage will be rcR. The output of the third stage will be the translation defined by the simple SDTS T' = (N, E', A, R', S), where tRecall that n, the right parse, is the reverse of the sequence of productions used in a rightmost derivation. Thus, nR begins with the first production used and ends with the last production used in a rightmost derivation. To obtain nR we can merely read the buffer in which z~is stored backwaM.
736
TRANSLATION AND CODE GENERATION
CHAP. 9
R' contains the rule A ~
if and only if
> iBmBm_~ " "
Bi, ymB,, ""
ylBlyo
. . . BmXr,, y o B l y ~ " " BmYm is a rule in R and is the ith production in the underlying LR(k) grammar. It is easy to prove that (z~R, yR) is in v(T') if and only if (S, S ) 7 n " (w, y). T' is a simple SDTS based on an LL(1) grammar and can thus be implemented on a DPDT. The fourth stage merely reverses the output of the third stage. If the output of the third stage is put on a pushdown list, then the fourth stage merely pops off the symbols from the pushdown list, emitting each symbol as it is popped off. Figure 9.4 summarizes this procedure. A ----, x o B l x ~
A ---, x o B t x ~ • •. BmXm
rr, the fight parse for w
Stage 1
7rR
Stage 2
71.R
yR
Stage 3
yR
Stage 4 Fig. 9.4
Simple S D T on an L R ( k ) g r a m m a r .
Each of these three stages requires a number of basic operations that is linearly proportional to the length of w. Thus, we can state the following result. THEOREM 9.3 Any simple SDTS with an underlying LR(k) grammar can be implemented in time proportional to its input length. Proof
A formalization of the discussion above.
[-]
SEC. 9.2 9.2.2.
SYNTAX-DIRECTED TRANSLATIONS
737
A Generalized Transducer
While the pushdown transducer is adequate for defining all simple SDTS's on an LL grammar and for some simple SDTS's on an LR grammar, we need a more versatile model of a translator when doing (1) Nonsimple SDTS's, (2) Nonpostfix simple SDTS's on an LR grammar, (3) Simple SDTS's on a non-LR grammar, and (4) Syntax-directed translation when the parsing is not deterministic one pass, such as the algorithms of Chapters 4 and 6. We shall now define a new device called a pushdown processor (PP) for defining syntax-directed translations that map strings into graphs. A pushdown processor is a PDT whose output is a labeled directed graph, generally a tree or part of a tree which the processor is constructing. The major feature of the PP is that its pushdown list, in addition to pushdown symbols, can hold pointers to nodes in the output graph. Like the extended PDA, the pushdown processor can examine the top k cells of its pushdown list for any finite k and can manipulate the contents of these cells arbitrarily. Unlike the PDA, if these k cells include some pointers to the output graph, the PP can modify the output graph by adding or deleting directed edges connected to the nodes pointed to. The PP can also create new nodes, label them, create pointers to them, and create edges between these nodes and the nodes pointed to by those pointers on the top k cells of the pushdown list. As it is difficult to develop a concise lucid notation for such manipulations, we shall use written descriptions of the moves of the PP. Since each move of the PP can involve only a finite number of pointers, nodes, and edges, such descriptions are, in principle, possible, but we feel that a formal notation would serve to obscure the essential simplicity of the translation algorithms involved. We proceed directly to an example. Example 9.5
Let us design a pushdown processor P to map the arithmetic expressions of L(G o) into syntax trees. In this case, a syntax tree will be a tree in which each interior node is labeled by + or • and leaves are labeled by a. The following table gives the parsing and output actions that the pushdown processor is to take under various combinations of current input symbol and symbol on top of the pushdown list. P has been designed from the SLR(1) parser for G Ogiven in Fig. 7.37 (p. 625). However, here we have eliminated table T4 from Fig. 7.37, treating F ~ a as a single production, and have renamed the tables as follows"
738
TRANSLATION AND CODE GENERATION
CHAP. 9
Old name
To T1 T5 T6 T7 T8 T9 T10 Tll
New name
To
T1
(
--~- *
Z2
T3
T4
)
In addition $ is used as a right endmarker on the input. The parsing and output actions of P are given in Figs. 9.5 and 9.6. P uses the LR(1) tables to determine its actions. In addition P will attach a pointer to tables T1, T2, T3, and T4 when these tables are placed on the pushdown list.t These pointers are to the output graph being generated. However, the pointers do not affect the parsing actions. In parsing we shall not place the g r a m m a r symbols on the pushdown list. However, the table names indicate what that g r a m m a r symbol would be. The last column of Fig. 9.5 gives the new LR(1) table to be placed on top of the pushdown list after a reduce move. Blank entries denote error situations. A new input symbol is read only after a shift move. The numbers in the table refer to the actions described in Fig. 9.6. Let us trace the behavior of P on input a 1 • (a 2 + a3)$. We have subscripted the a's for clarity. The sequence of moves m a d e by P is as follows:
Pushdown List (1) (2) (3) (4) (5). (6) (7) (8) (9) (10) (11)
To To[TI,pl] To[Ti,pl] * To[Ti,pl] * ( To[Ti,pa] * ([T2,P2] To[TI,pl] * ([T2,P2] + T o [ T i , p l ] * ([T2,p2] + [T3,P3] To[Ti,pl] * ([T2,P4] To[Ti,pl] * ([T2,p4]) T o [ T t , p l ] * [T4, P4] ,
To[Ti,p 5]
Input al * (a2 --t-a3)$ * (a2 -+- a3)$ (a2 -]- a3)$ a2 -1- a3)$ -q- a3)$ a3)$ )$ )$ $ $ $
Let us examine the interesting moves in this sequence. In going from configuration (1) to configuration (2), P creates a node n I labeled a and sets a p o i n t e r p l to this node. The pointer pl is stored with the LR(1) table T1 on the pushdown list. Similarly, going from configuration (4) to configuration (5), P creates a new node n 2 labeled a 2 and places a pointer P2 to this node on the pushdown list with LR(1) table Tz. In configuration (7) another node n 3 labeled a3 is created and P3 is established as a pointer to n 3.
tin practice, these pointers can be stored immediately below the tables on the pushdown list.
739
SYNTAX-DIRECTED TRANSLATIONS
SEC. 9 . 2
\ ~
Input Symbol
Table on Top of Pushdown List
action
"~
a
To
1
+
,
)
(
accept
shift +
shift
r2
shift +
shift
shift )
2
s~ft
2
3
3
+
(
goto
rl
shift (
Ti
~q
$
shift (
r3
shift (
r4
shift (
~5
) Fig. 9.5 Pushdown processor P.
(1) Create a new node n labeled a. Push the symbol [T1, p] on top of the pushdown list, where p is a pointer to n. Read a new input symbol. (2) At this point, the top of the pushdown list contains a string of four symbols of the form X[T~, p 1] + [Ti, p z], where p i andp 2 are pointers to nodes n 1and nz, respectively. Create a new node n labeled Jr-. Make nl and nz the left and right direct descendants of n. Replace [Ti, pl] + [Ti, p2] by [T,p], where T = goto(X) and p is a pointer to node n. (3) Same as (2) above with • in place of ÷. (4) Same as (1) with T3 in place of Ti. (5) Same as (1) with T4 in place of T~. (6) Same as (1) with T~ in place of T1. (7) The pushdown list now contains X([T,p]), where p is a pointer t o s o m e node n. Replace ([T, p]) by [T', p], where T ' = goto(X). Fig. 9.6 Processor output actions.
In going to c o n f i g u r a t i o n (8), P creates the o u t p u t g r a p h s h o w n in Fig. 9.7. Here, P4 is a p o i n t e r to n o d e n4. After entering configuration (11), the o u t p u t g r a p h is as s h o w n in Fig. 9.8. H e r e p~ is a p o i n t e r to n o d e ns. In configuration (11), the action of T1 on i n p u t $ is accept, a n d so the tree in Fig. 9.8 is the final o u t p u t . This tree is the s y n t a x tree for the expression
al * (az + a3).
[~
740
CHAP. 9
TRANSLATION AND CODE GENERATION
P41
nl
Fig. 9.7
n3
Output after configuration (8).
//5
/'//3
Fig. 9.8 9.2.3.
Final output.
Deterministic One-Pass Bottom-Up Translation
This is the first of three sections showing how various parsing algorithms can be extended to implement an SDTS by the use of a deterministic pushdown processor instead of a pushdown transducer. We begin by giving an algorithm to implement an arbitrary SDTS with an underlying LR grammar. ALGORITHM 9. i SDTS on an LR grammar.
Input. A semantically unambiguous SDTS T = (N, E, A, R, S) with underlying LR(k) grammar G = (N, E, P, S) and an input w in E*.
SYNTAX-DIRECTED TRANSLATIONS
SEC. 9.2
741
Output~ An output tree whose frontier is the output for w. Method. The tree is constructed by a pushdown processor M, which simulates ~, an LR(k) parser for G. M will hold on its pushdown list (top at the right) the symbols in N U ~ and the LR(k) tables exactly as ~ does. In addition, immediately below each nonterminal, M will have a pointer to the output graph being generated. The shift, reduce, and accept actions of M are as follows"
(1) When ~ shifts symbol a onto its stack, M does the same. (2) If ~ reduces X1 .-- Xm to nonterminal A, M does the following" (a) Let A--~ X i . . . Xm, WoBlWl"'" Brwr be a rule of the SDTS, where the B's are in one-to-one correspondence with those of the X's which are nonterminals. (b) M removes X1 " - Xm from the top of its stack, along with intervening pointers, LR(k) tables, and, if X1 is in N, the pointer immediately below X~. (c) M creates a new node n, labels it A, and places a pointer to n below the symbol A on top of its stack. (d) The direct descendants of n have labels reading, from the left, WoB lw~ . . . Brwr. Nodes are created for each of the symbols of the w's. The node for Bi, 1 ~ i ~ r, is the node pointed to by the pointer which was immediately below Xj on the stack of M, where Xj is the nonterminal corresponding to B; in this particular rule of the SDTS. (3) If t~'s input becomes empty (we have reached the right endmarker) and only p S and two LR(k) tables appear on M's pushdown list, then M accepts if ~ accepts; p points to the root of the output tree of M. [~ Example 9.6
Let Algorithm 9.l be applied to the SDTS S
> aSA, OAS
S-
>b, 1
A
> bAS, 1SA
A
~a, 0
with input abbab. The underlying grammar is SLR(1), and we shall omit discussion of the LR(1) tables, assuming that they are there and guide the parse properly. We shall list the successive stack contents entered by the pushdown processor and then show the tree structure pointed to by each of the pointers. The LR(1) tables on the stack have been omitted.
742
TRANSLATIONAND CODE GENERATION
CHAP. 9
(e, abbab$) [_z_(ab, bab$) [-- (aplS, bab$) (ap ~Sba, b $) ~-- (ap,Sbp2A, b$) [--- (aplSbpzAb , $) [-- (apiSbpzAp3S, $) ~-- (ap~Sp4A, $)
k- (p~S, s) The trees constructed after the third, fifth, seventh, eighth, and ninth indicated configurations are shown in Fig. 9.9(a)-(e). Note that when bp2Ap3S is reduced to A, the subtree formerly pointed to by P3 appears to the left of that pointed to by P2, because the translation element associated with A --~ bSA permutes the S and A on the right. A similar permutation occurs at the final reduction. Observe that the yield of Fig. 9.9(e) is 01101. THEOREM 9.4
Algorithm 9.1 correctly produces the translation of the input word according to the given SDTS.
Proof Elementary induction on the order in which the pointers are created by Algorithm 9.1. [---] We comment that if we "implement" a pushdown processor on a reasonable random access computer, a single move of the PP can be done in a finite number of steps of the random access machine. Thus, Algorithm 9.1 takes time which is a linear function of the input length. 9.2.4.
Deterministic One-Pass Top-Down Translation
Suppose that we have a predictive (top-down) parser. Converting such a parser to a translator requires a somewhat different approach from the way in which a bottom-up parser was converted into a translator. Let us suppose that we have an SDTS with an underlying LL grammar. The parsing process builds a tree top-down, and at any stage in the process, we can imagine that a partial tree has been constructed. Those leaves in this partial tree labeled by nonterminals correspond to the nonterminals contained on the pushdown list of the predictive parser generated by Algorithm 5.3. The expansion of a nonterminal is tantamount to the creation of descendants for the corresponding leaf of the tree. The translation strategy is to maintain a pointer to each leaf of the "cur-
SEC. 9.2
SYNTAX-DIRECTED TRANSLATIONS
743
I =1
(a)
(b)
1 ,i
(c)
t.,i
(d)
(e)
Fig. 9.9 Translation by pushdown processor. rent" tree having a nonterminal label. While doing ordinary LL parsing, that pointer will be kept on the pushdown list immediately below the nonterminal corresponding to the node pointed to. When a nonterminal is expanded according to some production, new leaves are created for the corresponding translation element, and nodes with nonterminal labels are pointed to by newly created pointers on the pushdown list. The pointer below the expanded nonterminal disappears. It is therefore necessary to keep, outside the pushdown processor, a pointer to the root of the tree being created. For example, if the pushdown list contains Ap, and the production A ~ aBbCc, with translation element 0CIB2, i~ used to expand A, then the pu~hdown processor
744
TRANSLATION AND CODE GENERATION
CHAP. 9
will replace A p by aBpsbCp2c. If p pointed to the leaf labeled A before the expansion, then after the expansion A is the root of the following subtree"
where P l points to node B and P2 points to node C. ALGORITHM 9.2 Top-down implementation of an SDTS on an LL grammar. Input. A semantically unambiguous SDTS T = (N, X, A, R, S) with an underlying LL(k) grammar G = (N, X, P, S). Output. A pushdown processor which produces a tree whose frontier is the output for w for each w in L(~). M e t h o d . We shall construct a pushdown processor M which simulates the LL(k) parser t/, for G. The simulation of ~ by M proceeds as follows. As in Algorithm 9.1, we ignore the handling of tables. It is the same for M as for 6.
(1) Initially, M will have Spr on its pushdown stack (the top is at the left), where Pr is a pointer to the root node nr. (2) If ~ has a terminal on top of its pushdown list and compares it with the current input symbol, deletingboth, M does the same. (3) Suppose that ~ expands a nonterminal A [possibly with an associated LL(k) table] by production A ~ X i ' " Xm, having translation element y o B l y l " • • BrYr, and that the pointer immediately below A (it is easy to show that there will always be one) points to node n. Then M does the following: (a) M creates direct descendant nodes for n labeled, from the left, with the symbols of yo B1Yl • • " B~ yr. (b) On the stack M replaces A and the pointer below by X'~ . . . X,n, with pointers immediately below those of X~ . . . X m which are nonterminals. The pointer below X) points to the node created for B~ if Xj and B~ correspond in the rule A ~
X1 ' "
Xm, Y o B l Y l " "
BrY~
(4) If M's pushdown list becomes empty when it has reached the end of the input sentence, it accepts; the output is the tree which has been constructed with root n~. [~]
SEC. 9.2
SYNTAX-DIRECTED TRANSLATIONS
745
Example 9.7
We shall consider an example drawn from the area of natural language translation. It is a little known fact that an SDTS forms a precise model for the translation of English to another commonly spoken natural language, pig Latin. The following rules informally define the translation of a word in English to the corresponding word in pig Latin" (1) If a word begins with a vowel, add the suffix YAY. (2) If a word begins with a nonempty string of consonants, move all consonants before the first vowel to the back of the word and append suffix AY. (3) One-letter words are not changed. (4) U following a Q is a consonant. (5) Y beginning a word is a vowel if it is not followed by a vowel. We shall give an SDTS that incorporates only rules (1) and (2). It is left for the Exercises to incorporate the remaining rules. The rules of the SDTS are as follows" (word)
> (consonants) (vowel) (letters), (vowel) (letters) (consonants) 'AY'
(word)~
> (vowel) (letters), (vowel)(letters) 'YAY'
(consonants) ~
(consonant)(consonants), (consonant)fconsonants)
(consonants) - - + (consonant), (consonant) (letters)~
~ (letter) (letters), (letter) (letters)
(letters)
> e, e
(vowel)
~ 'A', 'A'
(vowel) ~ ~ 'E', 'E'
,
(vowel)-
~ 'U','U'
(consonant)
~ 'B', 'B'
(consonant)
> 'C', 'C' ,
(consonant)~ (letter) (letter) ~
> 'Z', 'Z' ~ (vowel), (vowel) (consonant), (consonant)
746
TRANSLATIONAND CODEGENERATION
CHAP. 9
The underlying grammar is easily seen to be LL(2). Let us compute the output translation corresponding to the input word "THE". As in the previous two examples, we shall first list the configurations entered by the processor and then show the tree at various stages in its construction. The pushdown top is at the left this time.
( 1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
Input
Stack
THE$ THE$ THE$ THE$ HE$ HE$ HE$ E$ E$ $ $
(word)p1 (consonant s)p 2(vowel)p 3(letters)p 4 p 3(letters)p 4 (vowel)p 3(letters)p 4 E(letters)p4 (letters)p4 e
The tree structure after steps 1, 2, 6, and 11 are shown in Fig. 9.10(a)-(d), respectively. D As in the previous section, there is an easy proof that the current algorithm performs the correct translation and that on a suitable random access machine the algorithm can be implemented to run in time which is linear in the input length. For the record, we state the following theorem. THEOREM 9.5 Algorithm 9.2 constructs a pushdown processor which produces as output a tree whose frontier is the translation of the input string.
Proof We can prove by induction that an input string w has the net effect of erasing nonterminal A and pointer p from the pushdown list if and only if (A, A ) ~T* (w, x), where x is the frontier of the subtree whose root is the node pointed to by p (after erasure of A and p and the symbols to which A is expanded). Details are left for the Exercises. 9.2.5.
Translation in a Backtrack E n v i r o n m e n t
The ideas central to the pushdown processor can be applied to backtrack parsing algorithms as well. To be specific, we shall discuss how the parsing machine of Section 6.1 can be extended to incorporate tree construction. The chief new idea is that the processor must be able to "destroy," if need be, subtrees which it has constructed. That is, certain subtrees can become inaccessible, and while we shall not discuss it here, the memory cells used to
SYNTAX-DIRECTED TRANSLATIONS
SEC. 9.2
747
< word >
l 1 !
(
< v o w e l > | |
< word >
(a)
A
(b)
< word >
[
A
(c) I P4
1
T
< consonant >
< word >
(d)
< vowel >
< letters >
E
e
< consonants >
A
< consonant > < consonants >
i
T
I
< consonant >
I
H Fig. 9.10 Translation to pig Latin. represent the subtree are in practice r e t u r n e d to the available m e m o r y . W e shall m o d i f y the parsing m a c h i n e of Section 6.1 to give it the capability o f placing pointers to nodes of a g r a p h on its p u s h d o w n list. (Recall
748
TRANSLATION AND CODE GENERATION
CHAP. 9
that the parsing machine already has pointers to its input; these pointers are kept on one cell along with an information symbol.) The rules for manipulating these pointers are the same as for the pushdown processor, and we shall not discuss the matter in any more detail. Before giving the translation algorithm associated with the parsing machine, let us discuss how the notion of a syntax-directed translation carries over to the G T D P L programs of Section 6.1. It is reasonable to suppose that we shall associate a translation with a "call" of a nonterminal if and only if that call succeeds. Let P -- (N, Z, R, S) be a G T D P L program. Let us interpret a G T D P L statement A --~ a, where a is in E U {e}, as though it were an attempt to apply production A ~ a in a CFG. Then, in analogy with the SDTS, we would expect to find associated with that rule a string of output symbols. That is, the complete rule would be A --~ a, w, where w is in A*, A being the "output alphabet." The only other G T D P L statement which might yield a translation is A --~ B[C, D], where A, B, C, and D are in N. We can suppose that two C F G productions are implied here, namely A --~ BC and A ~ D. If B and C succeed, we want the translation of A to involve the translations of B and C. Therefore, it appears natural to associate with rule A --~ B[C, D] a translation element of the form wBxCy or wCxBy, where w, x, and y are in A*. (If B -- C, then there must be a correspondence specified between the nonterminals of the rule and those of the translation element.) If B fails, however, we want the translation of A to involve the translation of D. Thus, a second translation element, of the form uDv, must be associated with the rule A --~ B[C, D]. We shall give a formal definition of such a translation-defining method and then discuss how the parsing machine can begeneralized to perform such translations. DEFINITION
A GTDPL program with output is a system P -- (N, E, A, R, S), where N, E, and S are as for any G T D P L program, A is a finite set of output symbols, and R is a set of rules of the following forms: (1) A - ~ f , e
(2) A - - ~ a , y w h e r e a ~ E u [ e } a n d y ~ A * (3) (a) A --~B[C, D], yiBy2Cy 3, y4Dy5 (b) A---, B[C, D], Y l Cy2By3, yaDy5 where y, is in A* There is at most one rule for each nonterminal. We define relations ~ between nonterminals and triples of the form (u r v, y, r), where u and v are in Z*, y E A*, and r is s or f. Here, the first
SYNTAX-DIRECTED TRANSLATIONS
SEC. 9 . 2
749
component is the input string, with input head position indicated; the second component is an output string, and the third is the outcome, either success or failure. (1) If A has rule A ~ a, y, then for all u in E*, A =~ (a l" u, y, s). If v is in E* but does not begin with a, then A =~ ([" v, e, f ) . (2) IfA has rule A --, B[C, D], ylEyzFy3, y4Dys, where E -- B and F -- C or vice versa, then the following hold" 111 112 (a) If B =~ (Ul 1" uzu3, xl, s) and C =~ (u2 I" u3, x2, s), then
A ,~1
(ulu z { u3, YlXlYzXzY3, S)
if E -----B and F - - C, and
A .~___~_~1(u,u z { u3, Y~xzy2x~y3, s) if E -- C and F -- B. In case B -- C, we presume that the correspondence between E and F on the one hand and the positions held by B and C on the other is indicated by superscripts, e.g.,
A ~ B~t~[B~z~, D], ylB~Z~y~BCl~y3, y4DYs. 111
(b) If B ~
11~
(u 1 1" u 2, x, s) and C =~ (1" u 2, e, f ) , then 111+112+ !
A ~ (c) If B ~
(l" ulu2, e, f )
([" utuz, e, f )
~md D ~
nt+n~+ 1
A /Ii
~
(u 1 ~ u2,
(u, { u z, x, s), then
Y4XYs, S)
t12
(d) if B =~ (l" u, e, f ) and O =~ (l" u, e, f ) , then A ,,+,,+1 ( { u , e , f ) . Note that if A =~ (u [" v, y, f ) , then u -- e and y -- e. That is, on failure the input pointer is not moved and no translation is produced. It also should be observed that in case (2b) the translation of B is "canceled" when C fails. + n We let =~ be the union of =~ for n ~> 1. The translation defined by P, denoted z(P), is [(w, x) [S =~ (w ~, x, s)}. Example 9.8
Let us define a G T D P L p r o g r a m with output that performs the pig Latin translation of Example 9.7. Here we shall use lowercase output and the following nonterminals, whose correspondence with the previous example is listed below. Note that X and C3 represent strings of nonterminals.
750
CRAP. 9
TRANSLATION AND CODE GENERATION
Here
Example 9.7
W C1 C2 C3 Li Lz V X
(word) (consonant) (consonants) (consonant)* (letter) (letters~ (vowel) (vowel)(letters)
In addition, we shall use nonterminals S and F with rules S ~ e and F ~ f. Finally, the rules for C~, V, and L must be such that they match any consonant, vowel, or letter, respectively, giving a translation which is the letter matched. These rules involve additional nonterminals and will be omitted. The important rules are
W-
> Cz[X, X],
XC2 'ay',
X'yay'
C2
> Ci[C3, F],
C1C3,
F
C1[C3, S],
C1C3,
S
C3 ~
Lz
~ Li[L2, S],
LiL2,
S
X
> V[L2, F],
VL2,
F
For example, if we consider the input string 'and', we observe that V ~ (a ~ nd, a, s) and that L 2 ~ (nd [~, nd, s). Thus, X ~ (and 1~, and, s). + + Since C~ =~ (1~ and, e, f ) , it follows that C2 =~- (]~ and, e, f ) . Thus, we have W ~ (and I~, andyay, s). [~] Such a translation can be implemented by a modification of the parsing machine of Section 6.1. The action of this modified machine is based on the following observation. If procedure A with rule A --~ B[C, D] is called, then A will produce a translation only if B and C succeed or if B fails and D succeeds. The following algorithm describes the behavior of the modified parsing machine. ALGORITHM 9.3 Implementation of G T D P L programs with output.
lnput. P = (N, E, A, R, S), a G T D P L program with output. Output. A modified parsing machine M such that for each input w, M produces a tree with frontier x if and only if (w, x) is in z(P). Method. If we ignore the translation elements of the rules of P, we have a G T D P L program P' in the sense of Section 6.1. Then we can use Lemma 6.6 to construct M', a parsing machine that recognizes L(P').
SYNTAX-DIRECTED TRANSLATIONS
SEC. 9.2
751
We shall now informally describe the modifications of M' we need to make to obtain the modified parsing machine M. The modified machine M simulates M' and also creates nodes in an output tree, placing pointers to these nodes on its pushdown list. With an input string w, M has initial configuration (begin, ]" w, (S, 0)p,). Here p, is a pointer to a root node nr. (1) (a) Suppose that A ~ B[C, D], ylEy2Fy3, y4Dys is the rule for A. Suppose that M makes the following sequence of moves implementing a call of procedure A under the rule A --~ B[C, D]: (begin, u 1 ~ uzu 3, (A, i)?) ~
(begin, u 1 j" u2u 3, ( B , j ) ( A , i)y)
(success, ulu 2 [" u 3, (A, i)~,) J-- (begin, u lu2 J" u3, (C, i)~,) Then M would make the following sequence of moves corresponding to the moves above made by M ' : (9.2.1)
(begin, u 1 J" u2u 3, (A, i)pA~')
(9.2.2)
~ (begin, u 1 J" u2u 3, (B, j)pB(A, i)pnPAY')
(9.2.3)
[--- (success, ulu2 ]" u 3, (A, i)pBpA?' )
(9.2.4)
~ (begin, ulu2 I" u3, (C, i)pc?') In configuration (9.2.1) M has a pointer PA, to a leaf nA, directly below A. (We shall ignore the pointers to the input in this discussion.) In going to configuration (9.2.2), M creates a new node nn, makes it a direct descendant of nA, and places a pointer PB to nn immediately below and above A. Then M places B on top of the pushdown list. In (9.2.3) M returns to A in state success. In going to configuration (9.2.4), M creates a direct descendant of nA for each symbol of yiEy2Fy3 and orders them from the left. Node nB, which has already been created, becomes the node for E or F, whichever is B. Let the node for the other of E and F be n c. Then in configuration (9.2.4), M has replaced ApBpA by Cpc, where Pc is a pointer to nc. Thus if E = C and F = B, then at configuration (9.2.4), node nA is root of the following subtree" HA
y~
nc
Y2
nB
Y3
752
TRANSLATION AND CODE GENERATION
CHAP. 9
(b) If, on the other hand, B returns failure in rule A---~ B[C, D], M will make the following sequence of moves" (9.2.1)
(begin, u 11" uzu3, (A, i)pAT')
(9.2.2)
I---(begin, u l 1" U2U3, (B, j)pa(A, i)PBPAT')
(9.2.5)
I-- (failure, u I l" U2U3' (A, i)pBPAT')
(9.2.6)
1--- (begin, u l I" u 2u3, (D, i)PDT')
In going from configuration (9.2.5) to (9.2.6), M first deletes n n and all its descendants from the output tree, using the pointer PB below A to locate riB. Then M creates a direct descendant of nA for each symbol of y4Dys in order from the left and replaces ApBPA on top of its pushdown list by DpD, where PD is the pointer to the node created f o r D. (2) If A --~ a, y is a rule a in Z U [e}, then M would make one of the following moves: (a) (begin, u 1" av, (A, i)PAr) ~ (success, ua 1" v, r). In this move a direct descendant o f nA (the node pointed to by PA) would be created for each symbol of y. If y = e, then one node, labeled e, would be created as a direct descendant. (b) (begin, u 1" v, (A, i)PhT) ~ (failure, u' 1" v', 7), where [ u' l = i and v does not begin with an a. (3) If A ~ f, e is a rule, then M would make the move in (2b). (4) If M' reads its entire input w and erases all symbols on its pushdown list, then the tree constructed with root nr will be the translation of w. [~] THEOREM 9 . 6
Algorithm 9.3 defines a modified parsing machine which for an input w produces a tree whose frontier is the translation of w.
Proof This is another straightforward inductive argument on the number of moves made by M. The inductive hypothesis is that if M, started in state begin with A and a pointer p to node n on top of its pushdown list and uv to the right of its input head, uses input u with the net effect of deleting A and p from the stack and ending in state success, then the node pointed to by p will be the root of a subtree with frontier y such that A *~ (u l" v, y, s). Example 9.9
Let us show how a parsing machine would implement the translation of Example 9.8. The usual format will be followed. Configurations will be indicated, followed by the constructed tree. Moves representing recognition by C1, V, and L 1 of consonants, vowels, and letters will not be shown.
EXERCISES
State
Input
753
Pushdown List
begin begin begin
]" and [" and ]" and
ClP3C2P3P2Wp2Pl
failure begin failure begin begin
]" and l" and r and r and r and
Czp3pzWp2pl Fp 4 W e 2 P 1 Wp 2p 1 Xp 5 Vp 6 Xp 6p s
Wp 1 C2p2 Wp2p 1
,
success begin
begin
a ]" nd a l" n d a I" nd
Xp 6 p
s
L 2P 7
Llp8LzpsP7
.
,
success
begin begin
a n I" d an r d
L 2P 8P 7 L2 p 9
an l" d
LlploL2PIop9
°
success
begin begin
and ]" and r and ["
L2 p 10p 9 L2 p 11 LlplzLzplZpll
.
failure
begin success
and I" and [" and I"
L2 p 12p 11 Sp 13 e
We show the tree after every fourth-listed configuration in Fig. 9.1 l ( a ) -
(e). A l t h o u g h we shall not discuss it, the techniques suggested in A l g o r i t h m 9.3 are applicable to the other t o p - d o w n parsing algorithms, such as A l g o r i t h m 4.1 a n d t r a n s l a t i o n built on a T D P L p r o g r a m .
EXERCISES
9.2.1.
Construct a pushdown transducer to implement the following simple SDTS"
(a)
(b)
) l .~3r_~ (c)
(d)
y)
(
(e)
t The empty string is meant here; it is the translation of S
Fig. 9.11 Translation of pig Latin by parsing machine. 754
EXERCISES
E
> E + T, E T ' A D D ; '
E
>T,
T
T
> T , F,
TF 'MPY;'
T
>F,
F
F
> F 1" P,
FP 'EXP;'
F
> P,
P
P
> (E),
E
P - - - + a,
"755
' L O A D a;'
The transducer should be based on an SLR(1) parsing algorithm. Give the sequence of output strings during the processing of
(a) a l " a , ( a + a ) . (b) a -4- a • a 1"a. 9.2.2.
Repeat Exercise 9.2.1 for a pushdown processor parsing in an LL(1) manner. Modify the underlying grammar when necessary.
9.2.3.
Show how a pushdown transducer working as an LL(1) parser would translate the following strings according to the simple SDTS E ~
> TE',
E'
> -t- TE', T 'ADD;' E'
E ' -----> e, T ~
FT',
TE"
e FT"
T" ~
• F T ' , F ' M P Y ; ' T'
Z' ~
e,
e
F----+ (E),
E
F~
' L O A D a ;'
a,
(a) a , (a + a). (b) ((a + a ) , a) -4- a.
9.2.4.
Show how a pushdown processor would translate the word abbaaaa according to the SDTS S ~
aAE+T
Et : E l - l - T 1 E~ = E ~ + T~
E
>T
E1 =T1 E~ = T~
T
>T*F
T1 : T i * F 1 T~ = T~ , F~ + (T~) , F,
T
>F
T~ =F~
r
>(E)
F~ =(E~)
F
> sin(E)
Fl
sin(El)
:
F~ = cos(El) • (E~)
FF
> cos(E) >x
F~
= Cos(E1)
Fz
:
--sin(El)
• (E2)
F~ : x F2=l
F
>0
Fi : 0
F~=0 F
>1
F1:1
F~=0 We leave for the Exercises a proof that if (0~, fl) is in z(T), then fl is the derivative of ~. fl may contain some redundant parentheses. The derivation tree for sin(cos(x)) + x is given in Fig. 9.14. The values of the translation symbols at each of the interior nodes are listed below: Nodes
E l , T1, o r F1
Ez, Tz or Fz
n3, nz nl 2, nl 1, Hi 0 rig, n8, n7 n6, n5, n4 nl
x x
1 1 --sin(x) • (1) cos(cos(x)) • (--sin(x) • (1)) cos(cos(x)) • (--sin(x) • (1)) + 1
COS(X) sin(cos(x)) sin(cos(x)) + x
==
.
.
.
.
762
TRANSLATIONAND CODE GENERATION
CHAP. 9
n2
n3
n7
n8
n9
r/10
nil
nl 2
Fig. 9.14 Derivation tree for sin (cos (x)) + x. The implementation of a GSDTS is not much different from that of an SDTS using Algorithms 9.1 and 9.2. We shall generalize Algorithm 9.1 to produce as output a directed acyclic graph (dag) rather than a tree. It is then possible to "walk" over the dag in such a way that the desired translation can be obtained. ALGORITHM 9.4
Bottom-up execution of a GSDTS.
Input. A semantically unambiguous GSDTS T = (N, X, A, I', R, S), whose underlying grammar G is LR(k), and an input string x ~ X*.
SEC. 9.3
GENERALIZED TRANSLATION SCHEMES
763
Output. A dag from which we can recover the output y such that (x, y) is in T.
Method. Let ~ be an LR(k) parser for G. We shall construct a pushdown processor M with an endmarker, which will simulate ~ and construct a dag. If ~ has nonterminal A on its pushdown list, M will place below A one pointer for each translation symbol A l ~ r'. Thus, corresponding to a node labeled A on the parse tree will be as many nodes of the dag as there are translations of A, i.e., symbols Al in r'. The action of M is as follows: (!) If ~ shifts, M does the same. (2) Suppose that ~ is about to reduce according to production A ---~ 0c, with translation elements A1 = fl 1,- - . , Am = fl~. At this point M will have on top of its pushdown list, and immediately below each nonterminal in there will be pointers toeach of the translations of that nonterminal. When M makes the reduction, M first creates m nodes, one for each translation symbol A r The direct descendants of these nodes are determined by the symbols in i l l , . . . , tim" New nodes for output symbols are created. The node for translation symbol B k ~ F is the node indicated by that pointer below the nonterminal B in ~ which represents the kth translation of B. (As usual, if there is more than one B in ~, the particular instance of B referred to will be indicated in the translation element by a superscript.) In making the reduction, M replaces ~ and its pointers by A and the m pointers to the translations for A. For example, suppose that ~ reduces according to the underlying production of the rule A-
> BaC,
A I = bBiCzBi,
A 2 = CicB z
Then M would have the string PB,Ps,Bapc, Pc iC on top of its pushdown list (C is on top), where the p's are pointers to nodes representing translations. In making the reduction, M would replace this string by pA,PalA, where PAl and p~, are pointers to the first and second translations of A. After the reduction the output dag is as shown in Fig. 9.15, We assume that Px, points to the node for X~.
Fig. 9.15 Portion of output dag.
764
TRANSLATION AND CODE GENERATION
CHAP. 9
(3) If M has reached the end of the input string and its pushdown list contains S and some pointers, then the pointer to the node for $1 is the root of the desired output dag. [Z] We shall delay a complete example until we have discussed the interpretation of the dag. Apparently, each node of the dag "represents" the value of a translation symbol at some node of the parse tree. But how can that value be produced from the dag ? It should be evident from the definition of the translation associated with a dag that the value represented by a node n should be the concatenation, in order from the left, of the values represented by the direct descendants of node n. Note that two or more direct descendants may in fact be the same node. In that case, the value of that node is to be repeated. With the above in mind, it should be clear that the following method of walking through a dag will produce the desired output. ALGORITHM 9.5 Translation from dag. Input. A dag with leaves labeled and a single root. Output. A sequence of the symbols used to label the leaves. Method. We shall use arecursive procedure R(n), where n is a node of the dag. Initially, R(no) is called, where n o is the root. Procedure R(n).
(1) Let al, a 2 , . . . , am be the edges leaving n in this order. Do step (2) for al, az, . . . , am in order. (2) Let a i be the current edge. (a) If ai enters a leaf, emit the label of that leaf. (b) If a~ enters an interior node n', perform R(n'). [~ THEOREM 9.7 If Algorithm 9.5 is applied to the dag produced by Algorithm 9.4, then the output of Algorithm 9.5 is the translation of the input x to Algorithm 9.4. P r o o f Each node n produced by Algorithm 9.4 corresponds in an obvious way to the value of a translation symbol at a particular node of the parse tree for x. A straightforward induction on the height of a node shows that R(n) does produce the value of that translation symbol. Example 9.12
Let us apply Algorithms 9.4 and 9.5 to the GSDTS of Example 9.10 (p. 759), with input aaaa. The sequence of configurations of the processor is [with LR(1) tables omitted, as usual]
SEe. 9.3
GENERALIZED TRANSLATION SCHEMES
Pushdown List (1)
(2) (3) (4) (5) (6) (7)
e a aa aaa aaaa aaap 1p 2S aap3P4S
(8) apsp6S (9) p7psS
765
Input aaaa$ aaa$ aa$ aS $ $ $
$ $
The trees constructed after steps 6, 7, and 9 are shown in Fig. 9.16(a)(c). Nodes on the left correspond to values of S~ and those on the right to values of $2. The application of Algorithm 9.5 to the dag of Fig. 9.16(c) requires many invocations of the procedure R(n). We begin with node n x, since that corresponds to $1. The sequence of calls of R ( n ) and the generations of output a will be listed. A call of R(n~) iS indicated simply as n t. The sequence is nln3nsn7ansansaan6nsaan6nsaaan4n6nsaaanan6nsaaaa.
9.3.2.
[~
Varieties of Translations
Heretofore, we have considered only string-valued translation variables in a translation scheme. The same principles that enabled us to define SDTS's and GSDTS's will allow us to define and implement translation schemes containing arithmetic and Boolean variables, for example, in addition to string variables. The strings produced in the course of code generation can fall into several different classes" (1) Machine or assembly code which is to be output of the compiler. (2) Diagnostic messages; also output of the compiler but not in the same stream as (1). (3) Instructions indicating that certain operations should be performed on data managed by the compiler itself. Under (3), we would include instructions which do bookkeeping operations, arithmetic operations on certain variables used by the compiler other than in the parsing mechanism, and operations that allocate storage and generate new labels for output statements. We defer discussion of when these instructions are executed to the next section. A few examples of types of translations which we might find useful during code generation are (1) Elements of a finite set of modes (real, integer, and so forth) to indicate the mode of an arithmetic expression,
766
CHAP. 9
TRANSLATION AND CODE GENERATION
I.~!
l~:1
k,
k...~
(a)
(b)
l,,o]
n3
n5
(c) Fig. 9.16
Dag constructed by Algorithm 9.4.
SEC. 9.3
GENERALIZED TRANSLATION SCHEMES
767
(2) Strings representing the labels of certain statements when compiling flow of control structures (if-then-else, for example), and (3) Integers indicating the height of a node in the parse tree. We shall generalize the types of formulas that can be used in a syntaxdirected translation scheme to compute translations. Of course, when dealing with numbers or Boolean variables, we shall use Boolean and arithmetic operators to express translations. However, we find it convenient to also use conditional statements of the form if B then E1 else E2, where B is a Boolean expression and E1 and E2 are arbitrary expressions, including conditionals. For example, we might have a production ,4--~ BC, where B has a Boolean translation B~ and C has two string-valued, translations C~ and C2. The formula for the translation of A 1 might be if B~ then C~ else Cz. That is, if the left direct descendant of the node whose translation A~ is being computed has translation B~ ----true, then take C1, the first translation of the right direct descendant, as An; otherwise, take C2. Alternatively, B might have an integer-valued translation B~, and the formula for A2 might be if B2 = 1 then C~ else C2. This statement would cause A s to be the same as C~ unless Bz had the value 1. We observe that it is not hard to generalize Algorithm 9.4, a bottom-up algorithm, to incorporate the possibility that certain translations are numbers, Boolean values, or elements of some finite set. In a bottom-up scheme the formulas for these variables (and the string-valued variables) are evaluated only when all arguments are known. In fact, it will often be the case that all but one of the translations associated with a node are Boolean, integer or elements of a finite set. If the remaining (string-valued) translation can be produced by rules that are essentially postfix simple SDTS rules (with conditionals, perhaps), then we can implement the entire translation on a DPDT which keeps the non, string translations on its pushdown list. These translations can be easily "attached" to the pushdown cells holding the nonterminals, there being an obvious connection between nonterminals on the pushdown list and interior nodes of the parse tree. If some translation is integer-valued, we have gone outside the DPDT formalism, but the extension should pose no problem to the person implementing the translator on a computer. However, generalizing Algorithms 9.2 and 9.3, which are top-down, is not as easy. When only strings are to be constructed, we have allowed the formulas to be expanded implicitly, i.e., with pointers to nodes which will eventually represent the desired string. However, it may not be possible to treat arithmetic or conditional formulas in the same way. Referring to Algorithm 9.2, we can make the following modification. If a nonterminal A on top of the pushdown list is expanded, we leave the pointer immediately below A on the pushdown list. This pointer indicates the node n which this instance of A represents. When the expansion of A is
768
TRANSLATIONAND CODEGENERATION
CHAP. 9
complete, the pointer will again be at the top of the pushdown list, and the translations for all descendants of n will have been computed. (We can show this inductively.) It is then possible to compute the translation of n exactly as if the parse were bottom-up. We leave the details of such a generalization for the Exercises. We conclude this section with several examples of more general translation schemes. Example 9.13
We shall elaborate on Section 1.2.5, wherein we spoke of the generation of code for arithmetic expresgions for a sin.gle accumulator random access machine. Specifically, we assume that the assembly language instructions ADD
~z
MPY
~z
LOAD
~z
STORE are available and have the obvious meanings. We shall base our translation on the grammar G 0. Nonterminals E, T, and F will each have ,two translation elements. The first will produce a string of output characters that will cause the value of the corresponding input expression to be brought to the accumulator. The second translation will be an integer, representing the height of the node in a parse tree. We shall not, however, count the productions E--~ T, T--~ F, or F---~ ( E ) w h e n determining height. Since the only need for this height measure is to determine a safe temporary location name, it is permissible to disregard these productions. Put another way, we shall really measure height in the syntax tree. The six productions and their translation elements are listed below: (1)
E
~.E+T
E1 = Ti ' ; S T O R E $' Ez ';' E, ' ; A D D $' Ez E2 ----max(Ez, T2) + 1
(2)
E
>T
El = Tit E~ = T~
(3)
T
~ T, F
T1 = F1 ' ; S T O R E $'T2 ';' T1 ' ; M P Y $'Tz T2 = max(T2, F2) + 1
t If we were working from the syntax tree, rather than the parse tree, reductions by E ~ T, T ~ F and F ~ (E) would not appear. We would then have no need to implement the trivial translation rules associated with these productions.
GENERALIZED TRANSLATION SCHEM ES
SEC. 9.3
(4)
T
:> F
7 69
T, = F,
T~ = F~
(5)
F~
(E)
F, = E, Fz = E z
(6)
F~
a
F~ = 'LOAD' NAME(a) F~=I
Again we use the SNOBOL convention in the translation elements for the definition of strings. Quotes surround strings that denote themselves. Unquoted symbols denote their current value, which is to be substituted for that symbol. Thus, in rule (1), the translation element for E1 states that the value of E1 is to be the concatenation of (1) The value of T1 followed by a semicolon; (2) The instruction STORE with an address $n, where n is the height of the node for E (on the right-hand side of the production), followed by a semicolon; (3) The value of E1 (the left argument of + ) followed by a semicolon; and (4) The instruction ADD Sn, where n is the same as in (2). Here $n is intended to be a temporary location, and the translation for E1 (on the right-hand side of the translation element) will not use that location because of the way Ez, T2, and Fz are handled. The rule for production (6) uses the function NAME(a), where NAME is a bookkeeping function that retrieves the internal (to the compiler) name of the identifier represented by the token a. Recall that the terminal symbols of all our grammars are intended to be tokens, if a token represents an identifier, the token will contain a pointer to the place in the symbol table where information about the particular identifier is stored. This information tells which identifier the token really represents, as well as giving attributes of that identifier. The parse tree of (a + a) • (a + a), with values of some of the translation symbols, is shown in Fig. 9.17. We assume that the function NAME(a) yields a~, a2, a3, and a 4, respectively, for the four identifiers represented bya. [~] Example 9.14
The code produced by Example 9.13 is by no means optimal. A considerable improvement can be made by observing that if the right operand is a single identifier, we need not load it and store it in a temporary. We shall therefore add a third translation for E, T, and F, which is a Boolean variable
770
TRANSLATION AND CODE GENERATION
CHAP. 9
T 1 = LOAD a 4 • STORE $1" LOAD a 3 • ADD $1" STORE $2; LOAD a 2 • STORE $1" L O A D a I • ADD $ 1 MPY $2 T . = ~ "~
E1 = L O A D a 2 . S T O R E $ t. L O A D a 1 ' ADD $1
t ~ t~ / ~ T ~
E1 = L O A D a 4 . S T O R E $ 1 . L O A D a 3 ' ADD $1
E2=2
/
E2= 2
F2 =1 F1 = L O A D a l y F2= F1= LOADa2 1
Fig. 9.17
~F1
F2== LOADa3~ 1 FI = LlOA~Da4 F2=
P a r s e tree a n d t r a n s l a t i o n s .
with value true if and only if the expression dominated by the node is a single identifier. The first translation of E, T, and F is again code to compute the expression. However, if the expression is a single identifier, then this translation is only the " N A M E " of that identifier. Thus, the translation scheme does not "work" for single identifiers. This should cause little trouble, since the expression grammar is presumably part of a grammar for assignment and the translation for assignments such as A ~ B can be handled at a higher level. The new rules are the following:
SEC. 9.3 (1)
E
GENERALIZED TRANSLATION SCHEMES
> E -q- T
771
E, : if T3 then
if E3 then 'LOAD' E1 ';ADD' T1 #
else Ei ';ADD' T, else if E3 then T, ';STORE $1 ;LOAD' E,
';ADD $1' else T, ';STORE $' E2 ';' E, ';ADD $' E2
Ez : max(E2, T2) -t- i E3 : false (2)
E
>T
E, = T, E2 = T2 E3 = T3
(3)
T~T,F
T, = i f F 3 t h e n if T3 then 'LOAD' T, ';MPY' F1 else T~ ';MPY' F1 else if T3 then F~ ';STORE $1 ;LOAD' T,
';MPY $1' else F, ';STORE $' T2 ';' T~ ';MPY $' T2
Tz : max(T2, Fz) -+- 1 T3 : false (4)
T-
>F
T, : F , T~ :
F~
T~ = FF~
(5)
F
> (E)
F, = E,
F2 : E2 F3 : E~
(6)
F
>a
F, : NAME(a) F2:l
F3 : true In rule (1), the formula for E, checks whether either or both arguments are single identifiers. If the right-hand argument is a single identifier, then the code generated causes the left-hand argument to be computed and the right-hand argument to be added to the accumulator. If the left-hand argument is a single identifier, then the code generated for this argument is the
772
TRANSLATION AND CODE GENERATION
CHAP. 9
identifier name. Thus, 'LOAD' must be prefixed to it. Note that Ez -- 1 in this case, and so $1 can be used as the temporary store rather than '$' Ez (which would be $1 in this case anyway). The code produced for (a + a) • (a + a) is LOAD a3; A D D a4; STORE $2; LOAD al; A D D az; MPY $2 Note that these rules do not assume that + and • are commutative.
[--]
Our next example again deals with arithmetic expressions. It shows how if three address code is chosen as the intermediate language, we can write what is essentially a simple SDTS, implementable by a deterministic pushdown transducer that holds some extra information at the cells of its pushdown list. Example 9.15
Let us translate L(Go) to a sequence of three address statements of the form A ~ + BC and A ~ • BC, meaning that A is to be assigned the sum or product, respectively, of B and C. In this example A will be a string of the form $i, where i is an integer. The principal translations, El, T1, and FI, will be a sequence of three address statements which evaluate the expression dominated by the node in question; E2, T2, and F2 are integers indicating levels, as in the previous examples. E3, T3, and F3 will be the name of a variable which has been assigned the value of the expression by the aforementioned code. This name is a program variable in the case that the expression is a single identifier and a temporary name otherwise. The following is the translation scheme" E
~E+T
Ei -~ EiT1 '$' max(E2, T2) '<
+ ' E3T3 ';'
E~ : max(E2, T2) + 1 E3 = '$' max(E2, Tz) E
>T
Ei
T1
E~ =T~ E3 =T~ T
>T*F
T1 : T1Fi '$' max(T2, F2)'<
T~ : T
.'F
max(T2, F2) + 1
T3 : '$' max(T2, F2) Ti = F , T~ T~ =F~
,' T3F3 ';'
SEC. 9.3
GENERALIZED TRANSLATION SCHEMES
F
>(E)
773
F1 = E l
F~=E~ F3 = E3 F---~ a
Fi = e F2=l F~ = NAME(a)
As an example, the output for the input a, • (a2 -t- a3) is $1 < $2<
-t- a2a3; -,a,$1;
We leave it to the reader to observe that the rules for E,, T,, and F, form a postfix simple SDTS if we assume that the values of second and third translations of E, T, and F are output symbols. A practical method of implementation is to parse Go in an LR(1) manner by a D P D T which keeps the values of the second and third translations on its stack. That is, each pushdown cell holding E will also hold the values of Ez and E3 for the associated node of the parse tree (and similarly for cells holding T and F). The translation is implemented by emitting 'S' max(E2, T2)'<
+' E3T3 ';'
every time a reduction of E -t- T to E is called for, where Ez, E3, Tz, and T3 are the values attached to the pushdown cells involved in the reduction. Reductions by T---, T , F are treated analogously, and nothing is emitted when other reductions are made. We should observe that since the second and third translations of E, T, and F can assume an infinity of values, the device doing the translation is not, strictly speaking, a DPDT. However, the extension is easy to implement in practice on a random access computer. D Example 9.16
We shall generate assembly code for control statements of the if-thenelse form. We presume that the nonterminal S stands for a statement and that one of its productions is S ~ if B then S else S. We suppose that S has a translation S, which is code to execute that statement. Thus, if a production for S other than the one shown is used, we can presume that S, is correctly computed. Let us assume that B stands for a Boolean expression and that it has two translations, B, and B2, which are computed by other rules of the translation
~i i ~i~
i
774
CHAP. 9
TRANSLATION AND CODE GENERATION
system. Specifically, B~ is code that causes a jump to location B z if the Boolean expression has value false. To generate the expected code, S will need another translation S 2, which is the location to be transferred to after execution of S is finished. We assume that our computer has an instruction JUMP ~ which transfers control to the location named a. To generate names for these locations, we shall assume the existence of a function NEWLABEL, which when invoked generates a name for a label that has never been used before. For example, the compiler might keep a global integer i available. Each time NEWLABEL is invoked, it might increment i by 1 and return the name $$i. The function NEWLABEL is not actually invoked for this S-production but will be invoked for some other S-productions and for B-productions. We also make use of a convenient assembler pseudo-operation, the likes of which is found in most assembly languages. This assembly instruction is of the form EQUAL
tz, fl
It generates n o code but causes the assembler to treat the locations named and fl as the same location. The EQUAL instruction is needed because the two instances of S on the right-hand side of the production S ~ if B then S else S each have a name for the instruction they expect to execute next. We must make sure that a location allotted for one serves for the other as well. The translation elements for the production mentioned are S
> if B then S el(L)
L
>E, LIE
CHAP. 9
That is, a statement can be the keyword call followed by an identifier and an argument list (A). The argument list can be the empty string or a parenthesized list (L) of expressions. We assume that each expression E has two translations El and E2, where the translation of E~ is object code that leaves the value of E in the accumulator and the translation of E2 is the name of a temporary location for storing the value of E. The names for the temporaries are created by the function NEWLABEL. The following rules perform the desired translation: S
> call a A
$1 -- A 1 ';CALL' NAME(a) A2
A-
>(L)
A1 --L 1 A z --
A
>e
' ", L2
A 1 --e
A2 :e
L
~E,L
L1 -- E1 ';STORE' Ez ';' L1 "Lz -- 'ARG' Ez ';' L2
L-
>E
L~ = Et ';STORE' Ez L z = 'ARG' Ez
For example, the statement call AB(E (1), E (2)) would, if the temporaries for E (1) and E (2) were $$1 and $$2, respectively, be compiled into code for E _ c ll x V. (3) The range of T is finite.
EXERCISES
783
9.3.4.
Show that the translation [(an, am) l m is the integer part of,v/-~ -} cannot be defined by any GSDTS.
9.3.5.
For Exercise 9.3.1(a)-(c), give the dags produced by Algorithm 9.4 with inputs a 3, a 4, and 011, respectively.
9.3.6.
Give the sequence of nodes visited by Algorithm 9.5 when applied to the three dags of Exercise 9.3.5.
*9.3.7.
Embellish Example 9.16 to include the production S ~ while B do S, with the intended meaning that alternately expression B is to be tested, and then the statement S done, until such time as the value of B becomes false.
9.3.8.
The following grammar generates PL/I-like declarations: D ~
(L)M
L - - - ~ a, L I D , L I a I D
M--->" mt Im2l "-" Imk The terminal a stands for any identifier; m 1 , . . . , mk are the k possible attributes of identifiers in the language. Comma and parentheses are also terminal symbols. L stands for a list of identifiers and declarations; D is a declaration, which consists of a parenthesized list of identifiers and declarations. The intention is that the attribute derived from M is to apply to all those identifiers generated by L in the production D ~ ( L ) M even if the identifier is within a nested declaration. For example, the declaration (aa, (a2, a3)ma)m2 assigns attribute ml to a2 and aa and attribute m2 to ax, a2, and a3. Build a translation scheme, involving any type of translation, that will translate strings generated by D into k lists; list i is to hold exactly those identifiers which are given attribute mr. 9.3.9. 9.3.10.
Modify Example 9.17 to place values of the expressions, rather than pointers to those values, in the ARG instructions. Consider the following translation scheme" N--+
L E(adop)T[ T
786
CHAP. 9
TRANSLATION AND CODE GENERATION
T
~ T(mulop)Fl F
F----~ (E) IX I ~
ala(L)
L----~E,
(adop) ~ (mulop~ ~
LIE
+l-,[/
An example of a statement generated by this grammar would be a(a, a) • = a(a + a, a ,
(a + a)) + a
Here, a is a token representing an identifier. Construct a translation scheme based on this grammar that will generate suitable assembly or multiple address code for assignment statements. 9.3.18. • '9.3.19.
Show that the GSDTS of Example 9.11 correctly produces an expression for the derivative of its input expression. Show that {(al... a,bl. . . b,, a l b i . . . a,b,) l n ~ 1, ai ~ [0, 1} and bi ~ {2, 3} for 1 < i ~ n] is not definable by a GSDTS.
Research Problem
9.3.20.
Translation of arithmetic expressions can become quite complicated if operands can be of many different data types. For example, we could be dealing with identifiers that could be Boolean, string, integer, real, or complex--the last three in single, double, or perhaps higher precisions. Moreover, some identifiers could be in dynamically allocated storage, while others are statically allocated. The number of c9mbinations can easily be large enough to make the translation elements associated with a production such as E ~ E + T quite cumbersome. Given a translation which spells out the desired code for each case, can you develop an automatic way of simplifying the notation ? For instance, in Example 9.19, the then and else portions of the translation of E~ 1~ differ only in the single character A or B at the end of ADD. Thus, almost a factor of 2 in space could be saved if a more versatile defining mechanism were used.
Programming
"9.3.21.
Exercise
Construct a translation scheme that maps a subset of F O R T R A N into intermediate language as in Exercise 9.1.10. Write a program to implement this translation scheme. Implement the code generator designed in Exercise 9.1.11. Combine these programs with a lexical analyzer to
BIBLIOGRAPHIC NOTES
787
produce a compiler for the subset of FORTRAN. Design test strings that will check the correct operation of each program.
BIBLIOGRAPHIC
NOTES
Generalized syntax-directed translation schemes are discussed by Aho and Ullman [1971], and a solution to Exercise 9.3.3 can be found there. Knuth [1968b] defined translation schemes with inherited and synthesized attributes. The GSDT's in Exercises 9.3.10 and 9.3.11 are discussed there. The solution to Exercise 9.3.14 is found in Bruno and Burkhard [1970] and Knuth [1968b]. Exercise 9.3.12 is from Knuth [1968b].
10
BOOKKEEPING
This chapter discusses methods by which information can be quickly stored in tables and accessed from these tables. The primary application of these techniques is the storage of information about tokens during lexical analysis and the retrieval of this information during code generation. We shall discuss two ideas-asimple information retrieval techniques and the formalism of property grammars. The latter is a method of associating attributes and identifiers while ensuring that the proper information about each identifier is available at each node of the parse tree for translation purposes.
10.1.
S Y M B O L TABLES
The term symbol table is given to a table which stores names and information associated with these names. Symbol tables are an integral feature of virtually all compilers. A symbol table is pictured in Fig. 10.1. The entries in the name field are usually identifiers. If names can be of different lengths, then it is more convenient for the entry in the name field to be a pointer to a storage area in which the names are actually stored. The entries in the data field, sometimes called descriptors, provide information that has been collected about each name. In some situations a dozen or more pieces of information are associated with a given name. For example, we might need to know the data type (real, integer, string, and so forth) of an identifier; whether it was perhaps a label, a procedure name, or a formal parameter of a procedure; whether it was to be given statically or dynamically allocated storage; or whether it was an identifier with structure (e.g., an array), and if so, what the structure was (e.g., the dimensions of an array). If the number of pieces of information associated with a given name is vari788
SYMBOL TABLES
SEC. 10.1 NAME
789
DATA
I
INTEGER
LOOP
LABEL
REAL ARRAY
Fig. 10.1
Symbol table.
able, then it is again convenient to store a pointer in the data field to this information. 10.1.1.
Storage of Information About Tokens
A compiler Uses a symbol table to store information about tokens, particularly identifiers. This information is then used for two purposes. First, it is used to check for the semantic correctness of a source program. For example, if a statement of the form GOTO
LOOP
is found in the source program, then the compiler must check that the identifier LOOP appears as the label of an appropriate statement in the program. This information will be found in the symbol table (although not necessarily at the time at which the goto statement is processed). The second use of the information in the symbol table is in generating code. For example, if we have a F O R T R A N statement of the form
A=Bq--C in the source program, then the code that is generated for the operator + depends on the attributes of identifiers B and C (e.g., are they fixed- or floating-point, in or out of "common," and so forth). The lexical analyzer enters names and information into the symbol table. For example, whenever the lexical analyzer discovers an identifier, it consults the symbol table to see whether this token has previously been used. If not, the lexical analyzer inserts the name of this identifier into the symbol table along with any associated information. If the identifier is already present in the symbol table at some location/, then the lexical analyzer produces the token (~identifier), l) as output. Thus, every time the lexical analyzer finds a token, it consults the symbol table. Therefore, to design an efficient compiler, we must, given an instance of an identifier, be able to rapidly determine whether or not a location in the gymbol table hag been regerved for that identifier. If no such entry exists, we m u s t then be able to insert the identifier quickly into the .table.
790
BOOKKEEPING
CHAP. 10
Example 10.1
Let us suppose that we are compiling a F O R T R A N - l i k e language and wish to use a single token type (identifier) for all variable names. When the (direct) lexical analyzer first encounters an identifier, it could enter into a symbol table information as to whether this identifier was fixed- or floatingpoint. The lexical analyzer obtains the information by observing the first letter of the identifier. Of course, a previous declaration of the identifier to be a function or subroutine or not to obey the usual fixed-floating convention would already appear in the symbol table and would overrule the attempt by the lexical analyzer to store its information. [~ Example 10.2
Let us suppose that we are compiling a language in which array declarations are defined by the following productions" (array statement)
> array (array list~
{array list)
> (array definition), (array list) I(array definition)
~array definition)
> (identifier) ((integer))
An example of an array declaration in this language is (10.1.1)
array AB(10), CD(20)
For simplicity, we are assuming that arrays are one-dimensional here. The parse tree for statement (10.1.1) is shown in Fig. 10.2, treating A B and CD identifiers and 10 and 20 as integers.
j
< array statement >
< array list >
array
< array list >
< array definition >
< identifier >l
(
< integer >1
)
< identifier >2
< array definition >
(
Fig. 10.2 Parse tree of array statement.
< integer >2
)
SEC. 10.1
SYMBOLTABLES
791
In this parse tree the tokens (identifier)l, (identifier)2, (integer)l, and (integer)2 represent AB, CD, 10, and 20, respectively. The array statement is, naturally, nonexecutable; it is not compiled into machine code. However, it makes sense to think of its translation as a sequence of bookkeeping steps to be executed immediately by the compiler. That is, if the translation of an identifier is a pointer to the place in the symbol table reserved for it and the translation of an integer is itself, then the syntaxdirected translation of a node labeled (array definition) can be instructions for the bookkeeping mechanism to record that the identifier is an array and that its size is measured by the integer. [~] The ways in which the information stored in the symbol table is used are numerous. As a simple example, every subexpression in an arithmetic expression may need mode information, so that the arithmetic operators can be interpreted as fixed, floating, complex, and so forth. This information is collected from the symbol table for those leaves which have identifier or constant labels and are passed up the tree by rules such as fixed + floating = floating, and floating q- complex = complex. Alternatively, the language, and hence the compiler, may prohibit mixed mode expressions altogether (e.g., as in some versions of FORTRAN). 10.1.2.
Storage Mechanisms
We conclude that there is a need in a compiler for a bookkeeping method which rapidly stores and retrieves information about a large number of different items (e.g., identifiers). Moreover, while the number of items that actually occur is large, say on the order of 100 to 1000 for a typical program, the number of possible items is orders of magnitude larger; most of the possible identifiers do not appear in a given program. Let us briefly consider possible ways of storing information in tables in order to better motivate the use of the hash or scatter storage tables, to be discussed in the next section. Our basic problem can be formulated as follows. We have a large collection of possible items that may occur. Here, an item can be considered to be a name consisting of a string of symbols. We encounter items in an unpredictable fashion, and the exact number of items to be encountered is unknown in advance. As each item is encountered, we wish to check a table to determine whether that item has been previously encountered and if it has not, to enter the name of the item into the table. In addition, there will be other information about items which we wish to store in the table. We might initially consider using a direct access table to store information about items. In such a table a distinct location is reserved for each possible item. Information concerning the item would be stored at that location, and the name of the item need not be entered in the table. If the number of possible items is small and a unique location can readily be assigned to each item,
792
BOOKKEEPING
CHAP. 10
then the direct access table provides a very fast mechanism for storingand retrieving information about items. However, we would quickly discard the idea of using a direct access table for most symbol table applications, since the size of the table would be prohibitive and most of it would never be used. For example, the number of F O R T R A N identifiers (a letter followed by up to five letters or digits) is about 1.33 × 109. Another possible method of storage is to use a pushdown list. If a new item is encountered, its name and a pointer to information concerning that item is pushed onto the pushdown list. Here, the size of the table is proportional to the number of items actually encountered, and new items can be inserted very quickly. However, the retrieval of information about an item requires that we search the list until the item is found. Thus, retrieval on the average requires time proportional to the number of items on the list. This technique is often adequate for small lists. In addition, it has advantages when a block-structured language is being compiled, as a new declaration of a variable can be pushed on top of an old one. When the block ends, all its declarations are popped off the list and the old declarations of the variables are still there. A third method, which is faster than the pushdown list, is to use a binary search tree. In a binary search tree each node can have a left direct descendant and a right direct descendant. We assume that data items can be linearly ordered by some relation < , e.g., the relation "precedes in alphabetical order." Items are stored as the labels of the nodes of the tree. When the first item, say ~1, is encountered, a root is created and labeled ~1. If ct2 is the next item and ~2 < ~1, then a leaf labeled ~2 is added to the tree and this leaf is made the left direct descendant of the root. (If 0cl < ~z, then this leaf would have been made the right direct descendant.) Each new item causes a new leaf to be added to the tree in such a position that at all times the tree will have the following property. Suppose that N is any node in the tree and that N is labeled ft. If node N has a left subtree containing a node labeled ~, then < ft. If node N has a right subtree with a node labeled 7, then fl < 7. N
75
75
SEC. 10.1
SYMBOLTABLES
793
The following algorithm can be used to insert items into a binary search tree. ALGORITHM 10.1 Insertion of items into a binary search tree.
Input. A sequence ¢z~,..., ¢z, of items from a set of items A with a linear order < on A. Output. A binary tree whose nodes are each labeled by one of 0c~. . . . , ~,, with the property that if node N is labeled cz and some descendant N' of N is labeled fl, then fl < 0c if a n d only if N ' is in the left subtree of N. Method. (1) Create a single node (the root) and label it el. (2) Suppose that e l , . . . , ei-1 have been placed in the tree, i > 0. If i = n + 1, halt. Otherwise, insert et in the tree by executing step (3) beginning at the root. (3) Let this step be executed at node N with label ft. (a) If e~ < fl and N has a left direct descendant, Nt, execute step (3) at NI. If N has no left direct descendant, create such a node and label it ei. Return to step (2). (b) If fl < 0~iand N has a right direct descendant, Nr, execute step (3) at Nr. If N has no right direct descendant, create such a node and label it ei. Return to step (2). [~ The method of retrieval of items is essentially the same as the method for insertion, except that one must check at each node encountered in step (3) whether the label of that node is the desired item. Example 10.3
Let the sequence of items input to Algorithm 10. l be XY, M, QB, ACE, and OP. We assume that the ordering is alphabetic. The tree constructed is shown in Fig. 10.3. [~ It can be shown that, after n items have been placed in a binary search tree, the expected number of nodes which must be searched to retrieve one of them is proportional to log n. This cost is acceptable, although hash tables, which we shall discuss next, give a faster expected retrieval time. 10.1.3.
Hash Tables
The most efficient and commonly used method for the bookkeeping necessary in a compiler is the hash table. A hash storage symbol table is shown schematically in Fig. 10.4.
794
BOOKKEEPING
CHAP. 10
Fig. 10.3
NAME
Binary search tree.
POINTER
Ol
Item
•
Ha
h(ot)
Information about ot
function
n-1 Hash Table
Data Storage Table
Fig. 10.4 Hash storage scheme.
T h e h a s h s t o r a g e m e c h a n i s m uses a hashing function, h, a hash table, a n d a data storage table. T h e h a s h t a b l e has n entries, w h e r e n is fixed beforeh a n d . E a c h e n t r y in t h e h a s h t a b l e consists o f t w o f i e l d s - - a n a m e field a n d
SEC. 10.1
SYMBOLTABLES
795
a pointer field. Initially, each entry in the hash table is assumed to be empty.t If an item tz has been encountered, then some location in the hash table, usually h(o¢), contains an entry whose name field contains tz (or possibly a pointer to a location in a name table in which tz is stored) and whose pointer field holds a pointer to a block in the data storage table containing the information associated with tz. The data storage table need not be physically distinct from the hash table. For example, if k words of information are needed for each item, then it is possible to use a hash table of size kn. Each item stored in the hash table would occupy a block of k consecutive words of storage. The appropriate location in the hash table for an item tz can then readily be found by multiplying h(~), the hash address for 0c, by k and using the resulting address as the location of the first word in the block of words for tz. The hashing function h is actually a list of functions h0, h ~ , . . . , hm, each from the set of items to the set of integers [0, 1 , . . . , n -- 1}. We shall call ho the primary hashing function. When a new item ~ is encountered, we can use the following algorithm to compute h(~t), the hash address of tz. If tz has been previously encountered, h(00 is the location in the hash table at which tz is stored. If ~ has not been encountered, then h(~) is an empty location into which ~ can be stored. ALGORITHM 10.2 Computation of a hash table address.
Input. An item ~, a hashing function h consisting of a sequence of functions h 0, h l,..., ., h meach from the set of items to the set of integers {0, 1 , . . . , n -- 1} and a (not necessarily empty) hash table with n locations. Output. The hash address h(00 and an indication of whether tz has been previously encountered. If tz has already been entered into the hash table, h(tz) is the location assigned to tz. If ~z has not been encountered, h(tz) is an empty location into which ~ is to be stored. Method. (1) We compute ho(tZ), h l ( t z ) , . . . , hm(tz) in order using step (2) until no "collision" occurs. If hm(tz) produces a collision, we terminate this algorithm with a failure indication. (2) Compute hi(tz) and do the following: (a) If location h,(ot) in the hash table is empty, let h(tz)= h,(oO, report that ~ has not been encountered, and halt. (b) If location hi(a) is not empty, check the name entry of this locatSometimes it is convenient to put the reserved words and standard functions in the symbol table at the start.
796
BOOKKEEPING
CHAP. 10
tion.t If the name is a, let h(a) = hi(a), report that a has already been entered, and halt. If the name is not a, a collision occurs and we repeat step (2) to compute the next alternate address.
Each time a location hi(a) is examined, we say that a probe of the table is made. When the hash table is sparsely filled, collisions are rare, and for a new item a, h(a) can be computed very quickly, usually just by evaluating the primary hashing function, ho(a). However, as the table fills, it becomes increasingly likely that for each new item a, ho(a) will already contain another item. Thus, collisions become more frequent as more items are inserted into the table, and thus the number of probes required to determine h(a) increases. However, it is possible to design hash tables whose overall performance is much superior to binary search trees. Ideally, for each distinct item encountered we would like the primary hashing function h 0 to yield a distinct location in the hash table. This, of course, is not generally feasible because the total number of possible items is usually much larger than n, the number of locations in the table. In practice, n will be somewhat larger than the number of distinct items expected. However, some course of action must be planned in case the table overflows. To store information about an item a, we first compute h(a). If 0c has not been previously encountered, we store the name a in the name field of location h(a). [If we are using a separate name table, we store a in the next empty location in the name table and put a pointer to this location in the name field of location h(a).] Then we seize the next available block of storage in the data storage table and put a pointer to this block in the pointer field of location h(a). We can then insert the information in this block of data storage table. Likewise, to fetch information about an item a, we can compute h(a), if it exists, by Algorithm 10.2. We can then use the pointer in the pointer field to locate the information in the data storage table associated with item a. Example 10.4
Let us choose n = 10 and let an item consist of any string of capital R o m a n letters. We define CODE(a), where a is a string of letters to be the sum of the "numerical value" of each letter in a, where A has numerical value of 1, B has value 2, and so forth. Let us define hj(a), for 0 < j _~ 9, ?If the name entry contains a pointer to a name table, we need to consult the name table to determine the actual name.
SEC. 10.1
SYMBOLTABLES
797
to be (CODE(00 q--j)mod 10.t Let us insert the items A, W, and EF into the hash table. We find that h 0 ( A ) = (1 mod 1 0 ) = 1, so A is assigned to location 1. Next, W is assigned to location h0(W ) = (23 mod 1 0 ) = 3. Then, EF is encountered. We find h0(EF ) = (5 ÷ 6)mod 10 -~ 1. Since location 1 is already occupied, we try h l ( E F ) = (5 .jr 6 + 1)rood 10-~ 2. Thus, EF is given location 2. The hash storage contents are now as depicted in Fig. 10.5.
Name
Pointer
A
-'----- - " " " " - * "
EF
-~.
W
data for A
data for W
data for EF 7 8 9 Fig. 10.5 Hash table contents. Now, suppose that we wish to determine whether item HX is in the table. We find h0(HX ) = 2. By Algorithm 10.2, we examine location 2 and find that a collision occurs, since location 2 is filled but not with HX. We then examine location hi(HX ) ~- 3 and have another collision. Finally, we compute hz(HX) -~ 4 and find that location 4 is empty. Thus, we can conclude that the item HX is not in the table. [~ 10.1.4.
Hashing Functions
It is desirable to use a primary hashing function h0 that scatters items uniformly throughout the hash table. Functions which do not map onto the entire range of locations or tend to favor certain locations, as well as those which are expensive to compute, should be avoided. Some commonly used primary hashing functions are the following. ta mod b is the remainder when a is divided by b.
798
BOOKKEEPING
CHAP. 10
(1) If an item ~ is spread across several computer words, we can numericically sum these words (or exclusive-or these words) to obtain a single word representation of ~. We can then treat this word as a number, square it, and use the middle log2 n bits of the result as ho(~). Since the middle bits of the square depend on all symbols in the item, distinct items are likely to yield different addresses regardless of whether the items share common prefixes or suffixes. (2) We can partition the single word representation of a into sections of some fixed length (log2 n bits is common). These sections can then be summed, and the log2 n low-order bits of the sum determine ho(00. (3) We can divide the single word representation of ~ by the table size n and use the remainder as ho(00. (n should be a prime here.) Let us now consider methods of resolving collisions, that is, the design of the alternate functions ha, h ~ , . . . , h m. First we note that h~(0¢) should be different from h~(00 for all i ~ j. If h~(~) produces a collision, it would be senseless to then probe hi+r(a) if hi+r(t~) = hi(a). Also, in most cases we want m = n -- 1, because we always want to find an empty slot in the table if one exists. In general the method used to resolve collisions can have a significant effect on the overall efficiency of a scatter store system. The simplest, but as we shall see one of the worst, choices for the functions h a, h 2 , . . . , h,_i is to use h,(~)
[h0(tx) + ilmod n for 1 ~ i ~ n -- 1.
Here we search forward from the primary location ho(00 until no collision occurs. If we reach location n -- 1, we proceed to location 0. This method is simple to implement, but clusters tend to occur once several collisions are encountered. For example, given that h0(00 produces a collision, the probability that ha(tx) will also produce a collision is greater than average. A more efficient method of generating alternate addresses is to use hi(a) = (h0(a) + r~)mod n for 1 ~ i ~ n -- 1 where r~ is a pseudorandom number. The most rudimentary random number generator that generates every integer between 1 and n - 1 exactly once will often suffice, (See Exercise 10.1.8.) Each time the alternate functions are used, the random number generator is initialized to the same point. Thus, the same sequence r 1, r2, • • • is generated each time h is invoked, and the "random" number generator is quite predictable. Other possibilities for the alternate functions are hi(0¢) = [i(ho(~ ) + 1)]mod n
SEC. 10.1
SYMBOL TABLES
799
and
h,(a) = [h0(a ) q- ai 2 + bi]mod n, where a and b are suitably chosen constants. A somewhat different method of resolving collisions, called chaining, is discussed in the Exercises. 10.1.5.
Hash Table Efficiency
We shall now address ourselves to the question, "How long, on the average, does it take to insert or retrieve data from a hash table ?" A companion question is, "Given a set of probabilities for the various items, how should the functions h 0 , . . . , h,_ 1 be selected to optimize the performance of the hash table ?" Interestingly, there are a number of open questions in this area. As we have noted, it is foolish to have duplications in the sequence of locations h 0 ( a ) , . . . , h,_a(a) for any a. We shall thus assume that the best system of hashing functions avoids duplication. For example, the sequence ] t o , . . . , h 9 of Example 10.4 never causes the same location to be examined twice for the same item. If n is the size of the hash table with locations numbered from 0 to n -- 1 and h 0, . . . , h,_ ~ are the hashing functions, we can associate with each item a a permutation II~ of {0. . . . , n - 1}, namely II~ = [ h 0 ( a ) , . . . , h,_~(a)]. Thus, the first component of I I , gives the primary location to be assigned to a, the second component gives the next alternate location, and so forth. If we know p,, the probability of item a occurring, for all items a, we can define the probability of permutation II to be ~ p~, where the sum is taken over all items a such that I I , = II. Hereafter we shall assume that we have been given the probabilities of the permutations. It should be clear that we can calculate the expected number of locations which must be examined to insert an item into a hash table or find an item knowing only the probabilities of the permutations. It is not necessary to know the actual functions h 0 , . . . , h,_~ to evaluate the efficiency of a particular hashing function. In this section we shall study properties of hashing functions as measured by the probabilities of the permutations. We are thus motivated to make the following definition, which abstracts the problem of hash table design to a question of what is a desirable set of probabilities for permutations. DEFINITION
A hashihg system is a number n, the table size, and a probability function p on permutations of the integers 0 through n -- 1. We say that a hashing system is random if p(II) = 1/n! for all permutations II.
800
BOOKKEEPING
CHAP.
10
The important functions associated with a hashing system are (1) p ( i l i ~ . . . ik), the probability that someitem will have il as its primary location, i2 the first alternate, i 3 the next alternate, and so forth (each ij here is an integer between 0 and n -- 1), and (2) p([i~, i 2 , . . . , ik}), the probability that a sequence of k items will fill exactly the set of locations [il, • • •, ik}. The following formulas are easy to derive" (10.1.2)
p(il . . . ik) ---
~_,
p(i~ ' ' " iki )
if k < n
i not a m o n g il, .. ', i~
(10.1.3)
p(il'"
i,) = p(H)
where H is the permutation [ i l , . . . , i,]' (10.1.4)
p ( S ) = ~_~ p ( S -- {i}) ~] p ( w i ) iES
w
where S is any subset of { 0 , . . . , n -- 1} of size k and the rightmost sum is taken over all w such that w is any string of k -- 1 or fewer distinct locations in S -- [i}. We take p ( ~ ) = 1. Formula (10.1.2) allows us to compute the probability of any sequence of locations occurring as alternates, starting with the probabilities of the permutations. Formula (10.1.4) allows us to compute the probability that the locations assigned to k items will be exactly those in set S. This probability is the sum of the probability that the first k -- 1 items will fill all locations in S except location i and that the last item will fill i, taken over all i in S. Example 10.5
Let n = 3 and let the probabilities of the six permutations be Permutation
Probability
[0, 1, 21 [0,2, 11 [1, 0, 21 [1, 2, 0l [2, 0, 11 [2, 1, 0l
.1 .2 .1 .3 .2 .1
By (10.1.3) the probability that the string of locations 012 is generated by applying the hashing function to some item is just the probability of the permutation [0, 1, 2]. Thus, p(012) = .1. By (10.1.2) the probability that 01 is generated is p(0!2). Likewise, p(02) is the probability of permutation [0,2, 1], which is .2. Using formula (10.1.2), p ( 0 ) = p ( 0 1 ) - t - p ( 0 2 ) = .3.
SEC. 1 0 . 1
SYMBOL TABLES
801
The probability of each string of length 2 or 3 is the probability of the unique permutation of which it is a prefix The probabilities of strings 1 and 2 are, respectively, p(10) + p(12) = .4 and p(20) + p(21) = .3. Let us now compute the probabilities that various sets of locations are filled. For sets S of one element, (10.1.4) reduces to p({i}) = p(i). Let us compute p({0, 1}) by (10.1.4). Direct substitution gives p({0, 1}) -- p((0})[p(1) + p(01)] + p({1))[p(0) + p(10)] = .3[.4 + .11 + .4[.3 -t-.1] = .31 Similarly, we obtain p((0, 2}) = .30
and
p({1, 2}) - - . 3 9
To compute p((0, 1, 2}) by (10.1.4), we must evaluate p({0, 1)}[p(2) -k- p(02) --Jrp(12) -+- p(012) --Jr-p(102)] + p({0, 2})[p(1) + p(01) + p(2 l) + p(021) -Jr-p(201)] -+- p(( 1, 2})[p(0) -Jr-p(10) -t- p(20) -+- p(120) -t- p(210)] which sums to 1, of course.
[~]
The figure of merit which we shall use to evaluate a hashing system is the expected number of probes necessary to insert an item ~ into a table in which k out of n locations are filled. We shall succeed on the first probe if h0(00 = i and location i is one of the n - k empty locations. Thus, the probability that we succeed on the first probe is given by n-1
p(i) ~_~p(S) i=0
S
where the rightmost sum is taken over all sets S of k locations which do not contain i. If h0(a ) is in S but hi(e) ~ S, then we shall succeed on the second try. Therefore, the probability that we fail on the first try but succeed on the second is given by n-I
n-1
~_~p(ij) ~ p(S) i=0 j = 0
S
where the rightmost sum is taken over all sets S such that _~ S = k, i E S, and j q! S. Note that P(O') = 0 if i = j. Proceeding in this manner, we arrive at the following formula for E(k, n),
the expected number of probes required to insert an item into a table in which
802
BOOKKEEPING
CHAP. 10
k out of n locations are filled, k < n" k+l
E(k, n) ---- E m ~_~p(w) ~_~p(S)
(10.1.5)
m=l
S
w
where (1) The middle summation is taken over all w which are strings of distinct locations of length m and (2) The rightmost sum is taken over all sets S of k locations such that all but the last symbol of w is in S. (The last symbol of w is not in S.) The first summation assumes that m steps are required to compute the primary location and its first m -- 1 alternates. Note that if k < n, an empty location will always be found after at most k + 1 tries. Example 10.6
Let us use the statistics of Example 10.5 to compute E(2, 3). Equation (10.1.5) gives 3
m ~
E(2, 3 ) = ~ =
I
p(w) m
Z
p(S)
,$' s u c h t h a t ~ S = 2 a n d a l l but t h e last symbol o f w is i n S
-- p(O)p([1, 2}) + p(1)p({O, 2}) + p(2)p([O, 1}) •
+ 2[p(O 1)p({~O,2]) + p(10)p([ 1, 2}) + p(O2)p([O, 1}) + p(20)p([ 1, 2}) + p(12)p([O, 1}) + p(21)p({~O, 2})] -1- 3[p(O12)p([O, 1}) + p(O21)p({~O, 2}) + p(lO2)p({~O, 1}) -q--p(120)p([1, 2}) + p(201)p([O, 2}) + p(ZlO)p([1, 2})] -~ 2.008
Another figure of merit used to evaluate hashing systems is R(k, n), the expected number of probes required to retrieve an item from a table in which k out of n locations are filled. However, this figure of merit can be readily computed from E(k, n). We can assume that each of the k items in the table is equally likely to be retrieved. Thus, the expected retrieval time is equal to the average number of probes that were required to originally insert these k items into the table. That is,
lE(i,n) R(k, n) - - - ~1 ~,=o For this reason, we shall consider E(k, n) as the exclusive figure of merit. A natural conjecture is that performance is best when a hashing system is random, on the grounds that any nonrandomness can only make certain
SEC. 10.1
SYMBOLTABLES
803
locations more likely to be filled than others and that these are exactly the locations more likely to be examined when we attempt to insert new items. While this will be seen not to be precisely true, the exact optimum is not known. Random hashing is conjectured to be optimum in the sense of minimum retrieval time, and other common hashing systems do not compare favorably with a random hashing system. We shall therefore calculate E(k, n) for a random hashing system. LEMMA 10.1 If a hashing system is random, then (1) For all sequences w of locations such that 1 _~ ]wl ~ n,
p(w) = (n -- l w l)!In! (2) For all subsets S of{0, I , . . . ,
n -- 1}, 1
-(;x) Proof (1) Using (10.1.2), this is an elementary induction on (n - - [ w I), starting at lwl = n and ending at lw[ ---- 1. (2) A simple argument of symmetry assures us that p(S) is the same for
•
all S of size k. Since the number of sets of size k is ate.
(z) , part
(2) is immedi-
[~
LEMMA 10.2 I f n > k, then ~ ( ~ - - J ) =
Proof. Exercise.
(n-~ ]1
[~]
THEOREM 10.1 If a hashing system is random, then E(k, n) = (n + 1)/(n + 1 -- k).
Proof. Let us suppose that we have a hash table with k out of n locations filled. We wish to insert the k ÷ 1st item e. It follows from Lemma 10.1(2) that every set of k locations has the same probability of being filled. Thus E(k, n) is independent of which k locations are actually filled. We can therefore assume without loss of generality that locations 0, 1, 2 , . . . , k - 1 are filled. To determine the expected number of probes required to insert a, we examine the sequence of addresses obtained by applying h to a. Let this sequence be ho(a ), h a ( a ) , . . . , h,_~(a). By definition, all such sequences of n locations are equally probable.
804
BOOKKEEPING
CHAP. 10
Let q j be the probability that the first j - 1 locations in this sequence are in [0, 1 , . . . , k -- 1} but that the jth is not. Clearly, E(k, n), the expected number of probes to insert ~, is ~=+ljqj. We observe that k+l
k+l
~ J q j ~-- ~
(10.1.6)
j=l
m
m=l
k+l
k+l
~ qm = ~2 ~ qm j=l j = l m=j
But ~ ' j qm is just the probability that the first j - 1 locations in the sequence h0(~z), h , ( 0 0 , . . . , h,_,(~) are between 0 and k - 1, i.e., that at least j probes are required to insert the k + 1st item. By Lemma 10.1(1), this quantity is
( k ) ( k --11 ) " "
( k - -- -J j- + t -2 ) __
n - j + 1) kl (n--j+ (k - - j + 1)! n!
1)l
k--j+ (~)
1
Then,
E(k, n) __ k+, (~c
j+l
(n
1
(z) using Lemma 10.2.
n+l n--k+l
[~]
We observe from Theorem 10.1 that for large n and k, the expected time to insert an item depends only on the ratio of k and n and is approximately 1/(1 -- p), where p = kin. This function is plotted in Fig. 10.6. The ratio kin is termed the load factor. When the load factor is small, the insertion time increases with k, the number of filled locations, at a slower rate than log k, and hashing is thus superior to a binary search. Of course, if k approaches n, that is, as the table gets filled, insertion becomes very expensive, and at k = n, further insertion is impossible unless some mechanism is provided to handle overflows. One method of dealing with overflows is suggested in Exercises 10.1.11 and 10.1.12. The expected number of trials to insert an item is not the only criterion of goodness of a hashing scheme. One also desires that the computation of the hashing functions be simple. The hashing schemes we have considered compute the alternate functions h,(e) . . . . , h,_ ,(~) not from 0citself but from h0(e), and this is characteristic of most hashing schemes. This arrangement is efficient because ho(00 is an integer of known length, while e may be arbitrarily long. We shall call such a method hashing on locations. A more restricted case, and one that is even easier to implement, is linear hashing, where hi(e) is given by (h0(e) + i)mod n. That is, successive locations in the table are tried until an empty one is found; if the bottom of the table is reached, we proceed to the top. Example 10.4 (p. 796) is an example of linear hashing.
SEC. 10.1
SYMBOL TABLES
805
10-
6 h
4--
2--
0
"° ~ o j ° i 0.2
. i 0.4
.
! 0.6
l 0.8
! 1.0
k/n Fig. 10.6 Expected insertion time as a function of load factor for r a n d o m hashing.
We shall give an example to show that linear hashing can be inferior to random hashing in the expected number of trials for insertion. We shall then discuss hashing on locations and show that at least in one case it, too, is inferior to random hashing. Example 1 0 . 7
Let us consider a linear hashing system with n = 4 and probability 1/4 for each of the four permutations [0123], [1230], [2301], and [3012]. The reader can show that performance is made worse only if these probabilities are made unequal. For random hashing, Theorem 10.1 gives E(2, 4 ) = 5/3. By (10.1.4), we can calculate the probabilities that a set of two locations is filled. These probabilities are p({0, 1}) =p({1, 2}) =p((2, 3]) =p([3, 0}) = 3/16 and p((0, 2]) -- p((1, 3}) = 1/8 Then, by (10.1.5), we compute E(2, 4) for linear hashing to be 27/16, which is greater than 5/3, the cost for random hashing. [~] We shall next compare hashing on locations with random hashing in the special case that the third item is entered into the table. We note that
806
BOOKKEEPING
CHAP. 10
when we hash on locations there is, for each location i, exactly one permutation, H~, that begins with i and has nonzero probability. We can denote the probability of IIt by Pi. We shall denote the second entry in II~, the first alternate of i, by a~. Ifp~ = 1/n for each i, we call the system random hashing on locations.
THEOREM 10.2 E(2, n) is smaller for random hashing than for random hashing on locations for all n > 3. Proof. We know by Theorem 10.1 that E(2, n) for random hashing is (n + 1 ) / ( n - 1). We shall derive a lower bound on E(2, n) for hashing on locations. Let us suppose that the first three items to be entered into the table have permutations Hi, IIj, and rib, respectively. We shall consider two cases, depending on whether i = j or not. Case 1: i ~ j. This occurs with probability (n -- 1)/n. The expected number of trials to insert the third item is seen to be
1 + (2/n) + (2/n)[1/(n -- 1)] = (n + 1)/(n
--
1).
Case 2: i = j . This occurs with probability 1/n. Then with probability (n -- 2)/n, the third item is inserted in one try, that is, k ~ i and k ~ a~, the second location filled. With probability 1/n, k = a~, and at least two tries are made. Also with probability 1/n, k = i, and three tries must be
made. (This follows because we know that the second try is for a~, which was filled by j.) The expected number of tries in this case is thus at least [(n -- 2)/n] + (2/n) + (3/n) = (n + 3)In. Weighting the two cases according to their likelihoods, we get, for random hashing on location, E ( 2 ' n ) ~+( ~- -- I ) ( n - l n ) + (
n+n 3)(_~_)=
n2 + 2 n + 3
The latter expression exceeds (n + 1)/(n -- 1) for n > 3.
n2
[~
The point of the previous example and theorem is that many simple hashing schemes do not meet the performance of random hashing. Intuitively, the cause is that when nonrandom schemes are used, there is a tendency for the same location to be tried over and over again. Even if the load factor is small, with high probability there will still be some locations that have been tried many times. If a scheme such as hashing on locations is used, each time a primary location h0(~) is filled, all the alternates of h0(~) which were tried before will be tried again, resulting in poor performance. The foregoing does not imply that one should not use a scheme such
EXERCISES
807
as hashing on locations if there is a compensating saving in time per insertion try. In fact, it is at least possible that random hashing is not the best we can do. The following example shows that E(k, n) does not always attain a minimum when random hashing is used. We conjecture, however, that r a n d o m hashing does minimize R(k, n), the expected retrieval time. •E x a m p l e
10.8
Let the permutations [0123] and [1032] have probability .2 and let [2013], [2103], [3012], and [3102] have probability .15, all others having zero probability. We can calculate E(2, 4) directly by (10.1.5), obtaining the value 1.665. This value is smaller than the figure 5/3 for r a n d o m hashing.
EXERCISES
10.1.1.
Use Algorithm 10.1 to insert the following sequence of items into a binary search tree" T, D, H, F, A, P, O, Q, W, TO, TH. Assume that the items have alphabetic order.
10.1.2.
Design an algorithm which will take a binary search tree as input and list all elements stored in the tree in order. Apply your algorithm to the tree constructed in Exercise 10.1.1.
"10.1.3.
Show that the expected time to insert (or retrieve) one item in a binary search tree is O(log n), where n is the number of nodes in the tree. What is the maximum amount of time required to insert any one item ?
"10.1.4.
What information about FORTRAN variables and constants is needed in the symbol table for code generation ?
10.1.5.
Describe a symbol table storage mechanism for a block-structured language such as ALGOL in which the scope of a variable X is limited to a given block and all blocks contained in that block in which X is not redeclared.
10.1.6.
Choose a table size and a primary hashing function h0. Compute h0(t~), where 0c is drawn from the set of (a) FORTRAN keywords, (b) ALGOL keywords, and (c) PL/I keywords. What is the maximum number of items with the same primary hash address ? You may wish to do this calculation by computer. Sammet [1969] will provide the needed sets of keywords.
"10.1.7.
Show that R(k, n) for random hashing approximates (-- 1/p) log (1 --p), where p = k/n. Plot this function. Hint: Approximate (n/k) ~k~-o~(n + 1)/(n -- i + 1) by an integral.
*10.1.8.
Consider the following pseudorandom number generator. This generator creates a sequence rl, rz, • • •, r,_l of numbers which can be used to compute hi(00 = [h0(t~) + rdmod n for 1 ~ i ~ n - 1. Each time
808
CHAP. 10
BOOKKEEPING
that a sequence of numbers is to be generated, the integer R is initially set to 1. We assume that n -- 2p. Each time another number is required, the following steps are executed" (1) R = 5 • R. (2) R = R mod 4n. (3) Return r = [R/4]. Show that for each i, the differences r~+k - ri are all distinct for k > 1
andi+k~n*'10.1.9.
**10.1.10.
1.
Show that if h i = [ h 0 + a i z + b i ] m o d n for 1 < i ~ n 1, then at most half the locations in the sequence ho, hi, h z , . . . , h,_a are distinct. Under what conditions will exactly half the locations in this sequence be distinct ? How many distinct locations are there in the sequence ho, hi . . . . . if f o r l < i < p / 2
hp_l
hzi-1 = [h0 + iZ]modp
h z ~ = h o _ ~ ( p -2 1) + i t 2 mod P and p is a prime number ? DEFINITION
Another technique for resolving collisions that is more efficient in terms of insertion and retrieval time is chaining. In this method one field is set aside in each entry of the hash table to hold a pointer to additional entries with the same primary hash address. All entries with the same primary address are chained on a linked list starting at that primary location. There are several methods of implementing chaining. One method, called direct chaining, uses the hash table itself to store all items. To insert an item 0~, we consult location ho(0~). (1) If that location is empty, 0c is installed there. If h0(~) is filled and is the head of a chain, we find an empty entry in the hash table by any convenient mechanism and place this entry on the chain headed by ho(00. (2) If h0(0¢) is filled but not by the head of a chain, we move the current entry fl in h0(0c) to an empty location in the hash table and insert 0c in ho(00. [We must recompute h(fl) to keep fl in the proper chain.] This movement of entries is the primary disadvantage of direct chaining. However, the method is fast. Another advantage of the technique is that when the table becomes full, additional items can be placed in an overflow table with the same insertion and retrieval strategy.
EXERCISES
10.1.11.
809
Show that if alternate locations are chosen randomly, then R(k, n), the expected retrieval time for direct chaining, is 1 + p/2, where p = k/n. Compare this function with R(k, n) in Exercise 10.1.7. Another chaining technique which does not require items to be moved uses an index table in front of the hash table. The primary hashing function h0 computes addresses in the index table. The entries in the index table are pointers to the hash table, whose entries are filled in sequence. To insert an item ~ in this new scheme, we compute h0(tx), which is an address in the index table. If h0(t~) is empty, we seize the next available location in the hash table and insert t~ into that location. We then place a pointer to this location in h0(00. If h0(t~) already contains a pointer to a location in the hash table, we go to that location. We then search down the chain headed by that location. Once we reach the end of the chain, we take the next available location in the hash table, insert ~ into that location, and then attach this location to the end of the chain. If we fill the hash table in order beginning from the top, we can find the next available location very quickly. In this scheme no items ever need to be moved because each entry in the index table always points to the head of a chain. Moreover, overflows can be simply accommodated in this scheme by adding additional space to the end of the hash table.
"10.1.12.
What is the expected retrieval time for a chaining scheme with an index table? Assume that the primary' locations are uniformly distributed.
10.1.13.
Consider a random hashing system with n locations as in Section 10.1.5. Show that if S is a set of k locations and i q~ S, then ~w p(wi) = 1/(n -- k), where the sum is taken over all w such that w is a string of k or fewer distinct locations in S.
"10.1.14.
Prove the following identities" (a) ,=o~(n
+i i) --(n + k + )"
(c) i = o ~ i ( n f "10.1.15.
-- k ( n + k + ) - - ( n +k_k +l 1).
Suppose that items are strings of from one to six capital R o m a n letters. Let CODE(a) be the function defined in Example 10.4. Suppose that item tx has probability (1/6)26-1~,I. Compute the probabilities of the permutations on {0, 1 , . . . , n -- 1} if (a)
h,(ct) =
(CODE(t~) + i)mod n, 0 < i < n - 1.
810
BOOKKEEPING
CHAP. 10
(b) hi(a) = (i(ho(~) + 1))mod n, 1 .~ i _~ n - 1 where h0(~) = CODE(a) mod n *'10.1.16.
Show that for linear hashing with the primary location h0(00 randomly distributed, the limit of E(k, n) as k and n approach oo with kin : p is given by 1 + [p(1 -- p/2)/(1 - p)Z]. How does this compare with the corresponding function of p for random hashing ? DEFINITION
A hashing system is said to be k-uniform if for each set S ~ { 0 , 1 , 2 . . . . . n - - l ] such that ~ S : k the probability that a sequence of k distinct locations consists of exactly the locations of S is 1/(7,). (Most important, the probability is independent of S.) *10.1.17.
Show that if a hashing system is k-uniform, then
E(k, n) = (n + 1)/(n + 1 -- k), as for random hashing. "10.1.18.
Show that for each hashing system such that there exists k for which E(k, n) < (n + 1)/(n + 1 -- k), there exists k' < k such that
E(k', n) > (n + 1)/(n q-- 1 -- k') Thus, if a given hashing system is better than random for some k, there is a smaller k' for which performance is worse than random. Hint: Show that if a hashing system is (k -- 1)-uniform but not k-uniform, then E(k, n) > (n q- 1)/(n -- k q- 1). "10.1.19.
Give an example of a hashing systemwhich is k-uniform for all k but is not random.
"10.1.20.
Generalize Example 10.7 to the case of unequal probabilities for the cyclic permutations.
"10.1.21.
Strengthen Theorem 10.2 to include systems which hash on locations but do not have equal pt's.
Open Problems 10.1.22.
Is random hashing optimal in the sense of expected retrieval time ? That is, is it true that R(k, n) is always bounded below by (l/k) Z~-~ (n + 1)/(n q- 1 - - i ) ? We conjecture that random hashing is optimal.
10.1.23.
Find the greatest lower bound on E(k, n) for systems which hash on locations. Any lower bound that exceeds (n + 1)/(n ÷ 1 -- k) would be of interest.
10.1.24.
Find the greatest lower bound on E(k, n) for arbitrary hashing systems. We saw in Example 10.8 that (n ÷ 1)/(n + 1 -- k) is not such a bound.
SEC. 10.2
PROPERTY GRAMMARS
811
Research Problem 10.1.25.
In certain uses of a hash table, the items entered are known in advance. Examples are tables of library routines or tables of assembly language operation codes. If we know what the residents of the hash table are, we have the opportunity to select our hashing system to minimize the expected lookup time. Can you provide an algorithm which takes the list of items to be stored and yields a hashing system which is efficient to implement, yet has a low lookup time for this particular loading of the hash table ?
Programming Exercises 10.1.26.
Implement a hashing system that does hashing on locations. Test the behavior of the system on F O R T R A N keywords and common function names.
10.1.27.
Implement a hashing system that uses chaining to resolve collisions. Compare the behavior of this system with that in Exercise 10.1.26.
BIBLIOGRAPHIC
NOTES
Hash tables are also known as scatter storage tables, key transformation tables, randomized tables, and computed entry tables. Hash tables have been used by programmers since the early 1950's. The earliest paper on hash addressing is by Peterson [1957]. Morris [1968] provides a good survey of hashing techniques. The answer to Exercise 10.1.7 can be found there. Methods of computing the alternate functions to reduce the expected number of collisions are discussed by Maurer [1968], Radke [1970], and Bell [1970]. An answer to Exercise 10.1.10 can be found in Radke [1970]. Ullman [1972] discusses k-uniform hashing systems. Knuth [1973] is a good reference on binary search trees and hash tables.
10.2.
PROPERTY G R A M M A R S
An interesting and highly structured method of assigning properties to the identifiers of a p r o g r a m m i n g language is through the formalism of "property grammars." These are context-free grammars with an additional mechanism to record information about identifiers and to handle some of the non-context-free aspects of the syntax of p r o g r a m m i n g languages (such as requiring identifiers to be declared before their use). In this section we provide an introduction to the theory of property grammars a n d show how a property g r a m m a r can be implemented to model a syntactic analyzer that combines parsing with certain aspects of semantic analysis.
812 10.2.1.
BOOKKEEPING
CHAP: 10
Motivation
Let us try first to understand why it is not always sufficient to establish the properties of each identifier as it is declared and to place these properties in some location in a symbol table reserved for that identifier. If we consider a block-structured language such as A L G O L or PL/I, we realize that the properties of an identifier may change many times, as it may be defined in an outer block and redefined in an inner block. When the inner block terminates, the properties return to what they were in the outer block. For example, consider the diagram in Fig. 10.7 indicating the block structure of a program. The letters indicate regions in the program. This block structure can be presented by the tree in Fig. 10.8. begin block 1 A begin block 2 B
begin block 3
,i
c end block 3 end block 2 begin block 4 end block 4 end block 1
Fig. 10.7 Blockstructure of a program.
Suppose that an identifier I is defined in block 1 of this program and is recorded in a symbol table as soon as it is encountered. Suppose that I is then redefined in block 2. In regions B, C, and D of this program, I has the new definition. Thus, on encountering the definition of I in block 2, we must enter the new definition of I in the symbol table. However, we cannot merely replace the definition that I had in block 1 by the new definition in block 2, because once region E is encountered we must revert to the original definition of L One way to handle this problem is to associate two n u m b e r s - - a level number and an index--with each definition of an identifier. The level number is the nesting depth, and each block with the same level number is given a distinct index. For example, identifiers in areas B and D of block 2 would
$EC. 10.2
PROPERTY GRAMMARS
813
block 1
A
B
j
block 2
E
block 4
block 3
D
F
G
Fig. 10.8 Tree representation of block structure.
have level number 2 and index 1. Identifiers in area F of block 4 would have level number 2 and index 2. If an identifier in the block with level i and index j is referenced, we look in the symbol table for a definition of that identifier with the same level number and index. However, if that identifier is nowhere defined in the block with level number i, we would then look for a definition of that identifier in the block of level number i -- 1 which contained the block of level number i and index j, and so forth. If we encounter a definition at the desired level but with too small an index, we may delete the definition, as it will never again apply. Thus, a pushdown list is useful for the storing of definitions of each identifier as encountered. The search described is also facilitated if the index Of the currently active block at each level is available. For example, if an identifier K is encountered in region C and no definition of K appeared in region C, we would accept a definition of K appeari n g in region B (or D, if declarations after use are permitted). However, if no definition of K appeared in regions B or D, we would then accept a definition of K in regions A, E, or G. But we would not look in region F for a definition of K. The level-index method of recording definitions can be used for languages with the conventional nesting of definitions, e.g., A L G O L and PL/I. However, in this section we shall discuss a more general formalism, called property grammars, which permits arbitrary conventions regarding scope of definition. Property grammars have an inherent generality and elegance which stem from the uniformity of the treatment of identifiers a n d their properties. Moreover, they can be implemented in an amount of time which is essentially linear in the length of the compiler input. While the constant
814
BOOKKEEPING
CHAP. 10
of proportionality may be high, we present the concept in the hope that future research may make it a practical compiler technique. 10.2.2.
Definition of Property Grammar
Informally, a property grammar consists of an underlying C F G to whose nonterminal and terminal symbols we have attached "property tables." We can picture a property table as an abstraction of a symbol table. When we parse bottom-up according to the underlying grammar, the property tables attached to the nonterminals together represent the information available in the symbol table at that point of the parse. A property table T is a mapping from an index set I to a property set V. Here we shall use the set of nonnegative integers as the index set. We can interpret each integer in the index set as a pointer into a symbol table. Thus, if the entry pointed to in the symbol table represents an identifier, we can treat the integer as the name of that identifier. The set V is a set of "properties" or "values" and we shall use a finite set Of integers for V. One integer (usually 0) is distinguished as the "neutral" property. Other integers can be associated with various properties which are of interest. For example, the integer 1 might be associated with the property "this identifier has been referenced," 2 with "declared real," and so forth. In tables associated with terminals all but one index is mapped to the neutral property. The remaining index can be mappeid to any property (including the neutralone). However, if the terminal is the token (identifier), the index which represents the name of the particular token (that is, the data component of the token) would be mapped onto a property such as "this is the identifier mentioned here." For example, if we encounter the declaration real B in a program, this string might be parsed as n)
real
[1"2]
identifier
[1" 1]
where the d a t a c o m p o n e n t of the token (identifier) refers to the string B. With the terminal (identifier) is associated the table [1: 1], which associates the property 1 (perhaps "mentioned") with the index 1 (which now corresponds to B) and the neutral property with all other indices. The terminal token real is associated with a table which maps all indices to the neutral property. Such a table will normally not be explicitly present. With the nonterminal (declaration) in the parse tree we associate a table which would be constructed by merging the tables of its direct descendants according to some rule. Here, the table that is constructed is [1: 2], which
SEC. 10.2
PROPERTY GRAMMARS
81 5
associates property 2 with index 1 and the neutral property with all other indices. This table can then be interpreted as meaning that the identifier associated with index 1 (namely B) has the property "declared real." In general if we have the structure A T
x~ x~ . . . x ~ T~T~ ...T~ in a parse tree, then the property of index i in table T is a function only of the property of index i in tables T1, T2 . . . . , Tk. That is, each index is treated independently of all others. We shall now define a property grammar precisely. DEFINITION
A property grammar is an 8-tuple G -- (N, Z, P, S, V, %, F, g), where
(1) (2) (3) (4) (5)
(N, Z, P, S) is a CFG, called the underlying CFG; V is a finite set of properties; v 0 in V is the neutral property; F ~ V is the set of acceptable properties; v o is always in F; and /t is a mapping from P × V* to V, such that (a) If/t(p, s) is defined, then production p has a right-hand side exactly as long as s, where s is a string of properties; (b) tt(p, %% " " %) is v 0 if the string of v0's is as long as the righthand side of production p, and is undefined otherwise.
The function/t tells how to assign properties in the tables associated with interior nodes of parse trees. Depending on the production used at that node, the property associated with each integer is computed independently of the property of any other integer. Condition (5b) establishes that the table of some interior node can have a nonneutral property for some integer only if one of its direct descendants has a nonneutrai property for that integer. A sententiaIform of G is a string X1T1X2Tz . . . X,T,, where the X's are symbols in N U Z and the T's are property tables. Each table is assumed to be attached to the symbol immediately to its left and represents a mapping from indices to V such that all but a finite number of indices are mapped to v0. We define the relation =~, or =~. if G is understood, on sentential forms G as follows. Let aA Tfl be a sentential form of G and A ~ X1 ..- Xk be production p in P. Then a A T ~ =~ aX~T I .... XkTkfl if for all indices i,
816
CHAP. 10
BOOKKEEPING
It(P, T , ( i ) T 2 ( i ) . . . Tk(i)) = T(i).
Let =*, or ==~ if G is understood, be the reflexive, transitive closure of =-. The G
G
language defined by G, denoted L ( G ) , is the set of all a~ T~a2T2 . . . a , T , such that for some table T (1) (2) (3) (4)
. . . a,T,; Each aj is in E; For all indices i, T(i) is in F; and For each j, Tj maps all indices, or all but one index, to %.
S T *=, a t T l a z T 2
We should observe that although the definition of a derivation is topdown, the definition of a property grammar lends itself well to bottom-up parsing. If we can determine the tables associated with the terminals, we can construct the tables associated with each node of the parse tree deterministically, since/z is a function of the tables associated with the direct descendants of a node. It should be clear that if G is a property grammar, then the set [alaz . . . a, l a l T l a z T 2 . . . a , T , is in L ( G )
for some sequence of tables T1, T z , . . . , T,} is a context-free language, because any string generated by the underlying C F G can be given all-neutral tables and generated in the property grammar. CONVENTION
We continue to use the convention regarding context-free grammars, a, b, e , . . . are in E and so forth, except v now represents a property and s a string of properties. If T maps all integers to the neutral property, we write X instead of X T . Example 10.9
We shall give a rather lengthy example using property grammars to handle declarations in a block-structured language. We shall also show how, if the underlying C F G is deterministically parsable in a bottom-up way, the tables can be deterministically constructed as the parse proceeds.t Let G = (N, E, P, S, V, O, [0}, ,u) be a property grammar with (i) N = ((block), (statement), (declaration list), (statement list), (variable list)}. The nonterminal (variable list) generates a list of variables used in a statement. We are going to represent a statement by the actual variables used in the statement rather than giving its entire structure. This tOur example grammar happens to be ambiguous but will illustrate the points to be made.
SEC.
10.2
PROPERTY GRAMMARS
817
abstraction brings out the salient features of property grammars without involving us in too much detail. (ii) ~ = [begin, end, declare, label, goto, a}. The terminal declare stands for the declaration of one identifier. We do not show the identifier declared, as this information will be in the property table attached to declare. Likewise, labd stands for the use of an identifier as a statement label. The identifier itself is not explicitly shown. The terminal goto stands for goto(label>, but again we do not show the label explicitly, since it will be in the property table attached to goto. The terminal a represents a variable. (iii) P consists of the following productions.
(1) (block)
> begin (declaration list)(statement list) end
(2)
(statement list)
~ (statement list)(statement)
(3)
(statement list)
> (statement)
(4)
(statement)
~ (block)
(5)
(statement)
~ labd (variable list)
(6)
(statement)
~ (variable list)
(7)
(statement)
~ goto
(8)
(variable list) L ~ (variable list) a
(9)
(variable list)
~e
(10)
(declaration list)
~ declare (declaration list)
(11)
(declaration list)
~e
Informally, production (1) says that a block is a declaration list and a list of statements surrounded by begin and end. Production (4) says that a statement can be a block; productions (5) and (6) say that a statement is a list of the variables used therein, possibly prefixed with a label. Production (7) says that a statement can be a goto statement. Productions (8) and (9) say that a variable list is a string of 0 or more a's, and productions (10) and (11) say that a declaration list is a string of 0 or more deelare's. (iv) V = (0, 1, 2, 3, 4} is a set of properties with the following meanings: 0
Identifier does not appear in the string derived from this node (neutral property).
1 Identifier is declared to be a variable. 2
Identifier is a label of a statement.
3
Identifier is used as a variable but is not (insofar as the descendants of the node in question are concerned) yet declared.
4
Identifier is used as a goto target but has not yet appeared as a label.
818
BOOKKEEPING
CHAP. 10
(v) We define the function ,u on properties with the following ideas in mind" (a) If an invalid use of a variable or label is found, there will be no way to construct further tables, and so the process of table computation will "jam." (We could also have defined an "error" property.) (b) An identifier used in a goto must be the label of some statement of the block in which it is used.t (c) An identifier used as a variable must be declared within its block or a block in which its block is nested. We shall give #(p, w) for each production in turn, with comments as to the motivation. (1)
( b l o c k ) L > begin (declaration list)(statement list)end #(1, s) 0 0 0 0 0
0 1 1 0 0
0 0 3 3 2
0 0 0 0 0
The only possible property for all integers associated with begin and end is 0 and hence the two columns of O's. In the declaration list each identifier will have property 0 ( = not declared) or 1 ( = declared). If an identifier is declared, then within the body of the block, i.e., the statement list, it can be used only as a variable (3) or not used (0). In either case, the identifier is not declared insofar as the program outside the block is concerned, and so we give the identifier the 0 property. Thus, the second and third lines appear as they do. If an identifier is not declared in this block, it may still be used, either as a label or variable. If used as a variable (property 3), this fact must be transmitted outside the block, so we can check that it is declared at some appropriate place, as in line 4. If an identifier is defined as a label within the block, this fact is not transmitted outside the block (line 5), because a label within the block may not be transferred to from outside the block. Since # is not defined for other values of s, the property grammar catches uses of labels not found within the block as well as uses of declared variables as labels within the block. A label used as a variable will be caught at another point. tThis differs from the convention of ALGOL, e.g., in that ALGOL allows transfers to a block which surrounds the current one. We use this convention to make the handling of labels differ from that of identifiers.
SEC. 10.2 (2)
PROPERTY GRAMMARS
(statement list)
819
(statement list) ( s t a t e m e n t ) U(2, s) 0 3 0 3 4 0 4 4 2 0 2
0 3 3 0 4 4 0 2 4 2 0
Lines 2-4 say that an identifier used as a variable in (statement list) or ( s t a t e m e n t ) on the right-hand side of the production is used as a variable insofar as the (statement list) on the left is concerned. Lines 5-7 say the same thing about labels. Lines 8-11 say that a label defined in either (statement list) or ( s t a t e m e n t ) on the right is defined for (statement list) on the left, whether or not that label has been used. At this point, we catch uses of an identifier as both a variable and label within one block. (3)
(statement list)
~ (statement) s
u(3, s)
0 2 3 4
o 2 3 4
Properties are transmitted naturally. Property 1 is impossible for a statement. (4)
(statement)
> (block) s
U(4, s)
0 3
0 3
The philosophy for production (3) also applies for production (4).
820
CHAP. 10
BOOKKEEPING
(5)
(statement>-
> label u(5, s) 0 0 0 3 2 0
The use of an identifier as a label in label or variable in is transmitted to the (statement> on the right. (6)
=
> s
u(6, s)
0 3
0 3
Here, the use of a variable in (variable list> is transmitted to (statement>. (7)
(statement)
~ goto s
U(7,s)
0 4
0 4
The use of a label in goto is transmitted to (statement). (8)
(variable list)
> (variable l i s t ) a
u(8, s) 0 3 0 3
0 0 3 3
Any use of a variable is transmitted to (variable list). (9)
(variable list) ~
e s
g(9, s)
e
0
The one value for w h i c h / t is defined is s = e. The property must be 0 by definition of the neutral property.
PROPERTY GRAMMARS
SEC. 10.2
(10)
821
> declare (declaration list> .u(lO, s) 0 0 1 1
0 1 0 1
Declared identifiers are transmitted to (declaration list). (11)
(declaration list) ~
e s
#(11,s)
e
0
The situation with regard to production (9) also applies here. We shall now give an example of how tables are constructed bottom-up on a parse tree. We shall use the notation [il : vl, i2 : v 2 , . . . , i, :v,] for the table which assigns to index ij the property vj for 1 . ~ j < n and assigns the neutral property to all other indices. Let us consider the input string begin declare [1 : 1] declare [ 2 : 1 ] begin label [ 1 : 2 ] a [2 : 3] goto [1 : 4]. end
a [1 : 3] end
That is, in the outer block, identifiers 1 and 2 are declared as variables by symbols declare[1 : 1] and declare[2 : 1]. Then, in the inner block, identifier 1 is declared and used as a label (which is legitimate) bysymbols label[1 : 2] and goto[l : 4], respectively, and identifier 2 is used as a variable by symbol a[2 : 3]. Returning to the outer block, 1 is used as a variable by symbol a[1:3]. A parse tree with tables associated with each node is given in Fig. 10.9.
822
BOOKKEEPING
CHAP. 10
< block >
li21,
begin
< declaration list >
< statement fist > [1"3 2"3]
end
declare
< declaration fist >
< statement list > [2 3]
< statement > [1 3]
declare [2 • 1]
< declaration fist >
< statement > [2 3]
< variable fist > [1 3]
< block > 3
< variable list >
begin
< declaration fist >
< statement list > [1"2,2"31
< statement list > [1"2,2"3]
I
< statement > [1"2,2'3]
label [1 • 2]
end
e
< statement > [l 41
goto [1'4]
< variable list > [2 3]
< variable list >
a
[2"3]
e
Fig. 10.9
Parse tree with tables.
a [1:31
SEC. 1 0 . 2
PROPERTY
GRAMMARS
823
The notion of a property grammar can be generalized. For example, (1) The mapping /z can be made nondeterministic; that is, let/z(p, w) be a subset of V. (2) The existence of a neutral property need not be assumed. (3) Constraints may be placed on the class of tables which may be associated with symbols, e.g., the all-neutral table may not appear. Certain theorems about property grammars in this more general formulation are reserved for the Exercises. 10.2.3.
Implementation of Property Grammars
We shall discuss the implementation of a property grammar when the underlying CFG can be parsed bottom-up deterministically by a shiftreduce algorithm. When parsing top-down, or using any of the other parsing algorithms discussed in this book, one encounters the same timing problems in the construction of tables as one has timing the construction of translation strings in syntax-directed translation. Since the solutions are essentially those presented in Chapter 9, we shall discuss only the bottom-up case. In our model of a compiler the input to the lexical analyzer does not have tables associated with its terminals. Let us assume that the lexical analyzer will signal whether a token involved is an identifier or not. Thus, each token, when it becomes input to the parser, will have one of two tables, an all-neutral table or a table in which one index has the nonneutral property. We shall thus assume that unused input to the parser will have no tables at all; these are constructed when the symbol is shifted onto the pushdown list. We also assume that the tables do not influence the parse, except to interrupt it when there is an error. Thus, our parsing mechanism will be the normal shift-reduce mechanism, with a pointer to a representation of the property table for each symbol on the pushdown list. Suppose that we have [B, T1][C, T2] on top of the pushdown list and that a reduction according to the production A ~ B C is called for. Our problem is to construct the table associated with A from T1 and T2 quickly and conveniently. Since most indices in a table are mapped to the neutral property, it is desirable to have entries only for those indices which do not have the neutral property. We shall implement the table-handling scheme in such a way that all table-handling operations and inquiries about the property of a given index can be accomplished in time that is virtually linear in the number of operations and inquiries. We make the natural assumption that the number of table inquiries is proportional to the length of the input. The assumption is correct if we are parsing deterministically, as the number of reductions made (and hence the number of table mergers) is proportional to the length of the
~
i
~i ~
824
BOOKKEEPING
CHAP.
10
input. Also, a reasonable translation algorithm would not inquire of properties more than a constant number of times per reduction. In what follows, we shall assume that we have a property g r a m m a r whose underlying C F G is in Chomsky normal form. Generalizations o f the algorithms are left for the Exercises. We shall also assume that each index (identifier) with a nonneutral property in any property table has a location in a hash table and that we may construct data structures out of elementary building blocks called cells. Each cell is of the form
DATUM1
DATUMm
POINTER1
POINTERn
consisting of one or more fields, each of which can contain some data or a pointer to another cell. The cells will be used to construct linked lists. Suppose that there are k properties in V. The property table associated with a g r a m m a r symbol on the pushdown list is represented by a data structure consisting of up to k property lists and an intersection list. Each property list is headed by a property list header cell. The intersection list is headed by an intersection list leader cell. These header cells are linked as shown in Fig. 10.10. The property header cell has three fields:
PROPERTY
COUNT
NEXT HEADER
Property Header Cell
Grammar Symbol Pointer to Property Table
Intersection List Header
~
v Headers for Property Lists
B
I t ] i
Fig. 10.10 A pushdown list entry with structure representing property table.
Pushdown List Entry
SEC. 10.2
PROPERTY GRAMMARS
825
The intersection list header cell contains only a pointer to the first cell on the intersection list. All cells on the property lists and the intersection lists are index cells. An index cell consists of four pointer fields: TOWARD PROPERTY
INTERSECTION LIST
TOWARD HASH TABLE
AWAY F R O M HASH TABLE
Index Cell Suppose that T is a property table associated with a symbol on the pushdown list. Then T will be represented by p property lists, where p is the number of distinct properties in T, and one intersection list. For each index in T having a nonneutrai property j, there is one index cell on the property list headed by the property list header cell for property j. All index cells having the same property are linked into a tree whose root is the header for that property. The first pointer in an index cell is to its direct ancestor in that tree. If an index cell is on the intersection list, then the second pointer in that cell is to the next cell on the intersection list. The pointer is absent if an index cell is not on the intersection list. The last two pointers link index cells which represent the same index but in different property tables. Suppose that tST3~,T2flTx~ represents a string of tables on the pushdown list such that T1, T2, and T3 each contain index i with a nonneutral property and that all tables in t~, fl, 7, or ~ give index i the neutral property. If table T1 is closest to the top of the pushdown list, then C1, the index cell representing i in T1, will have in its third field a pointer to the symbol table location for i. The fourth field in C1 has a pointer to C2, the cell in T2 that represents i. In C2, the third field has a pointer to cell C~ and the fourth field has a pointer to the cell representing i in T3. Thus, an additional structure is placed on the index cells: All index cells representing the same index in all tables on the pushdown list are in a doubly linked list, with the hash table entry for the index at the head. A cell is on the intersection list of a table if and only if some table above it on the pushdown list has an index cell representing the same index. In the example above, the cell C2 above will be on the intersection list for T2. Example 10.10
Suppose that we have a pushdown list containing grammar symbols B and C with C on top. Let the tables associated with these entries be, respectively, T~ = [ 1 : v ~ , 2 : v 2 , 5 : v v S : v ~ , 8 : v z ]
826
CHAP. 10
BOOKKEEPING
and T 2 = [ 2 " v l , 3 " v 1,4"v 1,5"v 1,7"v2,8"%] Then a possible implementation for these tables is shown in Fig. 10.11. Circles Pushdown List B
v2
C
!
v3Hv2H ! ~I
I
/ /f'-",~
/ / ~ / /
J J
/ / / / / / / / / / / /
/
Symbol Table
Fig. 10.11
Implementation of property tables.
indicate index cells. The number inside the circle is the index represented by that cell. Dol~ted lines indicate the links of the intersection list. Note that the intersection list of the topmost table is empty, by definition, and that the intersection list of table T1 consists of index cells for indices 2, 5, and 8. Dashed lines indicate links to the hash table and to ceils representing the same index on other tables. We show these only for indices 2 and 3 to avoid clutter. D
SEC. 10.2
PROPERTYGRAMMARS
827
Suppose that we are parsing and that the parser calls for B C on its stack to be reduced to A. We must compute table T for A from tables Ti and Tz for B and C, respectively. Since C is the top symbol, the intersection list of T1 contains exactly those indices having a nonneutral property on both T~ and Tz (hence the name "intersection list"). These indices will be set aside for later consideration. Those indices which are not on the intersection list of T~ have the neutral property on at least one of T1 or T2. Thus, their property on T is a function only of their one nonneutral property. Neglecting those indices on the intersection list, the data structure representing table T can be constructed by combining various trees of T1 and T~. After doing so, each entry on the intersection list of T1 is treated separately and made to point to the appropriate cell of T. Before formalizing these ideas, we should point out that in practice we would expect that the properties can be partitioned into disjoint subsets such that we can express V as Va × V2 × .-- × Vm for some relatively large m. The various components, V~, V2, • • • would in general be small. For example, V1 might contain two elements designating "real" and "integer". Vz might have two elements "single precision" and "double precision"; V3 might consist of "dynamically allocated" and "statically allocated," and so on. One element of each of V1, Vz, • • • can be considered the default condition and the product of the default elements is the neutral property. Finally, we may expect that the various components of an identifier's property can be determined independently of the others. If this situation pertains, it is possible to create one property header for each nondefault element of V~, one for each nondefault element of V2, and so on. Each index cell is linked to several property headers, but at most one from any V,. If the links to the headers for Vt are made distinct from those to the headers of Vj for i ~ j, then the ideas of this section apply equally well to this situation, and the total number of property headers will approximate the sum Of the sizes of the V,'s rather than their product. We shall now give a formal algorithm for implementing a property grammar. For simplicity in exposition, we restrict our consideration to property grammars whose underlying C F G is in Chomsky normal form. ALGORITHM 10.3 Table handling for property grammar implementation. Input. A property grammar G = (N, E, P, S, V, v 0, F, ,u) whose underlying C F G is in Chomsky normal form. We shall assume that a nonneutral property is never mapped into the neutral property; that is~ if l t ( p , v 1v2) = vo, then vl = v2 = v0. [This condition may be easily assumed, because if we find It(i, vlv2) -- Vo but v~ or v 2 is not v0, we could on the right-hand side replace v o by v~, a new, nonneutral property, and introduce rules that would
828
BOOKKEEPING
CHAP. 10
make v~ "look like" v0.] Also, part of the input to this algorithm is a shiftreduce parsing algorithm for the underlying CFG. Output. A modified shift-reduce parsing algorithm for the underlying CFG, which while parsing computes the tables associated with those nodes o f the parse tree corresponding to the symbols or the pushdown list. Method. Let us suppose that each table has the format of Fig. 10.10, that is, an intersection list and a list of headers, at most one for each property, with a tree of indices with that property attached to each header. The operation of the table mechanism will be described in two parts, depending on whether a terminal or two nonterminals are reduced. (Recall that the underlying grammar is in Chomsky normal form.) Part 1: Suppose that a terminal symbol a is shifted onto the pushdown list and reduced to a nonterminal A. Let the required table for A be [ i ' v ] . To implement this operation, we shall shift A onto the pushdown list directly and create the table [i" v] for A as follows.
(1) In the entry for A at the top of the pushdown list, place a pointer to a single property header cell having property v and count 1. This property header points to an intersection list header with an empty intersection list. (2) Create C, an index cell for i. (3) Place a pointer in the first field of C to the property header cell. (4) Make the second field of C blank. (5) Place a pointer in the third field of C to the hash table entry for i and also make that hash table entry point to C. (6) If there was previously another cell C' which was linked to the hash table entry for i, place C' on the intersection list of its table. (Specifically, make the intersection list header point to C' and make the third field in C' point to the previous first cell of the intersection list if there was one.) (7) Place a pointer in the fourth field of C to C'. (8) Make the pointer in the third field of C' point to C. Part 2: Now, suppose that two nonterminals are reduced to one, say by production A ----~ B D . Let T1 and T2 be the tables associated with B and D, respectively. Then do the following to compute T, the table for A.
(1) Consider each index cell on the intersection list of Tt. (Recall that T2 has no intersection list.) Each such cell represents an entry for some index i on both Tt and T2. Find the properties of this index on these tables by Algorithm 10.4.¢ Let these properties be v~ and v2. Compute v -- It(p, v~v2), tObviously, one can find the property of the index by going from its cells on the two tables to the roots of the trees on which the cells are found. However, in order that the table handling as a whole be virtually linear in time, it is necessary that following the path to the root be done in a special way. This method will be described subsequently in Algorithm 10.4.
SEC. 10.2
PROPERTYGRAMMARS
829
where p is production A ~ BD. Make a list of all index cells on the intersection list along with their new properties and the old contents of the cells on T1 and T2. (2) Consider the property header cells of T1. Change the property of the header cell with property v to the property It(p, vvo). That is, assume that all indices with property v on T1 have the neutral property on T2. (3) Consider the property header cells of Tz. Change the property of the header cell with property v to It(p, roy ). That is, assume that all indices with property v on Tz have the neutral property on Ti. (4) Now, several of the property header cells formerly belonging to T1 and T2 may have the same property. These are merged by the following steps, which combine two trees into one" (a) Change the property header cell with the smaller count (break ties arbitrarily) into a dummy index cell not corresponding to any index. (b) Make the new index cell point to the property header cell with the larger count. (c) Adjust the count of the remaining property header cell to be the sum of the counts of the two headers plus I, so that it reflects the number of index cells in the tree, including dummy cells. (5) Now, consider the list of indices created in step (1). For each such index, (a) Create a new index cell C. (b) Place a pointer in the first field of C to the property header cell with the correct property and adjust the count in that header cell. (c) Place pointers in the third field of C to the hash table location for that index and from this hash table entry to C. (d) Place a pointer in the fourth field of C to the first index cell below (on the pushdown list) having the same index. (e) Now, consider C~ and C2, the two original rcells representing this index on T1 and T2. Make C~ .and C2 into "dummy" cells by preserving the pointers in the first field of C~ and C2 (their links to their ancestors in their trees) but by removing the pointers in the third and fourth fields (their links to the hash table and to cells on other tables having the same index). Thus, the newly created index cell C plays the role of the two cells C~ and C2 that have been made dummy cells. (6) Dummy cells that are leaves can be returned to available storage. [1 Example 10.11
Let us consider the two property tables of Example 10.10 (Fig. 10.11 on p. 826). Suppose that It(p, st) is given by the following table:
830
BOOKKEEPING
CHAP. 10 s
t
u(p, st)
VO
Vl
Vl
VO
V2
V2
VO
V3
V3
Vl
VO
Vl
V2
VO
V2
V2
Vl
V3
V2
V3
V2
In part 2 of Algorithm 10.3 we must first consider the intersection list of T1, which consists of index cells 2, 5, and 8. (See Fig. 10.11.) Inspection of the above table shows that these indices will have properties v3, v3, and v2, respectively, on the new property table T. Then, we consider the property header cells of T, and T2. #(p, VoV~)= v,. and It(p, %%)--vi, so the properties in the header cells all remain the same. Now, we merge the tree for v l on T1 into the tree for v~ on Tz, since the latter is the larger. The resulting tree is shown in Fig. 10.12(a). Then, we merge the tree for v2 on Tz into the tree for v2 on T1. The resulting tree is shown in Fig. 10.12(b). It should be observed that in Fig. 10.12(a) node 5
+ (a)
(b) Fig. 10.12
Merged trees.
has been made a direct descendant of the header, while in Fig. 10.11 it was a direct descendant of the node numbered 3. This is an effect of Algorithm 10.4 and occurred when the intersection list of T1 was examined. Node 2 in Fig. 10.12(b) has been moved for the same reason. In the last step, we consider the indices on the intersection list of Ti. New index cells are created for these indices; the new cells point directly to the appropriate header. All other cells for that index in table T are made dummy cells. Dummy ceils with no descendants are then removed. The resulting table T is shown in Fig. 10.13. The symbol table is not shown. Note that the intersection list of T is empty.
PROPERTY GRAMMARS
see. 10.2
831
A
Pushdown List
I
Fig. 10.13 New table T.
Now suppose that an input symbol is shifted onto the pushdown list and reduced to D[2 : v~]. Then the index cell for 2, which points to Va in Fig. 10.13, is linked to the intersection list of its table. The changes are shown in Fig. 10.14. E] We shall now give the algorithm whereby we inquire about the property of index i on table T. This algorithm is used in Algorithm 10.3 to find the property of indices on the intersection list. A
l
D
? 'r
Ol ~.
[
~ dummy /
//
\\ \ /
Fig. 10.14 Tables after shift.
i
/
832
BOOKKEEPING
CHAP. 10
ALGORITHM 10.4 Finding the property of an index. Input. An index cell on some table T. We assume that tables are structured as in Algorithm 10.3. Output. The property of that index in table T. Method. (1) Follow pointers from the index cell to the root of the tree on which it appears. Make a list of all cells encountered on this path. (2) Make each cell on the path, except the root itself, point to the root directly. (Of course, the cell on the path immediately before the root already does so.) [Z]
Note that it is step (2) of Algorithm 10.4, which is essentially a "side effect" of the algorithm, that is of significance. It is step (2) which guarantees that the aggregate time spent on the table handling will be virtually proportional to input length. 10.2.4.
Analysis of the Table Handling Algorithm
The remainder of this chapter is devoted to the analysis of the time complexity of Algorithms 10.3 and 10.4. To begin, we define two functions F and G which will be used throughout this section. DEFINITION We define F(n) by the recurrence"
F(1)-- 1 F(n) -- 2 F('- I )
The following table shows some values of F(n). n 1
2 3 4 5
F(n) 1
2 4 16 65536
Now let us define G(n) to be the least integer i such that F(i) ~ n. G(n) grows so slowly that it is reasonable to say that G(n) ~ 6 for all n which are representable in a single computer word, even in floating-point notation. Alternatively, we could define G(n) to be the number of times we have to apply log 2 to n in order to produce a number equal to or less than 0.
sEc. 10.2
PROPERTYGRAMMARS
833
The remainder of this section is devoted to proving that Algorithm 10.3 requires O(nG(n)) steps of a random access computer when the input string is of length n. We begin by showing that exclusive of the time spent in Algorithm !0.4, the time complexity of Algorithm 10.3 is 0(n). LEMMA 10.3 Algorithm 10.3, exclusive of calls to Algorithm 10.4, can be implemented to run in time 0(n) on a random access computer, where n is the length of the input string being parsed. P r o o f . Each execution of part 1 requires a fixed amount of time, and there are exactly n such executions. In part 2, we note that steps (1) and (5) take time proportional to the length of the intersection list. (Again recall that we are not counting the time spent in calls of Algorithm 10.4.) But the only way for an index cell to find its way onto an intersection list is in part 1. Each execution of part 1 places at most one index cell on an intersection list. Thus, at most n indices are placed on all the intersection lists of all tables. After execution of part 2, all index cells are removed from the intersection list, and so the aggregate amount of time spent in steps (1) and (5) of part 2 is 0(n). The other steps of part 2 are easily seen to require a constant amount of time per execution of part 2. Since part 2 is executed exactly n - 1 times, we conclude that all time spent in Algorithm 10.3, exclusive of calls to Algorithm 10.4, is 0(n). D
We shall now define an abstract problem and provide a solution, mirroring the ideas of Algorithms 10.3 and 10.4 in terms that are somewhat more abstract but easier to analyze. DEFINITION For the remainder of this section let us define a s e t m e r g i n g p r o b l e m as (1) A collection of o b j e c t s a l, (2) A collection of set n a m e s , (3) A sequence of instructions (a) merge(A, B, C) or (b) find(a), where A, B, and C are set
• • . , an;
including A1, A2 . . . . , An; and I~, 12, •. • , Ira, where each I~ is of the form
names and a is an object.
(Think of the set names as pairs consisting of a table and a property. Think of the objects as index cells.) The instruction merge(A, B, C) creates the union of the sets named A and B and calls the resulting set C. No output is generated. The instruction find(a) prints the name of the set of which a is currently a member. Initially, we assume that each object a~ is in the set A~; that is, A~ = {a~}.
834
BOOKKEEPING
CHAP. ]0
The response to a sequence o f instructions I1, I 2 , . . . , Im is the sequence of outputs generated when each instruction is executed in turn.
Example 10.12 Suppose that we have objects a 1, a 2. . . . tions
, a6 and the sequence of instruc-
merge(AI, A 2, A 2) merge(A 3, A 4, A 4) merge(A 5, A 6, A 6) merge(A 2, A 4, A 4) merge(A4, A 6, A 6) find(a3) After executing the first instruction, A 2 is {a l, az}. After the second instruction, A 4 is [a3, a4}. After the third instruction, A 6 is {as, a6}. Then after the instruction merge(A 2, A 4, A4), A4 becomes {al, az, a3, a4}. After the last merge instruction, A 6 = {al, a z . . . . ,a6}. Then, the instruction find(a3) prints the name A 6, which is the response to this sequence of instructions. We shall now give an algorithm to execute any instruction sequence of length 0(n) on n objects in O(nG(n)) time. We shall make certain assumptions about the way in which objects and sets can be accessed. While it may not initially appear that the property g r a m m a r implementation of Algorithms 10.3 and 10.4 meets these conditions, a little reflection will suffice to see that in fact objects (index cells) are easily accessible at times when we wish to determine their property and that sets (property headers) are accessible when we want to merge them. ALGORITHM 10.5
C o m p u t a t i o n of the response to a sequence of instructions.
Input. A collection of objects [a 1. . . . , a,} and a sequence of instructions 11 . . . .
, Im of the type described above.
Output. The response to the sequence 1, . . . . , Ira. Method. A set will be stored as a tree in which each node represents one element of the set. The root of the tree has a label which gives (1) The name of the set represented by the tree and (2) The n u m b e r of nodes in that tree (count). We assume that it is possible to find the node representing an object or the root of the tree representing a set in a fixed number of steps. One way of
SEC. 10.2
PROPERTY GRAMMARS
835
accomplishing this is to use two vectors OBJECT and SET such that OBJECT(a) is a pointer to the node representing a and SET(A) is a pointer to the root of the tree representing set A. Initially, we construct n nodes, one for each object a t. The node for a l is the root of a one-node tree. Initially, this root is labeled Ai and has a count ofl. (1) To execute the instruction merge(A, B, C), locate the roots of the trees for A and B [via SET(A) and SET(B)]. Compare the counts of the trees named A and B. The root of the smaller tree is made a direct descendant of the larger. (Break ties arbitrarily.) The larger root is given the name C, and its count becomes the sum of the counts of A and B.'t Place a pointer in location SET(C) to the root of C. (2) To execute the instruction find(a), determine the node representing a via OBJECT(a). Then follow the path from that node to the root r of its tree. Print the name found at r. Make all nodes on this path, except r, direct descendants of r.:l: Example 10.13
Let us consider the sequence of instructions i n Example 10.12. After executing the first three merge instructions, we would have three trees, as shown in Fig. 10.15. The roots are labeled with a set name and a count. A2
A4
@
A6
Fig. 10.15 Trees after three merge instructions.
(The count is not shown.) Then executing the instruction merge(A 2, A 4, A4), we obtain the structure of Fig. 10.16. After the final merge instruction merge(A4, A6, A6), we obtain Fig. 10.17. Then, executing the instruction find(a3), we print the name A 6 and make nodes a 3 and a4 direct descendants of the root. (a4 is already a direct descendant.) The final structure is shown in Fig. 10.18. [Z] tThe analogy between this step and the merge procedure of Algorithm 10.3 should be obvious. The discrepancy in the way counts are handled has to do simply with the question of whether the root is counted or not. Here it is; in Algorithm 10.3 it was not. ,The analogy to Algorithm 10.4 should be obvious.
i
836
BOOKKEEPING
CHAP. I0
A4
A6
C Fig. 10.16 Trees after next merge instruction.
a 6 a2
Fig. 10.17 Tree after last merge instruction.
A6
Fig. 10.18 Tree after find instruction. We shall now show that Algorithm 10.5 can be executed in O(nG(n)) time on the reasonable assumption that execution of a merge instruction requires one unit of time and an instruction of the form find(a) requires time proportional to the length of the path from the node representing a to the
SEC. 10.2
PROPERTY GRAMMARS
837
root. All subsequent results are predicated on this assumption. From this point on we shall assume that n, the number of objects, has been fixed, and that the sequence of instructions is of length 0(n). DEFINITION We define the rank of a node on one of the structures created by Algorithm 10.5 as follows. (1) A leaf is of rank 0. (2) If a node N ever has a direct descendant of rank i, then N is of rank at least i + 1. (3) The rank of a rtode is the least integer consistent with (2). It may not be immediately apparent that this definition is consistent. However, if node M is made a direct descendant of node N in Algorithm 10.5, then M will never subsequently be given any more direct descendants. Thus the rank of M may be fixed at that time. For example, in Fig. 10.17 the rank of node a 6 can be fixed at 1 since a~ has one direct descendant of rank 0 and a 6 subsequently acquires no new descendants. The next three lemmas state some properties of the rank of a node. LEMMA 10.4 Let N be a root of rank i created by Algorithm 10.5. Then N has at least 2 i descendants.
Proof. The basis, i = 0, is clear, since a node is trivially its own descendant. For the inductive step, suppose node N is a root of rank i. Then, N must have at some time been given a direct descendant M of rank i - 1. Moreover, M must have been made a direct descendant of N in step (1) of Algorithm 10.5, or else the rank of N would be at least i -q- 1. This implies that M was then a root so that, by the inductive hypothesis, M has at least 2 t-1 descendants at that time, and in step (1) of Algorithm 10.5, N has at least 2 i- 1 descendants at that time. Thus, N has at least 2 t descendants after merger. As long as N remains a root, it cannot lose descendants. LEMMA 10.5 At all times during the execution of Algorithm 10.5, if N has a direct descendant M, then the rank of N is greater than the rank of M.
Proof. Straightforward induction on the number of instructions executed.
[~]
The following lemma gives a bound on the number of nodes of rank i. LEMMA 10.6 There are at most n2 -i nodes of rank i.
Proof. The structure created by Algorithm 10.5 is a collection of trees.
838
BOOKKEEPING
CHAP. 10
Thus, no node is a descendant of two different nodes of rank i. Since there are n nodes in the structure, the result follows immediately from Lemma 10.4. [~ COROLLARY
No node has rank greater than logzn.
[~]
DEFINITION
With the number of objects n fixed, define groups of ranks as follows. We say integer i is in group j if and only if logsj) (n) > i • log~zJ+l) (n), where log~z1~ (n) -- log2 n and log~k+l) (n) -- log~zk) (log 2 (n)). That is, log~zk) is the function which applies the log 2 function k times. For example, log~23) (65536) = log~22~(16) = log 2 (4) = 2. A node of rank r is said to be in rank group j if r is in group j. Since log~k) (F(k)) = 1 and no node has rank greater than log 2 n, we note that no node is in a rank group higher than G(n). For example, if n = 65536, we have the following rank groups. Rank of Node 0 1 2 3,4 5,6 . . . . .
16
Rank G ro u p 5 4 3 2 1
We are now ready to prove that Algorithm 10.5 is of time complexity
O(nG(n)). THEOREM 10.3 The cost of executing any sequence a of 0(n) instructions on n objects is
O(nG(n)). Proof Clearly the total cost of the merge instructions in a is 0(n) units of time. We shall account for the lengths of the paths traversed by the find instructions in a in two ways. In executing a find instruction suppose we move from node M to node N going up a path. If M and N are in different rank groups, then we charge one unit of time to the find instruction itself. We also charge 1 if N is the root. Since there are at most G(n) different rank groups along any path, no find instruction is charged more than G(n) units. If, on the other hand, M and N are in the same rank group, and N is not a root, we charge 1 time unit to node M itself. Note that M must be moved
SEC. ] 0 . 2
PROPERTY GRAMMARS
839
in this case. By Lemma 10.5, the new direct ancestor of M is of higher rank than its previous direct ancestor. Thus, if M is in rank group j, M may be charged at most logsj~ (n) time units before its direct ancestor becomes one of a lower rank group. From that time on, M will never be charged; the cost of moving M will be borne by the find instruction executed, as described in the paragraph above. Clearly the charge to all the find instructions is O(nG(n)). To find an upper bound on the total charge t o a l l the objects we sum over all rank groups the maximum charge to each node in the group times the maximum number of nodes in the group. Let g y be the number of nodes in rank group j and cj the charge to all nodes in group j. Then: log2 O) (n)
(10.2.1)
gj ~
~]
n2 -k
k=log2 (j+1) (n)
by Lemma 10.6. The terms of (10.2.1) form a geometric series with ratio 1/2, so their sum is no greater than twice the first term. Thus g j < 2n 2 -l°g~''" (n), which is 2n/logSj> (n). Now cj is bounded above by gj logsj> (n), so cj .~ 2n. Since j may vary only from 1 to G(n), we see that O(nG(n)) units of time are charged to nodes. It follows that the total cost of executing Algorithm 10.5 is
O(nG(n)).
[~]
Now we apply our abstract result to property grammars. THEOREM 10.4 Suppose that the parsing and table-handling mechanism constructed in Algorithms 10.3 and 10.4 is applied to an input of length n. Also, assume that the number of inquiries regarding properties of an index in tables is 0(n) and that these inquiries are restricted to the top table on the pushdown list for which the index has a nonneutral property.t Then the total time spent in table handling on a random access computer is O(nG(n)).
Proof. First, we must observe that the assumptions of our model apply, namely that the time needed in Algorithm 10.3 to reach any node which we want to manipulate is fixed, independent of n. The property headers (roots of trees) can be reached in fixed time since there are a finite number of them per table and they are linked. The index cells (nodes for objects) are directly accessible either from the hash table when we wish to inquire of their properties (this is why we assume that we inquire only of indices on the top table) or in turn as we proceed down an intersection list. tlf we think of the typical use of these properties, e.g., when a is reduced to F in Go and properties of the particular identifier a are desired, we see that the assumption is quite plausible.
840
BOOKKEEPING
CHAP. 10
It thus suffices to show that each index and header cell we create can be modeled as an object node, that there are 0(n) of them, and that all manipulations can be expressed exactly as some sequence of merge and lind instructions. The following is a complete list of all the cells ever created. (1) 2n "objects,' correspond to the n header cells and n index cells created during the shift operation (part 1 of Algorithm 10.3). We can cause the index cell to point to the header cell by a merge operation. (2) At most n objects correspond to the new index cells created in step (5) of part 2 of Algorithm 10.3. These cells can be made to point to the correct root by an appropriate merge operation. Thus, there are at most 3n objects (and n in Algorithm 10.5 means 3n here). Moreover, the number of instructions needed to manipulate the sets and objects when "simulating" Algorithms 10.3 and 10.4 by Algorithm 10.5 is 0(n). We have commented that at most 3n merge instructions suffice to initialize tables after a shift [(!) above] and to attach new index cells to headers [(2) above]. In addition, 0(n) merge instructions suffice in step (4) of part 2 of Algorithm 10.3 when two sets of indices having the same property are merged. This follows from the fact that the number of distinct properties is fixed and that only n -- 1 reductions can be made. Lemma 10.3 implies that 0(n) find instructions suffice to account for the examination of properties of indices on an intersection list [step (1) of part 2 of Algorithm 10.3]. Finally, we assume in the hypothesis of the theorem that 0(n) additional find instructions are needed to determine properties of indices (presumably for use in translation). If we put all these instructions in the order dictated by the parser and Algorithm 10.3, we have a sequence of 0(n) instructions. Thus, the present theorem follows from Lemma 10.3 and Theorem i0.3. [5]
EXERCISES
10.2.1.
Let G be the CFG with productions E-----~ E + TI E@) TI T T- > (E)ia where a represents an identifier, + represents fixed-point addition, and @ represents floating-point addition. Create from G a property grammar that will do the following: (1) Assume that each a has a table with one nonneutral entry. (2) If node n in a parse tree uses production E ~ E + T, then all the a's whose nodes are dominated by n are said to be "used in a fixed-point addition."
EXERCISES
841
(3) If, as in (2), production E ~ E @ T is used, all a's dominated by n are "used in a floating-point addition." (4) The property grammar must parse according to G, but check that an identifier which is used in a floating-point addition is not subsequently (higher up the parse tree) used in a fixed-point addition. 10.2.2.
Use the property grammar of Example 10.9 and give a parse tree, if one exists, for each of the following input strings" (a) begin
declare[ 1 : 1] declare[ 1 : 1] begin a[1:3]
end label[2: 2]a[ 1 : 3]
begin declare[3:1] a[3: 3]
end goto[2: 4] end (b) begin
declare[1 : 1] a[1:3]
begin declare[2:1] goto[1 : 4] end end DEFINITION
A nondeterministic property grammar is defined exactly as a property grammar, except that (1) The range of/.t is the subsets of V, and (2) There is no requirement as to fhe existence of a neutral property.
"10.2.3.
10.2.4.
Using the conventions of this section, we say that ocATp ocX1 T1 . . . X~T~fl if A ---~ X1 . . . X~ is some production, say p, and for each i, T(i) is a member of It(p, Tl(i) . . . T~(i)). The ==~ relation and the language generated are defined exactly as for the deterministic case. Prove that for every nondeterministic property grammar G there is a property grammar G' (possibly without a neutral property) generating the same language, in which/z is a function. S h o w t h a t if the grammar G of Exercise t0.2.3 has a neutral property (i.e., all but a finite number of integers must have that property in each table of each derivation), then grammar G' is a property grammar with the neutral property.
842
BOOKKEEPING
CHAP.
10
10.2.5.
Generalize Algorithm 10.3 to grammars which are not in CNF. Your generalized algorithm should have the same time complexity as the original.
"10.2.6.
Let us modify our property grammar definition by requiring that no terminal symbol have the all-neutral table. If G is such a property grammar, let L'(G) be (al . . . anlalT1 . . . anTn is in L(G) for some (not all-neutral) tables T1 . . . . . Tn}. Show that L'(G) need not be a CFG. Hint: Show that (aibJckli < j < k} can be so generated.
*'10.2.7.
Show that for "property grammar" G, as modified in Exercise 10.2.6, it is undecidable whether L'(G) : ~ , even if the underlying C F G of G is right-linear.
10.2.8.
Let G be the underlying C F G of Example 10.9. Suppose the terminal deelare associates one of two properties with an index--either "declared real" or "declared integer." Define a property grammar on G such that if the implementation of Algorithm 10.3 is used, the highest table on the pushdown list having a particular index i with nonneutral property (i.e., the one with the cell for i pointed to by the hash table entry for i) will have the currently valid declaration for identifier i as the property for i. Thus, the decision whether an identifier is real or integer can be made as soon as its use is detected on the input stream.
10.2.9.
Find the output of Algorithm 10.5 and the final tree structure when given a set of objects ( a l , . . . , alz} and the following sequence of instructions. Assume that in case of tie counts, the root of At becomes a descendant of the root of Aj if i < j. merge(A 1, Az, A 1) merge(A 3, A 1, A 1) merge(A4, A s, A4) merge(A 6, A 4, ,4 4) merge(A 7, A 8, A 7) merge(Ag, AT, AT) merge(A 1o, A 11, A 1o) merge(A12, AIo, Alo) find(a1) merge(A 1, A 4, A I) merge( A 7, A 1o, A 2) fred(aT) merge(A 1, A z, A 1) find(a3) find(a4)
BIBLIOGRAPHIC NOTES
*'10.2.10.
843
Suppose that Algorithm 10.5 were modified to allow either root to be made a descendant of the other when mergers were made. Show that the revised algorithm is of time complexity at best 0(n log n).
Open Problems 10.2.11.
Is Algorithm 10.5 as stated in this book really O(nG(n)) in complexity, or is it 0(n), or perhaps something in between ?
10.2.12.
Is the revised algorithm of Exercise 10.2.10 of time complexity 0(n log n) ?
Research Problem 10.2.13.
Investigate or characterize the kinds of properties of identifiers which can be handled correctly by property grammars.
BIBLIOGRAPHIC
NOTES
Property grammars were first defined by Stearns and Lewis [1969]. Answers to Exercises 10.2.6 and 10.2.7 can be found there. An n log log n method of implementing property grammars is discussed by Stearns and Rosenkrantz [1969]. To our knowledge, Algorithm 10.5 originated with R. Morris and M. D. Mcllroy, but was not published. The analysis of the algorithm is due to Hopcroft and Ullman [1972a]. Exercise 10.2.10 is from Fischer [1972].
11
CODE OPTIMIZATION
One of the most difficult and least understood problems in the design of compilers is the generation of "good" object code. The two most common criteria by which the goodness of a program is judged are its running time and size. Unfortunately, for a given program it is generally impossible to ascertain the running time of the fastest equivalent program or the length of the shortest equivalent program. As mentioned in Chapter 1, we must be content with code improvement, rather than true optimization when programs have loops. Most code improvement algorithms can be viewed as the application of various transformations on some intermediate representation of the source program in an attempt to manipulate the intermediate program into a form from which more efficient object code can be produced. These code improvement transformations can be applied at any point in the compilation process. One common technique is to apply the transformations to the intermediate language program that occurs after syntactic analysis but before code generation. Code improvement transformations can be classified as being either machine-independent or machine-dependent. An example of machine-independent optimization would be the removal of useless statements from a program, those which do not in any way affect its output.t Such machineindependent transformations would be beneficial in all compilers. " Machine-dependent transformations would attempt to transform a program into a form whereby advantage could be taken of special-purpose tSince such statements should not normally appear in a program, it is likely that an error is present, and thus the compiler ought to inform the user of the uselessness of the statement. 844
SEC. 11.1
OPTIMIZATION OF STRAIGHT-LINE CODE
845
machine instructions. As a consequence, machine-dependent transformations are hard to characterize in general, and for this reason they will not be discussed further here. In this chapter we shall study various machine-independent transformations that can be applied to the intermediate programs occurring within a compiler after syntactic analysis but before code generation. We shall begin by showing how optimal code can be generated for a certain simple but important class of straight-line programs. We shall then extend this class of programs to include loops and examine some of the code improvement techniques that can be applied to these programs.
11.1.
OPTIMIZATION OF STRAIGHT-LINE CODE
We shall first consider a program schema that models a block of code consisting of a sequence of assignment statements, each of which has the form A ~ f ( B i , • • •, Br), where A and B l, • • •, Br are scalar variables and f is a function of r variables for some r. For this restricted class of programs, we develop a set of transformations and show how these transformations can be used to find an optimal program under a rather general cost function. Since the actual cost of a program depends on the nature of the machine code that will eventually be produced, we shall consider what portions of the optimization procedure are machine-independent and how the rest depends on the actual machine model chosen. 11.1.1.
A Model of Straight-Line Code
We shall begin by defining a block. A block models a portion of an intermediate language program that contains only multiple-address assignment statements. DEFINITION
Let X be a countable set of variable names and ® a finite set of operators. We assume that each operator 0 in 19 takes a known fixed number of operands. We also assume that 19 and E are disjoint. A statement is a string of the form A + - - - O B 1 "'" B~
where A, B ~ , . . . , B~ are variables in E and 0 is an r-ary operator in ®. We say that this statement assigns (or sets) A and references B ~ , . . . , B~. A block (B is a triple ( P , / , U), where (1) P i s a list of statements S~; $2; • • • ; S,, where n ~> 0; (2) I is a set of input variables; and (3) U is a set of output variables.
846
CODE OPTIMIZATION
CHAP. 11
We shall assume that if statement Sj references A, then A is either an input variable or assigned by some statement before S i (i.e., by some St such that i < j ) . Thus, in a block all variables referenced are previously defined, either internally, as assigned variables, or externally, as input variables. Similarly, we assume each output variable either is an input variable or is set by some statement. A typical statement is A ~ + B C , which is just the prefix form of the more common assignment "A ~ B + C." If a statement sets variable A, we can associate a "value" with that assignment of A. This value is the formula for A in terms of the (unknown) initial values of the input variables. This formula can be written as a prefix expression involving the input variables and the operators. For the time being we are assuming that the input variables have unknown values and should be treated as algebraic unknowns. Moreover, the meaning of the operators and the set of quantities on which they are defined is not specified, and so a formula, rather than a quantity, is all that we can expect as a value. DEFINITION
To be more precise, let (P,/, U) be a block with P = $ 1 ; . . . ;S,. We define vt(A), the value of variable A immediately after time t, 0 < t < n, to be the following prefix expression" (1) If A is in L then vo(A ) = A. (2) If statement St is A ~ OB1 . . . Br, then (a) vt(A ) -- Ov,_~(B~) . . . V t _ l ( B r ) . (b) vt(C) = v~_~(C) for all C ~ A, provided that v,_~(C) is defined. (3) For all A in Z, v,(A) is undefined unless defined by (1) or (2) above. We observe that since each operator takes a known number of operands, every value expression is either a single symbol of Z or can be uniquely written as OEi . . . E r, where 0 is an r-ary operator and E l , . . . , E~ are value expressions. (See Exercise 3.1.17.) The value of block (B -~ (P,/, U), denoted v((B), is the set {v.(A) IA E U, and n is the number of statements of P}
Two blocks are (topologically)equivalent ( ~ ) i f they have the same value. Note that the strings forming prefix expressions are equal if and only if they are identical. That is, we assume no algebraic identities for the time being. In Section 11.1.6 we shall study equivalence of blocks under certain algebraic laws.
SEC. 11.1
O P T I M I Z A T I O N OF S T R A I G H T - L I N E CODE
847
Example 11.1
Let I = [A, B}, let U = {~F, G}, and let P consist o f the statementst T~
-A+B
S<
-A--B
T+--T,T S<
S,S
F<
T+S
G<
T--S
Initially, vo(A ) = A a n d vo(B ) ---B. After the first statement, v l ( T ) -A + B [in prefix n o t a t i o n vl(T) = + A B ] , Vl(A ) -- A, a n d vl(B) -- B. After the second statement, v2(S ) -- A -- B, and other variables retain their previous values. After the third statement, v3(T) -- (A q- B) • (A + B). The last three statements cause the following values to be c o m p u t e d (other values carry over f r o m the previous step): v , ( S ) = (A -- B) • (A -- B) v,(F) = (A + B) • (A + B) + (A -- B) • (A - - B) v~(C) = (A + B) • (A + B) -- (A -- B) • (A -- B)
Since v6(F) = vs(F), the value o f the block is {(A + B) • (A + B) + (A -- B) • (A -- B), (A + B) * (A + B) -- (A - - B ) • ( A - - B)}.~
N o t e that we are assuming that no algebraic laws pertain. If the usual laws o f algebra applied, then we could write F = 2(A z + B z) and G = 4 A B . [---] To avoid u n d u e complexity, we shall not consider statements involving structured variables (arrays a n d so forth) at this time. One way to handle arrays is the following. If A is an array, treat an assignment such as A ( I ) - - J as if A were a scalar variable assigned some function o f / , J and its f o r m e r value. T h a t is, we write A , - - O A I J , where 0 is an o p e r a t o r symbolizing assignment in an array. Similarly, an assignment such as J = A ( I ) could be expressed as J ,--- ~ A L tin displaying a list of statements we often use a new line in place of a semicolon to separate statements. In examples we shall use infix notation for binary operators. ~In prefix notation the value of the block is { ÷ • -t-AB + AB, --AB--AB, -- • q--AB -k-AB, -- AB--AB }.
848
CODE OPTIMIZATION
CHAP.
11
In addition, we make several other assumptions which make this theory less than generally applicable. For example, we ignore test statements, constants, and assignments of the form A ,--- B. However, changing the assumptions will lead to a similar theory, and we make our assumptions primarily for convenience, in order to give an example of this family of theories. Our principal assumptions are" (1) The important thing about a block is the set of functions of the input variables (variables defined outside the block) computed within the block. The number of times a particular function is computed is not important. This philosophy stems from the view that the blocks of a program pass values from one to another. We assume that it is never necessary for a block to pass two copies of a value to another block. (2) The variable names given to the functions computed are not important. This assumption is not a good one if the block is part of a loop and the function computed is fed back. That is, if 1 ~ 1 + 1 is computed, it would not do to change this computation to J ~ I + 1 and then repeat the block, expecting the computation to be the same. Nevertheless, we make this assumption because it lends a certain symmetry to the solution and is often valid. In the Exercises, the reader is asked to make the modifications necessary to allow certain output values to be given fixed names. (3) We do not include statements of the form X ~ Y. If such a statement occurred, we could substitute Y for X and delete it anyway, provided that assumption (2) holds. Again, this assumption lends symmetry to the model, and the reader is asked to modify the theory to include such statements. 11.1.2.
T r a n s f o r m a t i o n s on Blocks
We observe that given two blocks (B1 and (B2, we can test whether (B1 and (B2 are equivalent by computing their values v((Ba) and v((B2) and determining whether v((B1)= v(6~2). However, there are an infinity of blocks equivalent to any given block. For example, if 6~ = (P,/, U) is a block, X is a variable not mentioned in 6~, A is an input variable, and 0 is any operator, then we can append the statement X ~ O A . . . A to P as many times as we choose without changing the value of (B. Under a reasonable cost function, all equivalent blocks are not equally efficient. Given a block 6~, there are various transformations that we can apply to map 6~ into an equivalent, and possibly more desirable block (B'. Let 3 be the set of all transformations which preserve the equivalence of blocks. We shall show that each transformation in 3 can be implemented by a finite sequence of four primitive transformations on blocks. We shall then characterize those sequences of transformations which lead to a block that is optimal under a reasonable cost criterion.
sEc. 11.1
OPTIMIZATIONOF STRAIGHT-LINECODE
849
DEFINITION Let (B = (P, I, U) be a block with P = St; $2; . . . ;S,. For notational uniformity we shall adopt the convention that all members of the input set I are assigned at a zeroth statement, So, and all members of the output set U are referenced at an n + 1st statement, S,+ 1. Variable A is active immediately after time t, if (1) (2) (3) (4)
A is assigned by some statement S~; A is not assigned by statements S~+1, S ~ + 2 , . . . , $i; A is referenced by statement Sj+t; and 0~i~t~j~n.
I f j above is as large as possible, then the sequence of statements S~+~, S~+2, • • •, Sj+~ is said to be the scope of statement St and the scope of this assignment of variable A. If A is an output variable and not assigned after S~, then j -- n -t- 1, and U is also said to be in the scope of S~. (This follows from the above convention; we state it only for emphasis.) If a block contains a statement S such that the variable assigned in S is not active immediately after this statement, then the scope of S is null, and S is said to be a useless statement. Put another way, S is useless if S sets a variable that is neither an output variable nor subsequently referenced. Example 11.2
Consider the following block, where ~, fl, and 7 are lists o f zero or more statements"
A<
B+C
IJ D+--A,E ? If A is not assigned in the sequence of statements fl or referenced in 7, then the scope of A ,-- B + C includes all of fl and the statement D ~ A • E. If no statement in ? references D, and D is not an output variable, then the statement D ~-- A • E is useless. E] We shall now define four primitive equivalence-preserving transformations on blocks. We assume that CB = (P, I, U) is a block and that P is Si ; $2; • • • ; S,. As before, we assume that So assigns all input variables and that S,+, references all output variables. We shall define our transformations in terms of their effect on this block 6~. Our first transformation is intuitively appealing. We remove from a block any input variable or statement that does not affect an output variable.
850
CODE OPTIMIZATION
CHAP.
11
Ti" Elimination of Useless Assignments If statement S i, 0 ~ i ~ n, assigns A, and A is not active after time i, then (1) If i > 0, S,. can be deleted from P, or (2) If i = 0, A can be deleted from L Example 11.3
Let (g = (P, I, U), where I = {A, B, C}, U = {F, G}, and P consists of
F<
A+A
G N and N accumulators are available for this call (i.e., i = 1), or (2) 1 < N and at least 1 accumulators are available for this call (i.e., i 1 and l 2 ~ N. Then S - In} has at least one direct descendant with label ~ N, so ll _~ N. Thus, n is a major node, and S contains at least r -- 1 major nodes.
sac. 11.2
ARITHMETIC EXPRESSIONS
901
Fig. 11.20 Output of Algorithm 11.4. Case 3" n 2 is in S, b u t n l is not. This case is similar to case 2. Case 4: B o t h n 1 a n d n 2 are in S. Let rl o f the direct d e s c e n d a n t s of S with labels at least N be d e s c e n d a n t s o f nl a n d r2 o f t h e m be d e s c e n d a n t s o f n 2. T h e n r l q- r2 = r. By the inductive hypothesis, the p o r t i o n s o f S in T 1 a n d T2 have, respectively, at least r l -- 1 a n d at least r2 -- 1 m a j o r nodes. If neither r 1 n o r r 2 is zero, then ll _~ N a n d 12 ~ N, a n d so n is major. Thus, S has at least (r 1 -- 1) + (r~ -- 1) -q- 1 = r -- 1 m a j o r nodes. If rl = 0, t h e n r2 = r, a n d so the p o r t i o n o f S in Tz has at least r -- 1 m a j o r nodes. T h e case r2 = 0 is a n a l o g o u s . [~]
THEOREM 11.9 A l g o r i t h m 11.4 p r o d u c e s a tree in [T]~, t h a t has the least cost. Proof. A s t r a i g h t f o r w a r d i n d u c t i o n on the n u m b e r o f n o d e s in a n associative tree A shows t h a t the result o f a p p l y i n g p r o c e d u r e a c o m m u t e to its r o o t
902
CODEOPTIMIZATION
CHAP. 11
is a syntax tree T whose root after applying the labeling Algorithm 11.1 has the, same label as the root of A. No tree in [T]a has a root with label smaller than the label of the root of A, and no tree in [T]a has fewer major or minor nodes. Suppose otherwise. Then let T be a smallest tree violating one of those conditions. Let 0 be the operator at the root of T. Case 1" 0 is neither associative nor commutative. Every associative or commutative transformation on T must take place wholly within the subtree dominated by one of the two direct descendants of the root of T. Thus, whether the violation is on the label, the number of major nodes, or the number of minor nodes, the same violation must occur in one of these subtrees, contradicting the minimality of T. Case 2 : 0 is commutative but not associative. This case is similar to case 1, except that now the commutative transformation may be applied to the root. Since step (1) of procedure aeommute takes full advantage of this transformation, any violation by T again implies a violation in one of its subtrees. Case 3 : 0 is commutative and associative. Let S be the cluster containing the root. We may assume that no violation occurs in any of the subtrees whose roots are the direct descendants of S. Any application of an associative or commutative transformation must take place wholly within one of these subtrees, or wholly within S. Inspection of the result of step (2) of procedure aeommute assures us that the number of minor nodes resulting from cluster S is minimized. By Lemma 11.12, the number of major nodes resulting from S is as small as possible (inspection of the result of procedure aeommute is necessary to see this), and hence the label of the root is as small as possible. Finally we observe that the alterations made by Algorithm 11.4 can always be accomplished by applications of the associative and commutative transformations. IS]
EXERCISES
11,2.1. Assuming that no algebraic laws prevail, find an optimal assembly program with the number of accumulators N = 1, 2, and 3 for each of the following expressions: (a) A -- B , C -- D , (E + F). (b) A + (B -}- (C , (D -+- E / F-k- G) , H)) + ( I - ~ J). (c) (A • (B -- C)) • (D • (E • F)) + ((G + (H 4- I)) 4- (J + (K + L))), Determine the cost of each program found. 11.2.2.
Repeat Exercise 11.2.1 assuming that + and • are commutative.
11.2.3. Repeat Exercise 11.2.1 assuming that + and • are both associative and commutative.
EXERCISES
903
11.2.4.
Let E be a binary expression with k operators. What is the maximum number of parentheses required to express E without using any unnecessary parentheses ?
"11.2.5.
Let T be a binary syntax tree whose root is labeled N > 2 after applying Algorithm 11.1. Show that T contains at least 3 × 2N-2 - - 1 interior nodes.
11.2.6.
Let T be a binary expression tree with k interior nodes. Show that T can have at most k minor nodes.
"11.2.7.
Given N > 2, show that a tree with M major nodes has at least 3(M + 1)2N- 2 _ 1 interior nodes.
'11.2.8.
What is the maximum cost of a binary syntax tree with k nodes ?
"11.2.9.
What is the maximum saving in the cost of a binary syntax tree of k nodes in going from a machine with N accumulators to one with N + 1 accumulators ?
*'11.2.10.
Let ~ be an arbitrary set of algebraic identities. Is it decidable whether two binary syntax trees are equivalent under tZ ?
11.2.11.
For Algorithm 11.2 define the procedure code(n, [il, i2 . . . . . ik]) to compute the value of node n with accumulators Ate, A t , , . . . , Ate, leaving the result in accumulator Ai,. Show that by making step (5) to be code(n, [il, i2 . . . . .
ik]) =
code(n2, [i2, il, i3, • • • , ik])
code(n1, [ i l , 'OP0
i3, i 4 , .
• • , ik])
All, At~, At~'
Algorithm 11.2 can be modified so that assembly language instructions of types (3) and (4) are only of the form OP0 11.2.12.
A,B,A
Let !l
if b > 0
sign(a, b) :
if b = 0 [al
if b < 0
Show that sign is associative but not commutative. Give examples of other operators which are associative but not commutative. "11.2.13.
The instructions LOAD
M, A
STORE
A, M
oPO
A,M,B
904
CODE OPTIMIZATION
CHAP. 11
each use one memory reference. If B is an accumulator, then 0P 0 A, B, C uses none. Find the minimum number of storage references generated by a program that computes a binary syntax tree with k nodes.
"11.2.14.
Show that Algorithm 11.2 produces a program which requires the fewest number of memory references to compute a given expression.
"11.2.15.
Taking Go as the underlying grammar, construct a syntax-directed translation scheme which translates infix expressions into optimal assembly language programs, assuming N accumulators and that (a) No algebraic identities hold, (b) + and • are commutative, and (c) ÷ and • are associative and commutative.
11.2.16.
Give algorithms to implement the procedures code, commute, and acommute in linear time.
"11.2.17.
Certain operators may require extra accumulators (e.g., subroutine calls or multiple precision arithmetic). Modify Algorithms 11.1 and 11.2 to take into account the possible need for extra accumulators by operators.
11.2.18.
Algorithms 11.1-11.4 can also be applied to arbitrary binary dags if we first convert a dag into a tree by duplicating nodes. Show that now Algorithm 11.2 will not always generate an optimal program. Estimate how bad Algorithm 11.2 can be under these conditions.
11.2.19.
Generalize Algorithms 11.1-11.4 to work on expressions involving operators with arbitrary numbers of arguments.
"11.2.20.
Arithmetic expressions can have unary q- and unary -- operators. Construct an algorithm to generate optimal code for arithmetic expressions with unary q-- and --. [Assume that all operands are distinct and that the usual algebraic laws relating unary ÷ and -to the four binary arithmetic operators apply.]
"11.2.21.
Construct an algorithm to generate optimal code for a single arithmetic expression in which each operand is a distinct variable or an integer constant. Assume the associative and commutative laws for ÷ and • as well as the following identities:
(1) a + O : O + a = O . (2) a . 1 = 1 . a =t~. (3) c1 0 cz = c3, where c3 is the integer that is the result of applying the operator/9 to integers c1 and cz. 11.2.22.
Find an algorithm to generate optimal code for a single Boolean expression in which each operand is a distinct variable or a Boolean constant (0 or 1). Assume that Boolean expressions involve the operators and, or, and not and that these operators satisfy the laws of Boolean algebra. (See p. 23 of Volume I.)
EXERCISES
*'11.2.23.
905
In certain situations two or more operations can be executed in parallel. The algorithms in Sections 11.1 and 11.2 assume serial execution. However, if we have a machine capable of performing parallel operations, then we might attempt to arrange the order of execution to create as many simultaneous parallel computations as possible. For example, suppose that we have a four-register machine in which four operators can be simultaneously executed. Then the expression A i + A2 q- A3 -{- A4 nu A5 q- A6 q- A7 nu As can be done as shown in Fig. 11.21. In the first step we would load A a into the first register,
step 4
step 3
step 2
step 1 Fig. 11.21 Tree for parallel computation.
A3 into the second, A5 into the third, and A7 into the fourth. In the second step we would add A2 to register 1, A4 to register 2, A6 to register 3, and A8 to register 4. After this step register 1 would contain A1 + A2, register 2 would contain A3 + A4, and so forth. At the third step we would add register 2 to register 1 and register 4 to register 3. At the fourth step we would add register 2 to register 1. Define an N-register machine in which up to N parallel operations can be executed in one step. Assuming this machine, modify Algorithm 11.1 to generate optimal code (in the sense of fewest steps) for single arithmetic expressions with distinct operands. Research Problem
11.2.24.
Find an efficient algorithm that will generate optimal code of the type mentioned in this section for an arbitrary block.
Programming Exercise
11.2.25.
Write programs to implement Algorithms 11.1-11.4.
906
CHAP. 11
CODE OPTIMIZATION
BIBLIOGRAPHIC
NOTES
Many papers have been written on the generation of good code for arithmetic expressions for a specific machine or class of machines. Floyd [1961a] discusses a number of optimizations involving arithmetic expressions including detection of common subexpressions. He also suggested that the second operand of a noncommutative binary operator be evaluated first. Anderson [1964] gives an algorithm for generating code for a one-register machine that is essentially the same as the code produced by Algorithm 11.1 when N = 1. Nakata [1967] and Meyers [1965] give similar results. The number of registers required to compute an expression tree has been investigated by Nakata [1967], Redziejowski [1969], and Sethi and Ullman [1970]. Algorithms 11.1-11.4 as presented here were developed by Sethi and Ullman [1970]. Exercise 11.2.11 was suggested by P. Stockhausen. Beatty [1972] and Frailey [1970] discuss extensions involving the unary minus operator. An extension of Algorithm 11.2 to certain dags was made by Chen [1972]. There are no known efficient algorithms for generating optimal code for arbitrary expressions. One heuristic technique for making register assignments in a sequence of expression evaluations is to use the following algorithm. Suppose that expression tz is to be computed next and its value stored in a fast register (accumulator). (1) If the value Of ~ is already stored in some register i, then do not recompute ~. Register i is now "in use." (2) If the value of tz is not in any register, store the value of ~ in the next unused register, say register j. Register j is now in use. If there is no unused register available, store the contents of some register k in main memory, and store the value of g in register k. Choose register k to be that register whose value will be unreferenced for the longest time. Belady [1966] has shown that this algorithm is optimal in some situations. However, the model assumed by this algorithm (which was designed for paging) does not exactly model straight line code. In particlar, it assumes the order of computation to be fixed, while as we have seen in Sections 11.1 and 11.2, there is often much advantage to be had by reordering computations. A similar register allocation problem is discussed by Horwitz et al. [1966]. They assume that we are given a sequence of operations which reference and change values. The problem is to assign these values to fast registers so that the number of loads and stores from the fast registers to main memory is minimized. Their solution is to select a least-cost path in a dag of possible solutions. Techniques for reducing the size of the dag are given. Further investigation of register allocation where order of computation is not fixed has been done by Kennedy [1972] and Sethi [1972]. Translating arithmetic expressions into code for parallel computers is discussed by Allard et al. [1964], Hellerman [1966], Stone [1967], and Baer and Bovet [1968].
sac. 11.3
PROGRAMS W I T H LOOPS
907
The general problem of assigning tasks optimally to parallel processors is very difficult. Some interesting aspects of this problem are discussed by Graham [1972].
11,3.
PROGRAMS WITH LOOPS
When we consider programs that contain loops, it becomes literally impossible to mechanically optimize such programs. Most of the difficulty stems from undecidability results. Given two arbitrary programs there is no algorithm to determine whether they are equivalent in any worthwhile sense. As a consequence there is no algorithm which will find an optimal program equivalent to given program under an arbitrary cost criterion. These results are understandable when we realize that, in general, there are arbitrarily many ways to compute the same function. Thus, there is an infinity of algorithms that can be used to implement the function defined by the source program. If we want true optimization, a compiler would have to determine the most efficient algorithm for the function computed by the source program, and then it would have to generate the most efficient code for this algorithm. Needless to say, from both a theoretical and a practical point of view, optimizing compilers of this nature do not exist. However, in many situations there are a number of transformations that can be applied to a program to reduce the size and/or increase the speed of the resulting object language program. In this section we shall investigate several such transformations. Through popular usage, transformations of this nature have become known as "optimizing" transformations. A more accurate term would be "code-improving" transformations. However, we shall bow to tradition and use the more popular, but less accurate term "optimizing transformation" for the remainder of this book. Our primary goal will be to reduce the running time of the object language program. We begin by defining intermediate programs with loops. These programs will be very primitive, so that we can present the essential concepts without going into a tremendous amount of detail. Then we define a flow graph for a program. The flow graph is a two-dimensional representation of a program that displays the flow of control between the basic blocks of a program. A two-dimensional structure usually gives a more accurate representation of a program than a linear sequence of statements. In the remainder of this section we shall describe some important transformations that can be applied to a program in an attempt to reduce the running time of the object program. In the next section we shall look at ways of collecting the information needed to apply some of the transformations presented in this section.
908
11.3.1.
CODE OPTIMIZATION
CHAP.
11
The Program Model
We shall use a representation for programs that is intermediate between source language and assembly language. A program consists of a sequence of statements. Each statement may be labeled by an identifier followed by a colon. There will be five basic types of statements: assignment, goto, conditional, input-output, and halt. (I) An assignment statement is a string of the form A , - - 0 B1 ' " B,, where A is a variable, B 1 , . . . , B, are variables or constants, and 0 is an r-ary operator. As in the previous sections, we shall usually use infix notation for binary operators. We also allow a statement of the form A ~ - B in this category. (2) A goto statement is a string of the form goto (label) where (label) is a string of letters. We shall assume that if a goto statement is used in a program, then the label following the word goto appears as the label of a unique statement in the program. (3) A conditional statement is of the form if A (relation) B goto (label) where A and B are variables or constants and (relation) is a binary relation such as < , < , --, and ~ . (4) An input-output statement is either a read statement of the form read
A
where A is a variable, or a write statement of the form write
B
where B is a variable or a constant. For convenience we shall use the statement read
A1, A 2 , . . . , A ,
to denote the sequence of statements read read
Ai Az
read
A,
We shall use a similar convention for write statements.
PROGRAMSWITH LOOPS
SEC. 11.3
909
(5) Finally, a halt statement is the instruction halt. The intuitive meaning of each type of statement should be evident. For example, a conditional statement of the form if A r B goto L
means that if the relation r holds between the current values of A and B, then control is to be transferred to statement labeled L. Otherwise, control passes to the following statement. A definition statement (or definition for short) is a statement of the form read A or of the form A ~ OB1 . . . Br. Both statements are said to define the variable A. We shall make some further assumptions about programs. Variables are simple variables, e.g., A, B, C , . . . , or simple variables indexed by one simple variable or constant, e.g., A(1), A(2), A(I), or A(J). Further, we shall assume that all variables referenced in a program must be either input variables (i.e., appear in a previous read statement) or have been previously defined by an assignment statement. Finally, we shall assume that each program has at least one halt statement and that if a program terminates, then the last statement executed is a halt statement. Execution of a program begins with the first statement of the program and continues until a halt statement is encountered. We suppose that each variable is of known type (e.g., integer, real) and that its value at any time during the execution is either undefined or is a quantity of the appropriate type. (It will be assumed that all operators used are appropriate to the types of the variables to which they apply and that conversion of types occurs when appropriate.) In general the input variables of a program are those variables associated with read statements and the output variables are the variables associated with write statements. An assignment of a value to each input variable each time it is read is called an input setting. The value of a program under an input setting is the sequence of values written by the output variables during the execution of the program. We say that two programs are equivalent if for each input setting the two programs have the same value.t This definition of equivalence is a generalization of the definition of equivalent blocks used in Section 11.1. To see this, suppose that two blocks (~1 = (P~, 1i, Ui) and (B2 = (Pz, Iz, Uz) are equivalent in the sense of Section 11.1. We convert (g~ and (g2 into programs 6'1 and 6'z in the obvious way tWe are assuming that the meaning of each operator and relational symbol, as well as the data type of each variable, is established. Thus, our notion of equivalence differs from that of the schematologists (see for example, Paterson [1968] or Luckham et al. [1970]), in that they require two programs to give the same value not only for each input setting, but for each data type for the variables and for each set of functions and relations that we substitute for the operators and relational symbols.
910
CODEOPTIMIZATION
CHAP. 11
That is, we place read statements for the variables in 11 and 12 in front of P1 and P2, respectively, and place write statements for the variables in U 1 and U2 after P1 and P2- Then we append a halt statement to each program. However, we must add the write statements to P1 and P2 in such a fashion that each output variable is printed at least once and that the sequences of values printed will be the same for both ff'l and 6~2. Since 031 is equivalent to 032, we can always do this. The programs (Pi and if'2 are easily seen to be equivalent no matter what the space of input settings is and no matter what interpretation is placed on the functions represented by the operators appearing in (P1 and 6'z. For example, we could choose the set of prefix expressions for the input space and interpret an application of operator 0 to expressions e 1, • • •, er to yield Oel "'" e,. However, if 031 and 03z are not equivalent and 6'1 and 6'z are programs that correspond to 031 and 032, respectively, then there will always be a set of data types for the variables and interpretations for the operators that causes 6'1 and (P2 to produce different output sequences. In particular, let the variables have prefix expressions as a "type" and let the effect of operator 0 on prefix expressions el, e z , . . . , ee be the prefix expression Oelez...ek. Of course, we may make assumptions about data types and the algebra connected with the function and relation symbols that will cause 6' 1 and 6~2 to be equivalent. In that case 031 and 032 will be equivalent under the corresponding set of algebraic laws. Example 11,26
Consider the following program for the Euclidean algorithm described on p. 26 (Volume I). The output is t o be the greatest common divisor of two positive integers p and q. read p read q
loop:
done:
r <
remainder(p, q)
if
r =0
p<
~q
q<
r
goto
loop
write
q
goto
done
halt If, for example, we assign the input variables p and q the values 72 and 56, respectively, then the output variable q in the write statement will have value
SEC. 11.3
PROGRAMS WITH LOOPS
91 1
8 when that statement is executed with the normal interpretation of the operators. Thus, the value of this program for the input setting p ~ 72, q ~ 56 is the "sequence" 8 which is generated by the output variable q. If we replace the statement goto loop by if q ~ 0 goto loop, we have an equivalent program. This follows because the statement goto loop cannot be reached unless the fourth statement finds r ~ 0. Since q is given the value of r at the sixth statement, it is not possible that q = 0 when the seventh statement is executed. [~] It should be observed that the transformations which we may apply are to a large extent determined by the algebraic laws which we assume hold. 'Example 11.27
For some types of data we might assume that a • a = 0 if and only if a = 0. If we assume such a law, then the following program is equivalent to the one in Example 11.26:
read p
loop:
read
q
r <
remainder(p, q)
t<
r,r
if
t----0
p~
~q
goto
done
q~---r done"
goto
loop
write
q
halt Of course, this program would not be more desirable in any circumstance we can think of. However, without the law stated above, this program and the one in Example 11.26 might not be equivalent. [~ Given a program P, our goal is to find an equivalent program P ' such that the expected running time of the machine language version of P ' is less than that of the machine language version of P. A reasonable approximation of this goal is to find an equivalent program P " such that the expected number of machine language instructions to be executed by P" is less than the number of instructions executed by P. The latter goal is an approximation in that not every machine instruction requires the same amount of machine time to be executed. For example, an operation such as multiplication or division usually requires more time than an addition or subtraction. How-
912
CODEOPTIMIZATION
CHAP. 11
ever, initially we shall concentrate on reducing the number of machine language instructions that need to be executed. Most programs contain certain sequences of statements which are executed considerably more often than the remaining statements in the program. Knuth [1971] found that in a large sample of F O R T R A N programs, a typical program spent over one-half of its execution time in less than 4 7o of the program. Thus, in practice it is often sufficient to a p p l y t h e optimization procedures only to these heavily traveled regions of a program. Part of the optimization may involve moving statements from heavily traveled regions to lightly traveled ones even though the actual number of state' ments in the program itself remains the same or even increases. We can often deduce what the most frequently executed parts of a source program will be and pass this information along to the optimizing compiler along with the source program. In other cases it is relatively easy to write a routine that will count the number of times a given statement in a program is executed as the program is run. With these counts we can obtain the "frequency profile" of a program to determine those parts of the program in which we should concentrate our optimization effort. 11.3.2.
Flow Analysis
Our first step in optimizing a program is to determine the flow of control within the program. To do this, we partition a program into groups of statements such that no transfer occurs into a group except to the first statement in that group, and once the first statement is executed, all statements in the group are executed sequentially. We shall call such a group of statements a basic block, or block if no confusion with the term "block" in the sense of Section 11.1 arises. DEFINITION A statement S in a program P is a basic block entry if (1) S is the first statement in P, or (2) S is labeled by an identifier which appears after goto in a goto or conditional statement, or (3) S is a statement immediately following a conditional statement. The basic block belonging to a block entry S consists of S and all statements following S (1) Up to and including a halt statement or (2) Up to but not including the next block entry. Notice that the program constructed from a block in the sense of Section 11.1 will be a basic block in the sense of this section.
sEc. 11.3
PROGRAMS WITH LOOPS
91 3
Example 11.28
Consider the program of Example 11.26. There are four block entries, namely the first statement in the program, the statement labeled loop, the assignment statement p ~ q, and the statement labeled done. Thus, there are four basic blocks in the program. These blocks are given below: Block 1
Block 2
loop:
Block 3
Block 4
done:
read
p
read
q
r ~
remainder(p, q)
if
r----0
pc
q
q~
r
goto
loop
write
q
halt
goto
done
D
From the blocks of a program we can construct a graph that resembles the familiar flow chart for the program. DEFINITION
A flow graph is a labeled directed graph G containing a distinguished node n such that every node in G is accessible from n. Node n is called the begin node. Aflow graph ofaprogram is a flow graph in which each node of the graph corresponds to a block of the program. Suppose that nodes i a n d j of the flow graph correspond to blocks i and j of the program. Then an edge is drawn from node i to node j if (1) The last statement in block i is not a goto or halt statement and block j follows block i in the program, or (2) The last statement in block i is goto L or i f . . . goto L and L is the label of the first statement of block j. The node corresponding to the block containing the first statement of the program is the begin node. Clearly, any block that is not accessible from the begin node can be removed from a given program without changing its value. From now on we shall assume that all such blocks have been removed from each program under consideration.
914
CODEOPTIMIZATION
CHAP. 11
Example 11.29 The flow graph for the program of Example 11.26 is given in Fig. 11.22. Block 1 is the begin node. E]
read read ;
loop:
Block 3
Block 1
1
r "~-remainder (p,q) if r = 0 goto done
p~q q~r goto loop
Fig. 11.22
done:
Block 2
write q halt
Block 4
Flow graph.
Many optimizing transformations on programs require knowing the places in a program at which a variable is defined and where that definition is subsequently referenced. These definition-reference relationships are determined by the sequences of blocks that can actually be executed. The first block in such a sequence is the begin node, and each subsequent block must have an edge from the previous block. Sometimes the predicates used in the conditional statements may preclude some paths in the flow graph from being executed. However, there is no algorithm to detect all such situations, and we shall assume that no paths are precluded from execution. It is also convenient to know for a block ~ whether there is another block 6~' such that each time ~B is executed, ~ ' was previously executed. One application o f this knowledge is that if the same value is computed in both 6~ and 6~', then we can store this value after it is computed in CB' and thus avoid recomputing the same value in (B. We now develop these ideas formally. DEFINITION
Let F be a flow graph whose blocks have names chosen from set A. A sequence of blocks CB1 . . . 6~, in A* is a (block) computationpath of F if (1) 6~1 is the begin node of F. (2) For 1 < i < n, there is an edge from block ~ _ 1 to CBt.
sEc. 11.3
PROGRAMS WITH LOOPS
91 5
In other words, a computation path 03t "'" 63, is a path from 031 to 63, in F such that 6~a is the begin node. We say that block 03' dominates 63 if 03' :~ 63 and every path from the begin node to 63 contains 63'. We say that 03' directly dominates 03 if (1) 03' dominates 03, and (2) If 63" dominates 63 and 63" ~ 63', then 63" dominates 63'. Thus, block 63' directly dominates 63 if 63' is the block "closest" to 03 which dominates 63. Example 11.30
Referring to Fig. 11.22, the sequence 1232324 is a computation path. Block 1 directly dominates block 2 and dominates blocks 3 and 4. Block 2 directly dominates blocks 3 and 4. D Here are some algebraic properties of the dominance relation. LEMMA 11.13 (1) If 03I dominates 032 and 032 dominates 033, then 031 dominates 033 (transitivity). (2) If 63i dominates 032, then 632 does not dominate 031 (asymmetry). (3) If ~1 and 032 dominate 633, then either 031 dominates 632 or conversely.
Proof (I) and (2) are Exercises. We shall prove (3). Let e 1 . - . C,033 be any computation path with no cycles (i.e., Ci ~ 033, and ei ~ Cj if i ~ j). One such path exists since we assume that all nodes are accessible from the begin node. By hypothesis, eg = 031 and C i = 032 for some i and j. Assume without loss of generality that i < j. Then we claim that 03i dominates 032. In proof, suppose that 031 did not dominate 032. Then there is a computation path ~D1 -.. q.E)m032, where none of ~D1, . . . , ~Dm are 031. It follows that ~D1 - . . ~Dm032~j+1 "'" ~,033 is also a computation path. But none of the symbols preceding 033 are 031, contradicting the hypothesis that 031 dominates 033. [~ LEMMA 1 1.14 Every block except the begin node (which has no dominators), has a unique direct dominator.
Proof Let ,$ be the set of blocks that dominate some block 03. By Lemma 1 1.1 3 the dominance relation is a (strict) linear order on $. Thus, S has a minimal element, which must be the direct dominator of 03. (See Exercise 0.1.23.)
D
91 6
CODEOPTIMIZATION
CHAP. 11
We now give an algorithm to compute the direct dominance rel~ttion for a flow graph. ALGORITHM 1 1.5 Computation of direct dominance.
Input. A flow graph F with A = [(B~, 0 3 2 , . . . , (B,}, the set of blocks in F. We assume that 631 is the begin node.
Output. DOM(63), the block that is the direct dominator of block 63, for each 63 in A, other than the begin node. Method. We compute DOM(63)recursively for each 63 in A - ~631}. At any time, DOM(63) will be the block closest to 63 found to dominate 63. Ultimately, DOM(63) will be the direct dominator of 63. Initially, DOM(63) is 631 for all d3 in A - {63~}. For i = 2, 3 , . . . , n, do the following two steps: (1) Delete block 63~from F. Using Algorithm 0.3, find each block 63 which is now inaccessible from the begin node of F. Block 63~ dominates 63 if and only if 63 is no longer accessible from the begin node when 63~is deleted from F. Restore 63~to F. (2) Suppose that it has been determined that 63~dominates 63 in step (1). If D O M ( 6 3 ) = DOM(63t), set DOM(63) to 63t. Otherwise, leave DOM(63) unchanged. Example 11.31
Let us compute the direct dominators for the flow graph of Fig. 1 1.22 using Algorithm 1 1.5. Here A = [631,632,633,634]- The successive values of DOM(63) after considering 63~, 2 _~ i < 4, are given below" i
DOM((Bz)
Initial 2 3 4
~1 6~1 (B1 ~1
DOM((B3) ~1 (Bz ~2 (B2
DOM((B4) ~i ~z ~2 ~2
Let us compute line 2. Deleting block 632 makes blocks 633 and (~4 inaccessible. We have thus determined that 632 dominates 633 and 634. Prior to this point, DOM(632) = DOM(633) = 631, and so by step (2) of Algorithm 1 1.5 we set DOM(633) to 63z- Likewise, DOM(634) is set to 63z. Deleting block 633 or 634 does not make any block inaccessible, so no further changes occur. THEOREM 1 1.10 When Algorithm 1 1.5 terminates, DOM(63) is the direct dominator of 63.
SEC. 11.3
PROGRAMS W I T H LOOPS
917
Proof We first observe that step (1) correctly determines those 65's dominated by 6~;, for 6~ dominates 6~ if and only if every path to 63 from the begin node of F goes through 6~. We show by induction on i that after step (2) is executed, DOM(CB) is that block 6~h, 1 < h ~ i, which dominates 6~ but which, in turn, is dominated by all 6~'s, 1 ~ j ~ i, which also dominate 6~. That such a 6~h must exist follows directly from Lemma 11.13. The basis, i -- 2, is trivial. Let us turn to the inductive step. If CB~+1 does not dominate 6~, the conclusion is immediate from the inductive hypothesis. If 6~+, does dominate ~, but there is some 6~j, 1 ~_~j ~ i, such that 6~j dominates ~B and 6~+ 1 dominates 6~, then DOM(6~) ~ DOM(6~+i). Thus, DOM((B) does not change, which correctly fulfills the inductive hypothesis. If (B~+1 dominates 6~ but is dominated by all (Bj's which dominate (g, 1 ~ j ~ i, we claim that prior to this step, DOM(6~)----DOM(6~+I). For if not, there must be some 6~k, 1 < k ~ i, which dominates 6~+ 1 but not 6~, which is impossible by Lemma 11.13(1). Thus DOM(6~) is correctly set to 6~.+1, completing the induction. [~ We observe that if F is constructed from a program, then the number of edges is at most twice the number of blocks. Thus, step (1) of Algorithm 11.5 takes time proportional to the square of the number of blocks. The space taken is proportional to the number of blocks. If (B 1, ( B 2 , . . . , 63, are the blocks of a program (except for the begin block), then we can store the direct dominators of these blocks as the sequence (~1, e 2 , . . . , e,, where e i is the direct dominator of 6~t, for 1 < i ~ n. All dominators for block 63i can be recovered from this sequence easily by finding DOM(CB~), DOM(DOM(6~,)), and so forth until we reach the begin block. 11.3.3.
Examples of Transformations on Programs
Let us now turn our attention to transformations that can be applied to a program, or its flow graph, in an attempt to reduce the running time of the object language program that is ultimately produced. In this section we shall consider examples of such transformations. Although there does not exist a complete catalog of optimizing transformations for programs with loops, the transformations considered here are useful for a wide class of programs.
1. Removal of Useless Statements This is a generalization of transformation T1 of Section 11.1. A statement that does not affect the value of a program is unnecessary in a program and can be removed. Basic blocks that are not accessible from the begin node are clearly useless and can be removed when the flow graph is constructed. Statements which compute values that are not ultimately used in computing
918
CODEOPTIMIZATION
CHAP. 11
an output variable also fall into this category. In Section 11.4, we shall provide a tool for implementing this transformation in a program with loops.
2. Elimination of Redundant Computations This transformation is a generalization o f transformation T2 of Section 11.1. Suppose that we have a program in which block (B dominates block 6~' and that 6~ and 6~' have statements A ~--B + C and A ' ~ - - B + C, respectively. If neither B nor C are redefined in any (not necessarily cyele free) path from 6~ to 6~' (it is not hard to detect this; see Exercise 11.3.5), then the values computed by the two expressions are the same. We may insert the statement X ~ - A after A ~-- B + C in 6~, where X is a new variable. We then replace A' ~ B + C by A' ~ X. Moreover, if A is never redefined going from 6~ to 6~', then we do not need X ~-- A, and A' ~ A serves for A' +-- B + C. In this transformation we assume that it is cheaper to do the two assignments X ~-- A and A ~ - X than to evaluate A' .-- B + C, an assumption which is realistic for many reasonable machine models. Example 11.32
Consider the flow graph shown in Fig. 11.23. In this flow graph block (B~ dominates blocks (B2, (B3, and (B4. Suppose that all assignment statements involving the variables ,4, B, C, and D are as shown in Fig. 11.23. Then the expression B + C has the same value when it is computed in blocks (B~, 033, and 6~4. Thus, it is unnecessary to recompute the expression B + C in blocks (B3 and 6~4. In block 6~ we can insert the assignment statement X ~ A after the statement A ~ B + C. Here X is a new variable name. Then in blocks 6~3 and 6~4 we can replace the statement A ~ B + C and G ~ B + C by the simple assignment statements A ~ X and G ~ X, respectively, without affecting the value of the program. Note that since A is computed in block 6~2, we cannot use A in place of X. The resulting flow graph is shown in Fig. 11.24. The assignment A ~-- X now in 6~3 is redundant and can be eliminated. Also note that if the statement F ~ A + G in block 6~2 is changed to B ~-- A + G, then we can no longer replace G ~ - B + C by G ~-- X in block 6~4. [-] Eliminating redundant computations (common subexpressions) from a program requires the detection of computations that are common to two or more blocks of a program. While we have shown a redundant computation occurring at a block and one of its dominators, an expression such as A + B might be computed in several blocks, none of which dominate a given
SEC. 11.3
PROGRAMS WITH LOOPS
919
B ~-D+D
C~D*D A ~B+C
A ~-B*C F~A+G
~l
,1
A ~B+C E~A,A
l CB2
t
~3
6~5 Fig. 11.23 Flow graph. block • (which also needs expression A + B). In general, a computation of A + B is redundant in a block • if (1) Every path from the begin block to ~ (including those which pass several times through ~) passes through a computation of A + B, and (2) Along any such path, no definition of A or B occurs between the last computation of A + B and the use of A + B in (B. In Section 11.4 we shall provide some tools for the detection of this more general situation. We should note that as in the straight-line case, algebraic laws can increase the number of common subexpressions.
3. Replacing Run Time Computations by Compile Time Computations It makes sense, if possible, to perform a computation once when a program is being compiled, rather than repeatedly when the object program is being executed. A simple instance of this is constant propagation, the replacement of a variable by a constant when the constant value of that variable is known.
920
CHAP. 11
CODE OPTIMIZATION
B~D+D .
C~D*D A ~B+C
I~1
X+-A
A ~-B*C F~A+G
A ~X E~A*A
(B2
"
G'~-
4
6~5
Fig. 11.24 Transformed flow graph.
Example 11.33 Suppose that we have the block
read PI <
R 3.14159
A <
4/3
B<
A*PI
C<
R1'3
V<
B,C
write
V
We can substitute the value 3.14159 for PI in the f o u r t h statement to obtain the statement B ~ A • 3.14159. We can also c o m p u t e 4/3 a n d substitute the resulting value in B ~--- A • 3.14159 to obtain B ~ 1.33333 • 3.14159. We can c o m p u t e 1 . 3 3 3 3 3 , 3 . 1 4 1 5 9 = 4 . 1 8 8 7 8 a n d substitute 4.18878 in the statement V ~ B • C to obtain V ~-- 4.18878 • C. Finally, we can eliminate the resulting useless statements to obtain the following shorter equiva-
SEC. 11.3
PROGRAMS
WITH
LOOPS
921
lent program: read
R
C<
R1"3
V<
4.18878 • C
write
V
4. Reduction in Strength
Reduction in strength involves the replacement of one operator, requiring a substantial amount of machine time for execution, by a less costly computation. For example, suppose that a PL/I source program contains the statement I ---- LENGTH(S1 I! $2) where S1 and $2 are strings of variable length. The operator [[ denotes string concatenation. String concatenation is relatively expensive to implement. However, suppose we replace this statement by the equivalent statement I = LENGTH(S1) -t-- LENGTH(S2)
We would now have to perform the length operation twice and perform one addition. But these operations are substantially less expensive than string concatenation. Other examples of this type of optimization are the replacement of certain multiplications by additions and the replacement of certain exponentiations by repeated multiplication. For example, we might replace the statement C ~-- R 1" 3 by the sequence C~
R,R
C<
C,R
assuming that it is cheaper to compute R • R • R rather than calling subroutines to evaluate R 3 as ANTILOG(3 • LOG(R)). In the next section we shall consider a more interesting form of reduction in strength within loops, where it is possible to replace certain multiplications by additions. 11.3.4.
Loop Optimization
Roughly speaking, a loop in a program is a sequence of blocks that can be executed repeatedly. Loops are an integral feature of most programs, and many programs have loops that are executed a large number of times. Many programming languages have constructs whose explicit purpose is
}
922
C O D EOPTIMIZATION
CHAP. 11
the establishment of a loop. Often substantial improvements in the running time of a program can be made by taking advantage of transformations that only reduce the cost of loops. The general transformations we just discussed--removal of useless statements, elimination of redundant computation, constant propagation, and reduction in strength--are particularly beneficial when applied to loops. However, there are certain transformations that are specifically directed at loops. These are the movement of computations out of loops, the replacement of expensive computations in a loop by cheaper ones, and the unrolling of loops. To apply these transformations, we must first isolate the loops in a given program. In the case of FORTRAN DO loops, or the intermediate code arising from a DO loop, the identification of a loop is easy. However, the concept of a loop in a flow graph is more general than the loops that result from DO statements in FORTRAN. These generalized loops in flow graphs are called "strongly connected regions." Every cycle in a flow graph with a single entry point is an example of a strongly connected region. However, more general loop structures are also strongly connected regions. We define a strongly connected region as follows. DEFINITION
Let F be a flow graph and 8 a subset of blocks of F. We say that ,~ is
a strongly connected region (region, for short) of F if (1) There is a unique block ~ of $ (the entry) such that there is a path from the begin node of F to (B which does not pass through any other block in S. (2) There is a path (of nonzero length) that is wholly within S from every block in ,~ to every other block in $. Example 11.34
Consider the abstract flow graph of Fig. 11.25. [2, 3, 4, 5} is a strongly connected region with entry 2. [4] is a strongly connected region with entry 4. {3, 4, 5, 6} is a region with entry 3. {2, 3, 7} is a region with entry 2. Another region with entry 2 is {2, 3, 4, 5, 6, 7}. The latter region is maximal in that every other region with entry 2 is contained in this region. D The important feature of a strongly connected region that makes it amenable to code improvement is the single identifiable entry block. For example, one optimization that we can perform on flow graphs is to move a computation that is invariant within a region into the predecessors of the entry block of the region (or we may construct a new block preceding the entry, to hold the invariant computations). We can characterize the entry blocks of a flow graph in terms of the dominance relation.
sEc. 11.3
PROGRAMS WITH LOOPS
923
Fig. 11.25 Flow graph. THEOREM 11.11 Let F be a flow graph. Block CBin F is an entry block of a region if and only if there is some block CB' such that there is an edge from ~ ' to 6~ and 6~ either dominates 6~' or is CB'.
Proof. Only if: Suppose that 6~ is the entry block of region S. If S = [¢B}, the result is trivial. Otherwise, let 6~' be in 8, cB' ~ 6~. Then ~ dominates 6~', for if not, then there is a path from the begin node to CB' that does not pass through ~, violating the assumption that 6~ is the unique entry block. Thus, the entry block of a region dominates every other block in the region. Since there is a path from every member of $ to cB, there must be at least one CB' in 8 - {tB} which links directly to cB.
If: The case in which 6~ = cB' is trivial, and so assume that 6~ ~ 6~'. Define S to be ~ together with those blocks ~ " such that tB dominates ~ " and there is a path from 6~" to 6~ which passes only through nodes dominated by 6~. By hypothesis, ~ and ~ ' are in $. We must show that $ is a region with entry ~. Clearly, condition (2) of the region definition is satisfied, and so we
924
CODEOPTIMIZATION
CHAP. 11
must show that there is a path from the begin node to ~ that does not pass through any other block in $. Let C 1 .- • •,CB be a shortest computation path leading to CB. If ej is in $, then there is some i, 1 < i < j, such that t~i = CB, because ~ dominates Cj. Then e l . . . C,O3is not a shortest computation path leading to CB,a contradiction. Thus, condition (1) of the definition of a strongly connected region holds. The set S constructed in the "if" portion of Theorem 11.11 is clearly the maximal region with entry ~B. It would be nice if there were a unique region with entry ~, but unfortunately this is not always the case. In Example 11.34, there are three regions with entry 2. Nevertheless, Theorem 11.11 is useful in constructing an efficient algorithm to compute maximal regions, which are unique. Unless a region is maximal, the entry block may dominate blocks not in the region, and blocks in the region may be accessible from these blocks. In Example 11.34, e.g., region [2, 3, 7} can be reached via block 6. We therefore say that a region is single-entry if every edge entering a block of the region, other than the entry block, comes from a block inside the region. In Example 11.34, region {2, 3, 4, 5, 6, 7} is a single-entry region. In what follows, we assume regions to be single-entry, although generalization to all regions is not difficult.
1. Code Motion There are several transformations in which knowledge of regions can be used to improve code. A principal one is code motion. We can move a regionindependent computation outside the region. Let us say that within some single-entry region variables Y and Z are not changed but that the statement X ~-- Y + Z appears. We may move the computation of Y-Jr- Z to a newly created block which links only to the entry block of the region.t All links from outside the region that formerly went to the entry now go to the new block. Example 11.35
It may appear that region-invariant computations would not appear except in the most carelessly written programs. However, let us consider the following inner DO loop of a F O R T R A N source program, where J is defined outside the loop" K=0
DO 3 I = 3
K--J+
1,1000
1 +I+K
The intermediate program for this portion of the source program might ?The addition of such a block may make the flow graph unconstructable from any program. However, the property of constructability from a program is never used here.
SEC. 11.3
PROGRAMS WITH LOOPS
925
look like this"
loop:
K<
0
I<
1
T+---J+ S<
T+I
K<
S+K
if
I :
I< goto
done:
1
1000
I+
goto
done
1
loop
halt
The corresponding flow graph is shown in Fig. 11.26.
K+-O I+-1
63 1
T~-J+I S~T+I K~S+K I = 1000?
i
6~2
i ! I~I+
1
i hat 6~ 3
6~4
Fig. 11.26 Flow graph.
We observe that [~2, 6~3} in Fig, 11.26 is a region with entry ~2- The statement T ~ J + 1 is invariant in the region, so it may be moved to a new block, as shown in Fig. 11.27. While the number of statements in the flow graphs of Figs. 11.26 and 11.27 is the same, the presumption is that statements in a region will tend to be executed frequently, so that the expected time of execution has been decreased. [Z] 2. Induction Variables
Another useful transformation concerns the elimination of what we shall call induction variables.
926
CHAP. 11
C O D EOPTIMIZATION
K~O I~1
6~
L
T~J+I
S~T+I K~S+K
I = 1000?
~2
I ! I~I+
1
:l 6~3
halt
]6~ 4 Fig. 11.27 Revised flow graph.
DEFINITION
Let $ be a single entry region with entry 6~ and let X be some variable appearing in a statement of the blocks of S. Let CB1 . . . 6~n03e1 . . . ~m be any computation path such that ~t is in S, 1 < i < m, and 6~n, if it exists, is not in S. Define X1, X2, • • • to be the sequence of values of X each time X is assigned in the sequence 6~e~ . . . ~m" If Xi, X2, • .. forms an arithmetic progression (with positive or negative difference) for arbitrary computation paths as above, then we say that X is an induction variable of $. We shall also consider X to be an induction variable if it is undefined the first time through (B and forms an arithmetic progression otherwise. In this case, it may be necessary to initialize it appropriately on entry to the region from outside, in order that the optimizations to be discussed here may be performed. Note that it is not trivial to find all the induction variables in a region. In fact, it can be proven that no such algorithm exists. Nevertheless, we can detect enough induction variables in common situations to make the concept worth considering. Example 11.36
In Fig. 11.27, the region {(B2, 6~3} has entry CB2. If CB2 is entered from (B~ and the flow of control passes repeatedly from CB2 to 6~3 and then back to 6~2, the variable I takes on the values 1, 2, 3 . . . . . Thus, I is an induction variable. Less obviously, S is an induction variable, since it takes on values T + 1, T + 2, T + 3, .. •. However, K is not an induction variable, because it takes values T + 1, 2T + 3, 3T + 6, . . . . E]
sEc. 11.3
PROGRAMS WITH LOOPS
927
The important feature of induction variables is that they are linearly related to each other as long as control is within their region. For example, in Fig. 11.27, the relations S - T-+- I and I - - S T hold every time we leave 6~2. If, as in Fig. 11.27, one induction variable is used only to control the region (indicated by the fact that its value is not needed outside the region and that it is always set to the same constant immediately before entering the region), it is possible to eliminate that variable. Even if all induction variables are needed outside the region, we may use only one inside the region and compute the remaining ones when we leave the region.
Example 11.37 Consider Fig. 11.27. We shall eliminate the induction variable I, which qualifies under the criteria listed above. Its role will be played by S. We observe that after executing 6~2, S has the value T + / , so when control passes from 6~3 back to 6~2, the relation S ---- T + I - 1 must hold. We can thus replace the statement S +-- T + I by S ~-- S + 1. But we must then initialize S correctly in 6~, so that when control goes from 6~ to (B2, the value of S after executing the statement S ,-- S + 1 is T + L Clearly, in block 6~'2, we must introduce the new statement S ~-- T after the statement T ~ - - J + I. We must then revise the test I = 1000 ? so that it is an equivalent test on S. When the test is executed, S has the value T ÷ I. Consequently, an equivalent test is R <
T -t- 1000
S--R? Since R is region-independent, the calculation R , - - T + 1000 can be moved to block 03'2. It is then possible to dispense with I entirely. The resulting flow graph is shown in Fig. 11.28. We observe from Fig. 11.28 that block CB3 has been entirely eliminated and that the region has been shortened by one statement. Of course, ~B~ has been increased in size, but we are presuming that regions are executed more frequently than blocks outside the region. Thus, Fig. 11.28 represents a speeding up of Fig. 11.27. We observe that the step S ,-- T in 6~ can be eliminated if we identify S and T. This is possible only because at no time will the values of S and T be different, yet both will be "active" in the sense that they may both be used later in the computation. That is, only S is active in 6~2, neither is active in CB~, and in 6~ both are active only between the statements S ~-- T and R ~ - T + 1000. At that time they certainly have the same value. If we replace T by S, the result is Fig. 11.29. To see the improvement between Figs. 11.26 and 11.29, let us convert
928
CODEOPTIMIZATION
CHAP. 11
K~0 K~0
i
T,S~ R +-
iI
1 ~
S*-'J+ 1 R ~- S + 1000
1000
S*-S + I K~S+K S =R?
S*-S+I K*-S+K S =R?
632
II
I i halt Fig. 11.28
d~2
halt
] (B4
Further revised flow graph.
Fig. 11.29
d~4
Final flow graph.
each into assembly language programs for a crude one-accumulator machine. T h e o p e r a t i o n c o d e s s h o u l d be t r a n s p a r e n t .
(JZERO
s t a n d s f o r " j u m p if
a c c u m u l a t o r is z e r o " a n d J N Z f o r " j u m p if a c c u m u l a t o r is n o t z e r o . " ) T h e t w o p r o g r a m s a r e s h o w n in Fig. 11.30.
LOOP:
DONE:
LOAD STORE LOAD STORE LOAD ADD ADD ADD STORE LOAD SUBTR JZERO LOAD ADD JUMP END (a)
= 0 K = 1 I J = 1 1 K K I = 1000 DONE 1 = 1 LOOP
Program from Fig. 11.26
LOOP:
LOAD STORE LOAD ADD STORE ADD STORE LOAD ADD STORE ADD STORE LOAD SUBTR JNZ END (b)
= 0 K J = 1 S = 1000 R S = 1 S K K S R LOOP
Program from Fig. 11.29
Fig. 11.30 Equivalent programs.
SEC. 11.3
PROGRAMS
WITH
LOOPS
929
Observe that the length of the program in Fig. 11.30(b) is the same as that of Fig. 11.30(a). However, the loop in Fig. 11.30(b) is shorter than in Fig. ll.30(a) (8 instructions vs. 12), which is the important factor when time is considered. [~
3. Reduction in Strength An interesting form of reduction in strength is possible within regions. If within a region there is a statement of the form A ,-- B • / , where the value of B is region-independent and the values of I at that statement form an arithmetic progression, we can replace the multiplication by addition or subtraction of a quantity, which is the product of the region-independent value and the difference in the arithmetic progression of the induction variable. It is necessary to properly initialize the quantity computed by the former multiplication statement. Example 11.38
Consider the following portion of a source program,
5
DO
5 J--1, N
DO
5
I - - 1, M
A(/, J) = B(~, J)
which sets array A equal to array B assuming that both A and B are M by N arrays. Suppose element A(I, J) is stored in location A + M • ( J - 1) + I - 1 for 1 < I < M , 1
X:
J.~----I + 3 write J
10goto Y
058
CHAP. 11
CODE OPTIMIZATION
W:
I<
-I + 1
goto Z Y:
I 15 goto FV halt
"11.4.12.
Let T1 and Tz be two transformations defined on flow graphs as follows: Ta: T2:
If node n has an edge to itself, remove that edge. I f node n has a single entering edge, from node m, and n is not t h e begin node, merge m and n by replacing the edge from m to n by edges from m to each node n' which was formerly entered by an edge from n. Then delete n.
Show that if T~ and Tz are applied to a flow graph F until they can no longer be applied, then the result is the limit of F. "11.4.13.
Use the transformations T1 and T2 to give an alternative way of computing the IN function without using interval analysis.
*'11.4.14.
Let G be a flow graph with initial node no. Show that G is irreducible if and only if it has nodes n~, n2, and n3 such that there are paths fromn0 to n~, from nl to nz and n3, from n2 to n3, and from n3 to n2 (See Fig. 11.48) which do not coincide except at their end points. All of n0, nl, n2 and n3 must be distinct, with the exception that nl may be no.
Fig. 11.48 Pattern in every irreducible flow graph. 11.4.15.
Show that every d-chart (See Section 1.3.2) is reducible. Hint: Use Exercise 11.4.14.
11.4.16.
Show that every F O R T R A N program in which every transfer to a previous statement of the program is caused b y a DO loop has a reducible flow graph.
EXERCISES
959
*'11.4.17.
Show that one can determine in time 0(n log n) whether a program flow graph is reducible. Hint' Use Exercise 11.4.12.
"11.4.18.
What is the relation between the concepts of intervals and single-entry regions ?
"11.4.19.
Give an interval-based algorithm that determines for each expression (say A + B) and each block (B whether every execution of the program must reach a statement which computes A q- B (i.e., there is a statement of the form C ~ A + B ) and which does not subsequently redefine A or B. Hint: If 63 is not the begin block, let IN(63) -- ~ OUT(63i), where the 63i's are all the direct predecessors of 63. Let OUT(63) be (IN(63) ~ X) w Y where X is the set of expressions "killed" in 63 (we "kill" ,4 + B by redefining A or B) and Y is the set of expressions computed by the block and not killed. For each interval I of the various derived graphs and each exit s o f / , compute GEN'(/, S) to be the set of expressions which are computed and not subsequently killed in every path from the entrance of I to exit s. Also, compute TRANS'(/, s) to be the set of expressions which if killed, are subsequently generated in every such path. Note that GEN'(/, s) ~ T R A N S ' (/, s).
• 11.4.20.
Give an algorithm, based on interval analysis, that determines for each variable A and each block 63 whether there is an execution which after passing through 63 will reference A before redefining it.
11.4.21.
Let F be an n node flow graph with e edges. Show that the ith-derived graph of F has no more than e -- i edges.
"11.4.22.
Give an example of an n node flow graph with 2n edges whose derived sequence is of length n.
"11.4.23.
Show that Algorithm 11.7 and the algorithms of Exercises 11.4.19 and 1 1.4.20 take at most 0(n z) bit vector steps on flow graphs of n nodes and at most 2n edges. There is another approach to data flow analysis which is tabular in nature. For example, in analogy with Algorithm 11.8, we could compute a table IN(d, 63) which had value 1 if definition d was in IN(63) and 0 otherwise. Initially, let IN(d, 63) = 1 if and only if there is a node 63' with an edge to 63 and d is in GEN(63'). For each 1 added to the table, say at entry (d, 63), place a 1 in entry (d, 63") if there is an edge from 63 to 63" and 63 does not kill d.
"11.4.24.
Show that the above algorithm correctly computes IN(63) and that it operates in time O(mn) on an n node flow graph with at most 2n edges and m definitions.
"11.4.25.
Give algorithms similar to the above performing the tasks of Exercises 1 1.4.19 and 11.4.20. How fast do they run?
960
CODEOPTIMIZATION
CHAP. 11
*'11.4.26.
Give algorithms requiring 0(n log n) bit vector steps to compute the IN functions of Algorithm 11.7 or Exercise 11.4.19 for flow graphs of n nodes.
*'11.4.27.
Show that a flow graph is reducible if and only if its edge set can be partitioned into two sets El and E2, where (1)E1 forms a dag, and (2) If (m, n) is in E2, then m = n, or n dominates m.
*'11.4.28.
Give a n 0 ( n log n) algorithm to compute direct dominators for an n node reducible graph.
Research Problems 11.4.29.
Suggest someadditional data flow information (other than that mentioned in Algorithm 11.7 and Exercises 11.4.19 and 11.4.20) which would be useful for code optimization purposes. Give algorithms to compute these, both for reducible and for irreducible flow graphs.
11.4.30.
Are there techniques to compute the IN function of Algorithm 11.8 or other data flow functions that are superior to node splitting for irreducible graphs? By "superior," we are assuming that bit vector operations are permissible, or else the algorithms of Exercises 11.4.24 and 11.4.25 are clearly optimal.
BIBLIOGRAPHIC
NOTES
The interval analysis approach to code optimization was developed by Cocke [1970] and further elaborated by Cocke and Schwartz [1970] and Allen [1970]. Kennedy [1971] discusses a global algorithm that uses interval analysis to recognize active variables in a program (Exercise 11.4.20). The solutions to Exercises 11.4.12-11.4.16 can be found in Hecht and Ullman [1972a]. Exercise 11.4.17 is from Hopcroft and Ullman [1972b]. An answer to Exercise 11.4.19 can be found in Cocke [1970] or Schaefer [1973]. Exercises 11.4.24 and 11.4.26 are from Ullman [1972b]. Exercise 11.4.27 is from Hecht and Ullman [1972b]. Exercise 11.4.28 is from Aho, Hopcroft and Ullman [1972]. There are several papers that discuss the implementation of optimizing compilers. Lowry and Medlock [1969] discuss some optimizations used in the OS/360 F O R T R A N H compiler. Busam and Englund [1969] present techniques for the recognition of common subexpressions, the removal of invariant computations from loops, and register allocation in another F O R T R A N compiler. Knuth [1971] collected a large sample of F O R T R A N programs and analyzed some of their characteristics.
BIBLIOGRAPHY AND
FOR V O L U M E S
!
II
AHO, A. V. [1968] Indexed grammars~an extension of context-free grammars. J. A CM 15: 4, 647-671. AnD, A. V. (ed.) [1973] Currents in the Theory of Computing. Prentice-Hall, Englewood Cliffs, N.J. AHO, A. V., P. J. DENNING, and J. D. ULLMAN [1972] Weak and mixed strategy precedence parsing. J. A C M 19:2, 225-243. AHO, A. V., J. E. HOPCROFT, and J. D. ULLMAN[1968] Time and tape complexity of pushdown automaton languages. Information and Control 13:3, 186-206. AHO A. V., J. E. HOPCROFT, and J. D. ULLMAN [1972] On finding lowest common ancestors in trees. Proc. Fifth Annual ACM Symposium on Theory of Computing, (May, 1973), 253-265. AnD, A. V., S. C. JOHNSON,and J.D. ULLMAN [1972] Deterministic parsing of ambiguous grammars. Unpublished manuscript, Bell Laboratories, Murray Hill, N.J. AHO, A. V., R. SETHI, and J. D. ULLMAN[1972] Code optimization and finite Church-Rosser systems. In Design and Optimization of Compilers (R. Rustin, ed.). Prentice-Hall, Englewood Cliffs, N.J., pp. 89-106. AHO, A. V., and J. D. ULLMAN [1969a] Syntax directed translations and the pushdown assembler. J. Computer and System Sciences 3:1, 37-56. 961
962
BIBLIOGRAPHYFOR VOLUMES I AND II
AHO, A. V., and J. D. ULLMAN[1969b]
Properties of syntax directed translations. J. Computer and System Sciences 3: 3, 319-334. AHO, A. V., and J. D. ULLMAN[1971] Translations on a context-free grammar. Information and Control 19: 5, 439-475. AHO, A. V., and J. D. ULLMAN[1972a] Linear precedence functions for weak precedence grammars. International J. Computer Mathematics, Section A, 3, 149-155. AHO, A. V., and J. D. ULLMAN[1972b] Error detection in precedence parsers. Mathematical Systems Theory, 7:2 (February 1973), 97-113. AHO, A. V., and J. D. ULLMAN[1972C] Optimization of LR(k) parsers. J. Computer and System Sciences, 6:6, 573-602. AHO, A. V., and J. D. ULLMAN[1972d] A technique for speeding up LR(k) parsers. Proc. Fourth Annual A C M Symposium on Theory of Computing, pp. 251-263. AHO, A. V., and J. D. ULLMAN[1972e] Optimization of straight line code. S I A M J. on Computing 1: 1, 1-19. AHO, A. V., and J. D. ULLMAN[1972f] Equivalence of programs with structured variables. J. Computer and System Sciences 6: 2, 125-137. AHO, A. V., and J. D. ULLMAN[1972g] LR(k) syntax directed translation. Unpublished manuscript, Bell Laboratories, Murray Hill, N.J. ALLARD, R. W., K. A. WOLF, and R. A. ZEMLIN [1964]
Some effects of the 6600 computer on language structures. Comm. A C M 7:2, 112-127. ALLEN, F. E. [1969]
Program optimization. Annual Review in Automatic Programming, Vol. 5., Pergamon, Elmsford, N.Y. ALLEN, F. E. [1970]
Control flow analysis. A C M SIGPLAN Notices 5:7, 1-19. ALLEN, F. E., and J. COCKE [1972] A catalogue of optimizing transformations. In Design and Optimization of Compilers (R. Rustin, ed.). Prentice-Hall, Englewood Cliffs, N.J., pp. 1-30. ANDERSON, J. P. [1964]
A note on some compiling algorithms. Comm. A C M 7:3, 149-150.
BIBLIOGRAPHY FOR VOLUMES I AND II
963
ANS X.3.9 [1966] American National Standard FORTRAN. American National Standards Institute, New York. ANSI SUBCOMMITTEEX3J3 [1971] Clarification of FORTRAN standards--second report. Comm. A C M 14:10, 628-642. ARBIB, M. A. [1969] Theories of Abstract Automata. Prentice-Hall, Englewood Cliffs, N.J. BACKUS, J. W., et al. [1957] The FORTRAN Automatic Coding System. Proc. Western Joint Computer Conference, Vol. 11, pp. 188-198. BAER, J. L., and D. P. BOVET[1968] Compilation of arithmetic expressions for parallel computations. Proc. IFIP Congress 68, B4-B10. BAGWELL, J. T. [1970] Local optimizations. A CM SIGPLAN Notices 5:7, 52-66. BAR-HILLEL, Y.
[1964]
Language and Information. Addison-Wesley, Reading, Mass. BAR-HILLEL Y., M. PERLES, and E. SHAMIR [1961]
On formal properties of simple phrase structure grammars. Z. Phonetik, Sprachwissenschaft und Kommunikationsforschung 14, 143-172. Also in Bar-Hillel [1964], pp. 116-150. BARNETT, M. P., and R. P. FUTRELLE[1962]
Syntactic analysis by digital computer. Comm. ACM 5: 10, 515-526. _
BAUER, H., S. BECKER,and S. GRAHAM[1968] ALGOL W implementation. CS98, Computer Science Department, Stanford University, Stanford, Cal. BEALS, A. J. [1969] The generation of a deterministic parsing algorithm. Report No. 304, Department of Computer Science, University of Illinois, Urbana. BEALS, A. J., J. E. LAFRANCE, and R. S. NORTHCOTE[1969] The automatic generation of Floyd production syntactic analyzers. Report No. 350. Department of Computer Science, University of Illinois, Urbana. BEATTY, J. C.
[1972]
An axiomatic approach to code optimization for expressions. J. ACM, 19: 4.
964
BIBLIOGRAPHY FOR VOLUMES I AND II
BELADY, L. A. [1966] A study of replacement algorithms for a virtual storage computer. IBM Systems J. 5, 78-82. BELL, J. R. [1969] A new method for determining linear precedence functions for precedence grammars. Comm. A CM 12:10, 316-333. BELL, J. R. [I 970] The quadratic quotient method: a hash code eliminating secondary clustering. Comm. A C M 13:2, 107-109. BERGE, C. [1958] The Theory of Graphs and Its Applications. Wiley, New York.
BIRMAN,A., and J. D. ULLMAN[1970] Parsing algorithms with backtrack. IEEE Conference Record of llth Annual Symposium on Switching and Automata Theory, pp. 153-174. BLATTNER, M. [1972] The unsolvabi!ity of the equality problem for sentential forms of context-free languages. Unpublished Memorandum, UCLA, Los Angeles, Calif. To appear in JCSS. BOBROW, D. G. [1963] Syntactic analysis of English by computer--a survey. Proc. AFIPS Fall Joint Computer Conference, Vol. 24. Spartan, New York, pp. 365-387. BooK, R. V. [1970] Problems in formal language theory. Proc. Fourth Annual Princeton Conference on Information Sciences and Systems, pp. 253-256. Also see Aho [1973]. BOOTH, T. L. [1967] Sequential Machines and Automata Theory. Wiley, New York. BORODIN, A. [1970] Computational complexity--a survey. Proc. Fourth Annual Princeton Conference on Information Sciences and Systems, pp. 257-262. Also see Aho [1973]. BRACHA, N. [1972] Transformations on loop-free program schemata. Report No. UIUCDCS-R-72-516, Department of Computer Science, University of Illinois, Urbana BRAFFORT, P., and D. HIRSCHBERG(eds.) [1963] Computer Programming and Formal Systems. North-Holland, Amsterdam.
BIBLIOGRAPHY FOR VOLUMESi AND II
965
BREUER, M. A. [1969] Generation of optimal code for expressions via factorization. Comm. A C M 12: 6, 333-340. BROOKER, R. A., and D. MORRIS[1963] The compiler-compiler. Annual Review in Automatic Programming, Vol. 3.
Pergamon, EImsford, N.Y., pp. 229-275. BRUNO, J. L., and W. A. BURKHARD[1970]
A circularity test for interpreted grammars. Technical Report 88. Computer Sciences Laboratory, Department of Electrical Engineering, Princeton University, Princeton, N.J. BRZOZOWSKI,J. A. [1962] A survey of regular expressions and their applications. IRE Trans. on Electronic Computers 11:3, 324-335. BRZOZOWSKI, J, A. [1964]
Derivatives of regular expressions. J. A C M 11:4, 481-494. BUSAM, V. A., and D. E. ENGLUND [1969]
Optimization of expressions in Fortran. Comm. A CM 12:12, 666-674. CANTOR, D. G. [1962]
On the ambiguity problem of Backus systems. J. A CM 9: 4, 477--479.
CAVINESS, B. F. [1970] On canonical forms and simplification. J. A C M 17:2, 385-396. CHEATHAM, T. E., [1965]
The TGS-Ii translator-generator system. Proc. IFIP Congress 65. Spartan, New York, pp. 592-593. CHEATHAM, T. E. [1966]
The introduction of definitional facilities into higher level programming languages. Proc. AFIPS Fall Joint Computer Conference. Vol. 30. Spartan, New York, pp. 623-637. CHEATHAM, T. E. [1967] The Theory and Construction of Compilers (2nd ed.).
Computer Associates, Inc., Wakefield, Mass. CHEATHAM, T. E., and K. SATTLEY[1964]
Syntax directed compiling. Proc. AFIPS Spring Joint Computer Conference, Vol. 25. Spartan, New York,
pp. 31-57.
966
BIBLIOGRAPHYFOR VOLUMES I AND II
CHEATHAM, T. E., and T. STANDISH[1970]
Optimization aspects of compiler-compilers. A C M SIGPLAN Notices 5: 10, 10-17. CHEN, S. [1972] On the Sethi-Ullman algorithm. Unpublished memorandum, Bell Laboratories, Holmdel, N.J. CHOMSKY, N. [1965] Three models for the description of language. IEEE Trans. on Information Theory 2: 3, 113-124. CHOMSKY, N. [1957] Syntactic Structures. Mouton and Co., The Hague. CHOMSKY, N. [1959a] On certain formal properties of grammars. Information and Control 2:2, 137-167. CHOMSKY, N. [1959b] A note on phrase structure grammars. Information and Control 2: 4, 393-395. CHOMSKY, N. [1962] Context-free grammarsand pushdown storage. Quarterly Progress Report No. 65. Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, Mass. CHOMSKY, N.
[1963]
Formal properties of grammars. In Handbook of Mathematical Psychology, Vol. 2, R.D. Luce, R. R. Bush, and E. Galanter (eds.). Wiley, New York, pp. 323-4i8. CHOMSKY, N. [1965] Aspects of the Theory of Syntax. M.I.T. Press, Cambridge, Mass. CHOMSKY, N., and G. A. MILLER[1958] Finite state languages. Information and Control 1:2, 91-112. CHOMSKY, N., and M. P. SCHUTZENBERGER[1963] The algebraic theory of context-free languages. In Braffort and Hirschberg [1963], pp. 118-161. CHRISTENSEN, C., and J. C. SHAW (eds.) [1969]
Proc. of the extensible languages symposium. A CM SIGPLAN Notices 4: 8. CHURCH, A. [1941] The Calculi of Lambda-Conversion. Annals of Mathematics Studies, Vol. 6. Princeton University Press, Princeton, N.J.
BIBLIOGRAPHY FOR VOLUMESI AND II
967
CHURCH, A. [1956] Introduction to Mathematical Logic. Princeton University Press, Princeton, N.J. CLARK, E. R. [1967] On the automatic simplification of source language programs. Comm. A C M 10:3, 160-164.
COCKE, J. [1970] Global common subexpression elimination. A CM SIGPLAN Notices 5:7, 20-24. COCKE, J., and J. T. SCHWARTZ[1970] Programming Languages and Their Compilers (2rid ed.). Courant Institute of Mathematical Sciences, New York University, New York. COHEN, D. J., and C. C. GOTLIEB [1970] A list structure form of grammars for syntactic analysis. Computing Surveys 2: 1, 65-82. COHEN, R. S., and K. CULIK, II [1971] LR-regular grammars--an extension of LR(k) grammars. IEEE Conference Record of 12th Annual Symposium on Switching and Automata Theory, pp. 153-165. COLMERAUER, A. [1970] Total precedence relations. J. A C M 17:1, 14-30. CONWAY, M. E. [1963] Design of a separable transition-diagram compiler. Comm. A C M 6:7, 396-408. CONWAY, R. W., and W. L. MAXWELL[1963] CORC: the Cornell computing language. Comm. A C M 6:6, 317-321. CONWAY, R. W., and W. L. MAXWELL[1968] CUPL--an approach to introductory computing instruction. Technical Report No. 68-4. Department of Computer Science, Cornell University, Ithaca, N.Y. CONWAY, R. W., et al. [1970] PL/C. A high performance subset of PL/I. Technical Report 70-55. Department of Computer Science, Cornell University, Ithaca, N.Y.
COOK, S. A. [1971] Linear time simulation of deterministic two-way pushdown automata. Proc. IFIP Congress 71, TA-2. North-Holland, Amsterdam. pp. 174-179. COOK, S. A., and S. D. AANDERAA[1969] On the minimum computation time of functions. Trans. American Math. Soc. 142, 291-314.
968
BIBLIOGRAPHYFOR VOLUMES I AND II
CULIK, K., II [1968]
Contribution to deterministic top-down analysis of context-free languages. Kybernetika 4: 5, 422---431. CULIK, K., II [1970] n-ary grammars and the description of mapping of languages. Kybernetika 6, 99-117. DAVIS, i . [1958] Computability and Unsolvability. McGraw-Hill, New York. DAviS, M. (ed.) [1965] The Undecidable. Basic papers in undecidable propositions, unsolvable problems and computable functions. Raven Press, New York. DE BAKKER, J . W . [1971]
Axiom systems for simple assignment statements. In Engeler [1971], pp. 1-22. DEREMER, F. L. [1968] On the generation of parsers for BNF grammars: an algorithm. Report No. 276. Department of Computer Science, University of Illinois, Urbana. DEREMER, F. L. [1969] Practical translators for LR(k) languages. Ph. D. Thesis, Massachusetts Institute of Technology, Cambridge, Mass. DEREMER, F. L. [1971] Simple LR(k) grammars. Comm. A C M 14: 7, 453-460. DEWAR, R. B. K., R. R. HOCHSPRUNG, and W. S. WORLEY[1969] The IITRAN programming language. Comm. A C M 12: 10, 569-575. EARLEY, J. [1966] Generating a recognizer for a BNF grammar. Computation Center Report, Carnegie-Mellon University, Pittsburgh. EARLEY, J. [1968] An efficient context-free parsing algorithm. Ph.D. Thesis, Carnegie-Mellon University, Pittsburgh. Also see Comm. A CM (February, 1970) 13 : 2, 94-102. EICKEL, J., i . PAUL, F. L. BAUER, and K. SAMELSON[1963] A syntax-controlled generator of formal language processors. Comm. A C M 6:8, 451-455. ELSON, M., and S. T. RAKE [1970] Code-generation technique for large-language compilers. IBM Systems J. 9: 3, 166-188.
BIBLIOGRAPHYFOR VOLUMESI AND II
969
ELSPAS, B., M. W. GREEN, and K. N. LEVITT[1971] Software reliability. Computer l, 21-27. ENGELER, E. (ed.) [1971] Symposium on Semantics of Algorithmic Languages. Lecture Notes in Mathematics. Springer, Berlin. EVANS, A., JR. [1964] An ALGOL 60 compiler. Annual Review in Automatic Programming, Vol. 4. Pergamon, Elmsford, N.Y., pp. 87-124. EVEY, R. J. [1963] Applications of pushdown-store machines. Proc. AFIPS Fall Joint Computer Conference, Vol. 24. Spartan, New York, pp. 215-227. FELDMAN, J. A. [1966] A formal semantics for computer languages and its application in a compilercompiler. Comm. ,4 CM 9: l, 3-9. FELDMAN, J. A., and D. GRIES [1968] • Translator writing systems. Comm. A CM 11" 2, 77-113. FISCHER, M. J. [1968] Grammars with macro-like productions. IEEE Conference Record of 9th Annual Symposium on Switching and Automata Theory, pp. 131-142. FISCHER, M. J. [1969] Some properties of precedence languages. Proc. A C M Symposium on Theory of Computing, pp. 181-190. FISCHER, M. J. [1972] Efficiency of equivalence algorithms. Memo No. 256, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Mass. FLOYD, R. W. [1961a] An algorithm for coding efficient arithmetic operations. Comm. A C M 4' 1, 42-51. FLOYD, R. W: [1961b] A descriptive language for symbol manipulation. J. A CM 8.: 4, 579-584. FLOYD' R. W. [1962a] Algorithm 97" shortest path. Comm. A C M 5" 6, 345.
970
BIBLIOGRAPHYFOR VOLUMES I AND II
FLOYD, R. W. [1962b] On ambiguity in phrase structure languages. Comm. A C M 5:10, 526-534. FLOYD, R. W. [1963] Syntactic analysis and operator precedence. J. A C M 10:3, 316-333. FLOYD, R. W. [1964a]
Bounded context syntactic analysis. C o m m . / I C M 7:2, 62-67. FLOYD, R. W. [1964b]
The syntax of programming languages--a survey. I E E E Trans. on Electronic Computers EC-13: 4, 346-353. FLOYD, R. W. [1967a]
Assigning meanings to programs. In Schwartz [1967], pp. 19-32. FLOYD, R. W. [1967b]
Nondeterministic algorithms. J . / 1 C M 14: 4, 636-644.
FRAILLY, D. J. [1970]
Expression optimization using unary complement operators. A C M SIGPL/1N .Notices 5:7, 67-85. FREEMAN, D. N. [1964]
Error correction in CORC, the Cornell computing language. Proc. AFIPS Fall Joint Computer Conference, Vol. 26.
Spartan, New York, pp. 15-34. GALLER, B. A., and A. J. PERLIS [1967]
A proposal for definitions in ALGOL. C o m m . / 1 C M 10: 4, 204-219. GARWICK, J. V. [1964]
GARGOYLE, a language for compiler writing. Comm. A C M 7: 1, 16-20. GARWICK, J. V. [1968]
GPL, a truly general purpose language. C o m m . / 1 C M 11:9, 634-638. GEAR, C. W. [1965]
High speed compilation of efficient object code. Comm. A C M 8:8, 483-487. GENTLEMAN, W. M. [1971]
A portable c0routine system. Proc. IFIP Congress 71, TA-3. North-Holland, Amsterdam, pp. 94-98.
GILL, A. [1962] Introduction to the Theory o f Finite State Machines.
McGraw-Hill, New York.
BIBLIOGRAPHY FOR VOLUMES I AND II
971
GINSBURG, S. [1962] An Introduction to Mathematical Machine Theory. Addison-Wesley, Reading, Mass. GINSBURG, S. [1966] The Mathematical Theory of Context-Free Languages. McGraw-Hill, New York. GINSBURG, S., and S. A. GREIBACH[1966] Deterministic context-free languages. Information and Control 9:6, 620-648. GINSBURG, S., and S. A. GREIBACH[1969] Abstract families of languages. Memoir American Math. Soc. No. 87, 1-32. GINSBURG, S., and H. G. RICE [1962] Two families of languages related to ALGOL. J. A C M 9:3, 350-371. GINZBURG, A. [1968] Algebraic Theory of Automata. Academic Press, New York. A. [1960] On the syntax machine and the construction of a universal compiler. Technical Report No. 2. Computation Center, Carnegie-Mellon University, Pittsburg, Pa.
GLENNIE,
GRAHAM, R. L. [1972] Bounds on multiprocessing anomalies and related packing algorithms. Proc. AFIPS Spring Joint Computer Conference, Vol. 40 AFIPS Press, Montvale, N.J. pp. 205-217. R. M. [1964] Bounded context translation. Proc. AFIPS Spring Joint Computer Conference, Vol. 25. Spartan, New York, pp. 17-29.
GRAHAM,
GRAHAM, S. L. [1970] Extended precedence languages, bounded right context languages and deterministic languages. IEEE Conference Record of l lth Annual Symposium on Switching and Automata Theory, pp. 175-180. GRAU, A. A., U. HILL, and H. LANGMAACK[1967] Translation of ALGOL 60. Springer-Verlag, New York GRAY, J. N. [1969] Precedence parsers for programming languages. Ph.D. Thesis, Department of Computer Science, University of California, Berkeley.
972
BIBLIOGRAPHYFOR VOLUMES I AND II
GRAY, J. N., and M. A. HARRISON[1969] Single pass precedence analysis. IEEE Conference Record of lOth Annual Symposium on Switching and Automata Theory, pp. 106-117. GRAY, J. N,, M. A. HARRISON, and O. IBARRA [1967] Two way pushdown automata. Information and Control 11 : 1, 30-70. GREIBACH, S. A. [1965]
A new normal form theorem for context-free phrase structure grammars. J. A C M 12: l, 42-52. GREIBACH S. A., and J. E. HOPCROFT[1969] Scattered context grammars. J. Computer and System Sciences 3:3, 233-247. GRIES, D. [1971]
Compiler Construction for Digital Computers. Wiley, New York.
GRIFFITHS, T. V. [1968] The unsolvability of the equivalence problem for A-free nondeterministic generalized machines. J. A C M 15:3, 409-413. GRIFFITHS, T. V., and S. R. PETRICK [1965] On the relative efficiencies of context-free grammar recognizers. Comm. A CM 8: 5, 289-300. GRISWOLD, R. E., J. F. POAGE, and I. P. POLONSKY[1971]
The SNOBOL4 Programming Language (2nd ed.). Prentice-Hall, Englewood Cliffs, N.J.
GROSS, M., and A. LENTIN [1970] Introduction to Formal Grammars. Springer-Verlag, New York. HAINES, L. H. [1970] Representation theorems for context-sensitive languages. Department of Electrical Engineering and Computer Sciences, University o f California, Berkeley. HALMOS, P. R. [1960] Naive Set Theory. Van Nostrand Reinhold, New York. HALMOS, P. R. [1963] Lectures on Boolean Algebras. Van Nostrand Reinhold, New York. HARARY, E. [1969] Graph Theory. Addison-Wesley, Reading, Mass.
BIBLIOGRAPHY FOR VOLUMES I AND II
973
HARRISON, M. A. [1965] Introduction to Switching and Automata Theory. McGraw-Hiil, New York. HARTMANIS, J., and J. E. HOPCROFT[1970] An overview of the theory of computational complexity. J. A C M 18:3, 444--475. HARTMANIS, J., P. M. LEWIS, II, and R. E. STEARNS[1965] Classifications of computations by time and memory requirements. Proc. IFIP Congress 65. Spartan, New York, pp. 31-35. HAYNES, H. R., and L. J. SCHUTTE[1970] Compilation of optimized syntactic recognizers from Floyd-Evans productions. A C M SIGPLAN Notices 5: 7, 38-51. HAYS, D. G. [i967] Introduction to Computational Linguistics. American Elsevier, New York. HECHT, M. S., and J. D. ULLMAN[1972a] Flow graph reducibility. S I A M J. on Computing 1"2, 188-202. HECHT, M. S., and J. D. ULLMAN [1972b] Unpublished memorandum, Department of Electrical Engineering, Princeton University HELLERMAN, H . [1966]
Parallel processing of algebraic expressions. 1EEE Trans. on Electronic Computers EC-15:1, 82-91. HEXT, J. B., and P. S. ROBERTS[1970] Syntax analysis by Domolki's algorithm. Computer J. 13: 3, 263-271. HOPCROFT, J. E. [1971] An n log n algorithm for minimizing states in a finite automaton. CS71-190. Computer Science Department, Stanford University, Stanford, Cal. Also in Theory of Machines and Computations, Z. Kohavi and A. Paz (eds). Academic Press, New York, pp. 189-196. HOPCROFT, J. E., and J. D. ULLMAN[i967] An approach to a unified theory of automata. Bell System Tech. J. 46: 8, 1763-1829. HOPCROFT, J. E., and J. D. ULLMAN[1969] Formal Languages and Their Relation to Automata. Addison-Wesley, Reading, Mass. HOPCROFT, J. E., and J. D. ULLMAN[1972a] Set merging algorithms. Unpublished memorandum. Department of Computer Science, Cornell University, Ithaca, N.Y.
974
BIBLIOGRAPHY FOR VOLUMES I AND II
J.E., and J. D. ULLMAN [1972b] An n log n algorithm to detect reducible graphs Proc. Sixth Annual Princeton Conference on Information Sciences and Systems, pp. 119-122.
HOPCROFT,
HOPGOOD, F. R. A. [1969] Compiling Techniques. American Elsevier, New York. HOPKINS, M. E. [1971] An optimizing compiler design. Proc. IFIP Congress 71, TA-3. North-Holland, Amsterdam, pp. 69-73. HORWtTZ, L. P., R. M. KARP, R. E. MILLER, and S. WINOCRAD [1966] Index register allocation. J. A C M 13: 1, 43-61. HUFFMAN, D. A. [1954] The synthesis of sequential switching circuits. J. Franklin Institute 257, 3-4, 161, 190, and 275-303. HUXTABLE, D. H. R. [1964] On writing an optimizing translator for ALGOL 60. In Introduction to System Programming, Academic Press, New York. IANOV, I. I. [1958] On the equivalence and transformation of program schemes. Translation in Comm. A C M 1:10, 8-11. IBM [1969] System 360 Operating System PL/I (F) Compiler Program Logic Manual. Publ. No. Y286800, IBM, Hursley, Winchester, England. ICHBIAH, J. D., and S. P. MORSE [1970] A technique for generating almost optimal Floyd-Evans productions for precedence grammars. Comm. A C M 13: 8, 501-508. IGARISHI, S. [1,968] On the equivalence of programs represented by Algol-like statements. Report of the Computer Centre, University of Tokyo 1, pp. 103-118. INGERMAN, P. Z. [1966] A Syntax Oriented Translator. Academic Press, New York. IRLAND, M. I., and P. C. FISCHER [1970] A bibliography on computational complexity. CSRR 2028. Department of Applied Analysis and Computer Science, University of Waterloo, Ontario. IRONS, E. T. [ 1961] A syntax directed compiler for ALGOL 60. Comm. A C M 4:1, 51-55.
BIBLIOGRAPHY FOR VOLUMES I AND II
975
IRONS, E. T. [1963a] " An error correcting parse algorithm. Comm. A C M 6: 11,669-673. IRONS, E. T. [1963b] The structure and use of the syntax directed compiler. Annual Review in Automatic Programming, Vol. 3. Pergamon, Elmsford, N.Y., pp. 207-227. IRONS, E. T. [1964] Structural connections in formal languages. Comm. A C M 7: 2, 62-67. JOHNSON, W. L., J. H. PORTER, S. I. ACKLEY, and D. T. Ross [1968] Automatic generation of efficient lexical processors using finite state techniques. Comm. A C M 11:12, 805-813. KAMEDA, T., and P. WEINER [1968]
On the reduction of nondeterministic automata. Proc. Second Annual Princeton Conference on Information Sciences and Systems, pp. 348-352. KAPLAN, D. M. [1970]
Proving things about programs. Proc. 4th Annual Princeton Conference on Information Sciences and Systems, pp. 244-25 I. KASAMI, T. [1965]
An efficient recognition and syntax analysis algorithm for context-free languages. Scientific Report AFCRL-65-758. Air Force Cambridge Research Laboratory, Bedford, Mass. KASAMI, T., and K. TORII [1969] A syntax analysis procedure for unambiguous context-free grammars. J. A C M 16: 3, 423-431. KENNEDY, K. [1971]
A global flow analysis algorithm. International J. Computer Mathematics 3: 1, 5-16 KENNEDY, K. [1972]
Index register allocation in straight line code and simple loops. In Design and Optimization of Compilers (R. Rustin, ed.). Prentice-Hail, Englewood Cliffs, N.J., pp. 51-64. KLEENE, S. C, [1952]
Introduction to Metamathematics. Van Nostrand Reinhold, New York.
KLEENE, S. C. [1956] Representation of events in nerve nets. In Shannon and McCarthy [1956], pp. 3--40.
976
BIBLIOGRAPHYFOR VOLUMES I AND II
KNUTH, D. E. [1965] On the translation of languages from left to right. Information and Control 8:6, 607-639. KNUTH, D. E. [1967] Top-down syntax analysis. Lecture Notes. International Summer School on Computer Programming, Copenhagen. Also in Acta lnformatica 1: 2, (1971), 79-110. KNUTH, D. E. [1968a] The Art of Computer Programming, Vol. 1 : Fundamental Algorithms. Addison-Wesley, Reading, Mass. KNUTH, D. E. [1968b] Semantics of context-free languages. Math. Systems Theory 2: 2, 127-146. Also see Math. Systems Theory 5: 1, 95-95. KNUTH, D. E. [ 1971 ] An empirical study of FORTRAN programs. Software-Practice and Experience 1 : 2, 105-134. KNUTH, D. E. [1973] The Art of Computer Programming, Vol. 3: Sorting and Searching. Addison-Wesley, Reading, Mass. KORENJAK, A. J. [1969] A practical method for constructing LR(k) processors. Comm. ACM ~12: II, 613-623. KORENJAK, A. J., and J. E. HOPCROFT [1966]
Simple deterministic languages. IEEE Conference Record of 7th Annual Symposium on Switching and Automata Theory, pp. 36--46. KOSARAJU, S. R. [1970]
Finite state automata with markers. Proc. Fourth Annual Princeton Conference on Information Sciences and Systems, p. 380. KUNO, S., and A. G. OETTiNGER [1962] Multiple-path syntactic analyzer. Information Processing 62 (IFIP Congress), Popplewell (ed.). North-Holland, Amsterdam, pp. 306-311. KURKt-SUONtO, R. [1969] Notes on top-down languages. BIT 9, 225-238. LAFRANCE, J. [1970]
Optimization of error recovery in syntax directed parsing algorithms. A CM SIGPLAN Notices 5: 12, 2-17.
BIBLIOGRAPHYFOR VOLUMESI AND II
977
LALONDE, W. R., E. S. LEE, and J. J. HORNING [1971] An LALR(k) parser generator. Proc. IFIF Congress 71, TA-3. North-Holland, Amsterdam, pp. 153-157. LEAVENWORTH, B. M. [1966] Syntax macros and extended translation. Comm. A C M 9:11,790--793. LEDLEY, R. S., and J. B. WILSON [1960] Automatic programming language translation through syntactical analysis. Comm. A C M 3, 213-214. LEE, J. A. N. [1967] Anatomy of a Compiler.
Van Nostrand Reinhold, New York. LEINIUS, R. P. [1970] Error detection and recovery for syntax directed compiler systems. Ph.D. Thesis, University of Wisconsin, Madison. LEWIS, P. M., II, and D. J. ROSENKRANTZ[197I] An ALGOL compiler designed using automata theory. Proc. Symposium on Computers and Automata, Microwave Research Institute Symposia Series, Vol. 21. Polytechnic Institute of Brooklyn., New York., pp. 75-88. LEw~s, P. M., II, and R. E. STEARNS[1968] Syntax directed transduction. J. A C M 15: 3, 464488. LOECKX, J. [I 970] An algorithm for the construction of bounded-context parsers. Comm. A C M 13:5, 297-307. LOWRY, E. S., and C. W. MEDLOeK [1969] Object code optimization. Comm. A C M 12:1, 13-22. LUCAS, P., and K. WALK [1969] On the formal description of PL/I. Annual Review in Automatic Programming, Vol. 6, No. 3.
Pergamon, Elmsford, N.Y., pp. 105-182. LUCKHAM, D. C., D. M. R. PARK, and M. S. PATERSON[1970] On formalized computer programs. J. Computer and System Sciences 4:3, 220-249. MANNA, Z. [1973] Program schemas. In Aho [1973]. MARKOV, A. A. [1951] The theory of algorithms (in Russian). Trudi Mathematicheskova Instituta imeni V. A. Steklova 38, 176-189. (English translation, American Math. Soc. Trans. 2:15 (1960), 1-14.)
978
BIBLIOGRAPHY FOR VOLUMES I AND II
MARILL, M. [1962] Computational chains and the size of computer programs. I R E Trans. on Electronic Computers, EC-11: 2, 173-180. MARTIN, D. F. [1972] A Boolean matrix method for the computation of linear precedence functions. Comm. A C M 15:6, 448-454. MAURER, W. D. [1968] An improved hash code for scatter storage. Comm. A C M 11 : 1, 35-38. MCCARTHY, J. [1963] A basis for the mathematical theory of computation. In Braffort and Hirschberg [1963], pp. 33-71. McCARTHY, J., and J. A. PAINTER [1967] Correctness of a compiler for arithmetic expressions. In Schwartz [1967], pp. 33-41. MCCLURE, R. M. [1965] T M G - - a syntax directed compiler. Proc. A C M National Conference, Vol. 20, pp. 262-274. MCCULLOCH, W. S., and W. PITTS [1943] A logical calculus of the ideas immanent in nervous activity. Bulletin o f Math. Biophysics 5, 115-133. MCILROY, M. D. [1960] Macro instruction extensions of compiler languages. Comm. A C M 3: 4, 414-220. MCILROY, M. D. [1968] Coroutines. Unpublished manuscript, Bell Laboratories, Murray Hill, N.J. MCILROY, M. D. [1972] A manual for the TMG compiler writing language. Unpublished memorandum, Bell Laboratories, Murray Hill, N.J. MCKEEMAN, W. M. [1965] Peephole optimization. Comm. A C M 8:7, 443--444. MCKEEMAN, W. M. [1966] An approach to computer language design. CS48. Computer Science Department, Stanford University, Stanfordl Cal. MCKEEMAN, W. M., J. J. HORNING, and D. B. WORTMAN[1970] A Compiler Generator.
Prentice-Hall, Englewood Cliffs, N.J. MCNAUGI-ITON, R. [1967] Parenthesis grammars. J. A C M 14: 3, 490-500.
BIBLIOGRAPHYFOR VOLUMESI AND II MCNAUGHTON, R., and H. YAMADA[1960] Regular expressions and state graphs for automata. IRE Trans. on Electronic Computers 9: i, 39-47. Reprinted in Moore [1964], pp. 157-174.
MENDELSON, E. [1968] Introduction to Mathematical Logic. Van Nostrand Reinhold, New York.
MEYERS, W. J. [1965] Optimization of computer code. Unpublished memorandum. G. E. Research Center, Schenectady, N.Y.
MILLER,W. F., and A. C. SHAW [1968] Linguistic methods in picture processingua survey. Proc. AFIPS Fall Joint Computer Conference, Vol. 33. The Thompson Book Co., Washington, D.C., pp. 279-290. MINSKY, M. [1967] Computation: Finite and Infinite Machines. Prentice-Hall, Englewoods Cliffs, N.J. MONTANARI, U. G. [1970] Separable graphs, planar graphs and web grammars. Information and Control 16:3, 243-267. MOORE, E. F. [1965] Gedanken experiments on sequential machines. In Shannon and McCarthy [1956], pp. 129-153. MOORE, E. F. [1964] Sequential Machines: Selected Papers. Addison-Wesley, Reading, Mass. MORGAN, H. L. [1970] Spelling correction in systems programs. Comm. A C M 13:2, 90-93. MORRIS, ROBERT[1968] Scatter storage techniques. Comm. A C M 11: 1, 35--44. MOULTON, P. G., and M. E. MULLER[i967] A compiler emphasizing diagnostics. Comm. A C M 10:1, 45-52. MUNRO, I. [1971] Efficient determination of the transitive closure of a directed graph. Information Processing Letters 1:2, 56-58. NAKATA, I. [1967] On compiling algorithms for arithmetic expressions. Comm. A C M 12: 2, 81-84.
979
980
BIBLIOGRAPHYFOR VOLUMES I AND II
NAUR, P. (ed.) [1963] Revised report on the algorithmic language ALGOL 60. Comm. A C M 6:1, 1-17. NIEVERGELT, J.
[1965]
On the automatic simplification of computer programs. Comm. A CM 8:6, 366-370. OETTINGER, A. [1961]
Automatic syntactic analysis and the pushdown store. In Structure of Language and its Mathematical Concepts, Proc. 12th Symposium on Applied Mathematics. American Mathematical Society, Providence, pp. 104-129. OGDEN, W.
[1968]
A helpful result for proving inherent ambiguity. Mathematical Systems Theory 2: 3, 191-194. ORE, O. [1962] Theory of Graphs. American Mathematical Society Colloquium Publications, Vol. 38, Providence. PAGER, D. [1970] A solution to an open problem by Knuth. Information and Control 17: 5, 462-473. PAINTER, J. A. [1970]
Effectiveness of an optimizing compiler for arithmetic expressions. A C M SIGPLAN Notices 5: 7, 101-126. PAIR, C. [1964]
Trees, pushdown stores and compilation. RFTI--Chiffres 7: 3, 199-216. PARIKH, R. J. [1966] On context-free languages. J. A CM 13: 4, 570-581. PATERSON, M. S. [1968] Program schemata. Machine Intelligence, Vol. 3 (Michie, ed.). Edinburgh University Press, Edinburgh, pp. 19-31. PAUL, M. [1962] A general processor for certain formal languages. Proc. ICC Symposium on Symbolic Language Data Processing. Gordon & Breach, New York, pp. 65-74. PAULL, M. C., and S. H. UNGER [1968a]
Structural equivalence of context-free grammars. J. Computer and System Sciences 2:1,427-463. PAULL, M. C., and S. H. UNGER [1968b]
Structural equivalence and LL-k grammars.
BIBLIOGRAPHY FOR VOLUMES I AND II
981
IEEE Conference Record of Ninth Annual Symposium on Switching and Automata Theory, pp. 176-186. PAVLIDIS, T. [1972] Linear and context-free graph grammars. J. A CM 19: I, 11-23. PETERSON, W. W. [1957] Addressing for random access storage. I B M J. Research and Development 1:2, 130-146. PETRONE, L. [1968] Syntax directed mapping of context-free languages. IEEE Conference Record of 9th Annual Symposium on Switching and Automata Theory, pp. 160-175. PFhLTZ, J. L., and A. ROSENFELO[1969] Web grammars. Proc. International Joint Conference on Artificial Intelligence, Washington, D. C., pp. 609-619. POST, E. L. [1943] Formal reductions of the general combinatorial decision problem. American J. of Math. 65, 197-215. POST, E. L. [1947] Recursive unsolvability of a problem of Thue. J. Symbolic Logic, 12, 1-11. Reprinted in Davis [1965], pp. 292-303. POST, E. L. [1965] Absolutely unsolvable problems and relatively undecidable propositionsm account of an anticipation. In Davis [1965], pp. 338-433. R. E. [1969] Minimal solutions of Paull-Unger problems. Math. System Theory 3:1, 76-85.
PRATHER,
E. [1971] Table lookup techniques. A CM Computing Surveys, 3" 2, 49-66.
PRICE, C.
PROSSER, R. T. [1959] Applications of Boolean matrices to the analysis of flow diagrams. Proc. Eastern J. Computer Conference, Spartan Books, N.Y., pp. 133-138. RABIN, M. O. [1967] Mathematical theory of automata. In Schwartz [1967], pp. 173-175. RABIN, M.O., and D. SCOTT [1959] Finite automata and their decision problems. I B M J. Research and Development 3, 114-125. Reprinted in Moore [1964], pp. 63-91.
982
BIBLIOGRAPHYFOR VOLUMES I AND II
RADKE, C. E. [1970] The use of quadratic residue search. Comm. A C M 13: 2, 103-109. RANDELL, B., and L. J. RUSSELL [1964] ALGOL 60 Implementation. Academic Press, New York. REDZIEJOWSKI, R. R. [1969] On arithmetic expressions and trees. Comm. A C M 12: 2, 81-84. REYNOLDS, J. C. [1965] An introduction to the COGENT programming system. Proc. A C M National Conference, Vol. 20, p. 422. REYNOLDS, J. C., and R. HASKELL[1970] Grammatical coverings. Unpublished memorandum, Syracuse University. RICHARDSON, D. [1968] Some unsolvable problems involving elementary functions of a real variable. J. Symbolic Logic 33, 514-520.
ROGERS, H., JR. [1967] Theory of Recursive Functions and Effective Computability. McGraw-Hill, New York. ROSEN S. (ed.) [1967a] Programming Systems and Languages. McGraw-Hill, New York. ROSEN, S. [1967b] A compiler-building system developed by Brooker and Morris. In Rosen [1967a], pp. 306-331. ROSENKRANTZ, O. J. [1967] Matrix equations and normal forms for context-free grammars. J. A C M 14: 3, 501-507.
ROSENI,ZRANTZ, D. J. [1968] Programmed grammars and classes of formal languages. J. A C M 16:1, 107-131. ROSENKRANTZ, D. J., and P. M. LEWIS, II [1970] Deterministic left corner parsing. IEEE Conference Record of llth Annual Symposium on Switching and Automata Theory, pp. 139-152. ROSENKRANTZ, D. J., and R. E. STEARNS[1970] Properties of deterministic top-down grammars. Information and Control 17:3, 226-256. SALOMAA, A. [1966] Two complete axiom systems for the algebra of regular events. J. A C M 13: 1,158-169.
BIBLIOGRAPHY FOR VOLUMES I AND II
983
SALOMAA, A. [1969a] Theory of Automata. Pergamon, Elmsford, N.Y. SALOMAA, A. [1969b] On the index of a context-free grammar and language. Information and Control 14" 5, 474-477. SAMELSON, K., and F. L. BAUER [1960] Sequential formula translation. Comm. A C M 3" 2, 76-83. SAMMET, J. E. [1969] Programming Languages" History and Fundamentals. Prentice-Hall, Englewood Cliffs, N.J. SCHAEFER, M. [1973]
A Mathematical Theory of Global Program Optimization Prentice-Hall, Englewood Cliffs, N.J., to appear.
SCHORRE, D. V. [1964] META II, a syntax oriented compiler writing language, Proe. A C M National Conference, Vol. 19, pp. Di.3-1-Dt.3-11. SCHUTZENBERGER, M. P. [1963] On context-free languages and pushdown automata. hlformation and Control 6" 3, 246-264. SCHWARTZ, J. T. (ed.) [1967] Mathematical Aspects of Computer Science. Proc. Symposia in Applied Mathematics, Vol. 19. American Mathematical Society, Providence. SCOTT, D., and C. STRACHEY[1971]
Towards a mathematical semantics for computer languages. Proc. Symposium on Computers and Automata, Microwave Research Institute Symposia Series, Vol. 21. Polytechnic Institute of Brooklyn, New York, pp. 19-46. SETHI, R. [1973] Validating register allocations for straight line programs. Ph.D. Thesis, Department of Electrical Engineering, Princeton University. SETHI, R., and J. D. ULLMAN[1970]
The generation of optimal code for arithmetic expressions. J. A C M 17" 4, 715-728. SHANNON, C. E., and J. MCCARTHY (eds.) [1956]
Automata Studies. Princeton University Press, Princeton, N.J.
SHAW, A. C. [1970] Parsing of graph-representable pictures. J. A C M 17" 3, 453-481.
984
BIBLIOGRAPHYFOR VOLUMESI AND II
SHEPHERDSON, J. C. [1959] The reduction of two-way automata to one-way automata. I B M J. Research 3, 198-200. Reprinted in Moore [1964], pp. 92-97. STEARNS, R. E. [1967] A regularity test for pushdown machines. Information and Control 11:3, 323-340.
STEARNS, R. E. [1971] Deterministic top-down parsing. Proc. Fifth Annual Princeton Conference on Information Sciences and Systems, pp. 182-188. STEARNS, R. E., and P. M. LEWIS, II [1969] Property grammars and table machines. Information and Control 14:6, 524-549. STEARNS, R. E., and D. J. ROSENKRANTZ[1969] Table machine simulation. IEEE Conference Record of lOth Annual Symposium on Switching and Automata Theory, pp. 118-128. STEEL, T. B. (ed.) [1966] Formal Language Description Languages for Computer Programming. North-Holland, Amsterdam. STONE, H. S. [1967] One-pass compilation of arithmetic expressions for a parallel processor. Comm. A CM 10: 4, 220-223. STRASSEN, V. [1969] Gaussian elimination is not optimal. Numerische Mathematik 13, 354-356. SUPPES, P. [1960] Axiomatic Set Theory. Van Nostrand Reinhold, New York. TARJAN, R. [1972] Depth first search and linear graph algorithms. S I A M J . on Computing 1"2, 146-160.
THOMPSON, K. [1968] ReguIar expression search algorithm. Comm. A C M 11:6, 419-422. TURING, A. M. [1936] On computable numbers, with an application to the Entscheidungsproblem. Proc. London Mathematical Soc. Ser. 2, 42, 230-265. Corrections, Ibid., 43 (1937), 544-546. ULLMAN, J. D. [1972a] A note on hashing functions. J. A C M 19:3, 569-575.
BIBLIOGRAPHY FOR VOLUMES I AND II
905
ULLMAN, J. D. [1972b] Fast Algorithms for the Elimination of Common Subexpressions.
Technical Report TR-106, Dept. of Electrical Engineering, Princeton University, Princeton, N.J. UNGER, S. H. [1968]
A global parser for context-free phrase structure grammars. Comm. A C M 11 : 4, 240-246, and 11: 6, 427. VAN WIJNGAARDEN, A. (ed.) [1969]
Report on the algorithmic language ALGOL 68. Numerische Mathematik 14, 79-218. WALTERS, D. A. [1970]
Deterministic context-sensitive languages. Information and Control 17" 1, 14-61. WARSHALL, S. [t962]
A theorem on Boolean matrices. J. A C M 9 : 1 , 11-12. WARSHALL, S., and R. M. SHAPIRO[1964]
A general purpose table driven compiler. Proc. AFIPS Spring Joint Computer Conference, Vol. 25. Spartan, New York, pp. 59-65. WEGBREIT, B. [1970]
Studies in extensible programming languages. Ph. D. Thesis, Harvard University, Cambridge, Mass. WILCOX, T. R. [1971]
Generating machine code for high-level programming Ianguages. Technical Report 71-103. Department of Computer Science, Cornell University, Ithaca, N.Y. WINOGRAD, S. [1965]
On the time required to perform addition. J. A C M 12: 2, 277-285. WINOGRAD, S. [1967]
On the time required to perform muItiplication. J. A C M 14:4, 793-802. WIRTH, N. [1965] Algorithm 265: Find precedence functions. Comm. A CM 8 : 10, 604-605. WIRTH, N. [1968] PL 360--a programming language for the 360 computers. J. A C M 15: i, 37-34. WIRTH, N., and H. WEBER [1966] EULER--a generalization of ALGOL and its formal definition, Parts 1 and 2. Comm. A C M 9: 1-2, 13-23, and 89-99.
986
BIBLIOGRAPHY FOR VOLUMES I AND II
WISE, D. S. [1971] Domolki's algorithm applied to generalized overlap resolvable grammars. Proc. Third Annual A C M Symposium on Theory of Computing, pp. 171-184. WOOD, D. [196%] The theory of left factored languages. Computer J. 12: 4, 349-356, and 13: 1, 55-62. WOOD, D. [1969b] A note on top on top-down deterministic languages. BIT 9: 4, 387-399. WooD, D. [1970] Bibliography 23: Formal language theory and automata theory. Computing Reviews 11" 7, 417--430. J. M., and A. EVANS, JR. [1969] Notes on Programming Languages. Department of Electrical Engineering, Massachusetts Institute of Technology, Cambridge, Mass.
WOZENCRAFT,
YERSHOV, A. P. [1966] ALPHANan automatic programming system of high efficiency. J. A C M 13:1, 17-24. YOUNGER, D. H. [1967] Recognition and parsing of context-free languages in time n 3. Information and Control 10: 2, 189-208.
INDEX TO LEMMAS, THEOREMS, AND ALGORITHMS
Theorem Number
Page
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.12 7.13 8.1 8.2 8.3 8.4 :8.5 8.6 8.7 8.8
548 556 573 574 590 594 601 614 6.2,5 639 640 653 657 669 672 676 680 683 684 688 688
Theorem Number
Page
8.9 8.10 8.11 8.12 8.13 8.14 8.15 8.16 8.17 8.18 8.19 8.20 8.21 8.22 9.1 9.2 9.3 9.4 9.5 9.6 9.7
691 694 697 698 698 700 701 701 705 707 710 711 713 716 731 733 736 742 746 752 764
987
i
Theorem Number
Page
10.1 10.2 10.3 10.4 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.I 1 11.12 11.13 11.14 11.15 11.16
803 806 838 839 854 856 860 861 862 888 890 894 901 916 923 939 943 952 955 955
988
INDEX TO LEMMAS, THEOREMS AND ALGORITHMS
Lemma Number
Page
Lemma Number
Page
Number
Page
7.1 7.2 7.3 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10
614 614 638 669 678 681 682 683 684 687 695 696 704
8.11 8.12 8.13 8.14 8.15 10.1 10.2 10.3 10.4 10.5 10.6 11.1 11.2
705 705 712 713 715 803 803 833 837 837 837 855 856
11.3 11.4 11.5 11.6 11.7 11.8 11.9 11.10 11.11 11.12 11.13 11.14 11.15
859 860 862 862 862 887 887 889 889 899 915 915 953
Algorithm Number
Page
Algorithm Number
Page
Algorithm Number
Page
7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10 7.11 7.I2 7.13
547 567 572 574 584 589 593 601 613 624 631 633 635
8.1 8.2 8.3 8.4 8.5 8.6 8.7 9.1 9.2 9.3 9.4 9.5 10.1
674 678 682 698 702 712 714 740 744 750 762 764 793
10.2 10.3 10.4 10.5 11.1 11.2 11.3 11.4 11.5 11.6 11.7 11.8
795 827 832 834 882 882 892 897 916 940 949 953
temma
INDEX
TO VOLUMES
I AND
A Aanderaa, S. D., 34 Absolute machine code, 720-7.21 Acceptable property, 815 Acceptance, 96 (see also Final configuration) Accessible configuration, 583 Accessible state, 117, 125-126 Accumulator, 65 Ackley, S. I., 263 Action function, 374, 392 Active variable, 849, 937 Adjacency matrix, 47-51 Aho, A. V., 102-103, 192, 251, 399, 426, 563, 621, 645, 665, 709, 757, 787, 878, 960 Algebraic law, 846, 867-873, 910-911 ALGOL, 198-199, 234, 281, 489-490, 501, 621 Algorithm, 27-36 Allard, R. W., 906 Allen, F. E., 936, 960 Alphabet, 15 Alternates (for a nonterminal), 285, 457 Ambiguous grammar, 143, 163, 202,207, 281,489-490, 662-663, 678, 711712 (see also Unambiguous grammar) Ancestor, 39 989
Ii
Anderson, J. P., 906 Antisymmetric relation, 10 Arbib, M. A., 138 Arc (see Edge) Arithmetic expression, 86, 768-773,778781, 878-907 Arithmetic progression, 124, 209, 925929 Assembler, 59, 74 Assembly language, 65-70, 721, 863, 879-880 Assignment statement, 65-70 Associative law, 23, 617, 868-873, 876, 891, 894-903 Associative tree, 895-902 Asymmetric relation, 9 Atom, 1-2 Attribute (see Inherited attribute, Synthesized attribute, Translation symbol) Augmented grammar, 372-373, 427-428, 634 Automaton (see Recognizer, Transducer) Available expression, 937 Axiom, 19-20 B
Backtrack parsing, 281-314, 456-500, 746--753
990
INDEXTO VOLUMES I AND II
Backus, J. W., 76 Backus-Naur form, 58 (see also Contextfree grammar) Backwards determinism (see Unique invertibility) Baer, J. L., 906 Bar-Hillel, Y., 82, 102, 211 Barnett, M. P., 237 Base node, 39 Basic block (see Straight-line code) Bauer, F. L., 455 Bauer, H., 426 Beals, A. J., 578 Beatty, J. C., 906 Becker, S., 426 Begin block, 913, 938 Belady, L. A., 906 Bell, J. R., 563, 811 Berge, C., 52 Bijection, 10 Binary search tree, 792-793 Birman, A., 485 Blattner, M., 211 Block (see Straight-line code) Block-structured language, bookkeeping for, 792, 812-813, 816-821 BNF (see Backus-Naur form) Bobrow, D. G., 82 Book, R. V., 103, 211 Bookkeeping, 59, 62-63, 74, 255, 722723, 781-782, 788-843 Boolean algebra, 23-24, 129 Booth, T. L., 138 Border (of a sentential form), 334, 369 Borodin, A., 36 Bottom-up parsing, 178-184, 268-271, 301-307, 485-500, 661-662, 740742, 767, 816 (see also Boundedright-context grammar, FloydEvans productions, LR(k) grammar, Precedence grammar) Bounded context grammar/language, 450-452 Bounded-right-context (BRC) grammar/ language, 427-435, 448, 451-452, 666, 699-701, 708, 717 (see also (1, 1)-BRC grammar, (1, 0)BRC grammar) Bovet, D. P., 966 Bracha, N., 878 Breuer, M. A., 878
Brooker, R. A., 77, 313 Bruno, J. L., 787 Brzozowski, J. A., 124, 138 Burkhard, W. A., 787 Busam, V. A., 936, 960 C Canonical collection of sets of valid items, 389-391, 616, 621 Canonical grammar, 692-697, 699-700 Canonical LR (k) parser, 393-396 Canonical parsing automaton, 647-648, 657 Canonical set of LR (k) tables, 393-394, 584, 589-590, 625 Cantor, D. G., 211 Cardinality (of a set), 11, 14 Cartesian product, 5 Catalan number, 165 Caviar, 667 Caviness, B. F., 878 CFG (see Context-free grammar) CFL (see Context-free language) Chaining, 808-809 Characteristic function, 34 Characteristic string (of a right sentential form), 663 Characterizing language, 238-243, 251 Cheatham, T. E., 58, 77, 280, 314, 579 Chen, S. C., 906 Chomsky, N., 29, 58, 82, 102, 124, 166, 192, 211 Chomsky grammar, 29 (see also Grammar) Chomsky hierarchy, 92 Chomsky normal form, 150-153, 243, 276-277, 280, 314, 362, 689, 708, 824 Christensen, C., 58 Church, A., 25, 29 Church-Turing thesis, 29 Circuit (see Cycle) Circular translation scheme, 777-778 Clark, E. R., 936 Closed portion (of a sentential form), 334-369 Closure of a language, 17-18, 197 reflexive and transitive, 8-9 of a set of valid items, 386, 633 P
INDEX TO VOLUMES I AND II
Closure (cont.) transitive, 8-9, 47-50, 52 Cluster (of a syntax tree), 894-895 CNF (see Chomsky normal form) Cocke, J., 76, 332, 936, 960 Cocke-Younger-Kasami algorithm, 281, 314-320 Code generation, 59, 65-70, 72, 74, 728, 765-766, 781-782, 863-867 (see also Translation) Code motion, 924-925 Code optimization, 59, 70-72, 723-724, 726, 769-772, 844-960 Cohen, R. S., 500 Collision (in a hash table), 795-796, 798-799 Colmerauer, A., 500 Colmerauer grammar, 492, 497-500 Colmerauer precedence relations, 490500 Column merger [in LR(k) parser] 611612 Common subexpression (see Redundant computation) Commutative law, 23, 71, 867, 869-873, 87 6, 891-903 Compatible partition, 593-596, 601, 627 Compiler, 53-57 (see also Bookkeeping, Code generation, Code optimization, Error correction, Lexical analysis, Parsing) Compiler-compiler, 77 Compile time computation, 919-921 Complementation, 4, 189-190, 197, 208, 484, 689 Completely specified finite automaton, 117 Component grammar (see Grammar splitting) Composition (of relations), 13, 250 Computational complexity, 27-28, 208, 210, 297-300, 316-320, 326-328, 356, 395-396, 473-476, 736, 839840, 863, 874, 944, 958-959 Computation path, 914-9 t 5, 944 Computed goto (in Floyd-Evans productions), 564 Concatenation, 15, 17, 197, 208-210, 689 Configuration, 34, 95, 113-114, 168-169, 224, 228, 290, 303, 338, 477, 488, 582
991
Conflict, precedence (see Precedence conflict) Congruence relation, 134 Consistent set of items, 391 Constant, 254 Constant propagation (see Compile time computation) Context-free grammar/language, 57, 9193, 97, 99, 101, 138-211, 240242, 842 Context-sensitive grammar/language, 9193, 97, 99, 101,208, 399 Continuing pushdown automaton, 188189 Conway, R. W., 77 Cook, S. A., 34, 192 Core [of a set of LR(k) items], 626-627 Correspondence problem (see Post's correspondence problem) Cost criterion, 844, 861-863, 891 Countable set, 11, 14 Cover, grammatical (see Left cover, Right cover) CSG (see Context-sensitive grammar) CSL (see Context-sensitive language) Culik, K. II, 368, 500 Cut (of a parse tree), 140-141 Cycle, 39 Cycle-free grammar, 150, 280, 302-303, 307
Dag, 39-40, 42-45, 116, 547-549, 552, 763-765, 854-863, 865-866, 959 Dangling else, 202-204 Data flow, 937-960 Davis, M., 36 D-chart, 79-82, 958 De Bakker, J. W., 878 Debugging, 662 Decidable problem (see Problem) Defining equations (for context-free languages), 159-163 Definition (statement of a program), 909 De Morgan's laws, 12 Denning, P. J., 426, 709 Depth (in a graph), 43 De Remer, F. L., 399, 512, 578, 645, 665 Derivation, 86, 98
992
INDEX TO VOLUMES I AND II
Derivation tree (sde Parse tree) Derivative (of a regular expression), 136 Derived graph, 9407-941 Descendant, 39 \ Deterministic finite automaton, 116, 255 (see also Finite automaton) Deterministic finite transducer, 226-227 (see also Finite transducer) Deterministic language (see Deterministic pushdown automaton.) Deterministic pushdown automaton, 184192, 201-202, 208-210, 251, 344, 398, 446, 448, 466--469, 684-686, 695, 701, 708-709, 711, 712, 717 Deterministic pushdown transducer, 229, 251, 271-275, 341, 395, 443, 446, 730-736, 756 Deterministic recognizer, 95 (see also Deterministic finite automaton, Deterministic pushdown automaton, Deterministic two-stack parser, Recognizer) Deterministic two-stack parser, 488, 492493, 500 Dewar, R. B. K., 77 Diagnostics (see Error correction) Difference (of sets), 4 Differentiation, 760-763 Dijkstra, E. W., 79 Direct access table, 791-792 Direct chaining, 808 Direct dominator (see Dominator) Direct lexical analysis, 61-62, 258-.281 Directed acyciic graph (see Dag) Directed graph (see Graph, directed) Disjoint sets, 4 Distance (in a graph), 47-50 Distinguishable states, 124-128, 593, 654-657 Distributive law, 23, 868 Domain, 6, 10 Domolki's algorithm, 312-313, 452 Dominator, 915-917, 923-924, 934, 939, 959 Don't care entry [in an LR(k) table], 581-5812, 643 (see also cp-inaccessible ) DPDA (see Deterministic pushdown automaton) DPDT (see Deterministic pushdown transducer)
Dyck language, 209
e (see Empty string) Earley, J., 332, 578, 645 Earley's algorithm, 73, 281,320-31,397398 Edge, 37 e-free first (EFF), 381-382, 392, 398 e-free grammar, 147-149, 280, 302-303, 397 Eickel, J., 455 Eight queens problem, 309 Elementary operation, 317, 319, 326, 395 e-move, 168, 190 Emptiness problem, 130-132, 144-145, 483 Empty set, 2 Empty string, 15 Endmarker, 94, 271, 341, 404, 469, 484, 698, 701,707, 716 Engeler, E., 58 English, structure of, 55-56, 78 Englund, D. E., 936, 960 Entrance (of an interval), 947 Entry (of a block), 912 e-production, 92, 147-149, 362, 674-680, 686--688, 690 Equivalence of LR(k) table sets, 585-588, 590596, 601, 617, 625, 652-683 of parsers, 560, 562-563, 580-581 (see also Equivalence of LR(k) table sets, Exact equivalence) of programs, 909, 936 of straight-line blocks, 848, 891 topological (see Equivalence of straight-line blocks) under algebraic laws, 868-869 Equivalence class (see Equivalence relation) Equivalence problem, 130-132, 201, 237, 362, 684-686, 709, 936 Equivalence relation, 6--7, 12-13, 126, 133-134 Erase state, 691, 693, 708 Error correction/recovery, 59, 72-74, 77, 367, 394, 399, 426, 546, 586, 615, 644, 781, 937
INDEX TO VOLUMESI AND II Error indication, 583 Essential blank (of a precedence matrix), 556 Euclid's algorithm, 26-27, 36, 910 Evans, A., 455, 512 Evey, R. J., 166, 192 Exact equivalence (of parsers), 555-559, 562, 585 Exit (of an interval), 947 Extended precedence grammar, 410-415, 424-425, 429, 451,717 Extended pushdown automaton, 173-175, 185-186 Extended pushdown transducer, 269 Extended regular expression, 253-258 Extensible language, 58, 501-504
Feldman, J. A., 77, 455 Fetch function (of a recognizer), 94 FIN, 135, 207 Final configuration, 95, 113, 169, 175, 224, 228, 339, 583, 648 Final state, t 13, 168, 224 Finite ambiguity, 332 Finite automaton, 112-121, 124-128, 255-261, 397 (see also Canonical parsing automaton) Finite control, 95, 443 (see also State) Finite set, 11, 14 Finite transducer, 223-227, 235, 237240, 242, 250, 252, 254-255, 258, 722 FIRST, 300, 335-336, 357-359 Fischer, M. J., 102, 426, 719, 843 Fischer, P. C., 36 Flipping (of statements), 853-859, 861863 Flow chart, 79-82 (see also Flow graph) Flow graph, 907, 913-914 Floyd, R. W., 52, 77, 166, 211, 314, 426, 455, 563, 878, 906 Floyd-Evans productions, 443-448, 452, 564-579 FOLLOW, 343, 425, 616, 640 Formal Semantic Language, 455 FORTRAN, 252, 501,912, 958 Frailey, D. J., 906 Freeman, D. N., 77, 263
993
Frequency profile, 912 Frontier (of a parse tree), 140 Function, 10, 14 Futrelle, R. P., 237 G Galler, B. A., 58 Gear, C. W., 936 GEN, 945-955 Generalized syntax-directed translation scheme, 758-765, 782-783 Generalized top-down parsing language, 469-485, 748-753 Gentleman, W. M., 61 Gill, A., 138 Ginsburg, S., 102-103, 138, 166, 211, 237 Ginzburg, A., 138 GNF (see Greibach normal form) Gotlieb, C. C., 314 GOTO, 386-390, 392, 598, 616 GOTO -a, 598-600 Goto function, 374, 392 GOTO graph, 599-600 Governing table, 581 Graham, R. L., 907 Graham, R. M., 455 Graham, S. L., 426, 709 Grammar (see Bounded-right-context grammar, Colmerauer grammar, Context-free grammar, Contextsensitive grammar, Indexed grammar, LC(k) grammar, LL(k) grammar, LR(k) grammar, Operator grammar, Precedence grammar, Right-linear grammar, Web grammar) Grammar splitting, 631-645 Graph directed, 37-52 undirected, 51 Graph grammar (see Web grammar) Gray, J. N., 192, 280, 455, 500, 709 Greek letters, 2t4 Greibach, S. A., 102-103, 166, 211 Greibach normal form, 153-162, 243, 280, 362, 668, 681-684, 690, 708 Gries, D., 76--77 Griffiths, T. V., 314 Griswold, R. E., 505
994
INDEXTO VOLUMES I AND II
Gross, M., 211 GTDPL (see Generalized parsing language)
top-down
Haines, L. H., 103 Halmos, P. R., 3, 25 Halting (see Recursive set, Algorithm) Halting problem, 35 Halting pushdown automaton, 282-285 Handle (of a right-sentential form) 179180, 377, 379-380, 403-404, 486 Harary, F., 52 Harrison, M. A., 138, 192, 280, 455, 500, 709 Hartmanis, J., 36, 192 Hashing function, 794-795, 797-798 (see also Hashing on locations, Linear hashing function, Random hashing function, Uniform hashing function ) Hashing on locations, 804-807 Hash table, 63, 793-811 Haskell, R., 280 Haynes, H. R., 578 Hays, D. G., 332 Header (of an interval), 938-939 Hecht, M. S., 960 Height in a graph, 43 of an LR(k) table, 598 Hellermann, H., 906 Hext, J. B., 314 Hochsprung, R. R., 77 Homomorphism, 17-18, 197, 207, 209, 213, 689 Hopcroft, J. E., 36, 102-103, 138, 192, 211, 368, 399, 690, 843, 960 Hopgood, F. R. A., 76, 455 Horning, J. J., 76-77, 450, 465 Horwitz, L. P., 906 Huffman, D. A., 138
Ianov, I. I., 937 Ibarra, O., 192 Ichbiah, J. D., 426, 579 Identifier, 60-63, 252, 254
Igarishi, S., 878 IN, 944-955, 959 Inaccessible state (see Accessible state) Inaccessible symbol (of a context-free grammar), 145-147 Inclusion (of sets), 3, 208 In-degree, 39 Independent nodes (of a dag), 552-555 Index (of an equivalence relation), 7 Indexed grammar, 100-101 Indirect chaining (see Chaining) Indirect lexical analysis, 61-62, 254-258 Indistinguishable states (see Distinguishable states.) Induction (see Proof by induction) Induction variable, 925-929 Infinite set, 11, 14 Infix expression, 214-215 Ingerman, P. Z., 77 Inherent ambiguity, 205-207, 209 Inherited attribute, 777-781,784 INIT, 135, 207 Initial configuration, 95, 113, 169, 583, 648 Initial state, 113, 168, 224 Initial symbol (of a pushdown automaton), 168 Injection, 10 Input head, 94-96 Input symbol, 113, 168, 218, 224 Input tape, 93-96 Input variable, 845 Interior frontier, 140 Intermediate (of an indexed grammar), 100
Intermediate code, 59, 65-70, 722-727, 844-845, 908-909 Interpreter, 55, 721, 725 Interrogation state, 650, 658 Intersection, 4, 197, 201, 208, 484, 689 Intersection list, 824-833, 839-840 Interval analysis, 937-960 Inverse (of a relation), 6, 10-11 Inverse finite transducer mapping, 227 Irland, M. I., 36 Irons, E. T., 77, 237, 314, 455 Irreducible flow graph, 953-955 Irreflexive relation, 9 Item (Earley's algorithm), 320, 331,397398 Item [LR(k)], 381 (see also Valid item)
INDEX TO VOLUMES I AND II
995
Left cover, 275-277, 280, 307, 690 Left factoring, 345 Left-linear grammar, 122 Johnson, S. C., 665 Leftmost derivation, 142-143, 204, 318Johnson, W. L., 263 320 Left-parsable grammar, 271-275, 341, 672-674 Left parse (see Leftmost derivation, TopKameda, T., 138 down parsing) Kaplan, D. M., 937 Left parse language, 273, 277 Karp, R. M., 906 Left parser, 266-268 Kasami, T., 332 Left recursion, 484 (see also Left-recurKennedy, K., 906, 960 sive grammar) Keyword, 59, 259 Left-recursive grammar, 153-158, 287KGOTO, 636-637 288, 294-298, 344-345, 681-682 Kleene, S. C., 36, 124 Left-sentential form, 143 Kleene closure (see Closure, of a lanLeinius, R. P., 399, 426, 645 guage) Knuth, D. E., 36, 58, 368, 399, 485, 690, Length, of a derivation, 86 709, 787, 811, 912, 960 of a string, 16 Korenjak, A. J., 368, 399, 636, 645, 690 Lentin, A., 211 Kosaraju, S. R., 138 Lewis, P. M. II, 192, 237, 368, 621, 757, k-predictive parsing algorithm (see Pre843 dictive parsing algorithm) Lexical analysis, 59-63, 72-74, 251-264, k'-uniform hashing function (see Uniform 721-722, 781, 789, 823 hashing function) Lexicographic order, 13 Kuno, S., 313 Limit flow graph, 941 Kurki-Suonio, R., 368, 690 Linear bounded automaton, 100 (see also Context-sensitive grammar) Linear grammar/language, 165-170, 207-208, 237 Labeled graph, 38, 42, 882, 896-897 Linear hashing function, 804-805 La France, J. E., 578-579 Linearization graph, 547-550 Lalonde, W. R., 450 Linear order 10, 13-14, 43-45, 865 LALR(k) grammar (see Lookahead Linear precedence functions, 543-563 LR(k) grammar) Linear set, 209-210 Lambda calculus, 29 Link editor, 721 Language, 16-17, 83-84, 86, 96, 114, LL(k) grammar/language, 73, 333-368, 169, 816 (see also Recognizer, 397-398, 448, 452, 579, 643, 664, Grammar) 666--690, 709, 711,716-717, 730LBA (see Linear bounded automaton) 732, 742-746 LC(k) grammar/language, 362-367 LL(k) table, 349-351, 354-355 Leaf, 39 LL(1) grammar, 342-349, 483, 662 Leavenworth, B. M., 58, 501 LL(0) grammar, 688-689 Lee, E. S., 450 Loader, 721 Lee, J. A. N., 76 Load module, 721 Left-bracketed representation (for trees), Loeckx, J., 455 46 Logic, 19-25 Left-corner parse, 278-280, 310-312, Logical connective, 21-25 362-367 Lookahead, 300, 306, 331, 334-336, 363, Left-corner parser, 310-312 371
996
INDEX TO VOLUMES I AND II
Lookahead LR(k) grammar, 627-630, 642-643, 662 Looping (in a pushdown automaton), 186-189 Loops (in programs), 907-960 Loop unrolling, 930-93.2 Lowry, E. S., 936, 960 LR(k) grammar/language, 73, 369, 371399, 402, 424, 428, 430, 448, 579665, 666-674, 730, 732-736, 740742 LR(k) table, 37~-376, 392-394, 398, 580-582 LR(1) grammar, 410, 448-450, 690, 701 LR(0) grammar, 590, 642, 646-657, 690, 699, 701,708 Lucas, P., 58 Luckham, D. C., 909, 937 Lukasiewicz, J., 214 M Major node, 889-890, 903 Manna, Z., 937 Mapping (see Function) Marill, M., 936 Marked closure, 2 I0 Marked concatenation, 210 Marked union, 210 Markov, A. A., 29 Markov algorithm, 29 Martin, D. F., 563 Maurer, W. D., 811 MAX, 135, 207, 209 Maxwell, W. L., 77 McCarthy, J., 77 McClure, R. M., 77, 485, 757 McCullough, W. S., 103 Mcllroy, M. D., 58, 61,757, 843 McKeeman, W. M., 76-77, 426, 455,936 McNaughton, R., 124, 690 Medlock, C. W., 936, 960 Membership (relation on sets), 1 Membership problem, 130-132 Memory (of a recognizer), 93-96 Memory references, 903-904 Mendelson, E., 25 META, 485 Meyers, W. J., 906 Miller, G. A., 124
Miller, R. E., 906 Miller, W. F., 82 MIN, 135, 207, 209 Minimal fixed point, 108-110, 121-123, 160-161 Minor node, 889, 903 Minsky, M., 29, 36, 102, 138 Mixed strategy precedence grammar, 435, 437-439, 448, 452, 552 Modus ponens, 23 Montanari, U. G., 82 Moore, E. F., 103, 138 Morgan, H. L., 77, 263 Morris, D., 77, 313 Morris, R., 811, 843 Morse, S. P., 426, 579 Moulton, P. G., 77 Move (of a recognizer), 95 MSP (see Mixed strategy precedence grammar) Muller, M. E., 77 Multiple address code, 65, 724-726 Munro, I., 52 N N akata, I., 906 NAME, 769 Natural language, 78, 281 (see also English ) Naur, P., 58 Neutral property, 814-815, 823, 827, 840-841 NEWLABEL, 774 NEXT, 598-599, 609, 616 Next move function, 168, 224 Nievergelt, J., 936 Node, 37 Nondeterministic algorithm, 285, 308310 Nondeterministic finite automaton, 117 (see also Finite automaton) Nondeterministic FORTRAN, 308-310 Nondeterministic property grammar, 823, 841 Nondeterministic recognizer, 95 (see also Recognizer) Non-left-recursive grammar (see Leftrecursive grammar) Nonnullable symbol (see Nullable symbol)
INDEX TO VOLUMESI AND II Nonterminat, 85, 100, 218, 458 Normal form deterministic pushdown automaton, 690-695 Northcote, R. S., 578 Nullable symbol, 674-680 O Object code, 59, 720 Oettinger, A., 192, 313 Ogden, W., 211 Ogden's lemma, I92-196 (1, 1 )-bounded-right-context grammar, 429-430, 448, 690, 701 ( 1, 0)-bounded-right-context grammar, 690, 699-701,708 One-turn pushdown automaton, 207-208 One-way recognizer, 94 Open block, 856-858 Open portion (of a sentential form), 334, 369 Operator grammar/language, 165, 438 Operator precedence grammar/language, 439-443, 448-450, 452, 550-551, 711-718 Order (of a syntax-directed translation), 243-251 Ordered dag, 42 (see also Dag) Ordered graph, 41-42 Ordered tree, 42-44 (see also Tree) Ore, O., 52 OUT, 944-958 Out degree, 39 Output symbol, 218, 224 Output variable, 844
Pager, D., 621 Painter, J. A., 77 Pair, C., 426 PAL, 512-517 Parallel processing, 905 Parenthesis grammar, 690 Parikh, R. J., 211 Parikh's theorem, 209-211 Park, D. M. R., 909, 937 Parse lists, 321 Parse table, 316, 339, 345-346, 348, 351356, 364-365, 374
997
Parse tree, 139-143, 179-180, 220-222, 273, 379, 464-466 (see also Syntax tree) Parsing, 56, 59, 63-65, 72-74, 263-280, 722, 781 (see also Bottom-up parsing, Shift-reduce parsing, Top-down parsing) Parsing action function (see Action function) Parsing automaton, 645-665 Parsing machine, 477-482, 484, 747-748, 750-753 Partial acceptance failure (of a TDPL or GTDPL program), 484 Partial correspondence problem, 36 Partial function, 10 Partial left parse, 293-296 Partial order, 9-10, 13-15, 43-45, 865 Partial recursive function (see Recursive function) Partial right parse, 306 Pass (of a compiler), 723-724, 782 Paterson, M. S., 909, 937 Path, 39, 51 Pattern recognition, 79-82 Paul, M., 455 Paull, M. C., i66, 690 Pavlidis, T., 82 PDA (see Pushdown automaton) PDT (see Pushdown transducer) Perfect induction, 20 Perles, M., 211 Perlis, A. J., 58 Peterson, W. W., 811 Petrick, S. R., 314 Pfaltz, J. L., 82 Phase (of compilation), 721,781-782 w-inaccessible set of LR(k) tables, 588597, 601,613, 616 Phrase, 486 Pig Latin, 745-747 Pitts, E., 103 PL/I, 501 PL360, 507-511 Poage, J. F., 505 Polish notation (see Prefix expression, Postfix expression) Polonsky, I. P., 505 Pop state, 650, 658 Porter, J. H., 263 Position (in a string), 193
998
INDEXTO VOLUMESI AND II
Post, E. L., 29, 36 Postdominator, 935 Postfix expression, 214-215, 217-218, 229, 512, 724, 733-735 Postfix simple syntax-directed translation, 733, 756 Postorder (of a tree), 43 Postponement set, 600-607, 617, 626 Post's correspondence problem, 32-36, 199-201 Post system, 29 Power set, 5, 12 PP (see Pushdown processor) Prather, R. E., 138 Precedence (of operators) 65, 233-234, 608, 617 Precedence conflict, 419-420 Precedence grammar/language, 399-400, 403, 404, 563, 579 (see also Extended precedence grammar, Mixed strategy precedence grammar, Operator precedence grammar, Simple precedence grammar, T-canonical precedence grammar, (2, 1)-precedence grammar, Weak precedence grammar) Predecessor, 37 Predicate, 2 Predictive parsing algorithm, 338-348, 351-356 Prefix expression, 214,215, 229, 236, 724, 730-731 Prefix property, 17, 19, 209, 690, 708 Preorder (of a tree), 43, 672 Problem, 29-36 Procedure, 25-36 Product of languages, 17 (see also Concatenation) of relations, 7 Production, 85, 100 Production language (see Floyd-Evans productions) Program schemata, 909 Proof, 19-21, 43 Proof by induction, 20-21, 43 Proper grammar, 150, 695 Property grammar, 723, 788, 811-843 Property list, 824-833, 839-840 Propositional calculus, 22-23, 35 Prosser, R. T., 936
Pseudorandom number, 798, 807, 808 Pumping lemma, for context-free languages, 195-196 (see also Ogden's lemma) for regular sets, 128-129 Pushdown automaton, 167-192, 201, 282' (see also Deterministic pushdown automaton) Pushdown list, 94, 168, 734-735 Pushdown processor, 737-746, 763-764 Pushdown symbol, 168 Pushdown transducer, 227-233, 237, 265-268, 282-285 (see also Deterministic pushdown transducer) Push state, 658 Q Quasi-valid LR(k) item, 638-639 Question (see Problem) Quotient (of languages), 135, 207, 209
Rabin, M. O., 103, 124 Radke, C. E., 811 Randell, B., 76 Random hashing function, 799, 803-807, 809-810 Range, 6, 10 Read state (of parsing automaton), 658 Reasonable cost criterion, 862-863 Recognizer, 93-96, 103 (see also Finite automaton, Linear bounded automaton, Parsing machine, Pushdown automaton, Pushdown processor, Turing machine, Twostack parser) Recursive function, 28 Recursive grammar, 153, 163 Recursively enumerable set, 28, 34, 9 2 , 97, 500 Recursive set, 28, 34, 99 Reduced block, 859-86.1 Reduced finite automaton, 125-128 Reduced parsing automaton, 656-657 Reduced precedence matrix, 547 Reduce graph, 574-575 Reduce matrix, 574-575
INDEX TO VOLUMES I AND II Reduce state, 646 Reducible flow graph, 937, 941,944-952, 958-959 Reduction in strength, 921, 929-930 Redundant computation, 851-852, 854859, 861-863, 918-919 Redziejowski, R. R., 906 Reflexive relation, 6 Reflexive-transitive closure (see Closure, reflexive and transitive) Region, 922-924, 926-927, 929 Regular definition, 253-254 Regular expression, 104-110, 1.21-123 Regular expression equations, 105-112, 121-123 Regular grammar, 122, 499 Regular set, 103-138, 191, 197, 208-210, 227, 235, 238-240, 424, 689 Regular translation (see Finite transducer) Relation, 5-15 Relocatable machine code, 721 Renaming of variables, 852, 854-859, 861-863 Reversal, 16, 121, 129-130, 397, 500, 689 Reynolds, J. C., 77, 280, 313 Rice, H. G., 166 Richardson, D., 878 Right-bracketed representation (for trees), 46 Right cover, 275-277, 280, 307, 708, 718 Right-invariant equivalence relation, 133134 Right-linear grammar, 91-92, 99, 110112, 118-121,201 Right-linear syntax-directed translation scheme, 236 Rightmost derivation, 142-143,264, 327330 Right parsable grammar, 271-275, 398, 672-674 Right parse (see Rightmost derivation, Bottom-up parsing) Right parse language, 273 Right parser, 269-27t Right-sentential form, 143 Roberts, P. S., 314 Rogers, H., 36 Root, 40 Rosen, S., 76
999
Rosenfeld, A., 82 Rosenkrantz, D. J., 102, 166, 368, 621, 690, 843 Ross, D. T., 263 Rule of inference, 19-20 Russell, L. J., 76 Russell's paradox, 2
Salomaa, A., 124, 138, 211 Samelson, K., 455 Sammet, J. E., 29, 58, 807 Sattley, K., 312 Scan state, 691, 693, 707 Scatter table (see Hash table) Schaefer, M., 960 Schorre, D. V., 77, 313, 485 Schutte, L. J., 578 Schutzenberger, M. P., 166, 192, 211 Schwartz, J. T., 76, 960 Scope (of a definition), 849 Scott, D., 103, 124 SDT (see Syntax-directed translation) Self-embedding grammar, 210 Self-inverse operator, 868 Semantic analysis, 723 Semantics, 55-58, 213 Semantic unambiguity, 274, 758 Semilinear set, 210 Semireduced automaton, 653, 661 Sentence (see String) Sentence symbol (see Start symbol) Sentential form, 86, 406-407, 414-415, 442, 815 Set, 1-19 Set merging problem, 833-840, 843 Sethi, R., 878, 906 Shamir, E., 211 Shapiro, R. M., 77 Shaw, A. C., 82 Shepherdson, J. C., 124 Shift graph, 570-574 Shift matrix, 569-574 Shift-reduce conflict, 643 Shift-reduce parsing, 269, 301-302, 368371, 392, 400-403, 408, 415, 418419, 433, 438-439, 442, 544, 823 (see also Bounded-right-context grammar, LR (k) grammar, Precedence grammar)
1000
INDEXTO VOLUMESI AND II
Shift state, 646 Simple LL(1) grammar/language, 336 Simple LR(k) grammar, 605, 626-627, 633, 640-642, 662 Simple mixed strategy precedence grammar/language, 437, 448, 451-452, 642, 666, 690', 695-699 Simple precedence grammar/language, 403-412, 420~-424, 492-493, 507, 544-552, 555-559, 615, 666, 709712, 716-718 Simple syntax-directed translation, 222223, 230-233, 240-242, 250, 265, 512, 730-736 (see also Postfix simple SDTS) Single entry region, 924 (see also Region) Single productions, 149-150, 452, 607615, 617, 712 Skeletal grammar, 440-442, 452, 611 SLR(k) grammar (see Simple LR(k) grammar) SNOBOL, 505-507 Solvable problem (see Problem) Source program, 59, 720 Space complexity (see Computational complexity) Spanning tree, 51, 571-574 Split canonical parsing automaton, 651652, 659-660 Splitting of grammars (see Grammar splitting) of LR(k) tables, 629-631 of states of parsing automaton, 650652, 657-660 Splitting nonterminal, 631 Standard f o r m (of regular expression equations), 106--.110 Standish, T., 77, 280 Start state (see Initial state) Start symbol, 85, 100, 168, 218, 458 (see also Initial symbol) State (of a recognizer), 113, 168,224 State transition function, 113 Stearns, R. E., 192, 211, 237, 368, 621, 690, 757, 843 Steel, T. B., 58 Stockhausen, P,; 906 Stone, H. S., 906 Storage allocation, 723 Store function (of a recognizer), 94
Straight line code, 845-879, 909-910, 912 Strassen, V., 34, 52 String, 15, 86 Strong characterization (see Characterizing language) Strong connectivity, 39 Strong LL(k) grammar, 344, 348 Strongly connected region (see Region) Structural equivalence, 690 SUB, 135 Subset, 3 Substitution (of languages), 196-197 Successor, 37 Suffix property, 17, 19 Superset, 3 Suppes, P., 3 Symbol table (see Bookkeeping) Symmetric relation, 6 Syntactic analysis (see Parsing) Syntax, 55-57 Syntax-directed compiler, 730 Syntax-directed translation, 57, 66-70, 215-251, 445, 730-787, 878 (see also Simple syntax-directed translation) Syntax macro, 501-503 Syntax tree, 722-727, 881 (see also Parse tree) Synthesized attribute, 777, 784
Tagged grammar, 617-620, 662 Tag system, 29, 102 Tautology, 23 T-canonical precedence grammar, 452454 TDPL (see Top-down parsing language) Temporary storage, 67-70 Terminal (of a grammar) 85, 100, 458 Theorem, 20 Thickness (of a string), 683-684, 690 Thompson, K., 138, 263 Three address code, 65 (see also Multiple address code) Time complexity (see Computational complexity) TMG, 485
INDEX TO VOLUMESI AND II Token, 59-63, 252 Token set, 452 Top-down parsing, i78, 264-268, 285301, 445, 456-485, 487, 661-662, 742-746, 767-768, 816 (see also LL (k) grammar) Top-down parsing language, 458-469, 472-473, 484-485 Topological sort, 43-45 Torii, K., 332 Total function, 10, 14 Total recursive function, 28 TRANS, 945-955 Transducer (see Finite transducer, Parsing machine, Pushdown processor, Pushdown transducer) Transformation, 71-72 (see also Function, Translation) Transition function (see State transition function, Next move function) Transition graph, 116, 225 Transitive closure (see Closure, transitive) Transitive relation, 6 Translation, 55, 212-213, 720-787 (see also Code generation, Syntax-directed translation, Transducer) Translation element, 758 Translation form (pair of strings), 216217 Translation symbol, 758 Translator, 216 (see also Transducer) Traverse (of a PDA), 691,693-694 Tree, 40-42, 45-47, 55-56, 64-69, 80-82, 283-284, 434-436 (see also Parse tree) Truth table, 21 T-skeletal grammar, 454 Turing, A. M., 29, 102 Turing machine, 29, 33-36, 100-132 (2, 1)-precedence grammar, 426, 448, 666, 690,702-707 Two-stack parser, 487-490, 499-500 Two-way finite automaton, 123 Two-way pushdown automaton, 191-192
UI (see Unique invertability) Ullman, J. D., 36, 102-103, 192, 211,
'! 001
Ullman (cont.) 251, 399, 426, 485, 563, 621, 645, 665, 709, 757, 787, 811, 843, 878, 960 Unambiguous grammar, 98-99, 325-328, 344, 395, 397, 407, 422, 430, 613, 663 (see also Ambiguous grammar) Undecidable problem (see Problem) Underlying grammar, 226, 758, 815 Undirected graph (see Graph, undirected) Undirected tree, 51 Unger, S. H., 314, 690 Uniform hashing function, 810 Union, 4, 197, 201,208, 484, 689 Unique invertibility, 370, 397, 404, 448, 452, 490, 499, 712-713 Universal set, 4 Universal Turing machine, 35 Unordered graph (see Graph) Unrestricted grammar/language, 84-92, 97-98, 100, 102 Unsolvable problem (see Problem) Useless statement, 844, 849-851, 854859, 861-863, 917-918, 937 Useless symbol, 123, I46--147, 244, 250, 280
Valid item, 381, 383-391, 394 Valid parsing algorithm, 339, 347-348 Valid set of LR(k) tables, 584 Value (of a block) 846, 857 Van Wijngaarden, A., 58 Variable (see Nonterminal) Venn diagram, 3 Vertex (see Node) Viable prefix (of a right-sentential form) 380, 393, 616
W Walk, K., 58 Waiters, D. A., 399 Warshall, S., 52, 77 Warshall's algorithm, 48-49 WATFOR, 721
1002
INDEXTO VOLUMESI AND II
Weak precedence function, 551-555, 561 Weak precedence grammar, 415-425, 437, 451-452, 552, 559, 561,564579, 667, 714-716 Weber, H., 426, 563 Web grammar, 79-82 Wegbreit, B., 58 Weiner, P., 138 Well-formed (TDPL or GTDPL program), 484 Well order, 13, 19, 24-25 Winograd, S., 33; 906 Wirth, N., 426, 507, 563 Wirth-Weber precedence (see Simple precedence grammar) Wise, D. S., 455 Wolf, K. A., 906 Wood, D., 368
Word (see String) Worley, W. S., 77 Wortman, D. B., 76-77, 455, 512 Wozencraft, J. M., 512 Write sequence (for a write state of a PDA), 694 Write state, 691,693,708
Yamada, H., 124 Younger, D. H., 332
Zemlin, R. A., 906
(continued from front flap)
• Describes methods for rapid storage of information in symbol tables. • Describes syntax-directed translations in detail and explains their implementation by computer. * Provides detailed mathematical solutions and many valuable examples.
ALFRED V. AHO is a member of the Technical Staff of Bell Telephone Laboratories' Computing Science Research Center at Murray Hill, New jersey, and an Affiliate Associate Professor at Stevens Institute of Technology. The author of numerous published papers on language theory and compiling theory, Dr. Aho received his Ph.D. from Princeton University. JEFFREY D. ULLMAN is an Associate Professor of Electrical Engineering at Princeton University. He has published numerous papers in the computer science field and was previously co-author of a text on language theory. Professor Ullman received his Ph.D. from Princeton Universit)/.
PRENTICE-HALL, Inc. Englewood Cliffs, New Jersey 1 1 7 6 • P r i n t e d in U.S. of A m e r i c a
View more...
Comments