Compilers

TODO merge in notes from CS164

  • TODO: distinguish grammars from parsers more cleanly in these notes
  • compiler types
    • recursive descent
      • parses LL(k)
      • simplest to hand-code; simply recursive functions that backtrack
      • non-predictve; requires backtracking
      • most embedded parser languages are this (e.g. haskell’s parsec monadic parser), unless the embedded language constructs the data structures fed into a “more separate” parsing
    • predictive grammars
      • use k tokens of lookahead to fully determine which production to use
      • no backtracking needed
    • top-down operator precedence (pratt) parser
      • http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/
      • basic idea: allow yourself to call into main parse(precedence) from anywhere, and the precedence is always checked to be non-decreasing
      • also controls associativity
      • pseudocode

        def parse(precedence):
          token = next()
          p = prefix_parsers[token.type]
          parsed = p.parse(token)
          while True:
            token = next()
            p = infix_parsers[token.type]
            if p is None or precedence < token.precedence: break
            parsed = p.parse(parsed, token)
          return parsed
  • parsing types
    • shift-reduce parsing: a type of bottom-up parsing
  • grammar types (hierarchy)
    • bottom-up
      • LR
        • LR(0)
        • SLR(1!)
        • LALR(1)
        • LR(1)
        • LR(n)
  • grammar types (details)
    • general
      • first L means left-to-right scanning
      • second L/R means left-to-right/right-to-left derivation (leftmost/rightmost)
        1. = k-token lookahead
    • LL(k) aka top-down: simplest
      • produces leftmost derivation top-down
      • LL(1): need to look ahead 1 token to determine which rule to apply (for infix, etc.)
      • animation: http://www.cs.wm.edu/~noonan/animations/llparse.html
      • ANTLR
      • LL is a subset of LR
    • non-commutative LL(*): most hand-written parsers, combinators
      • memoized non-commutative LL(*): packrat parsers, scala 2.8 parser combinators
      • harder theoretical analysis vs. LR, but probably weaker
    • LR(k)
      • k is the number of tokens to lookahead at the end of a rule
      • produces rightmost derivation in reverse bottom-up TODO
      • don’t get used much; internal tables tend to get too big
      • handles left-recursion
    • LALR(k)
      • YACC, bison
    • SLR(k)
      • LR(0) + followsets TODO
  • grammar forms
    • Backus-Naur Form (BNF): TODO
    • Extended BNF (EBNF): TODO
  • determining ambiguity: uncomputable
    • eg “dangling else” in C/C++/Java; convention is attach to nearest
      • if SLR/LR(1)/LALR LR, rely on generated parser feature of preferring shift over reduce if conflict
    • LR (and thus LL) grammars are always non-ambiguous
    • unsure: LR parsers can identify certain non-ambiguous grammars, but they have this gray area where they can’t tell whether the grammar is ambiguous
    • but at least GLR can notify you of this when it happens
  • parsing expression grammar (PEG)
    • recursive descent (top-down) parsers: no left recursion
    • packrat parsers
      • a memoizing parsers; use much more space TODO how much?
      • linear time (unlike GLR)
      • rewriting extension can handle left recursion
    • no ambiguity because alternative productions are ordered
      • that’s also a downside: can’t tell when you have ambiguous grammar
    • all LL grammars are PEGs but not vice versa
  • code generation
  • superoptimizers: exhaustively search for truly optimal generated code sequence
  • language classes: low to high power
    • regular: regexes; DFAs/NFAs
    • context-free: don’t need symbol table
      • python is context-free but req’s tokenizer that recognizes (de-)indents
    • context-sensitive: need symbol table; eg in C++, is T*x a type decl or a mult? (“most vexing parse”)

Tools

  • parser generators
    • lex/yacc: old and busted; LALR
    • flex/bison: new hotness; LALR
      • generates C
      • does GLR
      • flex includes flex++ for C++; bison++ also exists
    • javacc: most popular for java
    • alex/happy: haskell
    • ANTLR: LL(k) parser generator
  • edsl parsers
    • parsec: haskell
    • scala
  • boost spirit 2: GLR

Grammars

  • terminals: loosely, can be used interchangeably with “token”
  • non-terminals
  • start symbol
  • productions

Derivations

  • sequence of production steps proving that a string belongs in the language
  • 2 choices at each step:
    • which nonterminal to expand? leftmost vs. rightmost
    • which production to use for that nonterminal?
  • e.g.
    • grammar: E -> (E) | -E | id | E+E
    • a derivation: E => -E => -(E) => -(id)
    • leftmost derivation: E => E+E => id+E => id+id
    • rightmost derivation: E => E+E => E+id => id+id

Grammar issues

  • ambiguity
  • left recursion
    • hidden left recursion: even tomita’s GLR can’t handle

Grammar transformations

  • eliminating ambiguity
  • eliminating left recursion
  • left factoring: gets you a grammar suitable for top-down parsing

Top-down parsing

  • recursive descent: requires backtracking
  • LL: predictive, with k tokens of lookahead
    • tables of FIRST and FOLLOW sets
  • top-down parsing during left-to-right scan constructs a leftmost derivation

Bottom-up parsing

  • shift-reduce parsers
  • recursive-ascent: implementation technique for bottom-up (LR) parsers
    • practical benefit: flexible in that it allows for parameterized recursion (eg for error recovery)
    • also, popular bc simple
  • LR: largest class of grammars for which shift-reduce parser can be built
    • shift-reduce conflicts; ambiguous grammar cannot be LR
    • LR parsing is attractive because
      • it can recognize almost all CFGs
      • LR parser can detect a syntactic error ASAP on left-to-right scan
      • most general nonbacktracking shift-reduce parsing method known, but efficient
      • main drawback is difficulty of implementation
  • bottom-up parsing during left-to-right scan constructs a rightmost derivation in reverse
  • LALR: lookahead LR
    • more compact than LR, like SLR
    • more general than SLR
  • Excerpts from http://www.devincook.com/GOLDParser/articles/glossary.htm
    • LALR Parsing, or “Lookahead LR parsing”, is a variant of LR Parsing that combines different “configurations” to limit the size of the parse tables. As a result, the algorithm is slightly less powerful than the LR Parsing. Grammars that can be parsed by a LR parser, might be found to be “ambiguous” by LALR. However, this is very rarely the case and real-world examples are few.The number of states eliminated by LALR are sometimes huge. The C programming language, for instance, has over 10,000 LR states. LALR drops this number to slightly more than 200.
    • LL Parsing, or Left-to-right Left-derivative parsing, refers to a class of parsers that analyze text using a top-down algorithm.

      LL Parsers have the advantage of being very simple in the design - at least conceptually. Unlike LR Parsers, the system does not need to generate tables ahead of time. Instead, only a set of rules and lookahead data is needed.

      However, LL Parsers are not as powerful as LR parsers. Rules cannot have left-recursion since it causes problems with the recursive decent algorithm. In addition, LL Parsers are known as memory intensive and slow. Both of these are due to the “search” nature of the algorithm. Optimization is possible, but the complexity of the LL Parser runtime engine increases dramatically.
    • LR Parsing, or Left-to-right Right-derivative parsing, uses tables to determine when a rule is complete and when additional tokens are needed to be read from the source. Unlike LL Parsing, the LR Parser does very little “thinking” at runtime. All decisions are based on the content of the parse tables. As a result, LR Parsing is faster and more memory efficient than LL Parsing.

      The construction of these tables where all the “thinking” takes place. LR parser generators, such as YACC and GOLD, construct these tables by analyzing the grammar and determining all the possible “states” the system can be in when parsing.

      Each state represents a point in the parse process where a number of tokens have been read from the source and rules are in different states of completion. Each production in a state of completion is called a “configuration” and each state is really a configuration set.

      LR parse tables can be huge, and, as a result, often a variant of LR Parsing is used. For more information, please see theLALR Algorithm page.
    • A Reduce-Reduce Conflict is a caused when a grammar allows two or more different rules to be reduced at the same time, for the same token. When this happens, the grammar becomes ambiguous since a program can be interpreted more than one way.

      For instance, assume you have the following grammar: ::= |

      ::= Identifier ::= Identifier

      When the system reads an identifier, it cannot determine if it has completed or.

      When a LALR parser generator is analyzing a grammar and constructing the parse tables, these conflicts are located immediately.
    • The Shift-Reduce Conflict is the most common type of conflict found in grammars. It is caused when the grammar allows a rule to be reduced for particular token, but, at the same time, allowing another rule to be shifted for that same token.

      As a result, the grammar is ambiguous since a program can be interpreted more than one way.

      This error is often caused by recursive grammar definitions where the system cannot determine when one rule is complete and another is just started. The Builder documentation contains an example of the common if-then-else grammar problem and how to fit it.
    • GLR aka generalized LR aka parallel parser
      • handles nondeterministic and ambiguous LR grammars by returning all possible interpretations
      • tomita’s parsing algo is best
    • GLL aka generalized LL (2009)
      • RD-like but uses tomita-style algo

compiler optimizations

  • http://lambda-the-ultimate.org/node/3674
  • const objects are treated as constants (vs const refs where constness can be cast away)
  • macros/inline unnecessary; compiler tries to inline anyway
  • reduce conditionals; change to arithmetic
  • in-CPU call stack makes regular function calls fast too
  • common subexpression elimination
  • invariant hoisting moves loop-invariant checks out of the loop
  • inlining lets compiler do variable value range analysis
  • auto vectorization: words at a time or 128-bit vector at a time
  • bit shifts and leas
  • tail recursion (gcc)
  • dead code elimination
  • things that don’t matter: pre-/post-inc, data structures/algos (vs. simple cache-friendly arrays)
  • memory hierarchy is the only important optimization today
    • mult instead of shift: 5 cycles
    • branch mispredict: 10 cycles
    • cache miss to main memory: 250 cycles

lexing/regular expression matching

  • 2 approaches: backtracking and parallel
  • backtracking: depth-first search; used in Perl, Python, etc
    • sucks for a?{n}a{n} since it may need to redo the a?{n} many times
    • may take exp-time
    • perl uses memoization to avoid exp’l blow-up, but requires O(nm) space, where n is len of regex and m is len of str
  • parallel aka Thompson NFA: breadth-tracking search; track all possible states simultaneously
    • when building DFAs out of NFAs, this is the approach used
    • sed, awk, grep, tcl, lexers build DFAs
    • sucks (exp-time) for wide-branching cases? can’t think of them right now
  • backreferences trivial in backtracking, but make parallel exp-time
  • http://swtch.com/~rsc/regexp/regexp1.html
  • lexer hack: lexer doesn’t have enough info to determine if symbol is var name or type name, eg (int) *x vs (a) * b; must feed it symbol table info

tracing JITs

  • tracing vs per-method JIT
  • type-specializing vs non-type-specializing JIT
    • not yet any real per-method type-specializing JITs
      • hard to fully specialize bc these only check arg types, making input-specialized versions, but not dealing w internal types (eg getting obj props)
  • tracemonkey: only loops that iterate at least 3x
    • type specialization is biggest fruit; interp is less
    • branch specialization: cap at 32 traces to avoid using too much mem
    • nanojit: the actual compiler
    • 22x on numerical; 2-5x on object manip
    • stages: monitoring (loop counting), recording, compiling, native execution

inline caching

  • speed up dynamic method invocation by caching previous method binding directly at call site
  • type check (‘guard’) still required, usu in callee’s preamble
  • monomorphic: one type cached; else, revert to slow path
  • polymorphic: which of a limited set of methods to invoke; multiple bindings can be recorded at same call site
  • megamorphic: if see too many types; same as uninitialized but never mono-/polymorphic again