PEG like recursive descent Parser for Lua.

PEG stands for "Parsing Expression Grammar" and is one of the simplest parsing grammars since it maps very-closesly to recursive descent -- which is a hand-rolled parser that uses recursion. This library is a pure-LUA recursive-descent parser which exports types and functions to create a PEG-like DSL that is still just recursive descent under the hood.

The benefits of this library are:

Resources

If you are completely new to parsers and especially if you want to write your own language with an AST then I cannot recommend http://www.craftinginterpreters.com enough. It might be a better place to start than this library.

Introduction

A parser is a way to convert text into structured node objects so that the text can be compiled or annotated by a program. For example you might want to convert some source code like:

x = 1 + 2

Into something like:

{'x', '=', {'1', '+', '2', kind='op'}, kind='assign'}

A recursive descent parser does so via hand-rolled functions which typically recurse into eachother. Each function attempts to parse from the current parser position using its spec (which may be composed of calling other parsing functions) and returns either the successfully parsed node or nil (or returns/throws an error if it finds a syntax error).

PEGL is a Lua library for writing the common-cases of a recursive descent parser in a (pure Lua) syntax similar to PEG, while still being able to easily fallback to hand-rolled recursive descent when needed.

Most traditional PEG parsers (as well as other parsers) struggle with complicated syntax such as Lua's [===[raw string syntax]===], python's whitespace denoted syntax or C's lookahead requirements ((U2)*c**h -- recursive descent can solve a lot of these problems relatively easily and performantly. However, recursive descent parsers can be very verbose and sometimes difficult to understand. Below is a comparison of the above example in both PEG, PEGL and a "traditional" (though not very good) recursive descent implementation.

Examples

PEG: most concise but harder to fallback to hand-rolled recursive descent

grammar = [[
num    <- '%d'
name   <- '%w'
setVar <- num '=' name
expr   <- setVar / ... other valid expressions
]]
p:parse(grammar)

PEGL: very concise and easy to fallback to hand-rolled recursive descent. Things like `kind` and `name` make debug printing easier. Simply name your specs, putting them together into a root spec and parsing:

num    = Pat{'%d+', kind='num'} -- kind=num sets the node name
name   = Pat{'%w+', kind='name'}
-- Note: UNPIN and PIN are used for when errors should be raised
setVar = {UNPIN, name, '=', PIN, num, kind='setVar'}
expr   = Or{setVar, ... other valid expressions, name='expr'}
p:parse(expr)
]

["Note: the difference between name and kind is that a node
  with [$kind] set will [*never be unpacked]. For instance,
  to define a [$block; of; items;] you might use [$name] since you want
  [$single;] to not be represented as just [$single] instead of
  [$block { single }].
]

[*Recursive Descent]: not very concise, harder to debug. [{$$ lang=lua}
-- Note: p=parser, an object which tracks the current position
-- in it's `state`

function parseNum(p)
  local num = p:consume('%d+') -- return result and advance position
  if num then -- found
    return {num, kind='num'} end
  end
end

function parseSetVar(p)
  local state = p.state()
  local name = p:consume('%w+')
  if not name then return end
  local eq, num = p:consume('='), parseNum(p)
  if not (eq and num) then
    -- didn't match, reset state and return
    p.setState(state)
    return
  end
  return {{name, kind='name'}, eq, num, kind='setVar'}
end

function expression(p)
  local expr = parseSetVar(p)
  if expr then return expr end
  -- ... other possible expressions
end

expression(p)

See #pegl.Seq for the basic API of parsing specs.

Mod pegl

pegl: peg-like lua parser

Types: Token Config Parser Pat Key Or Many Seq Not common FmtPegl

Functions

Record Token

Fields: Methods

Record Config

The config spec defines custom behavior when parsing. It's attributes can be set to change how the parser skips empty (whitespace) and handles comments.

Fields:

Methods

Record Parser

The parser tracks the current position of parsing in `dat` and has several convienience methods for hand-rolling your own recursive descent functions.

Note: the location is **line/col based** (not position based) because it is designed to work with an application that stores the strings as lines (a text editor)

Fields:

Methods

Record Pat

Pat{'%w+', kind='word'} will create a Token with the span matching the %w+ pattern and the kind of word when matched.

Fields:

Methods

Record Key

The table given to Key forms a Trie which is extremely performant. Key depends strongly on the tokenizer passed to Config.

Example: Key{{'myKeword', ['+']={'+'=true}}, kind='kw'} will match tokens "myKeyword" and "+" followed by "+" (but not "+" not followed by "+").

To also match "+" use ['+']={true, '+'=true}

Note: The `Key` constructor converts all list items into value=true, so {'a', 'b'} is converted to {a=true, b=true}

Fields:

Methods

Record Or

choose one spec

Example: Or{'keyword', OtherSpec, Empty} will match one of the three specs given. Note that Empty will always match (and return pegl.EMPTY). Without Empty this could return nil, causing a parent Or to match a different spec.

Note: Maybe(spec) literally expands to Or{spec, Empty}

Prefer Key if selecting among multiple string or token paths as Or is not performant (O(N) vs Key's O(1))

Fields:

Methods

Record Many

match a Spec multiple times Example: Many{'keyword', OtherSpec, min=1, kind='myMany'} will match the given sequence one or more times (min=0 times by default). The parse result is a list with kind='myMany'.

Fields:

Methods

Record Seq

A Sequence of parsers. Note that Parser treats Seq{'a'} and {'a'} identically (you can use plain tables instead).

Raw strings are treated as keywords (the are parsed literally and have key set to themselves). Functions are called with the parser as the only argument and must return the node/s they parsed or nil for a non-error match.

A sequence is a list of either other parsers (Seq, {}, "keyword", fn(p), Not, Or, etc} and/or plain strings which are treated as keywords and will have kind be set to themselves when parsed.

If the first spec matches but a later one doesn't an error will be thrown (instead of nil returned) unless UNPIN is used. See the PIN/UNPIN docs for details.

PIN/UNPIN: Syntax Error Reporting

PEGL implements syntax error detection ONLY in Sequence specs (table specs i.e. `{...}`) by throwing an error if a "pinned" spec is missing.

Fields:

Methods

Record Not

Fields: Methods

Mod pegl.common

Record FmtPegl

Fields:

Mod pegl.acsyntax

Module for applying asciicolor syntax highlighting to parsed pegl.

Types: Highlighter tokenize

Record Highlighter

Usage:
hl = require'pegl.lua'.highlighter
hl:highlight(lf, fgFile, bgFile)

The highlighter object determines how to parse and highlight a text file's syntax.

Fields:

Methods

Record tokenize

Usage: tokens = tokenize(highlighter, lineFile)
A state that when called (constructed) is a list of tokens.

Fields:

Mod pegl.lua

Lua syntax in PEGL

http://parrot.github.io/parrot-docs0/0.4.7/html/languages/lua/doc/lua51.bnf.html

was used as a reference.

Functions