Civboot

PEG like recursive descent Parser for Lua.

PEG stands for "Parsing Expression Grammar" and is one of the simplest parsing grammars since it maps very-closesly to recursive descent -- which is a hand-rolled parser that uses recursion. This library is a pure-LUA recursive-descent parser which exports types and functions to create a PEG-like DSL that is still just recursive descent under the hood.

The benefits of this library are:

vs hand-rolled recursive descent is more concise and readable, as well as providing automatic error messages (i.e. symbol locations in your stack) and debugging.
vs PEG is nearly as concise while maintaining the ability to hand-roll any logic needed.

Resources

If you are completely new to parsers and especially if you want to write your own language with an AST then I cannot recommend http://www.craftinginterpreters.com enough. It might be a better place to start than this library.

Introduction

A parser is a way to convert text into structured node objects so that the text can be compiled or annotated by a program. For example you might want to convert some source code like:

x = 1 + 2

Into something like:

{'x', '=', {'1', '+', '2', kind='op'}, kind='assign'}

A recursive descent parser does so via hand-rolled functions which typically recurse into eachother. Each function attempts to parse from the current parser position using its spec (which may be composed of calling other parsing functions) and returns either the successfully parsed node or nil (or returns/throws an error if it finds a syntax error).

PEGL is a Lua library for writing the common-cases of a recursive descent parser in a (pure Lua) syntax similar to PEG, while still being able to easily fallback to hand-rolled recursive descent when needed.

Most traditional PEG parsers (as well as other parsers) struggle with complicated syntax such as Lua's [===[raw string syntax]===], python's whitespace denoted syntax or C's lookahead requirements ((U2)*c**h -- recursive descent can solve a lot of these problems relatively easily and performantly. However, recursive descent parsers can be very verbose and sometimes difficult to understand. Below is a comparison of the above example in both PEG, PEGL and a "traditional" (though not very good) recursive descent implementation.

Examples

PEG: most concise but harder to fallback to hand-rolled recursive descent

grammar = [[

num    <- '%d'

name   <- '%w'

setVar <- num '=' name

expr   <- setVar / ... other valid expressions

]]

p:parse(grammar)

PEGL: very concise and easy to fallback to hand-rolled recursive descent. Things like `kind` and `name` make debug printing easier. Simply name your specs, putting them together into a root spec and parsing:

num    = Pat{'%d+', kind='num'} -- kind=num sets the node name

name   = Pat{'%w+', kind='name'}

-- Note: UNPIN and PIN are used for when errors should be raised

setVar = {UNPIN, name, '=', PIN, num, kind='setVar'}

expr   = Or{setVar, ... other valid expressions, name='expr'}

p:parse(expr)

]

["Note: the difference between name and kind is that a node

  with [$kind] set will [*never be unpacked]. For instance,

  to define a [$block; of; items;] you might use [$name] since you want

  [$single;] to not be represented as just [$single] instead of

  [$block { single }].

]

[*Recursive Descent]: not very concise, harder to debug. [{$$ lang=lua}

-- Note: p=parser, an object which tracks the current position

-- in it's `state`

function parseNum(p)

  local num = p:consume('%d+') -- return result and advance position

  if num then -- found

    return {num, kind='num'} end

  end

end

function parseSetVar(p)

  local state = p.state()

  local name = p:consume('%w+')

  if not name then return end

  local eq, num = p:consume('='), parseNum(p)

  if not (eq and num) then

    -- didn't match, reset state and return

    p.setState(state)

    return

  end

  return {{name, kind='name'}, eq, num, kind='setVar'}

end

function expression(p)

  local expr = parseSetVar(p)

  if expr then return expr end

  -- ... other possible expressions

end

expression(p)

See #pegl.Seq for the basic API of parsing specs.

Mod pegl

pegl: peg-like lua parser

Types: Token Config Parser Pat Key Or Many Seq Not common FmtPegl

Functions

fn encodeSpan(l1, c1, l2, c2)
fn decodeSpan(s)
fn firstToken(list) -> t, listWithToken
fn lastToken(list) -> t, listWithToken
fn nodeSpan(t)
fn fmtSpec(s, f)
fn specToStr(s, fmt)
fn specTy(name)
Create a parser spec record. These have the fields kind and name and must define the parse method.
fn Maybe(spec)
fn isEmpty(M.EMPTY, t)
fn notEmpty(M.EMPTY, t)
fn skipWs1(p)
fn skipEmpty(p)
fn skipEmptyMinimal(p)
fn defaultTokenizer(p)
fn parseSeq(p, seq)
fn parse(dat, spec, config) -> list[Node]
Parse a spec, returning the nodes or throwing a syntax error.
config is used to define settings of the parser such as how to skip comments and whether to use debug mode.
fn assertParse(t) -> result, node, parser
Parse the dat with the spec, asserting the resulting "string tokens" are identical to expect.
The input is a table of the form:
{dat, spec, expect, dbg=nil, config=default} --> nil
fn assertParseError(t)
fn isKeyword(t)

Record Token

Fields:

kind optional, used for debugging
style computed by acsyntax

Methods

fn:span(dec) -> l, c, l2, c2
fn encode(T, p, l, c, l2, c2, kind, style)
fn:decode(self[1])
fn:lte(r)
compare two tokens, commonly used in sorting algorithms.

Record Config

The config spec defines custom behavior when parsing. It's attributes can be set to change how the parser skips empty (whitespace) and handles comments.

Fields:

skipEmpty =pegl.skipEmpty default=skip whitespace
- must be a function that accepts the `Parser` and advances it's `l` and `c` past any empty (white) space. It must also set `p.line` appropriately when `l` is moved.
- The return value is ignored.
- The default is to skip all whitespace (spaces, newlines, tabs, etc). This should work for _most_ languages but fails for languages like python.
- Recommendation: If your language has only a few whitespace-aware nodes (i.e. strings) then hand-roll those as recursive-descent functions and leave this function alone.
skipComment function for behavior of found comment
tokenizer =pegl.defaultTokenizer Requires:
- must return one token. The default is to return a single punctuation character or a whole word (_%w)
- Objects like Key can use the single punctuation characters in a Trie-like performant data structure.
dbg if true, prints out huge amounts of debug information of parsing.
lenient if set, syntax errors do not cause failure. Instead, all errors act as if the current block missed but was UNPIN.

Methods

fn skipEmpty(p)
fn tokenizer(p)
fn newFmt()

Record Parser

The parser tracks the current position of parsing in `dat` and has several convienience methods for hand-rolling your own recursive descent functions.

Note: the location is **line/col based** (not position based) because it is designed to work with an application that stores the strings as lines (a text editor)

Fields:

dat reference to the underlying data. Must look like a table of lines
l line, incremented when c is exhausted
c column in line
line the current line (dat:get(l))
lines
config
stack
stackL
stackC
stackLast
commentLC table of cached {line={col=CommentToken}}
comments list of comment tokens
dbgLevel =0
path
firstError first error.

Methods

fn:assertNode(expect, node, config) -> strTokens
fn new(T, dat, config)
fn:parse(spec) -> node
the main entry point and used recursively. Parses the spec, returning the node, which is a table of nodes that are eventually tokens.
fn:consume(pat, plain) -> Token
consume the pattern, advancing the column if found
fn:peek(pat)
identical to `consume` except it does not advance the column
fn:sub(t) -> string
fn:incLine()
fn:isEof() -> isAtEndOfFile
fn:skipEmpty()
fn:state()
get the current parser state {l, c, line}
fn:setState(st)
restore the current parser state {l, c, line}
fn:toStrTokens(n)
fn:makeStrTokens(t) -> t
recursively mutate table converting all Tokens to strings
fn:tokenStr(t) -> string
fn:trimTokenStart(list)
fn:trimTokenLast(list, trimNl)
fn:checkPin(pin, expect)
fn:error(msg)
fn:parseAssert(spec)
fn:dbgEnter(spec)
fn:dbgLeave(n)
fn:dbgMatched(spec)
fn:dbgMissed(spec, note)
fn:dbgUnpack(spec, t)
fn:dbg(fmtstr, ...)

Record Pat

Pat{'%w+', kind='word'} will create a Token with the span matching the %w+ pattern and the kind of word when matched.

Fields:

kind
name

Methods

fn set(name)
Create a parser spec record. These have the fields kind and name and must define the parse method.
fn get(name)
Create a parser spec record. These have the fields kind and name and must define the parse method.
fn extend(r, l) -> r
This is used by types implementing :extend. It uses their get and set methods to implement extend in a for loop.
types do this if they may yield in their get/set, which is not allowed through a C boundary like table.move
fn:parse(p)

Record Key

The table given to Key forms a Trie which is extremely performant. Key depends strongly on the tokenizer passed to Config.

Example: Key{{'myKeword', ['+']={'+'=true}}, kind='kw'} will match tokens "myKeyword" and "+" followed by "+" (but not "+" not followed by "+").

To also match "+" use ['+']={true, '+'=true}

Note: The `Key` constructor converts all list items into value=true, so {'a', 'b'} is converted to {a=true, b=true}

Fields:

keys
name
kind

Methods

fn:parse(p)

Record Or

choose one spec

Example: Or{'keyword', OtherSpec, Empty} will match one of the three specs given. Note that Empty will always match (and return pegl.EMPTY). Without Empty this could return nil, causing a parent Or to match a different spec.

Note: Maybe(spec) literally expands to Or{spec, Empty}

Prefer Key if selecting among multiple string or token paths as Or is not performant (O(N) vs Key's O(1))

Fields:

kind
name

Methods

fn:parse(p)

Record Many

match a Spec multiple times Example: Many{'keyword', OtherSpec, min=1, kind='myMany'} will match the given sequence one or more times (min=0 times by default). The parse result is a list with kind='myMany'.

Fields:

min =0
kind
name
yield call lap.yield() on every int loop

Methods

fn:parse(p)

Record Seq

A Sequence of parsers. Note that Parser treats Seq{'a'} and {'a'} identically (you can use plain tables instead).

Raw strings are treated as keywords (the are parsed literally and have key set to themselves). Functions are called with the parser as the only argument and must return the node/s they parsed or nil for a non-error match.

A sequence is a list of either other parsers (Seq, {}, "keyword", fn(p), Not, Or, etc} and/or plain strings which are treated as keywords and will have kind be set to themselves when parsed.

If the first spec matches but a later one doesn't an error will be thrown (instead of nil returned) unless UNPIN is used. See the PIN/UNPIN docs for details.

PIN/UNPIN: Syntax Error Reporting

PEGL implements syntax error detection ONLY in Sequence specs (table specs i.e. `{...}`) by throwing an error if a "pinned" spec is missing.

By default, no error will be raised if the first spec is missing. After the first spec, pin=true and any missing specs to throw an error with context.
PIN can be used to force pin=true until UNPIN is (optionally) specified.
UNPIN can be used to force pin=false until PIN is (optionally) specified.
PIN / UNPIN only affect the current sequence (they do not pin sub-sequences).

Fields:

kind
name

Methods

fn parse(seq, p) -> M.parseSeq(p, seq)

Record Not

Fields:

kind
name

Methods

fn:parse(p, self)

Mod pegl.common

Record FmtPegl

Fields:

kinds =pegl.fmtKinds kind -> fmtFn

Mod pegl.acsyntax

Module for applying asciicolor syntax highlighting to parsed pegl.

Types: Highlighter tokenize

Record Highlighter

Usage:

hl = require'pegl.lua'.highlighter

hl:highlight(lf, fgFile, bgFile)

The highlighter object determines how to parse and highlight a text file's syntax.

Fields:

config the pegl root configuration. Typically this should have lenient=true to allow for syntax errors to happen but parsing to continue.
spec pegl root spec to parse, aka the grammar tree. This should typically be a "lenient" spec with a fallback which can parse any symbol.
style a table of name|kind -> 'style' (asciicolor style) (name takes precendence).
The value can be either the style literal or a fn(node) -> style.
The Highlighter will call for each node's kind, if the style isn't found it will be called again for all node children.
builtin list of builtin names.
styleColor table of style -> asciicolor fg/bg
dir base dir where highlighting is stored

Methods

fn:paths(path) -> fgPath, bgPath
Given the file path to be highlighted, return the configured foreground and background paths.
fn:tokenize(lf) -> tokens
fn:assertTokens(expect, lf) -> strTokens, tokenize
fn:highlight(lf, fgFile, bgFile) -> nil
Write the asciicolor highlight values to fgFile and bgFile.
It is intended that these are read as line-files, where the final character in the line is treated as the "default" for the rest of the line.
fn:assertHighlight(str, fgExpect, bgExpect)

Record tokenize

Usage: tokens = tokenize(highlighter, lineFile)
A state that when called (constructed) is a list of tokens.

Fields:

hl highlighter config
p the parser
root the root parsed node
kind

Mod pegl.lua

Lua syntax in PEGL

http://parrot.github.io/parrot-docs0/0.4.7/html/languages/lua/doc/lua51.bnf.html

was used as a reference.

Functions

fn skipComment(t, add) -> t
return t with the key/vals of add inserted