PEG like recursive descent Parser for Lua.
PEG stands for "Parsing Expression Grammar" and is one of the simplest parsing
grammars since it maps very-closesly to recursive descent -- which is a
hand-rolled parser that uses recursion. This library is a pure-LUA
recursive-descent parser which exports types and functions to create a PEG-like
DSL that is still just recursive descent under the hood.
The benefits of this library are:
- vs hand-rolled recursive descent is more concise and readable, as well as
providing automatic error messages (i.e. symbol locations in your stack) and
debugging.
- vs PEG is nearly as concise while maintaining the ability to hand-roll any
logic needed.
If you are completely new to parsers and especially if you want to write your
own language with an AST then I cannot recommend
http://www.craftinginterpreters.com enough. It might be a better place to
start than this library.
A parser is a way to convert text into structured node objects so that the text
can be compiled or annotated by a program. For example you might want to convert
some source code like:
x = 1 + 2
Into something like:
{'x', '=', {'1', '+', '2', kind='op'}, kind='assign'}
A recursive descent parser does so via hand-rolled functions which typically
recurse into eachother. Each function attempts to parse from the current
parser position using its spec (which may be composed of calling other parsing
functions) and returns either the successfully parsed node or
nil (or
returns/throws an error if it finds a syntax error).
PEGL is a Lua library for writing the common-cases of a recursive descent parser
in a (pure Lua) syntax similar to PEG, while still being able to easily fallback
to hand-rolled recursive descent when needed.
Most traditional PEG parsers (as well as other parsers) struggle with
complicated syntax such as Lua's
[===[raw string syntax]===], python's
whitespace denoted syntax or C's lookahead requirements (
(U2)*c**h --
recursive descent can solve a lot of these problems relatively easily and
performantly. However, recursive descent parsers can be very verbose and
sometimes difficult to understand. Below is a comparison of the above example
in both PEG, PEGL and a "traditional" (though not very good) recursive descent
implementation.
Examples
PEG: most concise but harder to fallback to hand-rolled recursive descent
grammar = [[
num <- '%d'
name <- '%w'
setVar <- num '=' name
expr <- setVar / ... other valid expressions
]]
p:parse(grammar)
PEGL: very concise and easy to fallback to hand-rolled recursive descent.
Things like `kind` and `name` make debug printing easier. Simply name your
specs, putting them together into a root spec and parsing:
num = Pat{'%d+', kind='num'} -- kind=num sets the node name
name = Pat{'%w+', kind='name'}
-- Note: UNPIN and PIN are used for when errors should be raised
setVar = {UNPIN, name, '=', PIN, num, kind='setVar'}
expr = Or{setVar, ... other valid expressions, name='expr'}
p:parse(expr)
]
["Note: the difference between name and kind is that a node
with [$kind] set will [*never be unpacked]. For instance,
to define a [$block; of; items;] you might use [$name] since you want
[$single;] to not be represented as just [$single] instead of
[$block { single }].
]
[*Recursive Descent]: not very concise, harder to debug. [{$$ lang=lua}
-- Note: p=parser, an object which tracks the current position
-- in it's `state`
function parseNum(p)
local num = p:consume('%d+') -- return result and advance position
if num then -- found
return {num, kind='num'} end
end
end
function parseSetVar(p)
local state = p.state()
local name = p:consume('%w+')
if not name then return end
local eq, num = p:consume('='), parseNum(p)
if not (eq and num) then
-- didn't match, reset state and return
p.setState(state)
return
end
return {{name, kind='name'}, eq, num, kind='setVar'}
end
function expression(p)
local expr = parseSetVar(p)
if expr then return expr end
-- ... other possible expressions
end
expression(p)
See
#pegl.Seq for the basic API of parsing specs.
pegl: peg-like lua parser
Types: Token Config Parser Pat Key Or Many Seq Not common FmtPegl
Functions
- fn encodeSpan(l1, c1, l2, c2)
- fn decodeSpan(s)
- fn firstToken(list) -> t, listWithToken
- fn lastToken(list) -> t, listWithToken
- fn nodeSpan(t)
- fn fmtSpec(s, f)
- fn specToStr(s, fmt)
- fn specTy(name)
Create a parser spec record. These have the fields kind and name
and must define the parse method.
- fn Maybe(spec) -> M.Or{spec, M.Empty}
- fn isEmpty(t) -> mty.eq(M.EMPTY, t)
- fn notEmpty(t) -> not mty.eq(M.EMPTY, t)
- fn skipWs1(p)
- fn skipEmpty(p)
- fn skipEmptyMinimal(p)
- fn defaultTokenizer(p)
- fn parseSeq(p, seq)
- fn parse(dat, spec, config) -> list[Node]
Parse a spec, returning the nodes or throwing a syntax error.
config is used to define settings of the parser such as how to skip
comments and whether to use debug mode.
- fn assertParse(t) -> result, node, parser
Parse the dat with the spec, asserting the resulting "string tokens"
are identical to expect.
the input is a table of the form: {dat, spec, expect, dbg=nil, config=default} --> nil
- fn assertParseError(t)
- fn isKeyword(t) -> #t == 1 and t.kind == t[1]
Fields:
- .kindoptional, used for debugging
Methods
- fn span(t, dec) -> M.decodeSpan(t[1])
- fn encode(ty_, p, l, c, l2, c2, kind)
- fn decode(t, dat) -> lines.sub(dat, M.decodeSpan(t[1]))
The config spec defines custom behavior when parsing. It's attributes
can be set to change how the parser skips empty (whitespace) and handles comments.
Fields:
- .skipEmpty =pegl.skipEmptydefault=skip whitespace
- must be a function that accepts the `Parser`
and advances it's `l` and `c` past any empty (white) space. It must also set
`p.line` appropriately when `l` is moved.
- The return value is ignored.
- The default is to skip all whitespace (spaces, newlines, tabs, etc). This
should work for _most_ languages but fails for languages like python.
- Recommendation: If your language has only a few whitespace-aware nodes (i.e.
strings) then hand-roll those as recursive-descent functions and leave
this function alone.
- fn(p) -> Token for found comment
- .tokenizer =pegl.defaultTokenizerRequires:
- must return one token. The default is to return a single punctuation character
or a whole word (_%w)
- Objects like Key can use the single punctuation characters in a Trie-like
performant data structure.
- .dbgif true, prints out huge amounts of debug information of parsing.
- .lenientif set, syntax errors do not cause failure.
Instead, all errors act as if the current block missed but was UNPIN.
Methods
The parser tracks the current position of parsing in `dat` and has several
convienience methods for hand-rolling your own recursive descent functions.
Note: the location is **line/col based** (not position based) because it
is designed to work with an application that stores the strings as lines
(a text editor)
Fields:
Methods
- fn assertNode(p, expect, node, config)
- fn new(T, dat, config)
- fn parse(p, spec) -> node
the main entry point and used recursively.
Parses the spec, returning the node, which is a table of nodes that are
eventually tokens.
- fn consume(p, pat, plain) -> Token
consume the pattern, advancing the column if found
- fn peek(p, pat)
identical to `consume` except it does not advance the column
- fn sub(p, t)
- fn incLine(p)
- fn isEof(p) -> isAtEndOfFile
- fn skipEmpty(p)
- fn state(p) -> {l=p.l, c=p.c, line=p.line}
get the current parser state {l, c, line}
- fn setState(p, st)
restore the current parser state {l, c, line}
- fn toStrTokens(p, n)
- fn makeStrTokens(p, t) -> t
recursively mutate table converting all Tokens to strings
- fn tokenStr(p, t) -> string
- fn trimTokenStart(p, list)
- fn trimTokenLast(p, list, trimNl)
- fn checkPin(p, pin, expect)
- fn error(p, msg)
- fn parseAssert(p, spec)
- fn dbgEnter(p, spec)
- fn dbgLeave(p, n)
- fn dbgMatched(p, spec)
- fn dbgMissed(p, spec, note)
- fn dbgUnpack(p, spec, t)
- fn dbg(p, fmtstr, ...)
Pat{'%w+', kind='word'} will create a Token with the span matching the
%w+ pattern and the kind of
word when matched.
Fields:
Methods
- fn set(name)
Create a parser spec record. These have the fields kind and name
and must define the parse method.
- fn get(name)
Create a parser spec record. These have the fields kind and name
and must define the parse method.
- fn extend(r, l) -> r
- fn parse(self, p)
The table given to
Key forms a Trie which is extremely performant. Key depends
strongly on the
tokenizer passed to Config.
Example:
Key{{'myKeword', ['+'={'+'=true}}, kind='kw'}