PEG like recursive descent Parser for Lua.
PEG stands for "Parsing Expression Grammar" and is one of the simplest parsing
grammars since it maps very-closesly to recursive descent -- which is a
hand-rolled parser that uses recursion. This library is a pure-LUA
recursive-descent parser which exports types and functions to create a PEG-like
DSL that is still just recursive descent under the hood.
The benefits of this library are:
- vs hand-rolled recursive descent is more concise and readable, as well as
providing automatic error messages (i.e. symbol locations in your stack) and
debugging.
- vs PEG is nearly as concise while maintaining the ability to hand-roll any
logic needed.
If you are completely new to parsers and especially if you want to write your
own language with an AST then I cannot recommend
http://www.craftinginterpreters.com enough. It might be a better place to
start than this library.
A parser is a way to convert text into structured node objects so that the text
can be compiled or annotated by a program. For example you might want to convert
some source code like:
x = 1 + 2
Into something like:
{'x', '=', {'1', '+', '2', kind='op'}, kind='assign'}
A recursive descent parser does so via hand-rolled functions which typically
recurse into eachother. Each function attempts to parse from the current
parser position using its spec (which may be composed of calling other parsing
functions) and returns either the successfully parsed node or nil (or
returns/throws an error if it finds a syntax error).
PEGL is a Lua library for writing the common-cases of a recursive descent parser
in a (pure Lua) syntax similar to PEG, while still being able to easily fallback
to hand-rolled recursive descent when needed.
Most traditional PEG parsers (as well as other parsers) struggle with
complicated syntax such as Lua's [===[raw string syntax]===], python's
whitespace denoted syntax or C's lookahead requirements ((U2)*c**h --
recursive descent can solve a lot of these problems relatively easily and
performantly. However, recursive descent parsers can be very verbose and
sometimes difficult to understand. Below is a comparison of the above example
in both PEG, PEGL and a "traditional" (though not very good) recursive descent
implementation.
Examples
PEG: most concise but harder to fallback to hand-rolled recursive descent
grammar = [[
num <- '%d'
name <- '%w'
setVar <- num '=' name
expr <- setVar / ... other valid expressions
]]
p:parse(grammar)
PEGL: very concise and easy to fallback to hand-rolled recursive descent.
Things like `kind` and `name` make debug printing easier. Simply name your
specs, putting them together into a root spec and parsing:
num = Pat{'%d+', kind='num'} -- kind=num sets the node name
name = Pat{'%w+', kind='name'}
-- Note: UNPIN and PIN are used for when errors should be raised
setVar = {UNPIN, name, '=', PIN, num, kind='setVar'}
expr = Or{setVar, ... other valid expressions, name='expr'}
p:parse(expr)
]
["Note: the difference between name and kind is that a node
with [$kind] set will [*never be unpacked]. For instance,
to define a [$block; of; items;] you might use [$name] since you want
[$single;] to not be represented as just [$single] instead of
[$block { single }].
]
[*Recursive Descent]: not very concise, harder to debug. [{$$ lang=lua}
-- Note: p=parser, an object which tracks the current position
-- in it's `state`
function parseNum(p)
local num = p:consume('%d+') -- return result and advance position
if num then -- found
return {num, kind='num'} end
end
end
function parseSetVar(p)
local state = p.state()
local name = p:consume('%w+')
if not name then return end
local eq, num = p:consume('='), parseNum(p)
if not (eq and num) then
-- didn't match, reset state and return
p.setState(state)
return
end
return {{name, kind='name'}, eq, num, kind='setVar'}
end
function expression(p)
local expr = parseSetVar(p)
if expr then return expr end
-- ... other possible expressions
end
expression(p)
See #pegl.Seq for the basic API of parsing specs.
pegl: peg-like lua parser
Types: Token Config Parser Pat Key Or Many Seq Not common FmtPegl
Functions
- fn encodeSpan(l1, c1, l2, c2)
- fn decodeSpan(s)
- fn firstToken(list) -> t, listWithToken
- fn lastToken(list) -> t, listWithToken
- fn nodeSpan(t)
- fn fmtSpec(s, f)
- fn specToStr(s, fmt)
- fn specTy(name)
Create a parser spec record. These have the fields kind and name
and must define the parse method.
- fn Maybe(spec)
- fn isEmpty(M.EMPTY, t)
- fn notEmpty(M.EMPTY, t)
- fn skipWs1(p)
- fn skipEmpty(p)
- fn skipEmptyMinimal(p)
- fn defaultTokenizer(p)
- fn parseSeq(p, seq)
- fn parse(dat, spec, config) -> list[Node]
Parse a spec, returning the nodes or throwing a syntax error.
config is used to define settings of the parser such as how to skip
comments and whether to use debug mode.
- fn assertParse(t) -> result, node, parser
Parse the dat with the spec, asserting the resulting "string tokens"
are identical to expect.
The input is a table of the form:
{dat, spec, expect, dbg=nil, config=default} --> nil
- fn assertParseError(t)
- fn isKeyword(t)
Fields:
- kind
optional, used for debugging
- style
computed by acsyntax
Methods
- fn:span(dec) -> l, c, l2, c2
- fn encode(T, p, l, c, l2, c2, kind, style)
- fn:decode(self[1])
- fn:lte(r)
compare two tokens, commonly used in sorting algorithms.
The config spec defines custom behavior when parsing. It's attributes
can be set to change how the parser skips empty (whitespace) and handles comments.
Fields:
- skipEmpty =pegl.skipEmpty
default=skip whitespace
- must be a function that accepts the `Parser`
and advances it's `l` and `c` past any empty (white) space. It must also set
`p.line` appropriately when `l` is moved.
- The return value is ignored.
- The default is to skip all whitespace (spaces, newlines, tabs, etc). This
should work for _most_ languages but fails for languages like python.
- Recommendation: If your language has only a few whitespace-aware nodes (i.e.
strings) then hand-roll those as recursive-descent functions and leave
this function alone.
-
function for behavior of found comment
- tokenizer =pegl.defaultTokenizer
Requires:
- must return one token. The default is to return a single punctuation character
or a whole word (_%w)
- Objects like Key can use the single punctuation characters in a Trie-like
performant data structure.
- dbg
if true, prints out huge amounts of debug information of parsing.
- lenient
if set, syntax errors do not cause failure.
Instead, all errors act as if the current block missed but was UNPIN.
Methods
The parser tracks the current position of parsing in `dat` and has several
convienience methods for hand-rolling your own recursive descent functions.
Note: the location is **line/col based** (not position based) because it
is designed to work with an application that stores the strings as lines
(a text editor)
Fields:
Methods
- fn:assertNode(expect, node, config) -> strTokens
- fn new(T, dat, config)
- fn:parse(spec) -> node
the main entry point and used recursively.
Parses the spec, returning the node, which is a table of nodes that are
eventually tokens.
- fn:consume(pat, plain) -> Token
consume the pattern, advancing the column if found
- fn:peek(pat)
identical to `consume` except it does not advance the column
- fn:sub(t) -> string
- fn:incLine()
- fn:isEof() -> isAtEndOfFile
- fn:skipEmpty()
- fn:state()
get the current parser state {l, c, line}
- fn:setState(st)
restore the current parser state {l, c, line}
- fn:toStrTokens(n)
- fn:makeStrTokens(t) -> t
recursively mutate table converting all Tokens to strings
- fn:tokenStr(t) -> string
- fn:trimTokenStart(list)
- fn:trimTokenLast(list, trimNl)
- fn:checkPin(pin, expect)
- fn:error(msg)
- fn:parseAssert(spec)
- fn:dbgEnter(spec)
- fn:dbgLeave(n)
- fn:dbgMatched(spec)
- fn:dbgMissed(spec, note)
- fn:dbgUnpack(spec, t)
- fn:dbg(fmtstr, ...)
Pat{'%w+', kind='word'} will create a Token with the span matching the
%w+ pattern and the kind of
word when matched.
Fields:
Methods
The table given to
Key forms a Trie which is extremely performant. Key depends
strongly on the
tokenizer passed to Config.
Example: Key{{'myKeword', ['+']={'+'=true}}, kind='kw'} will match
tokens "myKeyword" and "+" followed by "+" (but not "+" not followed by
"+").
To also match "+" use ['+']={true, '+'=true}
Note: The `Key` constructor converts all list items into
value=true, so {'a', 'b'} is converted to {a=true, b=true}
Fields:
Methods
choose one spec
Example: Or{'keyword', OtherSpec, Empty} will match one of the three
specs given. Note that Empty will always match (and return
pegl.EMPTY). Without Empty this could return nil, causing a
parent Or to match a different spec.
Note: Maybe(spec) literally expands to Or{spec, Empty}
Prefer Key if selecting among multiple string or token paths as Or is
not performant (O(N) vs Key's O(1))
Fields:
Methods
match a Spec multiple times
Example:
Many{'keyword', OtherSpec, min=1, kind='myMany'} will match the
given sequence one or more times (
min=0 times by default). The parse
result is a list with
kind='myMany'.
Fields:
Methods
A Sequence of parsers. Note that
Parser treats
Seq{'a'} and
{'a'}
identically (you can use plain tables instead).
Raw strings are treated as keywords (the are parsed literally and have
key set to themselves). Functions are called with the parser as the
only argument and must return the node/s they parsed or nil for a
non-error match.
A sequence is a list of either other parsers
(Seq, {}, "keyword", fn(p), Not, Or, etc} and/or plain strings which are
treated as keywords and will have kind be set to themselves when parsed.
If the first spec matches but a later one doesn't an error will be thrown
(instead of nil returned) unless UNPIN is used. See the PIN/UNPIN
docs for details.
PIN/UNPIN: Syntax Error Reporting
PEGL implements syntax error detection ONLY in Sequence specs (table specs i.e.
`{...}`) by throwing an
error if a "pinned" spec is missing.
- By default, no error will be raised if the first spec is missing. After the
first spec, pin=true and any missing specs to throw an error with context.
- PIN can be used to force pin=true until UNPIN is (optionally)
specified.
- UNPIN can be used to force pin=false until PIN is (optionally)
specified.
- PIN / UNPIN only affect the current sequence (they do not pin
sub-sequences).
Fields:
Methods
Fields:
Methods
Fields:
- kinds =pegl.fmtKinds
kind -> fmtFn
Module for applying asciicolor syntax highlighting to parsed pegl.
Types: Highlighter tokenize
Usage:
hl = require'pegl.lua'.highlighter
hl:highlight(lf, fgFile, bgFile)
The highlighter object determines how to parse and
highlight a text file's syntax.
Fields:
- config
the pegl root configuration.
Typically this should have lenient=true to allow for
syntax errors to happen but parsing to continue.
- spec
pegl root spec to parse, aka the grammar tree.
This should typically be a "lenient" spec with a fallback
which can parse any symbol.
- style
a table of name|kind -> 'style' (asciicolor style)
(name takes precendence).
The value can be either the style literal or a fn(node) -> style.
The Highlighter will call for each node's kind, if the style isn't
found it will be called again for all node children.
- builtin
list of builtin names.
- styleColor
table of style -> asciicolor fg/bg
- dir
base dir where highlighting is stored
Methods
Usage:
tokens = tokenize(highlighter, lineFile)
A state that when called (constructed) is a list of tokens.
Fields:
- hl
highlighter config
- p
the parser
- root
the root parsed node
- kind
Lua syntax in PEGL
http://parrot.github.io/parrot-docs0/0.4.7/html/languages/lua/doc/lua51.bnf.html
was used as a reference.
Functions
- (t, add) -> t
return t with the key/vals of add inserted