lines: view files as list of lines
The
lines library (and it's submodules) enable line-based processing of
text. The
#lines module itself exports an API that work on both a list of
strings (aka lines) or deal with files as indexed lines (i.e.
#lines.File)
lines is the fundamental building block of civstack's ele editor
and pegl> parsers: it is much easier as a developer to think about
human-readable text as a list of lines then as a list of bytes and this is
especially true for an editor. #lines.File and #lines.EdFile use
a separate index file to make reading and writing to files as lines
actually performant for more real-world use-cases.
sub / insert / remove semantics
lines.sub(t, l,c, l2,c2) returns the text from span l.c -> l2.c2.
For instance: suppose you have the following text:
1234 6789
abcd fghi
You would get the following values:
| 1.1 1.2 |
{'12'} |
| 1.6 1.10 |
{'6789', ''} - last char goes to next line |
| 1.6 2.0 |
{'6789', ''} - next line zero char the same |
| 1.6 2.2 |
{'6789', 'ab'} |
| 1.10 2.2 |
{'', 'ab'} |
| 1.10 3.0 |
{'', 'abcd fghi'} - EoF does not have new line |
The methods span remove offset offsetOf insert are all designed to
use these conventions to enable reversibility. When you remove a span, it will
modify the lines object in-place, returning the span you removed. If you
re-insert that span in the same place it will return the table to it's previous
state. Along with being easy to understand, this architecture if fundamental to
how undo/redo works in the Ele editor.
The lines module, providing a uniform API for lines-like objects.
You can also call this module directly to get a table of lines
from a string
Functions
- fn join(t) -> string
Join a table of strings with newlines.
- fn span(l, c, l2, c2) -> (l, c?, l2, c2?)
Enables addressing lines via either (l,l2) or (l,c, l2,c2) span.
- fn bound(t, l,c, tlen, ln) -> l, c
Bound the line/col for the lines table.
- l will be from 1 to #t+1.
- c will be from 0 to #t[l]+1.
tlen is precomputed #t and line is pre-fetched t[l]
This can handle negative integers.
- fn boundSpan(t, l,c, l2,c2, tlen)
Bound a span from l,c -> l2,c2.
- fn insert(t, ins, l,c) -> nil
insert string at l, c
Note: this is NOT performant (O(N)) for large tables.
See: #Gap (or similar) for handling real-world workloads.
- fn sort(...) -> l1, c1, l2, c2
Sort the span
- fn sub(l, ...) -> {str}, l,c
Get the sub-span of the lines.
- fn usub(l, ...) -> {str}, l,c
Get the UTF8 aware sub-span of the lines.
- fn map(lines) -> table
create a table of lineText -> {lineNums}
- fn offset(t, off, l,c) -> l2,c2
Get the l, c with the +/- offset applied
- fn offsetOf(t, l,c, l2,c2) -> int
get the byte offset
- fn find(t, pat, l,c) -> (l, c, c2)
find the pattern starting at l/c
Note: matches are only within a single line.
- fn findBack(t, pat, l,c)
find the pattern (backwards) starting at l/c
- fn remove(t, l,c, l2,c2) -> string|table
remove span (l, c) -> (l2, c2), return what was removed
- fn box(t, l1, c1, l2, c2, fill) -> lines
return the box of the lines.
Outside the box is not returned.
***1------------------------+**
***|l1,c1 = top left |**
***| bot right = l2,c2|
***+------------------------2**
*So no '*' chars are returned.*
- fn getIndent(t, l) -> str?
Get the indentation of line.
- fn autoIndent(t, l) -> string?
Get the autoIndent to use for line.
- fn load(f, close) -> (table?, errstr?)
load lines from file or path. On error return (nil, errstr)
- fn dump(t, f, close, chunk)
write lines t to file f in chunks (default = 16KiB)
if f is a string then it is opened as a file and closed when done
- fn write(t, ...) -> true
Logic to make a table behave like a file:write(...) method.
This is NOT performant, especially for large lines.
Diffing module and command
Cmd Usage:
ldiff 'file/path1.txt' 'file/path2.txt'
Lib Usage:
io.fmt(ldiff.Diff(linesA, linesB))
This library/cmd creates readable diffs using the "patience diff" alorithm.
The code was written from scratch referencing only the algorithm outline
below, but I want to give special thanks to James Coglan for his
excellent blog post.
Fundamentals of patience diff:
- Skip unchanged lines on both top and bottom.
- Find unique lines in both sets and "align" them using "longest increasing
sequence".
- Repeat for each aligned section.
Types: Diff
Datastructure which holds the result of computing the difference
between two lists of lines.
Fields b and c are just the original base/change lines.
noc, rem and add are lists of integers which represent the length
of a block. For instance, if for a given index rem=3 and add=2 it
means that three lines were removed from b and two were added to
c. If noc=10 that means that there is a block of 10 identical lines.
Fields:
- b
base, aka raw original lines
- c
change, aka raw new lines
- len
len of diff blocks (aka len of below fields).
It's not possible to use # for below, since some values
are nil.
- noc
nochange range (in both)
- rem
removed from b
- add
added from c
Methods
- fn:map(nocFn, chgFn)
Iterate through nochange and change blocks, calling the functions for each
- nocFn(baseStart, numUnchanged, changeStart, numUnchanged)
- chgFn(baseStart, numRemoved, changeStart, numAdded)
Note that the num removed/added will be nil if none were added/removed.
Deprecated: use ds.bytearray instead. This will be removed.
A lines table with a write method and a few other file-like methods.
This is NOT performant, especially for small writes or large lines. It is
useful for tests and cases where simplicity is more important than
performance.
Methods
- fn set(name)
Create a parser spec record. These have the fields kind and name
and must define the parse method.
- fn get(name)
Create a parser spec record. These have the fields kind and name
and must define the parse method.
- fn write(t, ...) -> true
Logic to make a table behave like a file:write(...) method.
This is NOT performant, especially for large lines.
- fn flush()
function that does and returns nothing.
- fn extend(r, l) -> r
This is used by types implementing :extend.
It uses their get and set methods to implement
extend in a for loop.
types do this if they may yield in their get/set, which
is not allowed through a C boundary like table.move
- fn icopy(r)
For types implementing :copy() method.
Line-based gap buffer. The buffer is composed of two lists (stacks) of
lines
- The "bot" (aka bottom) contains line 1 -> curLine.
curLine is at #bot. Data gets added to bot.
- The "top" buffer is used to store data in lines
after "bot" (aka after curLine). If the cursor is
moved to a previous line then data is moved from top to bot
Gap gives a file-like write API which may not be the most performant
for some workloads (writing single characters)
Fields:
- top
array of lines on the top (near start).
- bot
array of lines on the bottom (near end).
- path
the path this was read from or nil.
- readonly
whether to throw errors on write.
Methods
- fn:icopy() -> list
Make a copy of the gap to a lua list.
- fn:reader() -> Gap
- fn load(T, f, close) -> Gap?, err?
Load gap from file, which can be a path.
returns nil, err on error
- fn:get(l) -> string
Get a specific line index.
- fn:set(l, v)
Set a specific line index with the value.
- fn:inset(i, values, rmlen) -> rm?
See ds.inset for documentation.
- fn:extend(lns) -> self
Extend gap with the lines.
- fn:setGap(l)
set the gap to the line number, making l == #g.bot.
- fn:write(...)
- fn dumpf(t, f, close, chunk)
write lines t to file f in chunks (default = 16KiB)
if f is a string then it is opened as a file and closed when done
A file of 3 byte (24 bit) integers. These are commonly
used for indexing lines.
This object supports get/set index operations including appending. Every
operation (except consecutive reads/writes) requires a file seek.
Fields:
Methods
- fn create(T, ...) -> icreate(T, 3, ...)
- fn:reload() -> IFile?, errmsg?
Reload IFile from path.
- fn load(T, ...) -> iload(T, 3, ...)
- fn:flush()
- fn:close()
- fn:closed() -> bool
- fn:getbytes(i)
get bytes. If index out of bounds return nil.
Panic if there are read errors.
- fn:get(i)
get value at index
- fn:setbytes(i, v)
- fn:set(i, v)
set value at index
- fn:move(to, mvFn) -> self
Move the IFile's path to to.
mv must be of type fn(from, to). If not provided,
civix.mv will be used.
This can be done on both closed and opened files.
The IFile will re-open on the new file regardless of the
previous state.
- fn:reader() -> IFile?, err?
Get a new read-only instance with an independent file-descriptor.
Warning: currently the reader's len will be static, so this should
be mostly used for temporary cases. This might be changed in
the future.
Usage:
File{'path/to/file.txt', mode='r'}
Indexed file of lines supporting modes 'r' and 'a+'.
use EdFile instead if you need to do non-append edits
Fields:
- path
path of this file.
- mode
'r', 'a' or 'a+'
- f
open (normal) file object
- idx
line index of f
- cache
cache of lines
- loadIdxFn
default=lines.futils.loadIdx
Methods
EdFile: an editable line-based file object, optimized for
indexed and consequitive reads and writes
Usage:
local ed = EdFile(path, mode);
ed:set(1, 'first line')
ed:set(2, 'second line')
ed:set(1, 'changed first line')
ed:close()
Fields:
- lf
indexed append-only file.
- dats
list of Slc | Gap objects.
- lens
rolling sum of dat lengths.
Methods
- fn:get(i) -> line
Get line at index
- fn:write(...) -> self?, errmsg?
- fn:set(i, v)
Set line at index.
- fn:reader()
Return a read-only view of the EdFile which shares the
associated data structures.
- fn:flush()
Flush the .lf member (which can only be extended).
To write all data to disk you must call :dumpf().
- fn:close()
Note: to write all data to disk you must call :dumpf().
- fn:dumpf(f)
Dump contents to file or path.
- fn:extend(values)
Appends to lf for extend when possible.
- fn:inset(i, values, rmlen) -> rm?
insert into EdFile's dats.
utilities for file loading of lines. Generally users shouldn't
need to use this file.
Functions
Helper methods for moving a cursor around a lines-like 2D grid.
The notation l.c is used to refer to line, column where
both are indexed by 1.
Functions
- fn decDistance(s, e) -> int
Move s closer to e by 1.
If they are equal do nothing.
- fn lcLe(l, c, l2, c2) -> bool
Return whether l.c is equal to or before l2.c2.
- fn lcGe(l, c, l2, c2) -> bool
Return whether l.c is equal to or after l2.c2
- fn topLeft(l, c, l2, c2) -> (l, c)
Return the top-left (aka the minimum) of two points.
- fn lcWithin(l, c, l1, c1, l2, c2) -> bool
- fn wordKind(ch) -> ws|sym|let
Given a character, return it's word-kind:
ws (whitespace), sym (symbol), let (letter).
- fn pathKind(ch) -> ws|sym|path
Given a character, return it's path-kind:
ws (whitespace), sym (symbol), path (path)
- fn forword(s, si, getKind) -> int
Get the start of the next word from si (start-index).
- fn backword(s, ei, getKind) -> int
Get the start of the previous word from ei (end-index).
- fn getRange(s, i, getKind) -> si,ei
get the range[si,ei] of whatever is at s[i].
- fn findBack(s, pat, ei, plain) -> int
find backwards from ei (end index).
This searches for the pattern and returns the LAST one found.
This is HORRIBLY non-performant, only use for small amounts of data (like a
line).
table: raw table
kev: "Key Equal Value" serialization format.
This is an extremely common format in many unix utilities, "good enough"
for a large number of configuration use cases. The format is simple: a file
containing lines of key=value. The input and output are a table of
key,val strings (though tostring is called for to()). Lines which start
with # or don't have = in them are ignored.
Nested data is absolutely not supported. Spaces are treated as literal both
before and after =. If you want a key containing = or key/value
containing newline then use a different format (or write your own).
Functions