Skip to content

An experimental implementation of parsing expression grammars (a la Janet) in Common Lisp

License

Notifications You must be signed in to change notification settings

ravi-delia/uclp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

37 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unnamed Common Lisp Peg

UCLP is an experimental implementation of PEG parsing in Common Lisp which compiles grammar rules directly to source code at runtime. A parsing expression grammar is a very elegant way of recognizing, parsing, and transforming text- much more powerful than regular expressions without the complexity of a custom-built parser. Note that while it is possible to parse PEGs in guaranteed linear time (at the cost of linear space) with a packrat parser, UCLP does not. Most patterns unless very poorly written will run in linear time anyway, and for ease of use some necessary departures from the strict definition are included.

UCLP patterns are just made of native data structures, so it’s easy to compose patterns and interact with them in code. Unlike regular expressions, PEG syntax is pretty readable even if you aren’t familiar with the specifics. The below example is largely copied from the Janet language documentation:

(defparameter ip-address
  '(grammar
     :dig (range "09")
     :0-4 (range "04")
     :0-5 (range "05")
     :byte (choice
             (sequence "25" :0-5)
             (sequence "2" :0-4 :dig)
             (sequence "1" :dig :dig)
             (between :dig 1 2))
     :main (sequence :byte "." :byte "." :byte "." :byte)))

; uclp:match returns two values, a boolean indicating success/failure and a list of captures
(uclp:match ip-address "0.0.0.0") ; -> t nil
(uclp:match ip-address "elephant") ; -> nil
(uclp:match ip-address "256.0.0.0") ; -> nil
(uclp:match ip-address "0.0.0.0moretext") ; -> t nil

If you follow the link you’ll see that the example is almost exactly copied, with the only differences being those forced by the differences in language syntax. UCLP is a very close reproduction of the semantics of Janet’s PEG module. After experiencing how pleasant text parsing is in Janet you’ll also feel the urge to rewrite it for every language you use.

As of now UCLP is usable. Bugs are to be expected, but almost all of Janet’s patterns are supported and there shouldn’t be any significant footguns. Unfortunately it has only been tested on SBCL, but I plan on at least expanding that to some other implementations since there isn’t much implementation-specific code. Except of course that for it to be performant it needs to be compiled quickly to a quick runtime, ideally machine code.

Usage

The peg syntax used by UCLP is largely a 1:1 reproduction of Janet’s. Differences are noted below, but otherwise you can safely default to using Janet’s documentation. Many tasks that might require a particular function when using regex can be accomplished with the correct pattern. However, for convenience, there are entrypoints for a few common usecases.

(match rule str &optional start &rest args)

This is the primary entrypoint to UCLP. Matches str against rule anchored to start. Returns two values on success, a boolean indicating success or failure and a list of captures, and nil otherwise. Additional arguments are accessable while matching using the argument pattern. Note that if using additional arguments you must specify a start.

(compile-peg rule &key (quiet? t) debug?)

Compiles rule to source ahead of time. Even with SBCL’s lightning quick compilation this can still save a significant amount of time, and for many rules almost all the consing. The result is just a closure, and it can be used in other patterns by including it literally. Does not promise to be nice if used outside its wrapper.

(captured rule str &optional start &rest args)

Takes exactly the same arguments as match but returns in the opposite order- first the list of captures, and success as the second value. Useful to get at captures without multiple-value-bind.

(replace-all rule replace str &optional start &rest args)

Returns the string resulting from replacing every instance of rule in str with replace, which may be a string to substitute or a function taking as an argument the substring matched by rule. Returns a boolean indicating if matches were found as a second value.

(replace-one rule replace str &optional start &rest args)

Behaves like replace-all, but replaces only the first match.

(find-all rule str &optional start &rest args)

Gives the position of each match in str as a list, such that

(defparameter dig '(* "dig:" :d))
(defparameter str "dig:7, dig:8, dig:9")

(every (lambda (position)
         (uclp:match dig str position))
       (uclp:find-all dig str)) ; => t

(find-one rule str &optional start &rest args)

Returns the position of the first match, or nil if there are no matches.

Differences From Janet

Mostly UCLP adheres to the behavior of Janet pegs, enough so that Janet’s documentation is better than anything I’ll have put together for a little while. However, there are some differences which obviously need to be documented somewhere. Some are due to differences in host language, some are due to taste, and some are just features I haven’t implemented yet.

Any difference in behavior from Janet not mentioned below is a bug, and either the behavior or the documentation will need to be changed.

Unimplemented

Janet buffers and strings are simply byte strings, and Janet pegs work on arbitrary strings of bytes. UCLP expects to work with Common Lisp strings, which in general are not byte strings but vectors of type character. As such the patterns intended to work on bytes are not implemented. These are: uint, uint-be, int, and int-be. Because using PEGs to parse binaries is so nice, I plan on at some point implementing some way of compiling PEGs intended to operate on raw bytes. However, specialization will be at compile-time and the above patterns will likely be available only in a byte PEG.

I’m not yet sure if UCLP will ever support a general number pattern. It’s possible I’ll bring in parse-float to make a float as well as a general number pattern.

Implemented

  • While UCLP does not have number, it does have integer, which takes identical arguments and parses an integer using parse-integer.
  • The pattern (split div-pat fill-pat) was contributed to Janet but not yet released. It will match when fill-pat matches each segment of the input when divided by div-pat. However, it is a little more complicated than that. For one, split always matches against the entire input. If you don’t want that, you need to use a sub pattern to contain it. This means that something like (match '(split "," 1) "a,b,c,") will not match! Additionally, fill-pat is implicitly matched in a sub- so if you want to match everything between your seperators, you can use a pattern like (split "," (any 1)). Look at the test cases for examples here.
  • The pattern (split div-pat fill-pat) was contributed to Janet but not yet released. It matches if the input string consists of instances of fill-pat split apart by div-pat. Look at the test cases to get a finer idea of the details, but it is equivalent to the following pattern:
    `(grammar
      :mid (sub (to (+ ,div-pat -1)) ,fill-pat) ; fill-pat matches up to the next div 
      :main (* :mid (any (* (drop ,div-pat) :mid)) -1)) 
        

Changes

  • Anywhere a string literal can go, including those in range or set, a character or list of characters and strings can also go. This is because Common Lisp strings do not have escape codes like Janet strings. So (range ("a" #\Newline)) in UCLP is the same as (range "a\n") in Janet.
  • between, at-least, at-most, and look all have the pattern as the first argument, unlike in Janet where it is the last argument.
  • backmatch requires a tag argument, and will not look up captures on the capture stack
  • replace takes either a string which it captures literally, or a function which it calls. Taking other datatypes literally will probably be in the next version. But unlike Janet, it will never look the matches up.
  • Grammars, represented in Janet by tables or structs, are written in UCLP with the grammar rule, which is followed by alternating keywords naming rules and patterns implementing them.
  • Because the reader doesn’t distinguish between :s and :S, the complement of a built-in pattern is prefixed with !. So :!s instead of :S.
  • The error pattern has significant differences from Janet’s error. See Error below.
  • In addition to the + and * variants of built-in patterns, UCLP has a maybe variant marked by a ?. So :w? denotes (? :w).
  • UCLP includes any (*), some (+), and maybe (?) variants of complement patterns. So :!d+ is (some (if-not :d 1)). See Aliases below

Aliases

UCLP offers aliases, keywords that stand in for larger patterns, similar to the Janet’s built-in patterns. And like built-in patterns, aliases are user extensible. However, there are a number of differences which are important to be aware of. Aliases are not first class citizens of UCLP- rather than full mutually recursive subpatterns, they are simple find-and-replace macros, inserted literally. You can reference other aliases from inside one, but if you create a cycle it’ll just blow out the stack. So just be cautious!

Aliases are stored in the alist *aliases*. You can manipulate *aliases* directly, or call the helper functions register-alias! and register-alias-suite!. Both take the name of the alias as a keyword, and the body as a peg expression, and push the new alias to *aliases*. However, register-alias-suite! will also add the complement, some, maybe, and any variants, like so:

(uclp:register-alias-suite! :v '(set "vV"))
(uclp:match '(* :v (<- :!v+)) "v not a V") ; => t (" not a ")

Error

Because Common Lisp conditions are so different from Janet signals, the error pattern has some subtleties in UCLP. It takes arguments of the form (error &optional pat condition). With 0 arguments, it will raise a peg-error. With one argument, it will raise an error only if pat matches. With two arguments, it will raise a condition so long as pat matches. To specify a particular condition a pattern must be given, but something like 0 will always match.

UCLP special cases conditions inheriting from peg-error, which is exported. A peg-error has slots pat, matched, and caps. If a pattern is given, these will automatically be filled by the pattern itself, the text that pattern matched, and the captures from the pattern respectively. If no pattern is supplied they will all be nil. These slots can be accessed with error-pat, error-matched, and caps. By default peg-error has a report function giving the pattern and, if nonempty, the matching substring. Depending on your choice of pattern, this can be tolerably readable.

About

An experimental implementation of parsing expression grammars (a la Janet) in Common Lisp

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published