1 Regular Definitions
Felix has a DSSL (Domain Specific SubLanguage) for regular expressions, or more precisely, regular definitions. We'll first show example of using it, and then show how he DSSL is constructed.
1.1 Example Definitions
Here is a simple example defining C identifiers:
examples/regex/re1.flx
regdef digit = charset "01234356789"; regdef lower = charset "abcdefghijklmnopqrstuvwxyz"; regdef upper = charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; regdef letter = lower | upper; regdef underscore = "-"; // C identifier regdef cidlead = underscore | letter; regdef cidtrail = underscore | letter | digit; regdef cid = cidlead cidtrail*;
The regdef
binder defines an ordinary var
containing a representation
of the specified regular expression. But it does a lot more than that:
it enables a regular definition grammar based on EBNF.
Regular definitions are vastly superior to regular expressions because they allow the expression to be factored into components; in other words, they provide modularity.
In addition, the use of program language syntax is a killer advantage over the usual regular expression string forms because the distinction between operators and data is not only clear .. the syntax is checked by the parser at compile time.
We'll now define something a bit more complicated:
examples/regex/re1.flx
// C int regdef cint = digit+; // C decimal fixed point float regdef ffixed = digit* "." digit+ | digit+ "." digit*; // Scifloat regdef expon = ("E" | "e") ("+"|"-")? digit+; regdef sci = (cint | ffixed) expon; regdef ffloat = ffixed | sci;
These definitions should be quite clear as written: for fixed point float, a sole dot is not a floating point number, but leading or trailing digists can be left out provcided there is a dot, but not both.
For a escientific float, we need either a plain integer of fixed point float, followed by an exponent, which allows but does not require a sign.
1.2 imple Usage
A variable defined by a regdef
has type regex
. However we are using Google RE2
as our regular expression engine, and that requires a compiled regular expression or type RE2
to work. The RE2 compiler compiles the usual kinds of strings, so we need first to convert
a regex
to a string.
So here is our first test case:
examples/regex/re1.flx
var float_s : string = ffloat.Regdef::render; // convert regex to string var float_r : RE2 = float_s.RE2; // compile string to RE2 regex println$ "123.4e-7" in float_r; // if it is a float? println$ float_r;
Now we printed out the actually rendered regular expression encoded as a string and here it is, with some newlines added so it fits on our page:
(?:[01234356789])*\x2E(?:[01234356789])+|(?:[01234356789])+ \x2E(?:[01234356789])*|(?:(?:[01234356789])+|(?:[01234356789])* \x2E(?:[01234356789])+|(?:[01234356789])+\x2E(?:[01234356789])*) (?:E|e)(?:\x2B|\x2D)?(?:[01234356789])+
The chance of getting this very simple regex correct as a string does not seem very high.
An anecdote: there was once a site that performance tested various languages, and one of the tests was a simple phone number regex. I think everyone copied the regex used by the first contribution, except Felix which used regular definitions. Every single language except Felix got it wrong.
1.3 Captures
Ofen we want to capture data from a regex. Although it made a mess and got the maths wrong, Perl had the ability to perform captures. In light of some mathematical analysis new techniques were developed, but in the end Google RE used a simple capture engine and threw out balancing captures to ensure it worked.
Here's s simple example
examples/regex/re1.flx
regdef assignment = " "* group(cid) " "* "=" " "* (group(cint) | group(ffloat)) " "* ";"; var a_s = assignment.Regdef::render; var a_r = a_s.RE2; var result = Match (a_r, "x = 123.45;"); match result with | None => println$ "No match"; | Some v => print$ "Variable is " + v.1; // group 1 if v.2 == "" do // if group 2 didn't match println$ " Initialiser is float " + v.3; // group 3 matched else println$ " Initialiser is int " + v.2; // group 2 matched done endmatch;
The Match
function returns an option type which is None
is there is no match,
otherwise Some v
if there is. The v
is a Felix varray[string]
with slot 0
containing the whole matchingh pattern (which is always the whole input)
and then one slot for each catpture group. The capture groups are numbered by a left
to right depth first recursive descent of the implicit tree structure.
If a group doesn't match anything it is set to the empty string.
If a group is inside a repetition only the last matching substring is retained.
1.4 The Sub bit of the Domain Specific Sub Language
So far, we have shown what is effectively a Domain Specific Language or DSL. Yes, it is nested in Felix and interacts with it, but to truly be a sub language we need to more completely integrate it with Felix code.
The Match
function always matches the whole string. To work around this
we can define this:
examples/regex/re1.flx
var anys = ".*"; regdef anystring = perl(anys);
and stick it at the end. Alternatively we can do this:
examples/regex/re1.flx
var prefix_assignment = a_s + anys;
The important point here is that we can lift a Felix string
which is a traditional Perl regex string into the computation,
and the string can be denoted by any Felix expression of
type string
. You can write any kind of functional rendering of
repex string generating code you like.
1.5 The term tree
There is a second way to generate a regex by using a combinator tree
or type regex
directly. In fact, the Regdef DSSL grammar just provides a convenient
syntax for generating these trees. Here is the important part of the
definition from the library:
class Regdef { variant regex = | Alts of list[regex] | Seqs of list[regex] | Rpt of regex * int * int | Charset of string | String of string | Group of regex | Perl of string ; ...
Although the grammar provides a convenient way to construct some of these trees, it cannot construct all of them. For example suppose you want to syntax colour a programming language by highlighting keywords.
You would of course like to do this:
regdef kw = "proc" | "fun" | "variant";
but alas, this grammar only works with a fixed list of known keywords which have to be cited literally as shown. What if you wanted to load the keyword list from a file?
You mean, like this?
examples/regex/re1.flx
var data = "proc fun variant"; var lines = split(data," "); var kws = unbox$ map (fun (x:string) => Regdef::String x) lines; var kws_r = kws.Regdef::Alts; println$ kws_r.str;
1.6 Inline regex
It is possible to build regex inline like this:
examples/regex/re1.flx
var cid2 = regexp ( cidlead cidtrail*);
which in this case is precisely equivalent to saying
regdef cide = cidlead | cidtrail*;
The parser switched to the regex syntax inside the parens.
Remember, a regex
is just a simple variant that builds a term tree.
The grammar that does all this is defined in the library; that is,
in user space.
2 Pattern matching
It is possible to pattern match with regexps. The ability to do this is quite general built in to the way the compiler works, rather than a feature of the regex DSSL: we enable user defined pattern matches using a feature of Felix pattern matching known as higher order pattern matching.
The way pattern matches work in Felix requires two functions:
- The match checker tests to see if a pattern matches the match argument
- The extractor fetches the argument of the variant constructor matched
The way this works with Felix data types is built in to the compiler, but there is a way to write your own checker and extractor functions:
examples/regex/re1.flx
// the match checker fun _match_ctor_re (r:RE2) (x:string) => x in r; // the extractor fun _ctor_arg_re (r:RE2) (x:string) => match Match (r,x) with | Some y => y // None case shouldn't happen! endmatch ; // test case match "Hello" with | re "H(ell)o".RE2 y => println$ y.1; endmatch;
In this code we invent a new pretend constructor re
which takes two arguments.
The first is the pattern we want to consider, and the second is the data we're
checking against. In our case the pattern is an RE2
and the data is a string;
but the technique is fully general.
The first function has a magic name used for checking is the string is in the regexp, the second provides the way to get data out of it. Notice the extractor is called if, and only if the match checker returns true.
Therefore the word re
followed by a term of type RE2
will be treated
as a pattern with one argument.
Although the higher order pattern matching feature is not specific to regular expressions .. it was in fact added to the compiler specifically to support that use case.
3 Iteration
In Felix we have a concept of an iterator: it corresponds with
the C++ notation of an input iterator. The library contains
an iterator allowing regexps to find matching substrings of a string.
Unlike Match
our iterator scans for the first match, and returns.
If the iterator is called again, it finds the next match.
Here's how to use it:
examples/regex/re1.flx
var kw = kws_r.Regdef::render.RE2; for xxv in iterator (kw, "proc blah fun proc") do println$ xxv.0; done
Here's the implementation
gen iterator (r:RE2, var target:string) () : opt[varray[string]] = { var emptystring = ""; var l = len target; var s = StringPiece target; var p1 = s.data; var p = 0; var n = NumberOfCapturingGroups(r)+1; var v1 = varray[StringPiece] (n.size,StringPiece emptystring); var v2 = varray[string] (n.size,""); again:> var result = Match(r, s, p, UNANCHORED,v1.stl_begin, n); if not result goto endoff; for var i in 0 upto n - 1 do set(v2, i.size, string(v1.i)); done var p2 = v1.0.data; assert(v1.0.len.int > 0); // prevent infinite loop p = (p2 - p1).int+v1.0.len.int; yield Some v2; goto again; endoff:> return None[varray[string]]; }
This uses a whole lot of features of the Google RE2 system
which are lifted into Felix, such as the data type StringPiece
which is roughly a view of a string, and which I am not going
to explain.
Rather the take away is that we can implement high level DSSL features based on a C/C++ library, by lifting the library API into Felix, more or less secretly, and then use them to implement the operations we actually want, with the syntax we actually want.
In many case the core system deliberately has features to support
this kind of modelling technology. In the code above, the key
piece of magic is the yield
operation, which returns a value,
but also saves the current location in the code, so that re-invoking
the generator will restart the code with the same context and at the
same point it previously left off.
Yielding Generators are a common feature of many languages in including Python and Rust. They are in fact a weak form of coroutine.