The Sub bit of the Domain Specific Sub Language

1 Regular Definitions

Felix has a DSSL (Domain Specific SubLanguage) for regular expressions, or more precisely, regular definitions. We'll first show example of using it, and then show how he DSSL is constructed.

1.1 Example Definitions

Here is a simple example defining C identifiers:

examples/regex/re1.flx

  regdef digit = charset "01234356789";
  regdef lower = charset "abcdefghijklmnopqrstuvwxyz"; 
  regdef upper = charset "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; 
  regdef letter = lower | upper;
  regdef underscore = "-";
  
  // C identifier
  regdef cidlead = underscore | letter;
  regdef cidtrail = underscore | letter | digit;
  regdef cid = cidlead cidtrail*;

The regdef binder defines an ordinary var containing a representation of the specified regular expression. But it does a lot more than that: it enables a regular definition grammar based on EBNF.

Regular definitions are vastly superior to regular expressions because they allow the expression to be factored into components; in other words, they provide modularity.

In addition, the use of program language syntax is a killer advantage over the usual regular expression string forms because the distinction between operators and data is not only clear .. the syntax is checked by the parser at compile time.

We'll now define something a bit more complicated:

examples/regex/re1.flx

  // C int
  regdef cint = digit+;
  
  // C decimal fixed point float
  regdef ffixed = digit* "." digit+ | digit+ "." digit*;
  
  // Scifloat
  regdef expon = ("E" | "e") ("+"|"-")? digit+;
  regdef sci = (cint | ffixed) expon;
  
  regdef ffloat = ffixed | sci;

These definitions should be quite clear as written: for fixed point float, a sole dot is not a floating point number, but leading or trailing digists can be left out provcided there is a dot, but not both.

For a escientific float, we need either a plain integer of fixed point float, followed by an exponent, which allows but does not require a sign.

1.2 imple Usage

A variable defined by a regdef has type regex. However we are using Google RE2 as our regular expression engine, and that requires a compiled regular expression or type RE2 to work. The RE2 compiler compiles the usual kinds of strings, so we need first to convert a regex to a string.

So here is our first test case:

examples/regex/re1.flx

  var float_s : string = ffloat.Regdef::render; // convert regex to string
  var float_r : RE2 = float_s.RE2; // compile string to RE2 regex
  println$ "123.4e-7" in float_r;  // if it is a float? 
  println$ float_r;

Now we printed out the actually rendered regular expression encoded as a string and here it is, with some newlines added so it fits on our page:

(?:[01234356789])*\x2E(?:[01234356789])+|(?:[01234356789])+
\x2E(?:[01234356789])*|(?:(?:[01234356789])+|(?:[01234356789])*
\x2E(?:[01234356789])+|(?:[01234356789])+\x2E(?:[01234356789])*)
(?:E|e)(?:\x2B|\x2D)?(?:[01234356789])+

The chance of getting this very simple regex correct as a string does not seem very high.

An anecdote: there was once a site that performance tested various languages, and one of the tests was a simple phone number regex. I think everyone copied the regex used by the first contribution, except Felix which used regular definitions. Every single language except Felix got it wrong.

1.3 Captures

Ofen we want to capture data from a regex. Although it made a mess and got the maths wrong, Perl had the ability to perform captures. In light of some mathematical analysis new techniques were developed, but in the end Google RE used a simple capture engine and threw out balancing captures to ensure it worked.

Here's s simple example

examples/regex/re1.flx

  regdef assignment = " "* group(cid) " "* "=" " "* (group(cint) | group(ffloat)) " "* ";";
  var a_s = assignment.Regdef::render;
  var a_r = a_s.RE2;
  var result = Match (a_r, "x = 123.45;");
  match result with
    | None => println$ "No match";
    | Some v =>
       print$ "Variable is " + v.1;                // group 1
       if v.2 == "" do                             // if group 2 didn't match
         println$ " Initialiser is float " + v.3;  // group 3 matched
       else
         println$ " Initialiser is int " + v.2;    // group 2 matched
      done
  endmatch;

The Match function returns an option type which is None is there is no match, otherwise Some v if there is. The v is a Felix varray[string] with slot 0 containing the whole matchingh pattern (which is always the whole input) and then one slot for each catpture group. The capture groups are numbered by a left to right depth first recursive descent of the implicit tree structure.

If a group doesn't match anything it is set to the empty string.

If a group is inside a repetition only the last matching substring is retained.

1.4 The Sub bit of the Domain Specific Sub Language

So far, we have shown what is effectively a Domain Specific Language or DSL. Yes, it is nested in Felix and interacts with it, but to truly be a sub language we need to more completely integrate it with Felix code.

The Match function always matches the whole string. To work around this we can define this:

examples/regex/re1.flx

  var anys = ".*";
  regdef anystring = perl(anys);

and stick it at the end. Alternatively we can do this:

examples/regex/re1.flx

  var prefix_assignment = a_s + anys;

The important point here is that we can lift a Felix string which is a traditional Perl regex string into the computation, and the string can be denoted by any Felix expression of type string. You can write any kind of functional rendering of repex string generating code you like.

1.5 The term tree

There is a second way to generate a regex by using a combinator tree or type regex directly. In fact, the Regdef DSSL grammar just provides a convenient syntax for generating these trees. Here is the important part of the definition from the library:

  class Regdef {
    variant regex =
    | Alts of list[regex]
    | Seqs of list[regex]
    | Rpt of regex * int * int
    | Charset of string
    | String of string
    | Group of regex
    | Perl of string
    ;
  ...

Although the grammar provides a convenient way to construct some of these trees, it cannot construct all of them. For example suppose you want to syntax colour a programming language by highlighting keywords.

You would of course like to do this:

  regdef kw = "proc" | "fun" | "variant";

but alas, this grammar only works with a fixed list of known keywords which have to be cited literally as shown. What if you wanted to load the keyword list from a file?

You mean, like this?

examples/regex/re1.flx

  var data = "proc fun variant";
  var lines = split(data," ");
  var kws = unbox$ map (fun (x:string) => Regdef::String x) lines;
  var kws_r = kws.Regdef::Alts;
  println$ kws_r.str;

1.6 Inline regex

It is possible to build regex inline like this:

examples/regex/re1.flx

  var cid2 = regexp ( cidlead cidtrail*);

which in this case is precisely equivalent to saying

  regdef cide = cidlead | cidtrail*;

The parser switched to the regex syntax inside the parens. Remember, a regex is just a simple variant that builds a term tree. The grammar that does all this is defined in the library; that is, in user space.

2 Pattern matching

It is possible to pattern match with regexps. The ability to do this is quite general built in to the way the compiler works, rather than a feature of the regex DSSL: we enable user defined pattern matches using a feature of Felix pattern matching known as higher order pattern matching.

The way pattern matches work in Felix requires two functions:

The match checker tests to see if a pattern matches the match argument
The extractor fetches the argument of the variant constructor matched

The way this works with Felix data types is built in to the compiler, but there is a way to write your own checker and extractor functions:

examples/regex/re1.flx

  // the match checker
  fun _match_ctor_re (r:RE2) (x:string) => x in r;
  
  // the extractor
  fun _ctor_arg_re (r:RE2) (x:string) =>
     match Match (r,x) with
     | Some y => y
     // None case shouldn't happen!
     endmatch
  ;
  
  // test case
  match "Hello" with
  | re "H(ell)o".RE2 y =>  println$ y.1;
  endmatch;

In this code we invent a new pretend constructor re which takes two arguments. The first is the pattern we want to consider, and the second is the data we're checking against. In our case the pattern is an RE2 and the data is a string; but the technique is fully general.

The first function has a magic name used for checking is the string is in the regexp, the second provides the way to get data out of it. Notice the extractor is called if, and only if the match checker returns true.

Therefore the word re followed by a term of type RE2 will be treated as a pattern with one argument.

Although the higher order pattern matching feature is not specific to regular expressions .. it was in fact added to the compiler specifically to support that use case.

3 Iteration

In Felix we have a concept of an iterator: it corresponds with the C++ notation of an input iterator. The library contains an iterator allowing regexps to find matching substrings of a string. Unlike Match our iterator scans for the first match, and returns. If the iterator is called again, it finds the next match.

Here's how to use it:

examples/regex/re1.flx

  var kw = kws_r.Regdef::render.RE2;
  for xxv in iterator (kw, "proc blah fun proc") do
    println$ xxv.0;
  done

Here's the implementation

id=line1> gen iterator (r:RE2, var target:string) () : opt[varray[string]] = { var emptystring = ""; var l = len target; var s = StringPiece target; var p1 = s.data; var p = 0; var n = NumberOfCapturingGroups(r)+1; var v1 = varray[StringPiece] (n.size,StringPiece emptystring); var v2 = varray[string] (n.size,""); again:> var result = Match(r, s, p, UNANCHORED,v1.stl_begin, n); if not result goto endoff; for var i in 0 upto n - 1 do set(v2, i.size, string(v1.i)); done var p2 = v1.0.data; assert(v1.0.len.int > 0); // prevent infinite loop p = (p2 - p1).int+v1.0.len.int; yield Some v2; goto again; endoff:> return None[varray[string]]; }

This uses a whole lot of features of the Google RE2 system which are lifted into Felix, such as the data type StringPiece which is roughly a view of a string, and which I am not going to explain.

Rather the take away is that we can implement high level DSSL features based on a C/C++ library, by lifting the library API into Felix, more or less secretly, and then use them to implement the operations we actually want, with the syntax we actually want.

In many case the core system deliberately has features to support this kind of modelling technology. In the code above, the key piece of magic is the yield operation, which returns a value, but also saves the current location in the code, so that re-invoking the generator will restart the code with the same context and at the same point it previously left off.

Yielding Generators are a common feature of many languages in including Python and Rust. They are in fact a weak form of coroutine.