Posts

Showing posts from May, 2023

Producing a Go scanner in 1,219 bytes of code

modernc.org/rec is a regexp to Go code compiler tool. It is still a bit rough around the edges. For example, it converts the regexps to a DFA, but it does not yet support intersecting character classes ending up in the same DFA state. Anyway, rec can already handle some nontrivial tasks, like generating a usable, working Go scanner. Here are those 1,219 bytes - in scanner.sh . The shell script is used in the generate target of the Makefile . Note the Perl Unicode character classes in the regexp for Go indentifiers. The respective EBNF  lexical grammar   part of Go specification is identifier = letter { letter | unicode_digit } . The above production expands and translates to the regexp (\pL|_)(\pL|_|\p{Nd})* .  As mentioned above, constructing DFAs from regexps using character classes is a bit challenging per se when considering Unicode. A similar program, lx(1) , part of libfsm , seems to not support Unicode so far, possibly because facing similar difficulties. The resulting sca