Producing a Go scanner in 1,219 bytes of code
modernc.org/rec is a regexp to Go code compiler tool. It is still a bit rough around the edges. For example, it converts the regexps to a DFA, but it does not yet support intersecting character classes ending up in the same DFA state. Anyway, rec can already handle some nontrivial tasks, like generating a usable, working Go scanner.
Here are those 1,219 bytes - in scanner.sh. The shell script is used in the generate target of the Makefile. Note the Perl Unicode character classes in the regexp for Go indentifiers. The respective EBNF lexical grammar part of Go specification is
identifier = letter { letter | unicode_digit } .
The above production expands and translates to the regexp (\pL|_)(\pL|_|\p{Nd})*. As mentioned above, constructing DFAs from regexps using character classes is a bit challenging per se when considering Unicode. A similar program, lx(1), part of libfsm, seems to not support Unicode so far, possibly because facing similar difficulties.
The resulting scanner is here. The test of the scanner is in all_test.go.
Comparing the performance of the manually written stdlib's Go scanner and this mechanically produced one
~/src/modernc.org/rec$ go test -bench Scanner
goos: linux
goarch: amd64
pkg: modernc.org/rec
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkScanner/stdlib-24 2 684787817 ns/op 105.51 MB/s 124280648 B/op 5206824 allocs/op
BenchmarkScanner/rec-24 1 1026476793 ns/op 70.39 MB/s 123613264 B/op 5595444 allocs/op
PASS
ok modernc.org/rec 9.462s
~/src/modernc.org/rec$
The generated scanner is slower by about one third. That might be possibly outweighed by the flexibility of using a generator when designing some new lexical grammar from scratch - while it still changes often. In such cases the proper time to manually optimize or rewrite the scanner comes later, if necessary. Many realistic programs spend more time by parsing and type checking Go source code than by turning input files into tokens.
Comments
Post a Comment