Producing a Go parser from the language specification mechanically, mostly

There's yet another, possibly never-to-be-completed Go front end, modernc.org/gc/v3. This time I'm trying something new with respect to the Go parser. It takes three main steps, initially.

1. Extract the EBNF specification from the language specs.

The unmodified EBNF grammar is not a well-formed PEG grammar:

$ go test -v -run Spec
=== RUN   TestSpecEBNF
    all_test.go:68: left recursive: Expression
    all_test.go:68: left recursive: PrimaryExpr
--- PASS: TestSpecEBNF (0.01s)
PASS
ok  modernc.org/gc/v3/internal/ebnf 0.011s

2. Manually rewrite spec.ebnf to peg.ebnf

The goals are to
  • Remove the left recursion
  • Reorder the terms to obtain the correct parse tree.
  • Rewrite selected parts of the grammar to get the backtracing on a large corpus of Go code to something like, say acceptable* 10% on average.
(* - acceptable as a starting point)

For the last part a PEG, actually in this case an 
EBNF interpreter is needed.

To clarify, a particular PEG grammar can be used to generate a particular as well parser of that grammar in some language X.

But a PEG grammar, or in this case the EBNF grammar, a subset of the former, can be also easily interpreted at runtime. So you have one parser that you supply any grammar at runtime and it produces a parse tree for that supplied grammar and any token stream. Which is the case linked just above.

$ go test -v -run TestEBNFParser -src $HOME/src
=== RUN   TestEBNFParser
=== RUN   TestEBNFParser/src
=== RUN   TestEBNFParser/goroot
=== CONT  TestEBNFParser
    all_test.go:201: TOTAL files 74,346, toks 322,032,687, skip 264, ok 74,082, fail 0
    all_test.go:207: Max backtrack: /home/jnml/src/modernc.org/tcl/lib/tcl_darwin_amd64.go, 427 for 2,902,751 tokens
         /home/jnml/src/modernc.org/tcl/lib/tcl_darwin_amd64.go:62198:9: - /home/jnml/src/modernc.org/tcl/lib/tcl_darwin_amd64.go:62300:56: (etc.go:1289:parseExpression:)
    all_test.go:208: Max backtracks: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_arm64.go, 2,104,073 for 3,328,330 tokens
    all_test.go:209: Max budget used: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_arm64.go, 5,432,402 for 3,328,330 tokens
    all_test.go:210: Max duration: /home/jnml/src/modernc.org/tcl/lib/tcl_linux_riscv64.go, 8.045542843s for 2,861,458 tokens
--- PASS: TestEBNFParser (23.79s)
    --- PASS: TestEBNFParser/src (21.45s)
    --- PASS: TestEBNFParser/goroot (1.12s)
PASS
ok  modernc.org/gc/v3/internal/ebnf 23.848s

Note: The current, linked peg.ebnf went through several rounds of editing. I don't know if the minimal working version that has the smallest diff compared to spec.ebnf was actually committed to the tree. But to get there, it was only about a dozen or two dozens of changed lines, if I recall correctly. What I want to say - it was not very complicated to find a shape that started passing tests. However, such initial version had a lot of backtracking. So much that some early tests at that time were getting OOM killed on a smaller machine.

3. Write an EBNF to Go generator

It does not have to be general, perfect and complete. It needs to handle this one particular grammar only. The unmodified result of the generator is in parser.go.

$ go test -v -run TestParser -src $HOME/src
=== RUN   TestParser
=== RUN   TestParser/src
=== RUN   TestParser/goroot
=== CONT  TestParser
    all_test.go:332: TOTAL files 74,346, toks 322,032,687, skip 264, ok 74,082, fail 0
    all_test.go:338: Max backtrack: /home/jnml/src/modernc.org/tcl/lib/tcl_freebsd_amd64.go, 427 for 2,860,022 tokens
         /home/jnml/src/modernc.org/tcl/lib/tcl_freebsd_amd64.go:49258:9: - /home/jnml/src/modernc.org/tcl/lib/tcl_freebsd_amd64.go:49360:56: (parser.go:2763:forClause:)
    all_test.go:339: Max backtracks: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_amd64.go, 1,125,853 for 3,328,330 tokens
    all_test.go:340: Max budget used: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_amd64.go, 4,454,183 for 3,328,330 tokens
    all_test.go:341: Max duration: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_amd64.go, 9.923364926s for 3,328,330 tokens
--- PASS: TestParser (46.22s)
    --- PASS: TestParser/src (39.08s)
    --- PASS: TestParser/goroot (4.06s)
PASS
ok  modernc.org/gc/v3/internal/ebnf 46.356s

Backtracking is now a bit better because the generated parser also peeks at one more look ahead symbol in some situations.

The next steps

The mechanically produced parser serves as a starting point for a more realistic parser that gets gradual manual tweaks. The goal is to make the parser faster while minimizing memory allocations. Currently, here are the results for parsing all of the stdlib in $GOROOT/src using the semi-mechanical parser vs. the manually written parser in go/parser:

$ make benchmarks
go test -v -run @ -bench . 2>&1 | tee log-benchmarks
goos: linux
goarch: amd64
pkg: modernc.org/gc/v3
cpu: AMD Ryzen 9 3900X 12-Core Processor           
BenchmarkParser
BenchmarkParser-24             1 5008448499 ns/op   14.01 MB/s 1850669848 B/op 21444271 allocs/op
BenchmarkGoParser
BenchmarkGoParser-24           1 4591922743 ns/op   15.28 MB/s 917058152 B/op 21170807 allocs/op
PASS
ok  modernc.org/gc/v3 9.607s
$

The numbers start to look promising. This proof of concept might work.

What about the backtracking?

The manually written Go parser in go/parser has no backtracking. It guarantees linear time parsing - if I understood the code correctly. The generated parser seems to work well on average, even on a quite large corpus of real Go code, mostly consisting of projects on this list.

But the backtracking is evil. A pathological Go source code can be easily crafted, for which the generated parser will backtrack a lot, and I believe exponentially so, in terms of the number of tokens on the input.

So it is important to get rid of any backtracking, eventually. For this task the generated parser can provide on-demand metrics of where exactly the parser backtracks, how many times and how many tokens it rewinds. Parsing the stdlib again:

$ go test -v -run TestParser -report
=== RUN   TestParser
=== RUN   TestParser/src
=== RUN   TestParser/goroot
=== CONT  TestParser
    all_test.go:536: TOTAL files 8,920, toks 12,646,287, skip 159, ok 8,761, fail 0
    all_test.go:542: Max backtrack: /usr/local/go/src/bufio/bufio.go, 51 for 3,944 tokens
         /usr/local/go/src/bufio/bufio.go:299:6: - /usr/local/go/src/bufio/bufio.go:299:102: (parser.go:2253:forClause:)
    all_test.go:543: Max backtracks: /usr/local/go/src/cmd/compile/internal/ssa/rewriteARM.go, 6,210 for 119,072 tokens
    all_test.go:544: Max budget used: /usr/local/go/src/cmd/compile/internal/ssa/rewriteAMD64.go, 216,159 for 213,057 tokens
    all_test.go:545: Max duration: /usr/local/go/src/cmd/compile/internal/ssa/rewriteAMD64.go, 459.852504ms for 213,057 tokens
    all_test.go:547: 
        parser.go:5215:                1                1      1.0
        parser.go:1493:                2                6      3.0
<snip> ...
        parser.go:5006:           15,548           60,796      3.9
        parser.go:2253:           15,769           62,250      3.9
        parser.go:4102:           16,304           52,341      3.2
        parser.go:1486:           16,314           52,456      3.2
        <total>          214,778          391,708      1.8
--- PASS: TestParser (1.32s)
    --- PASS: TestParser/src (0.00s)
    --- PASS: TestParser/goroot (1.26s)
PASS
ok  modernc.org/gc/v3 1.333s

The total line tells us that for the 12,646,287 input tokens the parser backtracked 214,778 times  (1,7%), rewinded a total of 391,708 tokens (3,1%) and on average there were 1.8 tokens rewinded.

Here's the current worst offender. After fully parsing the type production, it may be a part of a conversion when a left paren follows. Otherwise it's a part of some different production. Like for example a composite literal. In a hand written parser you don't need to throw away the already parsed type, you can keep it and pass it around as necessary. One of the future tweaks of this parser will start right at this place.

Conclusions

It is possible rather quickly - and with surprisingly little effort - to have a working Go parser, good enough for most real Go code, produced nearly directly from the EBNF that's present - and available to be mechanically extracted - in the language specification. Probably not in the general case, but likely for some other languages with similar parsing ambiguities, ruling out an easy option to use yacc or similar tools, this may work as well.

The generated parser, and later its accompanying type checker, will be used firstly by the ccgo/v4 linker, replacing gc/v2. The pathological cases are not an issue for that particular task 

And with the help from the built-in reporting tool, there's hope to incrementally eliminate the backtracking once/if a need arises to use the parser for general Go code.

Comments

Popular posts from this blog

ccgo/v4 experiment: Trying the new runtime.Pinner

Producing a Go scanner in 1,219 bytes of code