Producing a Go parser from the language specification mechanically, mostly
There's yet another, possibly never-to-be-completed Go front end, modernc.org/gc/v3. This time I'm trying something new with respect to the Go parser. It takes three main steps, initially.
1. Extract the EBNF specification from the language specs.
The unmodified EBNF grammar is not a well-formed PEG grammar:
$ go test -v -run Spec
=== RUN TestSpecEBNF
all_test.go:68: left recursive: Expression
all_test.go:68: left recursive: PrimaryExpr
--- PASS: TestSpecEBNF (0.01s)
PASS
ok modernc.org/gc/v3/internal/ebnf 0.011s
$
- Remove the left recursion
- Reorder the terms to obtain the correct parse tree.
- Rewrite selected parts of the grammar to get the backtracing on a large corpus of Go code to something like, say acceptable* 10% on average.
$ go test -v -run TestEBNFParser -src $HOME/src
=== RUN TestEBNFParser
=== RUN TestEBNFParser/src
=== RUN TestEBNFParser/goroot
=== CONT TestEBNFParser
all_test.go:201: TOTAL files 74,346, toks 322,032,687, skip 264, ok 74,082, fail 0
all_test.go:207: Max backtrack: /home/jnml/src/modernc.org/tcl/lib/tcl_darwin_amd64.go, 427 for 2,902,751 tokens
/home/jnml/src/modernc.org/tcl/lib/tcl_darwin_amd64.go:62198:9: - /home/jnml/src/modernc.org/tcl/lib/tcl_darwin_amd64.go:62300:56: (etc.go:1289:parseExpression:)
all_test.go:208: Max backtracks: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_arm64.go, 2,104,073 for 3,328,330 tokens
all_test.go:209: Max budget used: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_arm64.go, 5,432,402 for 3,328,330 tokens
all_test.go:210: Max duration: /home/jnml/src/modernc.org/tcl/lib/tcl_linux_riscv64.go, 8.045542843s for 2,861,458 tokens
--- PASS: TestEBNFParser (23.79s)
--- PASS: TestEBNFParser/src (21.45s)
--- PASS: TestEBNFParser/goroot (1.12s)
PASS
ok modernc.org/gc/v3/internal/ebnf 23.848s
$
Note: The current, linked peg.ebnf went through several rounds of editing. I don't know if the minimal working version that has the smallest diff compared to spec.ebnf was actually committed to the tree. But to get there, it was only about a dozen or two dozens of changed lines, if I recall correctly. What I want to say - it was not very complicated to find a shape that started passing tests. However, such initial version had a lot of backtracking. So much that some early tests at that time were getting OOM killed on a smaller machine.
3. Write an EBNF to Go generator
It does not have to be general, perfect and complete. It needs to handle this one particular grammar only. The unmodified result of the generator is in parser.go.
$ go test -v -run TestParser -src $HOME/src
=== RUN TestParser
=== RUN TestParser/src
=== RUN TestParser/goroot
=== CONT TestParser
all_test.go:332: TOTAL files 74,346, toks 322,032,687, skip 264, ok 74,082, fail 0
all_test.go:338: Max backtrack: /home/jnml/src/modernc.org/tcl/lib/tcl_freebsd_amd64.go, 427 for 2,860,022 tokens
/home/jnml/src/modernc.org/tcl/lib/tcl_freebsd_amd64.go:49258:9: - /home/jnml/src/modernc.org/tcl/lib/tcl_freebsd_amd64.go:49360:56: (parser.go:2763:forClause:)
all_test.go:339: Max backtracks: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_amd64.go, 1,125,853 for 3,328,330 tokens
all_test.go:340: Max budget used: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_amd64.go, 4,454,183 for 3,328,330 tokens
all_test.go:341: Max duration: /home/jnml/src/modernc.org/tcl/lib/tcl_windows_amd64.go, 9.923364926s for 3,328,330 tokens
--- PASS: TestParser (46.22s)
--- PASS: TestParser/src (39.08s)
--- PASS: TestParser/goroot (4.06s)
PASS
ok modernc.org/gc/v3/internal/ebnf 46.356s
$
The next steps
The mechanically produced parser serves as a starting point for a more realistic parser that gets gradual manual tweaks. The goal is to make the parser faster while minimizing memory allocations. Currently, here are the results for parsing all of the stdlib in $GOROOT/src using the semi-mechanical parser vs. the manually written parser in go/parser:
$ make benchmarks
go test -v -run @ -bench . 2>&1 | tee log-benchmarks
goos: linux
goarch: amd64
pkg: modernc.org/gc/v3
cpu: AMD Ryzen 9 3900X 12-Core Processor
BenchmarkParser
BenchmarkParser-24 1 5008448499 ns/op 14.01 MB/s 1850669848 B/op 21444271 allocs/op
BenchmarkGoParser
BenchmarkGoParser-24 1 4591922743 ns/op 15.28 MB/s 917058152 B/op 21170807 allocs/op
PASS
ok modernc.org/gc/v3 9.607s
$
The numbers start to look promising. This proof of concept might work.
What about the backtracking?
The manually written Go parser in go/parser has no backtracking. It guarantees linear time parsing - if I understood the code correctly. The generated parser seems to work well on average, even on a quite large corpus of real Go code, mostly consisting of projects on this list.
But the backtracking is evil. A pathological Go source code can be easily crafted, for which the generated parser will backtrack a lot, and I believe exponentially so, in terms of the number of tokens on the input.
So it is important to get rid of any backtracking, eventually. For this task the generated parser can provide on-demand metrics of where exactly the parser backtracks, how many times and how many tokens it rewinds. Parsing the stdlib again:
$ go test -v -run TestParser -report
=== RUN TestParser
=== RUN TestParser/src
=== RUN TestParser/goroot
=== CONT TestParser
all_test.go:536: TOTAL files 8,920, toks 12,646,287, skip 159, ok 8,761, fail 0
all_test.go:542: Max backtrack: /usr/local/go/src/bufio/bufio.go, 51 for 3,944 tokens
/usr/local/go/src/bufio/bufio.go:299:6: - /usr/local/go/src/bufio/bufio.go:299:102: (parser.go:2253:forClause:)
all_test.go:543: Max backtracks: /usr/local/go/src/cmd/compile/internal/ssa/rewriteARM.go, 6,210 for 119,072 tokens
all_test.go:544: Max budget used: /usr/local/go/src/cmd/compile/internal/ssa/rewriteAMD64.go, 216,159 for 213,057 tokens
all_test.go:545: Max duration: /usr/local/go/src/cmd/compile/internal/ssa/rewriteAMD64.go, 459.852504ms for 213,057 tokens
all_test.go:547:
parser.go:5215: 1 1 1.0
parser.go:1493: 2 6 3.0
<snip> ...
parser.go:5006: 15,548 60,796 3.9
parser.go:2253: 15,769 62,250 3.9
parser.go:4102: 16,304 52,341 3.2
parser.go:1486: 16,314 52,456 3.2
<total> 214,778 391,708 1.8
--- PASS: TestParser (1.32s)
--- PASS: TestParser/src (0.00s)
--- PASS: TestParser/goroot (1.26s)
PASS
ok modernc.org/gc/v3 1.333s
$
The total line tells us that for the 12,646,287 input tokens the parser backtracked 214,778 times (1,7%), rewinded a total of 391,708 tokens (3,1%) and on average there were 1.8 tokens rewinded.
Here's the current worst offender. After fully parsing the type production, it may be a part of a conversion when a left paren follows. Otherwise it's a part of some different production. Like for example a composite literal. In a hand written parser you don't need to throw away the already parsed type, you can keep it and pass it around as necessary. One of the future tweaks of this parser will start right at this place.
Conclusions
It is possible rather quickly - and with surprisingly little effort - to have a working Go parser, good enough for most real Go code, produced nearly directly from the EBNF that's present - and available to be mechanically extracted - in the language specification. Probably not in the general case, but likely for some other languages with similar parsing ambiguities, ruling out an easy option to use yacc or similar tools, this may work as well.
The generated parser, and later its accompanying type checker, will be used firstly by the ccgo/v4 linker, replacing gc/v2. The pathological cases are not an issue for that particular task
And with the help from the built-in reporting tool, there's hope to incrementally eliminate the backtracking once/if a need arises to use the parser for general Go code.
Comments
Post a Comment