Building a High-Performance JSON Parser in Go with Egg

Egg: A Modern LL(1) Parser Generator for Go

Parsing is a staple of computer science, yet finding the right tool for the job in Go often feels like a choice between heavy, complex frameworks or writing a brittle parser by hand.

Enter egg (Expression Grammar Generator).

egg is a new parser generator that prioritizes simplicity, performance, and Go-native conventions. It takes an EBNF grammar—very similar to the one used in the Go language specification—and generates a fast, dependency-light recursive descent parser.

In this post, we’ll build a fully functional JSON parser from scratch to demonstrate how egg works, and we'll dive deep into its most controversial and powerful feature: its flat AST encoding.

The Grammar: Familiar and Concise

One of egg's strengths is that it doesn't invent a complex new syntax. If you've read the Go language spec, you already know how to write an egg grammar. It combines lexical definitions (regex and literals) and syntax definitions in a single file.

Let's look at json.ebnf. This single file defines both the structure (Syntax) and the tokens (Lexical analysis):

# json.ebnf

# --- Syntax ---
JSON   = Value .
Object = '{' [ Member { ',' Member } ] '}' .
Member = str ':' Value .
Array  = '[' [ Value { ',' Value } ] ']' .
Value  = str | number | Object | Array | "true" | "false" | "null" .

# --- Lexical Tokens ---
# Lowercase names define terminals.
# Raw strings (backticks) are regex patterns.

digit          = `[0-9]` .
digits         = `[0-9]+` .
non_zero_digit = `[1-9]` .
hex            = `[0-9a-fA-F]` .
fraction       = '.' digits .
exponent       = ( 'e' | 'E' ) [ '+' | '-' ] digits .

integer        = digit 
               | non_zero_digit digits 
               | '-' digit 
               | '-' non_zero_digit digits .

number         = ( integer 
                 | integer fraction 
                 | integer exponent 
                 | integer fraction exponent ) .

str            = '"' { `[^"\\\x00-\x1F]` | `\\u` hex hex hex hex | `\\["\\\/bfnrt]` } '"'.

# White space handling is built-in if you define this specific production
white_space    = `( |\t|\r|\n)+` .

Notice the lack of a separate "lexer" file. egg handles tokenization automatically based on your lower-case productions (like str, number, white_space).

Generating the Parser

With the grammar in hand, generating the Go code is a one-liner. egg also includes a handy feature to generate a test harness automatically.

$ egg -o parser.go -test parser_test.go -start JSON json.ebnf

This command tells egg to:

  1. Output the parser logic to parser.go.
  2. Create a test file parser_test.go (we'll see why this is cool later).
  3. Use JSON as the entry point (the root node).

After running go fmt, we have a ready-to-use parser package.

The Generated API

The generated Parser struct is straightforward. It exposes methods for your entry points and a generic Parse method:

func (p *Parser) Parse(name string, src []byte) (ast []int32, err error)

Wait, take a look at that return signature: ast []int32.

Where is the tree of structs? Where are the *Node pointers?

The "Flat" AST: Performance vs. Convention

This is where egg deviates from traditional parser generators like ANTLR or yacc. Instead of allocating thousands of small struct objects on the heap for every node in your syntax tree, egg encodes the entire Abstract Syntax Tree (AST) into a single, flat slice of integers ([]int32).

How it works

The slice encodes the tree structure sequentially:

  1. Non-Terminals: Encoded as a header [-SymbolID, Size].
    • The negative number identifies the production (e.g., Object, Array).
    • The Size tells you how many subsequent integers belong to this node.
  2. Terminals: Encoded as a positive integer N.
    • N is an index into the token stream. You can retrieve the actual text or position by calling p.Token(N).

Pros and Cons

Pros:

  • Cache Locality: Traversing a slice is incredibly fast for the CPU pre-fetcher compared to chasing pointers.
  • Garbage Collection: You generate one slice, not 10,000 objects. This drastically reduces GC pressure for high-throughput applications.
  • Serialization: The AST is just an array of numbers; it's trivial to serialize or send over a network.

Cons:

  • Ergonomics: You cannot simply do node.Left.Right. You must write a traversal function that interprets the integer stream.
  • Type Safety: The structure is implicit in the integer sequence, not enforced by Go types.

Traversing the AST

To use the AST, you write a recursive walker. Here is an example of how you might traverse the JSON AST we just generated to print the structure:

func walk(p *Parser, ast []int32, level int) {
    for len(ast) > 0 {
        val := ast[0]
        
        // If negative, it's a Non-Terminal (e.g., Object, Array)
        if val < 0 {
            symbolID := -val
            size := ast[1] // The size of this node's children in the slice
            
            fmt.Printf("%sNode: %s\n", strings.Repeat("  ", level), Symbol(symbolID))
            
            // Recursively walk the children (skipping the header)
            // Header is 2 ints: ID and Size
            children := ast[2 : 2+size]
            walk(p, children, level+1)
            
            // Advance the slice past this node
            ast = ast[2+size:]
        } else {
            // It's a Terminal (Token)
            tok := p.Token(val)
            fmt.Printf("%sToken: %q\n", strings.Repeat("  ", level), tok.Src())
            
            // Advance past the token index
            ast = ast[1:]
        }
    }
}

Built-in Testing and Benchmarking

egg cares about correctness and speed. The -test flag we used earlier generates a parser_test.go file that integrates with standard go test.

It allows you to run your parser against a suite of files using regex flags to define expectations:

  • -yes: Files matching this regex must parse successfully.
  • -no: Files matching this regex must fail.
  • -may: Files matching this regex may fail.
# Run tests against the standard JSONTestSuite
$ go test -v -yes '^y_' -no '^n_' -may '^i_' -- ./JSONTestSuite/test_parsing/*.json
=== RUN   Test
    parser_test.go:61: tested files: 318
--- PASS: Test (0.34s)

It even generates benchmarks automatically. Because egg produces LL(1) parsers (which are deterministic and non-backtracking) and uses the flat AST optimization, the performance is often excellent:

$ go test -bench . -- ./bench_data/*.json
Benchmark/citm_catalog.json-24     74   16133539 ns/op   107.06 MB/s   90 allocs/op

Note the extremely low allocation count (90 allocs/op) for parsing a large file. That is the power of the flat AST.

Summary

egg is an opinionated tool. It forces your grammar to be LL(1)—meaning no ambiguities and no left-recursion—and it gives you a raw, high-performance memory layout instead of a friendly object tree.

However, for tasks where performance is critical and the grammar is well-defined (like JSON, configuration files, or simple expressions), egg offers a compelling blend of simplicity and raw speed.

Give it a try:

  1. Define grammar in standard EBNF.
  2. egg -o parser.go ...
  3. Enjoy fast, allocation-light parsing in pure Go.

Go Reference

Comments

Popular posts from this blog

Producing a Go parser from the language specification mechanically, mostly

ccgo/v4 experiment: Trying the new runtime.Pinner

Producing a Go scanner in 1,219 bytes of code