Egg: A Modern LL(1) Parser Generator for Go

Parsing is a staple of computer science, yet finding the right tool for the job in Go often feels like a choice between heavy, complex frameworks or writing a brittle parser by hand.

Enter egg (Expression Grammar Generator).

egg is a new parser generator that prioritizes simplicity, performance, and Go-native conventions. It takes an EBNF grammar—very similar to the one used in the Go language specification—and generates a fast, dependency-light recursive descent parser.

In this post, we’ll build a fully functional JSON parser from scratch to demonstrate how egg works, and we'll dive deep into its most controversial and powerful feature: its flat AST encoding.

The Grammar: Familiar and Concise

One of egg's strengths is that it doesn't invent a complex new syntax. If you've read the Go language spec, you already know how to write an egg grammar. It combines lexical definitions (regex and literals) and syntax definitions in a single file.

Let's look at json.ebnf. This single file defines both the structure (Syntax) and the tokens (Lexical analysis):

# json.ebnf

# --- Syntax ---
JSON   = Value .
Object = '{' [ Member { ',' Member } ] '}' .
Member = str ':' Value .
Array  = '[' [ Value { ',' Value } ] ']' .
Value  = str | number | Object | Array | "true" | "false" | "null" .

# --- Lexical Tokens ---
# Lowercase names define terminals.
# Raw strings (backticks) are regex patterns.

digit          = `[0-9]` .
digits         = `[0-9]+` .
non_zero_digit = `[1-9]` .
hex            = `[0-9a-fA-F]` .
fraction       = '.' digits .
exponent       = ( 'e' | 'E' ) [ '+' | '-' ] digits .

integer        = digit 
               | non_zero_digit digits 
               | '-' digit 
               | '-' non_zero_digit digits .

number         = ( integer 
                 | integer fraction 
                 | integer exponent 
                 | integer fraction exponent ) .

str            = '"' { `[^"\\\x00-\x1F]` | `\\u` hex hex hex hex | `\\["\\\/bfnrt]` } '"'.

# White space handling is built-in if you define this specific production
white_space    = ` |\t|\r|\n` .

Notice the lack of a separate "lexer" file. egg handles tokenization automatically based on your lower-case productions (like str, number, white_space).

Generating the Parser

With the grammar in hand, generating the Go code is a one-liner. egg also includes a handy feature to generate a test harness automatically.

$ egg -o parser.go -test parser_test.go -start JSON json.ebnf

This command tells egg to:

Output the parser logic to parser.go.
Create a test file parser_test.go (we'll see why this is cool later).
Use JSON as the entry point (the root node).

After running go fmt, we have a ready-to-use parser package.

The Generated API

The generated Parser struct is straightforward. It exposes methods for your entry points and a generic Parse method:

func (p *Parser) Parse(name string, src []byte) (ast []int32, err error)

Wait, take a look at that return signature: ast []int32.

Where is the tree of structs? Where are the *Node pointers?

The "Flat" AST: Performance vs. Convention

This is where egg deviates from traditional parser generators like ANTLR or yacc. Instead of allocating thousands of small struct objects on the heap for every node in your syntax tree, egg encodes the entire Abstract Syntax Tree (AST) into a single, flat slice of integers ([]int32).

How it works

The slice encodes the tree structure sequentially:

Non-Terminals: Encoded as a header [-SymbolID, Size].
- The negative number identifies the production (e.g., Object, Array).
- The Size tells you how many subsequent integers belong to this node.
Terminals: Encoded as a positive integer N.
- N is an index into the token stream. You can retrieve the actual text or position by calling p.Token(N).

Pros and Cons

Pros:

Cache Locality: Traversing a slice is incredibly fast for the CPU pre-fetcher compared to chasing pointers.
Garbage Collection: You generate one slice, not 10,000 objects. This drastically reduces GC pressure for high-throughput applications.
Serialization: The AST is just an array of numbers; it's trivial to serialize or send over a network.

Cons:

Ergonomics: You cannot simply do node.Left.Right. You must write a traversal function that interprets the integer stream.
Type Safety: The structure is implicit in the integer sequence, not enforced by Go types.

Traversing the AST

To use the AST, you write a recursive walker. Here is an example of how you might traverse the JSON AST we just generated to print the structure:

func walk(p *Parser, ast []int32, level int) {
    for len(ast) > 0 {
        val := ast[0]
        
        // If negative, it's a Non-Terminal (e.g., Object, Array)
        if val < 0 {
            symbolID := -val
            size := ast[1] // The size of this node's children in the slice
            
            fmt.Printf("%sNode: %s\n", strings.Repeat("  ", level), Symbol(symbolID))
            
            // Recursively walk the children (skipping the header)
            // Header is 2 ints: ID and Size
            children := ast[2 : 2+size]
            walk(p, children, level+1)
            
            // Advance the slice past this node
            ast = ast[2+size:]
        } else {
            // It's a Terminal (Token)
            tok := p.Token(val)
            fmt.Printf("%sToken: %q\n", strings.Repeat("  ", level), tok.Src())
            
            // Advance past the token index
            ast = ast[1:]
        }
    }
}

Update (Feb 2, 2026): The "Balanced Parentheses" Property

After publishing this, a reader on Gophers Slack pointed out a hidden advantage of this flat AST structure that I hadn't fully appreciated: it behaves like a "Balanced Parentheses" sequence.

In the "Cons" section above, I mentioned that you lose the ability to do node.Parent because we aren't storing pointers. While it is true that we don't store explicit parent pointers, the flat structure actually contains enough information to calculate them on the fly—without allocating a single extra byte.

Because every Non-Terminal node in egg stores its Size, it effectively defines a strict numerical interval in the slice:

Start: The index of the node itself.
End: index + 2 + Size.

This means our AST is actually a set of perfectly nested intervals. If you want to find the parent of the node at index i, you don't need a pointer; you just need to find which node's interval contains i.

This allows us to write a Parent() helper that traverses from the root to find the enclosing node. We trade a little bit of CPU (the search) to save massive amounts of RAM (no Parent *Node fields on thousands of objects).

Here is what that logic looks like conceptually:

// Parent finds the node that strictly contains the target index.
// It returns -1 if the target is the root.
func (p *Parser) Parent(ast []int32, target int) int {
    if target == 0 { return -1 } // Root has no parent

    // Start searching from the Root (index 0)
    current := 0
    
    // Traverse down as long as we can find a child that contains the target
    for {
        // ... (implementation details: iterate children, check range) ...
        
        // Logic:
        // If Child A (range 10-50) contains target (25), 
        // then Child A is the ancestor. We step in and repeat to see 
        // if one of Child A's children is the direct parent.
    }
    return current
}

This transforms the flat AST from a "read-only stream" into a fully traversable tree, maintaining the "Zero-Allocation" philosophy while regaining some of the ergonomic benefits of a pointer-based tree.

Built-in Testing and Benchmarking

egg cares about correctness and speed. The -test flag we used earlier generates a parser_test.go file that integrates with standard go test.

It allows you to run your parser against a suite of files using regex flags to define expectations:

-yes: Files matching this regex must parse successfully.
-no: Files matching this regex must fail.
-may: Files matching this regex may fail.

# Run tests against the standard JSONTestSuite
$ go test -v -yes '^y_' -no '^n_' -may '^i_' -- ./JSONTestSuite/test_parsing/*.json
=== RUN   Test
    parser_test.go:61: tested files: 318
--- PASS: Test (0.34s)

It even generates benchmarks automatically. Because egg produces LL(1) parsers (which are deterministic and non-backtracking) and uses the flat AST optimization, the performance is often excellent:

$ go test -bench . -- ./bench_data/*.json
Benchmark/citm_catalog.json-24     74   16133539 ns/op   107.06 MB/s   90 allocs/op

Note the extremely low allocation count (90 allocs/op) for parsing a large file. That is the power of the flat AST.

Summary

egg is an opinionated tool. It forces your grammar to be LL(1)—meaning no ambiguities and no left-recursion—and it gives you a raw, high-performance memory layout instead of a friendly object tree.

However, for tasks where performance is critical and the grammar is well-defined (like JSON, configuration files, or simple expressions), egg offers a compelling blend of simplicity and raw speed.

Give it a try:

Define grammar in standard EBNF.
egg -o parser.go ...
Enjoy fast, allocation-light parsing in pure Go.

Search This Blog

Modern C

Building a High-Performance JSON Parser in Go with Egg

Egg: A Modern LL(1) Parser Generator for Go

The Grammar: Familiar and Concise

Generating the Parser

The Generated API

The "Flat" AST: Performance vs. Convention

How it works

Pros and Cons

Traversing the AST

Update (Feb 2, 2026): The "Balanced Parentheses" Property

Built-in Testing and Benchmarking

Summary

Comments

Post a Comment

Popular posts from this blog

Taming the Flat AST: Ergonomics in the Age of Zero Allocations

Taming the Regex Monster: Optimizing Massive Literal Alternations