Taming the Regex Monster: Optimizing Massive Literal Alternations
A recent discussion on the Gophers Slack caught my eye regarding the performance of Go's standard library regexp package when handling massive alternations of literal strings.
The user posted the following observation:
"I've been looking at why a certain regexp is slow, and I find that anything with over 500 instructions (maxBacktrackProg) gets run by the NFA engine. However every literal character in the regex consumes an instruction, so this penalizes simple regexps which contain a bunch of long strings."
The "Monster" Regex
The user provided an obfusctaed version of the regular expression causing the bottleneck. It is essentially a massive list of pipe-separated literals, wrapped in a dot-star sandwich to search for substrings:
^(?s:.*zQPbMkNO.*|.*NNSPdvMi.*|.*iWuuSoAl.*|...many more...|.*wYtnhdYb.*)$
Because the standard library counts every literal character as an instruction, this regex exceeds the threshold for the efficient One-Pass execution engine and falls back to the NFA engine, resulting in slower matching performance.
The DFA Approach
As the maintainer of modernc.org/regexp, a drop-in replacement for the standard library that partially supports DFA (Deterministic Finite Automaton) compilation, I was curious to see if we could squeeze better performance out of this specific scenario.
There was a catch, however. If we feed the original regex (with .* surrounding every term) into a DFA compiler, the number of states explodes prohibitively. To solve this, I simplified the regex to just the alternations:
zQPbMkNO|NNSPdvMi|iWuuSoAl|...
I then implemented a Proof-of-Concept VM that handles the "search" aspect (the leading and trailing wildcards) manually, feeding the stream into the DFA generated by modernc.org/regexp. This keeps the DFA state count manageable while maintaining the same matching semantics as the original request.
The Benchmark Results
I ran a benchmark comparing the standard library (regexp) against the modernc DFA implementation. The tests were run on an AMD Ryzen 9 3900X.
The Protocol:
- Compile: Building the regex object.
- NonMatch: Scanning a long string that does not contain the targets.
- Match: Scanning a long string that does contain the target.
Here are the numbers:
| Benchmark | Stdlib (ns/op) | Modernc DFA (ns/op) | Delta |
|---|---|---|---|
| Compile | 78,101 | 15,747,807 | ~200x Slower |
| NonMatch | 2,097,923 | 74,867 | ~28x Faster |
| Match | 1,050,834 | 36,686 | ~28x Faster |
Analysis
The trade-offs here are distinct and fascinating:
- Compilation Cost: The DFA approach is significantly more expensive to compile (16ms vs 78µs). Calculating the deterministic states takes time and memory.
- Execution Speed: Once compiled, the DFA obliterates the standard library's NFA performance for this use case. We see a speedup of roughly 28x for both matching and non-matching scenarios.
Conclusion
If your application compiles regexes on the fly or runs them only once, the standard library is likely the better choice due to its fast compilation times. However, for long-running server applications where a complex regex is compiled once at startup and executed millions of times against incoming data, paying the "compile tax" upfront with modernc.org/regexp can yield massive throughput improvements.
You can find the full benchmark code and the VM implementation in this gist gitlab.com/-/snippets/4931863.
Update: The "Rob Pike" Test
I announced this post on Reddit, and I was honored to receive a reply from Rob Pike himself. He correctly pointed out that for a pure alternation of literal strings, using a regular expression is technically the "wrong tool."
"You can delete the anchors and the .*s. Then you have a pattern that is literally just an alternation of many literal strings... And for that there are much better algorithms than regular expressions [like Aho-Corasick]." — Rob Pike
He is absolutely right. If you are only searching for a fixed list of words, algorithms like Aho-Corasick are vastly superior. But this got me thinking: What happens when we leave the domain of finite sets?
I decided to run a new experiment. I took the same massive list of literals but appended a dynamic, infinite-set component to the end of the regex. This forces the engine to handle logic that standard string-search algorithms (like Aho-Corasick) cannot natively process.
// The new regex: 100+ literals OR a complex pattern
const regex = `...|wYtnhdYb|\pL\p{Nd}*Z?4-?`
Here, \p{Nd}* (zero or more Unicode digits) creates an infinite set of potential matches, effectively disabling the "fixed string set" optimization strategies.
The Results
Even in this "mixed" scenario where advanced string-search algorithms no longer apply, the DFA approach (using modernc.org/regexp) maintained a significant lead over the standard library's NFA implementation.
| Engine | Time per Op (ns) | Speedup |
|---|---|---|
| Standard Library (NFA) | 1,096,599 | 1x |
| Modernc (DFA) | 226,928 | ~4.8x |
While the gap isn't as massive as the ~28x seen in the pure literal case, a 4.8x speedup is substantial.
Conclusion: Rob is right—use Aho-Corasick for pure keywords. But if your pattern mixes massive keyword lists with dynamic regex logic (like \w+ or \d*), a DFA engine may sometimes offer a powerful "middle ground" optimization that the standard library currently misses.
The updated experiment code: gitlab.com/-/snippets/4932009.
Comments
Post a Comment