Taming the Regex Monster: Optimizing Massive Literal Alternations
A recent discussion on the Gophers Slack caught my eye regarding the performance of Go's standard library regexp package when handling massive alternations of literal strings. The user posted the following observation: "I've been looking at why a certain regexp is slow, and I find that anything with over 500 instructions (maxBacktrackProg) gets run by the NFA engine. However every literal character in the regex consumes an instruction, so this penalizes simple regexps which contain a bunch of long strings." The "Monster" Regex The user provided an obfusctaed version of the regular expression causing the bottleneck. It is essentially a massive list of pipe-separated literals, wrapped in a dot-star sandwich to search for substrings: ^(?s:.*zQPbMkNO.*|.*NNSPdvMi.*|.*iWuuSoAl.*|...many more...|.*wYtnhdYb.*)$ Because the standard library counts every literal character as an instruction, this regex exceeds the threshold for the efficient One-Pass...