Regex Efficiency

Considerations around writing efficient regex

A few ideas to keep in mind when writing regex. Please see Using Regular Expressions (regex) in SiteSpect first if you have not already.

Use Character Classes instead of Lazy wilds when possible

Let's say you have

<div class="structure prominent" data-myattr="1234">some content</div>

and you want to capture the classe(s), the following negated character class is more efficient

<div class="([^"]+)"

than using a lazy wild

<div class="(.*?)"

or use something like this to match on an unknown number of classes and other attributes and text in the div tag.

<div class="structure[^"]*"[^>]*>[^<]+</div>

which will be more efficient than using the lazy wilds. A negated character class works well when you know a single character that you want to match up until.

Don't use Lazy when the last possible match is more than halfway to the end

This comes into play when you want to match from a patter in the source of the page that is at the top such as

</head>

all the way to the bottom such as

</body>

in that case you would want to use a greedy regex to reduce unnecessary backtracking.

(</head>.*)</body>

If you used a .*? it would work but cause additional backtracking.

How do I see and compare my regex performance in SiteSpect?

Enable comprehensive debugging information to see the number of milliseconds it takes for each Variation to run while in a Preview Session.

What other tools can help?

External regex tools

Other interesting reading

https://www.loggly.com/blog/five-invaluable-techniques-to-improve-regex-performance/