Regular expressions are broken

Posted by Michał ‘mina86’ Nazarewicz on 28th of February 2021

Quick! What does re.search('foo|foobar', 'foobarbaz').group() produce? Or for those not fluent in Python, how about /foo|foobar/.exec('foobarbaz')? Or to put it into words, what part of string foobarbaz will a foo|foobar regular expression match?

foo foobar

Perhaps it’s just me, but I expected the result to be foobar. That is, for the regular expression to match the longest leftmost substring. Alas, that’s not what is happening. Instead, Python’s and JavaScript’s regex engine will only match foo prefix.

Knowing that, what does re.search('foobar|foo', 'foobarbaz').group() produce (notice the subexpressions in the alternation are swapped). This can be reasoned in two ways: either order of branches in the alternation doesn’t matter — in which case the result should be the same as before, i.e. foo — or it does matter — and now the result will be foobar.

A computer scientist might lean towards the first option but a software engineer will know it’s the second.

I’ve decided to check a whole bunch of tools, languages and libraries to see how they all behave. For those curious, code for all tests is of course available in a source repository. Various UNIX utilities passed the verification so long as they did not use Perl Compatible Regular Expressions (PCRE). Passed here means they produced the same result regardless of the order of the subexpressions in the alternation. Anything claiming to be using POSIX compatible engine (e.g. std::regex::extended form C++) worked correctly as well. The rest didn’t bode so well.

ToolResult
BusyBox 1.29 AWK, grep & sed✓ Pass, with and without -E
GNU AWK 5.1, grep 3.6 & sed 4.7✓ Pass, with and without -E
GNU grep 3.6 with -P (PCRE)✗ Failure
Emacs 28.0.50✗ Failure
C (regex.h, glibc 2.31)✓ Pass, with and without REG_EXTENDED
C++ (libstdc++ 10.2 & libc++ 10.0)✗ Failure when using the default (ECMAScript)
✓ Pass when using other formats
C++ RE2✗ Failure
Java (OpenJDK 11)✗ Failure
Rust (regex 1.4)✗ Failure
JavaScript Carakan (Opera 12.16)✗ Failure
JavaScript Chakra (Edge 90)✗ Failure
JavaScript SpiderMonkey (Firefox 78)✗ Failure
JavaScript V8 (Chromium 88, Opera 74)✗ Failure
CPython 2.7 & 3.9✗ Failure
PyPy 7.2 (Python 2 & 3)✗ Failure
Jython 2.7✗ Failure
PHP 8.0 & 5.6 preg✗ Failure
PHP 8.0 & 5.6 mb_ereg✗ Failure
PHP 5.6 ereg✓ Pass
Ruby 2.7✗ Failure

PHP’s ereg stands out with it passing the checks. That is to be expected since the ereg family of functions is based on regex.h. Sadly it has been removed in PHP 7.0 leaving only PCRE and mb_ereg left. The latter (which is a multibyte variant) does not share its namesake’s behaviour.

Rust’s regex crate is noteworthy for another reason. It includes a shortest_match method which turns out to be a lie. Rather than returning the shortest match it’s an optimisation returning position of the end of the matching substring. It’s a step between answering whether regex matched at all and returning full capture group information. While it may be useful, the name is confusing.

And there you have it. Another one to the ‘regular expressions are weird’ box giving some more credence to the old adage: ‘A man had a problem and decided to use a regex. Now he has two problems.’

Addendum

In March 2023, Andrew Gallant (author of Rust regex crate) pointed out that the described behaviour is intended and corresponds to how POSIX and Perl define regexes. He also mentioned advantages of the behaviour chosen in regex crate. This suggest that labelling it ‘broken’ is perhaps unfair. I maintain my opinion based on alternation being a commutative operation. Just like 3 + 6 and 6 + 3 should produce results which behave the same way, foo|foobar and foobar|foo should behave the same way.