Regular expressions are broken
Posted by Michał ‘mina86’ Nazarewicz on 28th of February 2021
Quick! What does re.search('foo|foobar', 'foobarbaz').group()
produce? Or for those not fluent in Python, how about /foo|foobar/.exec('foobarbaz')
? Or to put it into words, what part of string foobarbaz
will a foo|foobar
regular expression match?
Perhaps it’s just me, but I expected the result to be foobar
. That is, for the regular expression to match the longest leftmost substring. Alas, that’s not what is happening. Instead, Python’s and JavaScript’s regex engine will only match foo
prefix.
Knowing that, what does re.search('foobar|foo', 'foobarbaz').group()
produce (notice the subexpressions in the alternation are swapped). This can be reasoned in two ways: either order of branches in the alternation doesn’t matter — in which case the result should be the same as before, i.e. foo
— or it does matter — and now the result will be foobar
.
A computer scientist might lean towards the first option but a software engineer will know it’s the second.
I’ve decided to check a whole bunch of tools, languages and libraries to see how they all behave. For those curious, code for all tests is of course available in a source repository. Various UNIX utilities passed the verification so long as they did not use Perl Compatible Regular Expressions (PCRE). Passed here means they produced the same result regardless of the order of the subexpressions in the alternation. Anything claiming to be using POSIX compatible engine (e.g. std::regex::extended
form C++) worked correctly as well. The rest didn’t bode so well.
Tool | Result |
---|---|
BusyBox 1.29 AWK, grep & sed | ✓ Pass, with and without -E |
GNU AWK 5.1, grep 3.6 & sed 4.7 | ✓ Pass, with and without -E |
GNU grep 3.6 with -P (PCRE) | ✗ Failure |
Emacs 28.0.50 | ✗ Failure |
C (regex.h , glibc 2.31) | ✓ Pass, with and without REG_EXTENDED |
C++ (libstdc++ 10.2 & libc++ 10.0) | ✗ Failure when using the default (ECMAScript )✓ Pass when using other formats |
C++ RE2 | ✗ Failure |
Java (OpenJDK 11) | ✗ Failure |
Rust (regex 1.4) | ✗ Failure |
JavaScript Carakan (Opera 12.16) | ✗ Failure |
JavaScript Chakra (Edge 90) | ✗ Failure |
JavaScript SpiderMonkey (Firefox 78) | ✗ Failure |
JavaScript V8 (Chromium 88, Opera 74) | ✗ Failure |
CPython 2.7 & 3.9 | ✗ Failure |
PyPy 7.2 (Python 2 & 3) | ✗ Failure |
Jython 2.7 | ✗ Failure |
PHP 8.0 & 5.6 preg | ✗ Failure |
PHP 8.0 & 5.6 mb_ereg | ✗ Failure |
PHP 5.6 ereg | ✓ Pass |
Ruby 2.7 | ✗ Failure |
PHP’s ereg
stands out with it passing the checks. That is to be expected since the ereg
family of functions is based on regex.h
. Sadly it has been removed in PHP 7.0 leaving only PCRE and mb_ereg
left. The latter (which is a multibyte variant) does not share its namesake’s behaviour.
Rust’s regex
crate is noteworthy for another reason. It includes a shortest_match
method which turns out to be a lie. Rather than returning the shortest match it’s an optimisation returning position of the end of the matching substring. It’s a step between answering whether regex matched at all and returning full capture group information. While it may be useful, the name is confusing.
And there you have it. Another one to the ‘regular expressions are weird’ box giving some more credence to the old adage: ‘A man had a problem and decided to use a regex. Now he has two problems.’
Addendum
In March 2023, Andrew Gallant (author of Rust regex crate) pointed out that the described behaviour is intended and corresponds to how POSIX and Perl define regexes. He also mentioned advantages of the behaviour chosen in regex crate. This suggest that labelling it ‘broken’ is perhaps unfair. I maintain my opinion based on alternation being a commutative operation. Just like 3 + 6
and 6 + 3
should produce results which behave the same way, foo|foobar
and foobar|foo
should behave the same way.