Regular expressions are broken
Posted by Michał ‘mina86’ Nazarewicz on 28th of February 2021
Quick! What does
re.search('foo|foobar', 'foobarbaz').group() produce? Or for those not fluent in Python, how about
/foo|foobar/.exec('foobarbaz')? Or to put it into words, what part of string
foobarbaz will a
foo|foobar regular expression match?
Perhaps it’s just me, but I expected the result to be
Knowing that, what does
re.search('foobar|foo', 'foobarbaz').group() produce (notice the subexpressions in the alternation are swapped). This can be reasoned in two ways: either order of branches in the alternation doesn’t matter — in which case the result should be the same as before, i.e.
foo — or it does matter — and now the result will be
A computer scientist might lean towards the first option but a software engineer will know it’s the second.
I’ve decided to check a whole bunch of tools, languages and libraries to see how they all behave. For those curious, code for all tests is of course available in a source repository. Various UNIX utilities passed the verification so long as they did not use Perl Compatible Regular Expressions (PCRE). Passed here means they produced the same result regardless of the order of the subexpressions in the alternation. Anything claiming to be using POSIX compatible engine (e.g.
std::regex::extended form C++) worked correctly as well. The rest didn’t bode so well.
|BusyBox 1.29 AWK, grep & sed||✓ Pass, with and without |
|GNU AWK 5.1, grep 3.6 & sed 4.7||✓ Pass, with and without |
|GNU grep 3.6 with ||✗ Failure|
|Emacs 28.0.50||✗ Failure|
|C (||✓ Pass, with and without |
|C++ (libstdc++ 10.2 & libc++ 10.0)||✗ Failure when using the default (|
✓ Pass when using other formats
|C++ RE2||✗ Failure|
|Java (OpenJDK 11)||✗ Failure|
|Rust (||✗ Failure|
|CPython 2.7 & 3.9||✗ Failure|
|PyPy 7.2 (Python 2 & 3)||✗ Failure|
|Jython 2.7||✗ Failure|
|PHP 8.0 & 5.6 ||✗ Failure|
|PHP 8.0 & 5.6 ||✗ Failure|
|PHP 5.6 ||✓ Pass|
|Ruby 2.7||✗ Failure|
ereg stands out with it passing the checks. That is to be expected since the
ereg family of functions is based on
regex.h. Sadly it has been removed in PHP 7.0 leaving only PCRE and
mb_ereg left. The latter (which is a multibyte variant) does not share its namesake’s behaviour.
regex crate is noteworthy for another reason. It includes a
shortest_match method which turns out to be a lie. Rather than returning the shortest match it’s an optimisation returning position of the end of the matching substring. It’s a step between answering whether regex matched at all and returning full capture group information. While it may be useful, the name is confusing.
And there you have it. Another one to the ‘regular expressions are weird’ box giving some more credence to the old adage: ‘A man had a problem and decided to use a regex. Now he has two problems.’
In March 2023, Andrew Gallant (author of Rust regex crate) pointed out that the described behaviour is intended and corresponds to how POSIX and Perl define regexes. He also mentioned advantages of the behaviour chosen in regex crate. This suggest that labelling it ‘broken’ is perhaps unfair. I maintain my opinion based on alternation being a commutative operation. Just like
3 + 6 and
6 + 3 should produce results which behave the same way,
foobar|foo should behave the same way.