Regular expressions are broken
Posted by Michał ‘mina86’ Nazarewicz on 28th of February 2021
Quick! What does
re.search('foo|foobar', 'foobarbaz').group() produce? Or for those not fluent in Python, how about
/foo|foobar/.exec('foobarbaz')? Or to put it into words, what part of string
foobarbaz will a
foo|foobar regular expression match?
Perhaps it’s just me, but I expected the result to be
Knowing that, what does
re.search('foobar|foo', 'foobarbaz').group() produce (notice the subexpressions in the alternation are swapped). This can be reasoned in two ways: either order of branches in the alternation doesn’t matter — in which case the result should be the same as before, i.e.
foo — or it does matter — and now the result will be
A computer scientist might lean towards the first option but a software engineer will know it’s the second.
I’ve decided to check a whole bunch of tools, languages and libraries to see how they all behave. For those curious, code for all tests is of course available in a source repository. Various UNIX utilities passed the verification so long as they did not use Perl Compatible Regular Expressions (PCRE). Passed here means they produced the same result regardless of the order of the subexpressions in the alternation. Anything claiming to be using POSIX compatible engine (e.g.
std::regex::extended form C++) worked correctly as well. The rest didn’t bode so well.
|BusyBox 1.29 AWK, grep & sed||✓ Pass, with and without |
|GNU AWK 5.1, grep 3.6 & sed 4.7||✓ Pass, with and without |
|GNU grep 3.6 with ||✗ Failure|
|Emacs 28.0.50||✗ Failure|
|C (||✓ Pass, with and without |
|C++ (libstdc++ 10.2 & libc++ 10.0)||✗ Failure when using the default (|
✓ Pass when using other formats
|C++ RE2||✗ Failure|
|Java (OpenJDK 11)||✗ Failure|
|Rust (||✗ Failure|
|CPython 2.7 & 3.9||✗ Failure|
|PyPy 7.2 (Python 2 & 3)||✗ Failure|
|Jython 2.7||✗ Failure|
|PHP 8.0 & 5.6 ||✗ Failure|
|PHP 8.0 & 5.6 ||✗ Failure|
|PHP 5.6 ||✓ Pass|
|Ruby 2.7||✗ Failure|
ereg stands out with it passing the checks. That is to be expected since the
ereg family of functions is based on
regex.h. Sadly it has been removed in PHP 7.0 leaving only PCRE and
mb_ereg left. The latter (which is a multibyte variant) does not share its namesake’s behaviour.
regex crate is noteworthy for another reason. It includes a
shortest_match method which turns out to be a lie. Rather than returning the shortest match it’s just an optimisation returning position of the end of the matching substring. It’s a step between answering whether regex matched at all and returning full capture group information. While it may be useful, the name is confusing.
And there you have it. Another one to the ‘regular expressions are weird’ box giving some more credence to the old adage: ‘A man had a problem and decided to use a regex. Now he has two problems.’