• mina86.com

  • Categories
  • Code
  • Contact
  • Regular expressions are broken

    Posted by Michał ‘mina86’ Nazarewicz on 28th of February 2021

    Quick! What does re.search('foo|foobar', 'foobarbaz').group() produce? Or for those not fluent in Python, how about /foo|foobar/.exec('foobarbaz')? Or to put it into words, what part of string foobarbaz will a foo|foobar regular expression match?

    foo foobar

    Perhaps it’s just me, but I expected the result to be foobar. That is, for the regular expression to match the longest leftmost substring. Alas, that’s not what is happening. Instead, Python’s and JavaScript’s regex engine will only match foo prefix.

    Knowing that, what does re.search('foobar|foo', 'foobarbaz').group() produce (notice the subexpressions in the alternation are swapped). This can be reasoned in two ways: either order of branches in the alternation doesn’t matter — in which case the result should be the same as before, i.e. foo — or it does matter — and now the result will be foobar.

    A computer scientist might lean towards the first option but a software engineer will know it’s the second.

    I’ve decided to check a whole bunch of tools, languages and libraries to see how they all behave. For those curious, code for all tests is of course available in a source repository. Various UNIX utilities passed the verification so long as they did not use Perl Compatible Regular Expressions (PCRE). Passed here means they produced the same result regardless of the order of the subexpressions in the alternation. Anything claiming to be using POSIX compatible engine (e.g. std::regex::extended form C++) worked correctly as well. The rest didn’t bode so well.

    ToolResult
    BusyBox 1.29 AWK, grep & sed✓ Pass, with and without -E
    GNU AWK 5.1, grep 3.6 & sed 4.7✓ Pass, with and without -E
    GNU grep 3.6 with -P (PCRE)✗ Failure
    Emacs 28.0.50✗ Failure
    C (regex.h, glibc 2.31)✓ Pass, with and without REG_EXTENDED
    C++ (libstdc++ 10.2 & libc++ 10.0)✗ Failure when using the default (ECMAScript)
    ✓ Pass when using other formats
    C++ RE2✗ Failure
    Java (OpenJDK 11)✗ Failure
    Rust (regex 1.4)✗ Failure
    JavaScript Carakan (Opera 12.16)✗ Failure
    JavaScript Chakra (Edge 90)✗ Failure
    JavaScript SpiderMonkey (Firefox 78)✗ Failure
    JavaScript V8 (Chromium 88, Opera 74)✗ Failure
    CPython 2.7 & 3.9✗ Failure
    PyPy 7.2 (Python 2 & 3)✗ Failure
    Jython 2.7✗ Failure
    PHP 8.0 & 5.6 preg✗ Failure
    PHP 8.0 & 5.6 mb_ereg✗ Failure
    PHP 5.6 ereg✓ Pass
    Ruby 2.7✗ Failure

    PHP’s ereg stands out with it passing the checks. That is to be expected since the ereg family of functions is based on regex.h. Sadly it has been removed in PHP 7.0 leaving only PCRE and mb_ereg left. The latter (which is a multibyte variant) does not share its namesake’s behaviour.

    Rust’s regex crate is noteworthy for another reason. It includes a shortest_match method which turns out to be a lie. Rather than returning the shortest match it’s an optimisation returning position of the end of the matching substring. It’s a step between answering whether regex matched at all and returning full capture group information. While it may be useful, the name is confusing.

    And there you have it. Another one to the ‘regular expressions are weird’ box giving some more credence to the old adage: ‘A man had a problem and decided to use a regex. Now he has two problems.’

    Addendum

    In March 2023, Andrew Gallant (author of Rust regex crate) pointed out that the described behaviour is intended and corresponds to how POSIX and Perl define regexes. He also mentioned advantages of the behaviour chosen in regex crate. This suggest that labelling it ‘broken’ is perhaps unfair. I maintain my opinion based on alternation being a commutative operation. Just like 3 + 6 and 6 + 3 should produce results which behave the same way, foo|foobar and foobar|foo should behave the same way.