Regular expressions aren’t broken after all
- Posted by Michał ‘mina86’ Nazarewicz on 23rd of February 2025
- Share on Bluesky
- Cite
Four years ago I proclaimed that regular expressions were broken. Two years ago I discussed this with BurntSushi and even though his expertise in the subject could not be denied, he did not manage to change my opinion. But now, two more years after that, I adjusted my stance.
Everything factual I’ve written previously is still accurate, but calling regular expressions broken might have been a bit too much of a hyperbole. There’s definitely something funky going on with regex engines but I’ve realised an analogy which makes it make sense.
Recap
In formal language theory, alternation, i.e. the | operator, is commutative. For two grammars α and β, α|β and β|α define the same language just like 1 + 2 and 2 + 1 equal the same number (there are no two different 3s, depending how they were constructed).
Nevertheless, most regex engines care about the order of arguments in an alternation. As demonstrated in my previous post, when matching the string ‘foobar’ against foo|foobar
regular expression, the regex engines will match ‘foo’ substring but when matching it against foobar|foo
they will match the entire ‘foobar’ string.
This is a bit like saying that 5x should give different results depending on whether x was constructed as x = 1 + 2 or x = 2 + 1. Of course software engineering and maths are different disciplines and things don’t directly translate between the two. Nevertheless, I felt justified in calling such regex engines broken.
Prior Art
Adjustment of my stance on the issue was thanks to other examples where programming practice clashes with its theoretical roots. Below I’ll give a handful of examples culminating with one that really changed my position.
String concatenation
Many languages use a plus symbol as a string concatenation operator, which isn’t commutative. Meanwhile in maths, plus is by convention used for commutative operations only.1 Indeed, some languages opt for using different concatenation operators: D uses ~
, Haskell uses ++
, Perl uses .
(dot), SQL uses ||
and Visual Basic uses &
to name a few examples.2
However, this is a different situation than the case of alternation in regular expressions. Using plus symbol for concatenation may be considered unfortunate, but the operation itself behaves the same way its counterpart in maths does.
Floating point numbers
Another example where maths disagrees with programming are floating point numbers. They pretend to be real numbers but in reality they aren’t even good at being rational numbers. Most notably for this discussion, addition of floating point numbers is not associative and can lead to catastrophic cancellation.3 Plus there are NaN values which infamously do not equal themselves and can really mess up array sorting if not handled properly.
However, this didn’t convince me that regular expression weren’t broken either. After all, I’m perfectly happy to call floating point numbers broken. I don’t mean by this that they are unusable or don’t solve real (no pun intended) problems. Rather this is only to emphasise that there are many details that engineer needs to be vary of when using them. This is the same sense I used the word in regards to regex engines.
Logic operations
In the end what made me more open to the ‘broken’ behaviour of regular expressions were logic operators. In many (most? all?) imperative languages, the logic or operator use a short-circuit evaluation. For example, while puts("foo") || puts("bar")
and puts("bar") || puts("foo")
C expressions evaluate to the same value (one), their behaviours differ — the first one outputs ‘foo’ and the second one outputs ‘bar’ to standard output.
This is analogous to regular expressions. When matching foo|foobar
and foobar|foo
regular expressions, the result (whether the string matches) is the same, but the side effects of the execution may differ.
Conclusion
To be clear, I still have doubts whether foo|foobar
and foobar|foo
behaving differently is the right option. However, it’s also clear to say that it’s not as broken as I used to think; rather, it’s one of the peculiarities of regexes that one needs to be aware of. And specifically, aware of how regex engine they use behave.