• mina86.com

  • Categories
  • Code
  • Contact
  • Regular expressions aren’t broken after all

    Four years ago I proclaimed that regular expressions were broken. Two years ago I discussed this with BurntSushi and even though his expertise in the subject could not be denied, he did not manage to change my opinion. But now, two more years after that, I adjusted my stance.

    Everything factual I’ve written previously is still accurate, but calling regular expressions broken might have been a bit too much of a hyperbole. There’s definitely something funky going on with regex engines but I’ve realised an analogy which makes it make sense.

    Recap

    In formal language theory, alternation, i.e. the | operator, is commutative. For two grammars α and β, α|β and β|α define the same language just like 1 + 2 and 2 + 1 equal the same number (there are no two different 3s, depending how they were constructed).

    Nevertheless, most regex engines care about the order of arguments in an alternation. As demonstrated in my previous post, when matching the string ‘foobar’ against foo|foobar regular expression, the regex engines will match ‘foo’ substring but when matching it against foobar|foo they will match the entire ‘foobar’ string.

    This is a bit like saying that 5x should give different results depending on whether x was constructed as x = 1 + 2 or x = 2 + 1. Of course software engineering and maths are different disciplines and things don’t directly translate between the two. Nevertheless, I felt justified in calling such regex engines broken.

    Prior Art

    Adjustment of my stance on the issue was thanks to other examples where programming practice clashes with its theoretical roots. Below I’ll give a handful of examples culminating with one that really changed my position.

    String concatenation

    Many languages use a plus symbol as a string concatenation operator, which isn’t commutative. Meanwhile in maths, plus is by convention used for commutative operations only.1 Indeed, some languages opt for using different concatenation operators: D uses ~, Haskell uses ++, Perl uses . (dot), SQL uses || and Visual Basic uses & to name a few examples.2

    However, this is a different situation than the case of alternation in regular expressions. Using plus symbol for concatenation may be considered unfortunate, but the operation itself behaves the same way its counterpart in maths does.

    Floating point numbers

    Another example where maths disagrees with programming are floating point numbers. They pretend to be real numbers but in reality they aren’t even good at being rational numbers. Most notably for this discussion, addition of floating point numbers is not associative and can lead to catastrophic cancellation.3 Plus there are NaN values which infamously do not equal themselves and can really mess up array sorting if not handled properly.

    However, this didn’t convince me that regular expression weren’t broken either. After all, I’m perfectly happy to call floating point numbers broken. I don’t mean by this that they are unusable or don’t solve real (no pun intended) problems. Rather this is only to emphasise that there are many details that engineer needs to be vary of when using them. This is the same sense I used the word in regards to regex engines.

    Logic operations

    In the end what made me more open to the ‘broken’ behaviour of regular expressions were logic operators. In many (most? all?) imperative languages, the logic or operator use a short-circuit evaluation. For example, while puts("foo") || puts("bar") and puts("bar") || puts("foo") C expressions evaluate to the same value (one), their behaviours differ — the first one outputs ‘foo’ and the second one outputs ‘bar’ to standard output.

    This is analogous to regular expressions. When matching foo|foobar and foobar|foo regular expressions, the result (whether the string matches) is the same, but the side effects of the execution may differ.

    Conclusion

    To be clear, I still have doubts whether foo|foobar and foobar|foo behaving differently is the right option. However, it’s also clear to say that it’s not as broken as I used to think; rather, it’s one of the peculiarities of regexes that one needs to be aware of. And specifically, aware of how regex engine they use behave.

    1 John B. Fraleigh and Neal E. Brand. 2020. §4 Nonabelian Example. A First Course in Abstract Algebra (8th ed.). Pearson, Hoboken, NJ, USA. ISBN 978-0-13-575816-8. 

    2 Admittedly, at least in some of the cases choice of a different operator may have been influenced by factors other than ‘mathematical purity’. Perl has separate set of operators for strings and numbers (e.g. $x == $y converts operands to numbers if necessary while $x eq $y converts them to strings). Visual Basic has both + and & operators with the latter always converting operands to strings first. In contrast JavaScript is infamous with its type coercion and it would likely benefit from having separate concatenation operator. 

    3 David Goldberg. 1991. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, Vol 23, Issue 1 (March 1991), 5–48. doi:10.1145/103162.103163