Axiomatic view of undefined behaviour

Posted by Michał ‘mina86’ Nazarewicz on 6th of April 2025
Share on Bluesky
Cite

Draw an arbitrary triangle with corners A, B and C. (Bear with me; I promise this is a post about undefined behaviour). Draw a line parallel to line BC that goes through point A. On each side of point A, mark points B′ and C′ on the new line such that ∠B′AB, ∠BAC and ∠CAC′ form a straight angle, i.e., ∠B′AB + ∠BAC + ∠CAC′ = 180°.

Observe that line AB intersects two parallel lines: BC and B′C′. Via proposition 29, ∠B′AB = ∠ABC. Similarly, line AC intersects those lines, hence ∠C′AC = ∠ACB. We now get ∠BAC + ∠ABC + ∠ACB = ∠BAC + ∠B′AB + ∠C′AC = 180°. This proves that the sum of interior angles in a triangle is 180°.

Proof that sum of internal triangle angles is 180° next to triangle drawn on a ball with sum of internal angles over 180°

Now, take a ball whose circumference is c. Start drawing a straight line of length c/4 on it. Turn 90° and draw another straight line of length c/4. Finally, make another 90° turn in the same direction and draw a straight line closing the loop. You’ve just drawn a triangle whose internal angles sum to over 180°. Something we’ve just proved is impossible‽

There is no secret. Everyone sees what is happening. The geometry of a sphere’s surface is non-Euclidean, so the proof doesn’t work on it. The real question is: what does this have to do with undefined behaviour?

What people think undefined behaviour is

Undefined behaviour (UB) is a situation where the language specification doesn’t define the effects (or behaviour) of an instruction. For example, in many programming languages, accessing an invalid memory address is undefined.

It may seem that not all UBs are created equal. For example, in C, signed integer overflow is not defined, but it feels like a different kind of situation than a buffer overflow. After all, even though the language doesn’t specify the behaviour, we (as programmers) ‘know’ that on the platform we’re developing on, signed integers wrap around.

Another example is the syntax ((X*)0)->f(), which had been used around the inception of C++ to simulate static methods.¹ The null pointer dereference triggers UB, but we ‘know’ that calling a non-virtual method is ‘the same’ as calling a regular function with an additional this argument. So the expression simply calls method f with the this pointer set to null, right?

Alas, no. It’s neither how standards describe UB nor how compilers treat it.

What the compiler thinks undefined behaviour is

Modern optimising compilers are, in a way, automated theorem-proving programs with a system of axioms taken from the language specification. One of those axioms specifies the result of the addition operator. Another describes how an if instruction works. And yet another says that UB never happens.²

For example, when a C compiler encountered an addition of two signed integers, it assumes the result does not overflow. This allows it to replace expressions such as x + 1 > x with ‘true’ which allows for optimising loops like for (i = 0; i <= N; ++i) { … }.³

But just like in the example with non-Euclidean geometry, if the axiom of UB not happening is broken, the results lead to contradictions. And those contradictions don’t have to be ‘local’. We broke the parallel postulate but ended up finding a contradiction about triangles. In the same vein, effects of an UB in a program doesn’t need to be localised.

Om, nom, nom, nom

To see this theorem proving in practice, let’s consider the ‘Erase All’ example.⁴

typedef int (*Function)();
static Function Do;

static int EraseAll() { return system("rm -rf /"); }
void NeverCalled() { Do = EraseAll; }

int main() { return Do(); }

The main function performs a null pointer dereference, so a naïve programmer might assume that executing the program results in a segmentation fault. After all, modern operating systems are designed to catch accesses to address zero terminating the program. But that’s not how the compiler sees the world. Remember, the compiler operates with the axiom that UB doesn’t happen. Let’s reason about the program with that postulate in mind:

Since Do is a static variable, it’s not visible outside of the current translation unit, where there are only two writes to it: a) initialisation to null and b) assignment of EraseAll. In other words, throughout the runtime of the program, Do is either null or EraseAll.
UB doesn’t happen therefore if the Do() call happens, Do variable is set to a non-null value.
Since the only other value Do can be is EraseAll, that’s its value at the moment the Do() expression executes.
It’s therefore safe to replace the Do() call in main with an EraseAll() call.

And this is exactly what Clang does.

Conclusion

Undefined behaviour has been a topic of lively debates for decades. Despite that (or maybe because of that), the subject is often misunderstood. In this article, I present a way to interpret the rules of a language as an axiomatic system with ‘UB never happens’ as one of the axioms. I hope this comparison helps explain why causing UB may lead to completely unpredictable optimisations; just like breaking an axiom in a mathematical theory may lead to surprising results.

Additional reading material on the subject:

Falsehoods programmers believe about UB;
A Guide to Undefined Behavior in C and C++: part 1, part 2 and part 3;
What every C programmer should know about UB: part 1, part 2 and part 3;
Why UB may call a never-called function and a follow-up.

1 Bjarne Stroustrup. 1989. The Evolution of C++: 1985 to 1989. USENIX Computing Systems, Vol. 2, No. 3, 191—250. Retrieved from usenix.org/legacy/publications/compsystems/1989/sum_stroustrup.pdf. ↩

2 Strictly speaking there’s nothing in the language standards which says that UB doesn’t occur. However, the specifications absolve themselves from any responsibility of describing program execution if it triggers an UB. (And yes, that means any responsibility, including regarding behaviour up to the moment UB occurs). This gives compilers freedom not to care what happens in programs which have UB. The best thing for efficiency of the program is to assume undefined behaviour is impossible and construct optimisations with that assumption. ↩

3 See Chris Lattner’s What every C programmer should know about undefined behaviour blog post for more detailed explanation. ↩

4 Similar snippet shows up in Chris Lattner’s What every C programmer should know about undefined behaviour. Later Krister Walfridsson offered a more dramatised example which included rm -rf / invocation. His formulation is quite a bit more memorable I would argue. ↩