DEV Community

Paul J. Lucas
Paul J. Lucas

Posted on • Updated on

Undefined Behavior in C and C++

Introduction

Programming in either C or C++, you likely have heard the term “undefined behavior” and at least know that it’s “bad.” But what, exactly, is it? What are some examples? Why does it exist?

A Brief History of C

C was created by Dennis Ritchie in the early 1970s at Bell Labs and introduced to the world via the publication of The C Programming Language in 1978.

For those interested in a much more detailed history of C, see The Development of the C Language, Dennis M. Ritchie, April, 1993.

Prior to the first C standard in 1989, there were two “flavors” of C:

  • Unportable C: a step up from assembly used to program operating systems using any technique that worked for the particular hardware.

  • Semi-portable C: use of #ifdef made many programs “semi-portable” so they’d compile and run the same on all the hardware and operating systems considered.

As C’s popularity grew, it was becoming clear that a standard was needed. While The C Programming Language was the definitive description of C for over a decade, it was insufficiently precise to be a standard. For a standard, you ideally want it to specify precisely what happens in every circumstance for every aspect of a language.

Why does Undefined Behavior Exist?

However, by the mid-1980s, there were many unportable and semi-portable programs that worked. The problem was that doing X on computer 1 with compiler 1 yielded result R1 whereas doing X on computer 2 with compiler 2 yielded result R2 and their respective programs relied on those results. A standard would ordinarily have to mandate that only one of either result R1 or R2 was correct. Updated standard-conforming compilers would have broken many programs.

To not do this, the concepts of implementation defined behavior, unspecified behavior, and undefined behavior were invented as part of the standard to serve as “escape hatches” to allow many working programs to continue to work with standard-conforming compilers.

While this may not seem completely satisfying, definitively stating that X is one of implementation defined, unspecified, or undefined is an improvement over stating nothing at all about X.

The Triple Crown of “Bad”

The differences between the three somewhat related “bad” behaviors in C and C++ are explained in the comp.lang.c FAQ, question 11.33:

  • Implementation defined behavior: The implementation must pick some behavior, it must be consistent, and it must be documented.

  • Unspecified behavior: Like implementation defined, except it need not be documented.

  • Undefined behavior: Anything at all can happen. The program may execute incorrectly (either crash or silently generate incorrect results), or it may fortuitously do exactly what the programmer intended.

An example of implementation defined behavior is:

new_offset = ftell( f ) + fread( buf, 1, n, f );
Enter fullscreen mode Exit fullscreen mode

This behavior is implementation defined because the order in which the operands of + are evaluated (left, then right; or right, then left) is not specified by the C standard.

While implementation defined behavior is bad, at least it’s consistent (using the same platform, compiler, and compiler flags).

Undefined behavior is worse in that:

  • Different runs of the same executable can produce different results!
  • The same run of an executable can produce different results at different times!

Undefined Behavior Examples

Both C and C++ (since it inherited them from C) list hundreds of things that result in undefined behavior. Examples of the common ones include:

  • Signed integer overflow and underflow.
  • Object is referred to outside of its lifetime.
  • Using a pointer to an object whose lifetime has ended.
  • Reading from an uninitialized object.
  • Addition/subtraction of pointers of unrelated arrays.
  • Indexing beyond the end of an array.
  • Modifying a const object.
  • Data races.
  • ...
  • Anything not explicitly listed as one of defined behavior, implementation defined behavior, or unspecified behavior is undefined behavior!

Implications & Example

But what are the implications of undefined behavior?

  • The compiler is allowed to assume undefined behavior never happens, hence all programs are valid.

  • This allows the compiler to generate very efficient code, especially in tight loops. (This is the only good thing about undefined behavior.)

A simple example that can result in undefined behavior is:

bool no_overflow( int x ) {
    return x+1 > x;
}
Enter fullscreen mode Exit fullscreen mode

Normally, you’d never write silly code like this; but such code can sometimes result from macro expansion or template instantiation, so the compiler should do a good job of optimizing it.

The compiler will unsurprisingly generate the following optimized x86-64 assembly:

no_overflow:
    movl $1, %eax  ; return true
    ret
Enter fullscreen mode Exit fullscreen mode

because x+1 is always > x — or it is in pure math. However, computer math has limited precision, so there are two possible cases:

  1. x != INT_MAX
: Behavior of + is well-defined; must return true.
  2. x == INT_MAX
: Behavior of + is undefined; can do anything.

The compiler is allowed to assume that case 2 never happens. Why? Because the only reason for considering that case would be if the compiler could check for and do something about it such as rewrite the code as if it were:

    return x != INT_MAX && x+1 > x;
Enter fullscreen mode Exit fullscreen mode

But that would be inserting a check you didn’t ask for; and it would be less efficient for the majority of cases. Developers write in C and C++ typically for performance so inserting such code would be antithetical.

Two Parts to Undefined Behavior

There are actually two parts to undefined behavior:

  1. Actually performing undefined behavior at run-time; examples:

    • Dereferencing a null pointer.
    • Indexing beyond the end of an array.
    • ...
  2. The compiler being allowed to assume undefined behavior never happens (a false premise) allows it to generate sometimes surprising (but very efficient) code. In logic, if you accept a false premise, you can draw any conclusion. For example:

    • If the streets are wet, it has rained recently. (False premise.)
    • The streets are wet.
    • Therefore, it has rained recently. (Conclusion: logically valid, but wrong.)

It’s the second part that causes the most surprise. For another example, consider:

extern int table[4];

bool exists_in_table( int v ) {
    for ( int i = 0; i <= 4; ++i ) {
        if ( table[i] == v )
            return true;
    }
    return false;
}
Enter fullscreen mode Exit fullscreen mode

The compiler will surprisingly generate the following optimized x86-64 assembly:

exists_in_table(int):
    movl $1, %eax  ; return true
    ret
Enter fullscreen mode Exit fullscreen mode

How is that possible? Where did the for loop and if go? The problem stems from the fact that the code has a bug. (Did you notice it?) The bug is the i <= 4 should be i < 4. But even so, how does the compiler generate return true? The “rationale” is:

  1. The first four times through the loop, the function might return true.
  2. If i were 4, the code would perform undefined behavior (by attempting to access an element beyond the end of the array).
  3. The compiler is allowed to assume undefined behavior never happens (all programs are valid); therefore:
    • The variable i can “never” be 4. (False premise.)
    • Implies we must have found a match when i < 4.
    • Therefore, we can always return true. (Conclusion: logically valid, but wrong.)

This is not a compiler bug. Given the choice between assuming the programmer wrote a valid program versus an invalid program, we’ve told the compiler to choose the former — and it optimizes accordingly.

Optimization Can Make Things Even Worse

Consider the following:

void assign_not_null( int *p, int v ) {
    int old_v = *p;
    if ( p == nullptr )
        return;
    *p = v;
}
Enter fullscreen mode Exit fullscreen mode

The line int old_v = *p is dead code. It’s also wrong because it dereferences p before checking whether it’s null. (Perhaps that line was part of some debugging code that was added hastily, then forgotten about and left in by mistake. These kinds of things happen in the real world.)

You might think such dead code would be harmless, but, depending on what optimizations the compiler performs — and in what order — this can cause undefined behavior.

Assume there are at least two optimizations that the compiler does:

  1. Dead Code Elimination: code that isn’t used is eliminated.
  2. Redundant Null Check Elimination: if the compiler can deduce that a particular pointer can’t possibly be null on a given line, it eliminates the if check for null.

Assume that the compiler does the optimizations in the above order. It therefore would:

  1. Eliminate the int old_v = *p because old_v is not used.
  2. Do nothing else since the if ( p == nullptr ) is a necessary check before the *p = v on the next line and so can’t be eliminated.

So far, so good. But what if the compiler does the optimizations in the reverse order? It then would:

  1. Knowing that dereferencing a null pointer is undefined behavior and being allowed to assume that undefined behavior never happens, it means that:
    • If the code gets to the if, then the *p on the previous line must have succeeded.
    • That means p is never null.
    • Therefore, the null check is unnecessary and so the if can be eliminated.
  2. Now it performs dead code elimination and eliminates int old_v = *p.

The resulting code would be:

void assign_not_null( int *p, int v ) {
    *p = v;
}
Enter fullscreen mode Exit fullscreen mode

which is logically valid, but wrong.

In-Depth Example

For an in-depth example of the bizarre bugs undefined behavior can cause (and how to find and fix them), see The Curious Case of the Disappearing “if”.

Undefined Behavior in Other Languages

At this point, you might be wondering whether undefined behavior exists in other languages. Generally, the answer is “no” — with two exceptions:

  1. If a language provides a mechanism to perform “unsafe” operations, those typically can perform undefined behavior.
  2. Data races are always undefined behavior.

Ada has a similar, but weaker concept of “bounded errors.”

But for languages with always-defined behavior, the price paid is in performance:

  • Always initializing variables.
  • Always checking array indices.
  • Garbage collection.
  • ...

Conclusion

Undefined behavior was a least-bad compromise to get standard C. Try to know (and not do!) the common things that result in it.

Epilogue

To drive home the point that undefined behavior means anything is possible, John Woods posted the following in the Usenet newsgroup comp.lang.c:

From: John F. Woods
Newsgroups: comp.lang.c
Date: Feb 25, 1992, 11:51:52 AM

> * Undefined behavior -- behavior, upon use of a nonportable or
> erroneous program construct, ... for which the standard imposes
> no requirements.
 Permissible undefined behavior ranges from
> ignoring the situation completely
 with unpredictable results,
> to having demons fly out of your nose.

In short, you can't use sizeof() on a structure whose elements
haven't been
 defined, and if you do, demons may fly out of your
nose.

OK, OK; so the Standard doesn't *ACTUALLY* mention demons or
noses. Not as
 such, anyway.
Enter fullscreen mode Exit fullscreen mode

Someone else followed up coining the term “nasal demons” that stuck.

References

Top comments (5)

Collapse
 
gustrb profile image
Gustavo Reis Bauer

Great blog post!
It sure was a great read :)

Collapse
 
eztosin profile image
Info Comment hidden by post author - thread only accessible via permalink
Eztosin

Hi Mr Paul

I took the liberty of going through your profile and I came across your insightful post on the undefined behavior of C and C++. Your deep dive into this topic showcases not only your technical expertise but also your commitment to understanding the intricate details of programming languages. I truly appreciate how you shed light on a complex area, making it more accessible to those eager to learn.

Your dedication to sharing your knowledge is evident, and I'm genuinely inspired by your willingness to engage in meaningful discussions. Your perspective has been invaluable in broadening my understanding of these languages.

Your recent critique on my article provided me with valuable insights, and I genuinely respect your perspective. It's clear to me that your expertise goes beyond your posts and extends to fostering a culture of learning and growth.

Given your profound understanding of programming languages and your eagerness to engage with others, I would be honored to learn from you and possibly benefit from your mentorship. I believe your guidance could greatly enrich my journey in the realm of programming.

Thank you for the meaningful contributions you make to our community. Looking forward to the possibility of connecting and learning more from you.

Best regards,

Collapse
 
pauljlucas profile image
Paul J. Lucas

IMHO, if someone doesn’t have the commitment to understanding the intricate details of their programming language of choice, they have no business being a programmer.

My motivation for commenting invariably is to point out mistakes so that the author fixes them thereby reducing the amount of wrong information on the internet. (I realize that this is akin to bailing out the ocean with a teaspoon.) And perhaps also motivate them to do more research to ensure that what they write is actually both correct and complete going forward.

My motivation for writing my own posts is to write a clear, correct, and complete description of whatever it is I'm writing about; and also to have it handy to link to when I point out the errors or omissions in somebody else's writing on the same subject.

Collapse
 
eztosin profile image
Eztosin • Edited

I understand your point sir, I started my programming journey this year enrolled as a student in the ALX Software Engineering program.

I started with The C programming, and I've built projects in C during the course of my learning. As a result I developed a soft spot for C and I disliked the fact that it's becoming more irrelevant in the present programming landscape, hence why I wrote the article.

I am not the type to share false information to readers as I am well aware of the implications, the article was based solely on my understanding of C and research.

Thank you so much for your deep insights and I am happy to connect with you.

Thread Thread
 
pauljlucas profile image
Paul J. Lucas

I never said anybody who writes such blogs shares false information intentionally.

IMHO, novices should generally refrain from writing such blogs because they don't know what they don't know. You need several years of experience under your belt before writing about anything.

Some comments have been hidden by the post's author - find out more