edA‑qa mort‑ora‑y

Posted on Dec 17, 2017 • Originally published at mortoray.com

The uninitialized variable anathema: non-deterministic C++

#computerscience #cpp #programming #coding

A variable with an undefined value is a terrible language failure. Especially when programs tend to work anyway. It's a significant fault of C++ to allow this easy-to-create and hard-to-detect situation. I was recently treated to this insanity with my Leaf project. Various uninitialized structure values made my program fail on different platforms. There's no need for this: all variables should have guaranteed initial values.

The problem

Local variables, member variables and certain structure allocations result in uninitialized values. C++ has rules for uninitialized, default initialized, and zero-initialized. It's an overly complicated mess.

Given the frequency at which a variable is not initialized in C++ it seems like it'd be easy to spot. It isn't. While the language states they are uninitialized, a great number of variables nonetheless end up with zero values. Consider the below program.

#include <iostream>

int main() {
    bool a;
    if( a ) {
        std::cout << "True" << std::endl;
    } else {
        std::cout << "False" << std::endl;
    }

    int b;
    std::cout << b << std::endl;
}

For me this always prints False and 0. It prints the same on ideone. Yet according to the standard it could print True and whatever integer value it wants. Both a and b are uninitialized. A tool like valgrind can point out the problem in this case.

It's because such programs tend to work that makes the problem insidious. While developing the error may not show up since zero happens to be the desired initial value. A test suite is incapable of picking up this error until it's run on a different platform. In some projects, I've included valgrind as part of the regular testing, but I think that is rare, and even then I didn't make it part of the automated test suite (too many false positives).

Confounding the problem is that while all types are default initialized, it means nothing for fundamentals. At least a class will have the default constructor called, resulting in a usable instance. A fundamental's "default initializer" is nothing, rather than the sensible "zero initializer". This dichotomy creates a situation where it's not possible to say whether T a;, for some type T, is an initialized or uninitialized variable. A quick glance at the code will always "look" right, even if sometimes wrong.

Why zero

But why does it always tend to be zero? It's a bit ironic not to initialize the memory since the OS will not give a program uninitialized memory. This is a security mechanism. The underlying memory on the system is a shared protected resource. Program A writes to a page, frees it, then program B happens to get allocated the same page. Program B should not be able to read what Program A wrote to that memory. To prevent an information leak the kernel initializes all memory. On Linux it happens to do this with zeros.

There's no reason it has to be done this way. I believe OpenBSD uses a different initial value. And apparently, ArchLinux running inside VirtualBox does something different as well (this is where Leaf failed). It may not even be the OS; the program can also pick up memory that it previously had allocated. In this case, nothing will re-zero this memory since it's within the same program.

Apparently OpenBSD's basic free/malloc will reinitialize the data on each allocation. It's a security feature that mitigates the negative potential of buffer overflows. Curiously it might have prevented the Heartbleed defect, but OpenSSL sort of bypassed that mechanism anyway.

The solution

A language should simply not allow uninitialized variables. I'm not saying that an explicit value must always be given, but the language should guarantee the value. C++ should have stated that all variables are default initialized, which in turn means zero initialized for fundamentals. It should not matter how I created the variable, in which scope it resides, or whether it is a member variable.

There might be some exceptions for performance, or low-level data manipulation. These are however the outlying situations. Mostly the optimizer can handle the basic case of unused values and eliminate them. If we want a block of uninitialized memory we can always allocate it on our own, in which case, I don't expect the data to be initialized and thus don't get caught in the trap.

Just for completeness, a language might offer a special noinit keyword that indicates a variable should not be initialized.

I even think this should be modified in C++ now. Since the values were previously undefined, making them defined now won't change the correctness of any existing programs. It's entirely backwards compatible and would significantly improve the quality of C++.

Top comments (9)

Carlos Ureña • Dec 28 '17 • Edited

Probably the best option would be to design language rules allowing to write to uninitialized variables, but not to read them, based on a static, compile-time analysis of the code. I'm not 100% sure but I remember C# does this. This allows to safely initialize a variable by using two different expressions in the two branches of an if-else sentence, while avoiding double initialization at run time. The compiler must tag each variable read access in the text (as legal or ilegal), by computing all the possible paths leading there and making sure each of them includes at least a write acces to the variable.

edA‑qa mort‑ora‑y • Dec 28 '17

This is not possible to do. There is no way a compiler can know, via static analysis whether a program uses an uninitialized variable or not. I believe it was proven to be an unsolvable problem, or rather an NP-complete problem over the size of the code.

In limited cases, such as local variables, some basic deducations can be done. But, since it's not possible to do fully, the compiler must err on the side of caution. It will assume all uses are invalid unless it has a simple path it can prove it was set.

For this reason you get a lot of false positive warnigns about uninitialized variables -- I see them when compling in C++ with warnings on. I recall having them in C# as well.

Carlos Ureña • Dec 28 '17

Of course, it cannot be done in the general case, you're ok. I was thinking in simple cases where the compiler can assert the variable is explicitly accessed inside its declaration scope. In fact, you can declare a variable 'v' (uninitialized) and then immediately pass its address as a parameter to a function 'f' whose source is not available to the compiler at that moment (by using something like f(&v) in C/C++ syntax). Thus the compiler cannot tell whether 'f' correctly initializes 'v' or not, or whether 'f' incorrectly reads it. In order to handle this, the language must be extended with in/out tags (and rules) for parameters. If 'p' is the single formal parameter of 'f', then 'p' must be tagged as 'in', 'out' or both in 'f' declaration. Then the call 'f(&v)' by using an uninitialized variable is correct if 'p' is tagged as 'out', but not if it is tagged as 'in/out', or 'in'. When (independently) compiling 'f', the compiler ensures 'p' is accessed legally according to its tag. This is how C# works. This still does not handles it properly when a function directly access a global variable, but we all know this side-effects must not be used.....

Martin Bober • Dec 18 '17

Force-Initializing a variable would increase the cost of a variable definition from 0 instructions to 1 instruction.

edA‑qa mort‑ora‑y • Dec 18 '17

Compared to the cost of lost productivity and potential security defects it seems like a fair trade-off, but...

...the cost trade-off is not entirely true. In a large number of cases, especially for local variables and field initialization, the optimizer can determine whether the initial value is used or not. A lot of the actual zero initialization will not be done in the final machine code.

In the rare case where such a cost did matter there really isn't much of a problem to provide a keyword that says it shouldn't be initialized. Like other unsafe keywords it should be an opt-in though, as it isn't safe.

Martin Bober • Dec 18 '17


extern void sys_fcn(int* handle);

void fcn()
{
  int a;

  sys_fcn(&a)
}

In that case, there is no way for the compiler/optimizer to know if a is really initialized by sys_cfn. Only whole-program optimizers will know but few toolchains provide them.

Even with a new keyword, you still have to think about variable initialization. And if you have to think about it, you can as well remember that primitives are not initialized and not need a keyword at all. ;-)

C and C++ are designed to with higher regard to efficiency than fool-proofness, much as your sharp kitchen knife. If you do not like that design approach, why not use another language like Java, i.e. your butter knife? ;-)

edA‑qa mort‑ora‑y • Dec 18 '17

Yes, it's easy to find situations where the optimizer cannot optimize code. This doesn't discount the fact that in many cases it can.

I can't imagine a situation where the initialization cost in this type of code would be significant though. The overhead of calling the function, and the sub-function are probably more. And if there's any actual memory access involved, the pipelining of the CPU may render the init negligable.

That code also has the problem of a person being unable to determine whether it is correct. Without looking at the documentation for sys_fcn, you cannot tell if you should have initalized that variable or not.

As I said, in the cases where this is truly a cost problem (and they do exist), you could annotate it:

int a = undefined;

Or something like that.

arj • Dec 18 '17

Unfortunately there is also some compilers that initialize these values in DEBUG mode to zero or whatever the default value is and in RELEASE mode they are then filled with random values. The reason might also be re–use of registers and stack values.

edA‑qa mort‑ora‑y • Dec 18 '17

It may not be intentional that they are zero-initialized in DEBUG mode. They just happen to use freshed memory, and memory reuse is less aggressive.

I compile often with optmizations on, and still get lots of zeros in uninitialized areas.