Uman Shahzad

Posted on Feb 5, 2024

The Absolute Minimum Every Software Developer Must Know About Pointers

#c #pointers #memory

The concept of pointers regularly confuses beginner programmers. But pointers are fundamental to understanding how sophisticated memory management works in programs, so their importance cannot be avoided. In this post, we will use C-like syntax as a tool to help demonstrate pointer concepts.

The examples we shall encounter are simple and usually not what happens in the real world necessarily - especially the more complex examples. A C-like syntax is used only to give a taste of how the code might look like when using pointers. What is much, much more important is understanding what pointers are all about as a general concept. When you understand that, then the real-world use cases will start to look and feel obvious and easy.

What really is a pointer?

On 64-bit systems (and analogously on 32-bit ones), a pointer is nothing but a 8-byte number that is supposed to be interpreted as an address in memory.

When we say a pointer "points to" something, we mean that the address the pointer contains is the address of that "something" in memory.

A pointer that points to a value can be "dereferenced" to get the actual memory contents that the value holds.

A "dangling" pointer is a pointer that used to point to a memory location that had "valid" user-initiated data, but now that memory location is no longer "valid" for some reason or another. A memory location could become invalid if, for example, the memory gets deallocated by the application and reclaimed by the operating system. And so a pointer pointing to such a location now points to an invalid memory location, and is thus "dangling" unless it is reassigned to a valid location.

A "nil" pointer is a pointer whose address value refers to an address that will never be possibly valid in the lifetime of a program. In most modern programs with a pointer concept, the nil pointer always has the value 0, although it can technically be anything as long as the address can never be valid. In the examples in this document, we use the 0 address as if it were usable, but only for illustration. In the real world, the 0 address will generally not be accessible, and is also usually what all NULL or nil pointers have as their value.

Pointing to a Value

Let us consider a pointer p that points to v:

int v;
int *p;

v = 5;
p = &v;

An example illustration of how memory would look after this code runs is as follows:

ADDR    NAME    VAL
0       v       +---------+
                |    5    |
                +---------+
4       p       +---------+
                |    0    |
                +---------+
12      N/A     N/A
16      N/A     N/A
20      N/A     N/A
24      N/A     N/A
...     ...     ...

v is the name of a 32-bit signed number whose value is currently 5. It is at the very start of memory, at address 0. Since v takes 4 bytes, that means it occupies addresses 0, 1, 2 and 3 in total. The next value after v will start at address 4.

You will notice that we are incrementing addresses by 1 whenever 8 bits (or 1 byte) has passed in memory; this is how addresses are always counted - we don't increment an address when only 1 bit has passed or only after more than 1 byte has passed. Each address represents a unique byte in memory, where address A-1 is 1 byte before address A and address A+1 is 1 byte ahead of address A.

p is the name of a 64-bit address whose value is currently 0. Since 0 is the address of v, p is said to point to v. Since p is 64-bits long, it takes 8 bytes and therefore 8 unique address locations, specifically the addresses 4, 5, 6, 7, 8, 9, 10 and 11. The next value after p will start at address 12.

If we were to dereference p with *p while it holds address 0, the result of evaluating that expression would give us 5, which is the value found at address 0.

Pointing to a List of Values

If we had 3 values v1, v2 and v3, all of which are consecutive in memory and are the same length and type, we can use a pointer p to point to the first value in this list of values.

int v1, v2, v3;
int *p;

v1 = 51;
v2 = 52;
v3 = 53;

p = &v1;
ASSERT(*p, 51);
p += 1;
ASSERT(*p, 52);
p += 1;
ASSERT(*p, 53);

If the above code ran and v1, v2 and v3 happened to be next to each other in memory in that order, the assertions in the code above would hold true. (Note that in C there is no such guarantee; instead, a C user would use an array type which contains these 3 values, which will tell the compiler to explicitly put the 3 values consecutively in memory. However, we do not consider array types in this section for simplicity.)

Now, when p = &v1 in the above code occurs, this is an example illustration of what memory would look like:

ADDR    NAME    VAL
0       v1      +---------+
                |    51   |
                +---------+
4       v2      +---------+
                |    52   |
                +---------+
8       v3      +---------+
                |    53   |
                +---------+
12      p       +---------+
                |    0    |
                +---------+
20      N/A     N/A
24      N/A     N/A
28      N/A     N/A
32      N/A     N/A
...     ...     ...

When *p is run, this is dereferencing the pointer to get the value at address 0. Since p is specifically an int pointer (note that it was declared as int *p and not as bool_t *p or some other type), it is assumed that the value at address 0 is an int, so *p translates into 51, the int value found at address 0. We will discuss in the "Pointer Aliasing" section what would happen if p was a different pointer type but pointed to an int nonetheless.

When p += 1 occurs, since p was an int pointer and int is 4 bytes long, it is automatically incremented by 4 bytes; the address inside p is updated from 0 to 4.

Note that in C, incrementing any pointer of any type by any amount is always scaled by how long its type is. So if p was a pointer to a 8-byte number instead of a 4-byte one, p += 1 would increment by 8 addresses rather than 4. In Mlang, the operation is never automatically scaled by the type's length, so adding 1 to any pointer type will always explicitly add 1 address.

After p is incremented so that the address in it becomes 4, dereferencing it with *p gives us the value found at address 4 which is currently 52. A similar thing happens again in the next increment and dereference.

Aside: Pointer Aliasing

Before continuing, as a thought experiment, consider what would happen if we had another pointer p2 which was declared as bool_t *p2, which is a boolean pointer, as such:

int v = 50;
int *p;
bool_t *p2;

p = &v;
p2 = (bool_t *)&v;

Since a bool_t is only 1 byte long, *p2 would mean to get the boolean value at the address in p2.

This can be problematic, since we did not intend to put a boolean value in v. If it is done nonetheless, then only the first byte of v will be retrieved and interpreted as a boolean value.

This may seem troubling, but it is possible in C by force if the programmer wishes. This is rarely done in this way exactly, but if it is, the reason for doing so should be well-documented and well-reasoned, and only after considering the use of a language-level facility like union in C, which allows representing multiple different types in the same memory location safely. The details of union are out of scope for this document.

Compilers may have issues performing certain optimizations if more than 1 pointer points to the same value but with different types, because less assumptions can be made about what's happening. So, they introduce the concept of a "strict" pointer aliasing rule which requires that all pointers pointing to the same memory location are of the same type, otherwise the program is considered as having undefined behavior. An exception is made for the char * and void * pointer types, however, which can be used to refer to the same memory location as any other pointer type, without breaking this rule.

In Mlang, no strict pointer aliasing rule exists and the associated optimizations are not made on purpose; the programmer must perform the relevant optimization manually or make it explicit to the compiler that they desire a particular optimization. In our experience, this makes translation from the language to assembly a lot more predictable. We prefer the user to have more control over the compiler than have the compiler assume it can always do better than the user. The compiler is only a tool.

Pointing to a List of Variable-Sized Values

Instead of numbers, consider trying to store characters in memory consecutively, which can then be interpreted together as representing a human-readable sentence.

char *s = "Hello"

s is a pointer to a character, and when the "Hello" expression is evaluated, it returns a character pointer to the start location of the string. In particular, it returns the address of the "H" in "Hello".

ADDR    NAME    VAL
0       N/A     +---------+
                |    H    |
                |    e    |
                |    l    |
                |    l    |
                |    o    |
                |   \0    |
                +---------+
6       N/A     N/A
7       N/A     N/A
8       s       +---------+
                |    0    |
                +---------+
16      N/A     N/A
20      N/A     N/A
...     ...     ...

The memory for "Hello" is stored starting at address 0, and goes on until address 5. The reason this is actually 1 value higher than what you'd expect, is because in languages like C, strings declared in the source code are automatically appended with a NUL byte (the '\0' character), which is the 0 value in 7-bit ASCII. We adopt strings of this format (often called "C-strings" because of their ubiquity in the C language) for this lesson.

The actual pointer s starts at address 8 (see the section "Aside: Alignment" for why s doesn't start at address 6, which is unused in this example). The value it contains is address 0, which points to the "H" character.

Now, in order to access the remaining characters in the string, we can simply increment s by 1, access the character, and repeat, until we reach the special NUL byte character '\0', which indicates that we have reached the end of the string. Indeed, this methodology is exactly the technique used to get the total length of C-strings, because there is no other way otherwise.

Now that we understand how to point to strings, it's natural that as our programs have more use cases, they will require multiple strings that are somehow related, and must be stored "together" in some way, but still remain separate strings.

One methodology for doing this is to store all the strings that we feel are related next to each other in memory, just as we did with numbers:

Here is one weird trick one can pull to achieve this:

char *s = "Hi\0Bye";
ADDR    NAME    VAL
0       N/A     +---------+
                |    H    |
                |    i    |
                |   \0    |
                +---------+
3       N/A     +---------+
                |    B    |
                |    y    |
                |    e    |
                |   \0    |
                +---------+
7       N/A     N/A
8       s       +---------+
                |    0    |
                +---------+
16      N/A     N/A
20      N/A     N/A
24      N/A     N/A
28      N/A     N/A
...     ...     ...

Here, we manually put a NUL byte between "Hi" and "Bye", making them look like 2 C-strings in memory, and having a pointer s point to "H". s can be used to iterate through the "list of strings" in this way.

This is obviously problematic; as our string quantities and sizes increase, this strategy becomes too problematic in practice. Even more, what happens if you wanted to dynamically grow the size of one of these strings? For example, what if "Hi" needs to be converted to "Hi, brother"?

So instead, let us consider a more general format. Instead of requiring that the string content itself gets stored next to each other in memory, we instead store a list of character pointers together in memory, and have them each point to one unique string which could be stored anywhere and managed separately.

For example:

char *s1 = "Hi";
char *s2 = "Bye";
char **p = &s1;

ADDR    NAME    VAL
0       N/A     +---------+
                |    H    |
                |    i    |
                |   \0    |
                +---------+
8       s1      +---------+
                |    0    |
                +---------+
16      s2      +---------+
                |   44    |
                +---------+
24      p       +---------+
                |    8    |
                +---------+
32      N/A     N/A
36      N/A     N/A
40      N/A     N/A
44      N/A     +---------+
                |    B    |
                |    y    |
                |    e    |
                |   \0    |
                +---------+
48      N/A     N/A
52      N/A     N/A
...     ...     ...

Here, we assume that s1 and s2 are stored contiguously in memory just as if they were in an array together. p is stored after them. The two strings, "Hi" and "Bye", are stored far apart from each other. s1 and s2 contain the addresses of these two strings, and p points to s1.

What p essentially is now is a "pointer to an array of character pointers". Each of the character pointers (s1 and s2) are themselves a "pointer to a character array". Using p, we can access both "Hi" and "Bye":

char *s;

// s will be equal to `s1`.
s = *p;
ASSERT_EQ(*(s+0), 'H');
ASSERT_EQ(*(s+1), 'i');
ASSERT_EQ(*(s+2), '\0');

// s will be equal to `s2`.
s = *(p+1);
ASSERT_EQ(*(s+0), 'B');
ASSERT_EQ(*(s+1), 'y');
ASSERT_EQ(*(s+2), 'e');
ASSERT_EQ(*(s+3), '\0');

s is a character pointer that is being used to reference s1 and s2 one-by-one. When it first points to s1, it is dereferenced using an offset of 0, 1 and 2 to assert that the value at each address is 'H', 'i' and '\0' respectively.

The same is then done when s is changed to point to s2.

Note that the syntax of the form *(<ptr>+<number>), such as *(s+1), is the longer form version of what is found in C as s[1]. The expression s[1] actually expands into *(s+1), because it is doing exactly that: 1 is being added to s (since s is a character pointer, the actual address gets increased by 1 only) and then the resulting address is dereferenced to get the value stored at that address, which is a character in one of our strings.

Aside: Alignment

Consider having 3 numbers in memory, two 8-bit numbers and one 32-bit number, layed out as follows:

ADDR    NAME    VAL
0       N/A     +---------+
                |    5    |
                +---------+
1       N/A     +---------+
                |    9    |
                +---------+
2       N/A     +---------+
                |   999   |
                +---------+
5       N/A     N/A
8       N/A     N/A
12      N/A     N/A
16      N/A     N/A
...     ...     ...

When retrieving the 8-bit number 5 at address 0, or the 8-bit number 9 at address 1, everything is alright and works as expected when using assembly instructions appropriate for that size of access.

But, depending upon the CPU ISA, trying to access the 32-bit number at address 2 can cause a program to crash, or do the load slower than the other loads, or lose load atomicity. If the number was "aligned", this would not happen.

A value is "aligned" in memory if it is stored starting at an address that is divisible by some constant, where that constant is usually equal to the size of the value in bytes. If that is indeed the constant in question, we say that the value is "naturally aligned", or has "natural alignment".

So, the reason why accessing the first two 8-bit numbers works is because their size in bytes is 1, and both addresses 0 and 1 are considered divisible by 1. They are therefore naturally aligned. The reason accessing the 32-bit number fails or is slow or loses atomicity is because its size in bytes is 4, and 2 is not divisible by 4, so it is unaligned. Valid addresses would be, for example, 0, 4, 8, 12, 16, etc.

When a value is thus naturally aligned, by aligning it to an address that is divisible by its size in bytes, no issues will occur.

The reason for this seeming anomaly is a detail of how CPUs and memory chips work, and is out of scope for this document: just understand that it is required at the hardware level, so in software, we make sure to align our values.

Because of this, if we had a structure that looks like this:

struct {
    u8_t a;
    u8_t b;
    u32_t c;
};

The correct memory layout would look like this after taking alignment into account:

ADDR    NAME    VAL
0       a       +---------+ <---+ start of struct
                |    5    |     |
                +---------+     |
1       b       +---------+     |
                |    9    |     |
                +---------+     |
2       N/A     N/A             |
3       N/A     N/A             |
4       c       +---------+     |
                |   999   |     |
                +---------+ <---+ end of struct
8       N/A     N/A
12      N/A     N/A
16      N/A     N/A
...     ...     ...

We say that addresses 2 and 3 (i.e. the two extra bytes between b and c) are "padding" bytes - the compiler will automatically add this padding when the structure is used anywhere.

On x64 systems, the program will not crash if values are unaligned, but access will be costlier and atomicity is not guaranteed any longer. In some cases, an engineer may make the decision that these costs are worth paying in order to avoid adding padding.

For example, if the engineer is faced with a use case where memory usage is the costliest factor, and losing atomicity and having slightly slower accesses is worth the trade, then the compiler can be told to purposely not add padding to achieve alignment.

Obviously, it is the default to always align values, whether on the heap, stack or anywhere in memory. But say, for example, you have a system that stores millions of the following structures in memory, but accesses them very infrequently and always in a single-threaded fashion:

struct A {
    u8_t a;     // 1 byte
    u64_t b;    // 8 bytes
};              // = 9 bytes total?

struct B {
    u64_t a;    // 8 bytes
    u8_t b;     // 1 byte
};              // = 9 bytes total?

If alignment is required, then both structures will consume 16 bytes of memory; structure A will need an extra 7 bytes between a and b, and structure B will add an extra 7 after b to allow other values that surround structure B to be more easily aligned themselves.

If alignment is turned off for both structures, then both take 9 bytes and look in memory exactly as they're written in the code. However, note that values in memory which surround instances of structure A and B will still have to be aligned themselves (unless they too were allowed to be unaligned).

The engineer faced with this dilemma in such a memory-intensive system may opt for allowing these structures to be unaligned, thus saving 7 bytes per structure, which is a 43.75% memory savings - a big amount if we're talking about having gigabytes worth of these structures! If 20 GiB is used by these structures when aligned, the unaligned versions will consume 11.25 GiB instead.