Paula Gearon

Posted on Aug 18, 2019 • Edited on Apr 22, 2020

A Simple Structure in Binary

Part 3 of this story that is taking too long to get to the point. The previous part can be found here, or you can start at the beginning.

Using C

C introduced many of the concepts that later evolved into what we see today in languages like C++, Java, Javascript, Python and Ruby.

To be more accurate, C was the first language that exposed a majority of programmers to these concepts. The history of C can be traced back through language B (programmers apparently had better things on their mind than what to name their languages) which can trace it's lineage back to BCPL and ALGOL. But I'm just telling you these things to sound smart since I've never actually used any of these older languages. 😊

Just like the Javascript example in the previous post, C can declare a record that holds a name and an age, though with several differences. The most important difference from Javascript is that C needs to specify that the name is a string, and the age is a number.

struct person {
  char* name;
  long age;
};
struct person a_person;
a_person.name = "Mary";
a_person.age = 32;

This looks a bit different, and I should explain some of why that is.

C takes a very literal view of data, laying it out in memory in precisely the way that is described. This provides the programmer with precise knowledge of where everything is in memory. However, it also leaves a lot up to the programmer, making it easier to introduce errors. Modern languages take a lot of the work away from the developer, making the code more flexible and robust, but at the expense of some speed and knowing precisely what the data is doing. Luckily, the cost in speed is insignificant on modern hardware.

The person data structure holds a pointer to a string and a long integer. (The syntax to indicate pointer is the * character). Handling strings is actually awkward in C, since they vary in size, so I want to avoid talking too much about that just yet. For now, the name is not sitting inside of the structure. Instead, the structure holds a pointer to the name string and an integer containing the age.

Unlike other languages, we can actually see the exact layout of this data in memory. The object can be asked for its address, and then this can be treated as a series of bytes (an unsigned char in C) which can be printed in hex:

for (int i = 0; i < sizeof(struct person); i++) {
  printf("%02X ", ((unsigned char*)&a_person)[i]);
}

From a modern programming point of view, this is ghastly. The data structure should be considered a data structure and nothing else. This code is treating it as a series of bytes. It may be informative, but it is also asking for trouble.

But now that we can see the bytes, what do they look like?

21 30 D9 04 9C 7F 00 00 20 00 00 00 00 00 00 00

Let's split that into the parts:

char* name	long age
`21 30 D9 04 9C 7F 00 00`	`20 00 00 00 00 00 00 00`

What are we looking at here? While the name field is referred to as a pointer it is really just a long number. This is the number giving the memory location of the name string (which we know holds "Mary"). The age field holds a number of 32. In hexadecimal, that is represented as 20 (2*16 + 0*1).

age

One unexpected part is that the age has all those zeros after it. It might make more sense if the number were 0000000000000020 but this seems backwards. This has to do with the fact that we're pulling the number apart into individual bytes. The idea here is that as you go to the right, then memory addresses are increasing. The lower addresses have the smaller parts of the number (the 100s, 10s, and 1s), while the higher addresses have the bigger parts (millions, billions, etc). But when we print them from low to high like this, then they look backwards.

This is actually a quirk of Intel CPUs (which Windows and Mac typically run on). They made an explicit choice to represent their numbers this way. Other types of CPUs have chosen to represent numbers the other way around. This was the case with the PowerPC CPUs that Apple used before 2006, as well as many other processors used elsewhere in industry. Computers which store the small part of numbers at lower addresses are called "Little Endian" architectures, while computers which store the large part of the number first are called "Big Endian". This is a silly reference to Gulliver's Travels by Jonathon Swift, where the Lilliputians went to war with the people of Blufuscu over which end of a boiled egg should be broken: the big end or the little end.

We can also see that there are a lot of zeros. These numbers are taking up 8 bytes (64 bits). Is it really necessary to be able to store numbers that are so large they could take up all that space? It would seem not, given that C allows for several smaller sizes of numbers:

int is 4 bytes, or 32 bits.
short is 2 bytes, or 16 bits.
char is 8 bits.

An issue here is that a C compiler for a modern 64-bit CPU ensures that data structures always align on 64-bit/8-byte boundaries. This both makes things faster and ensures that data is always set up to be correctly processed by the CPU since 64-bit numbers can only be loaded into the CPU if they start at an address that is a multiple of 8.

I could have made the age a short, or even a char, but the data would not be any smaller since the structure would need to stretch out to the next 8-byte boundary anyway. The extra bytes would go unused. In fact, I did try this in an earlier iteration, but those unused bytes ended up with random numbers in them, and I thought that would be more confusing for this post.

Have you figured out why most people moved onto more modern languages yet?

name

The number in the name looks like random nonsense, and in a sense it is. In fact, if I run the program again, I get a different number:
21 90 E4 05 98 7F 00 00 20 00 00 00 00 00 00 00

Why is this? It's actually a combination of several things. If a malicious program were running on your computer, and it could guess where your data is going to be, it could read or modify that information maliciously. To mitigate this, some randomness was introduced into memory locations in the early 2000s (this affected kernel addresses more, but hey, I'm being thorough). The more important reason is that most operating systems allow multiple programs to run at once, and in general, they are not supposed to interact with each other. Part of this means that they each get their own view of what memory looks like, and the operating system can decide both where that memory really is, and if it's maybe sitting on the disk somewhere. This is a complex topic called Virtual Memory Management, and not something to go into here. But in case anyone reading this is disappointed, I'll be returning to this topic in future, as it matters a great deal when dealing with memory-mapped file access.

The most important thing to get from this investigation is that the pointer to the string holding the name of the person is actually just a number. Sure, it's a big number, and it's a number that only makes sense to the computer, but it's just a number, and the number specifies a location. I will be returning to this idea soon.

A Linked List

Adapting the Javascript linked list from the previous post, we get this:

struct element {
  struct element* next;
  long value;
};

Since the example is for a list of numbers, I've restricted the values to holding just that type.

The next pointer is to another element structure. It can be used to point to any memory address, including local element objects, or raw blocks of memory that are requested as needed. These are created using the malloc library function, where the required number of bytes for the needed data are allocated:

struct element* list = malloc(sizeof(struct element));

In the above examples, accessing elements of the structure was done with syntax like element.value = 5. However, this time the list variable holds a pointer to an element (a 64-bit number), and not an element. There are 2 different syntaxes to get to the data inside the element:

We can dereference the pointer using the * operator, and then access the field normally: (*list).value
We can use the pointer field access operator ->: list->value

It's usually better to use the pointer field access operator.

In Javascript we used a constructor function that took a newly allocated object and set the values in it. Similarly, we'll use a function that creates the element and sets the values for us, then returns it.

struct element* add_element(struct element* list, long value) {
  struct element* new_head = malloc(sizeof(struct element));
  new_head->next = list;
  new_head->value = value;
  return new_head;
}

So now we can create the list in a similar way:

  struct element* list = add_element(NULL, 34);
  list = add_element(list, 21);
  list = add_element(list, 13);
  list = add_element(list, 8);
  list = add_element(list, 5);
  list = add_element(list, 3);
  list = add_element(list, 2);
  list = add_element(list, 1);
  list = add_element(list, 1);

Note that we use NULL as an invalid address. This is just the number 0, which the CPU knows is an invalid address.

We can see the contents when we print it:

while (list != NULL) {
  printf("%d", list->value);
  list = list->next;
  if (list != NULL) printf(", ");
}
printf("\n");

Why didn't I use recursion to print it? I used recursion in Javascript to build a string, which slowly grew as we proceeded down the list and then printed the final result. This time I'm printing as I go instead of building a string. That's because building strings is awkward in C (I do have code that builds a string, but decided it was too messy for this post). Printing as-you-go could be done in a function that calls itself, but since a value would not be returned then it isn't really what we think of when using recursion.

Running the C program gives us the list:
1, 1, 2, 3, 5, 8, 13, 21, 34

So we know that we have roughly equivalent code to the list in Javascript.

Where to Now?

This was a lot of work just to duplicate something that I already had in Javascript. Why bother?

As we saw with the person structure above, we can see that the structure is laid out as data in memory. Let's print the list as bytes:

while (list != NULL) {
  for (int i = 0; i < sizeof(struct element); i++) {
      printf("%02X ", ((unsigned char*)list)[i]);
  }
  list = list->next;
  printf("\n");
}

struct element* next	long value
`40 D3 64 E1 FF 7F 00 00`	`01 00 00 00 00 00 00 00`
`20 D3 64 E1 FF 7F 00 00`	`01 00 00 00 00 00 00 00`
`00 D3 64 E1 FF 7F 00 00`	`02 00 00 00 00 00 00 00`
`E0 D2 64 E1 FF 7F 00 00`	`03 00 00 00 00 00 00 00`
`C0 D2 64 E1 FF 7F 00 00`	`05 00 00 00 00 00 00 00`
`A0 D2 64 E1 FF 7F 00 00`	`08 00 00 00 00 00 00 00`
`80 D2 64 E1 FF 7F 00 00`	`0D 00 00 00 00 00 00 00`
`60 D2 64 E1 FF 7F 00 00`	`15 00 00 00 00 00 00 00`
`00 00 00 00 00 00 00 00`	`22 00 00 00 00 00 00 00`

An interesting aspect here is how the pointers are working. Going from the end of the list (the first elements that were allocated), we see the second last element (value 21, which is hex 15) pointing to the final element with the number:
60 D2 64 E1 FF 7F 00 00

As a number, we can squish this back together, and put the digits in the correct order:
00007FFFE164D260
We can go through all of these addresses running from the end of the list to the front (ie. in the order allocated):

next
`00007FFFE164D260`
`00007FFFE164D280`
`00007FFFE164D2A0`
`00007FFFE164D2C0`
`00007FFFE164D2E0`
`00007FFFE164D300`
`00007FFFE164D320`
`00007FFFE164D340`

It doesn't include the address of the head of the list, but it follows the same pattern as the others.

Note how each one is exactly 0x20 more than the previous address? That's exactly 32, which happens to be 2 long values more than the size of the structures. Why 2 extra long values? Well, that's up to the memory allocator (in my case, glibc). To be honest, I don't know why this space appears between allocations. I'll confess that I took a peek in there (don't tell anyone: it's illegal to do access memory that you haven't allocated). It looks like it's using 2 longs to hold administrative data. I don't know what type, but some compilers support Run Time Type Identifiers [RTTI] this way, so maybe it's that.

Regardless of how glibc chose to allocate the elements, it could have been packed more tightly together, or much further apart. The only rule is that they have to start on an 8-byte boundary. It's also worth noting that even though the elements were allocated in order, they could have been scattered in random order.

A Different Kind of Memory

In the previous example, each element of the list was allocated using malloc. But that memory could be allocated in another way. What if we allocate a block of memory, and then hand out the allocations ourselves?

Instead of using a malloc at the start of add_element, we will provide the address to use instead. It makes more sense to call this version init_element:

struct element* init_element(struct element* new_head,
                             struct element* list,
                             long value) {
  new_head->next = list;
  new_head->value = value;
  return new_head;
}

Now we just need to start off by allocating a big chunk of memory. In this case, I'll ask for something that is 10 elements in size, and then use each of these in turn. Because the memory is referencing element structs, if we add to the memory address, then the compiler knows to add that many element struct sizes to the address. This lets me just add a number to the pointer.

struct element* memory = malloc(10 * sizeof(struct element));                                                           struct element* list = init_element(memory, NULL, 34);                                                                  list = init_element(memory + 1, list, 21);                                                                              list = init_element(memory + 2, list, 13);                                                                              list = init_element(memory + 3, list, 8);                                                                               list = init_element(memory + 4, list, 5);                                                                               list = init_element(memory + 5, list, 3);                                                                               list = init_element(memory + 6, list, 2);                                                                               list = init_element(memory + 7, list, 1);                                                                               list = init_element(memory + 8, list, 1);                                                                               print_list(list);

This gives exactly the same output as the last time. Let's print the raw data again:

struct element* next	long value
`D0 12 2B F2 FF 7F 00 00`	`01 00 00 00 00 00 00 00`
`C0 12 2B F2 FF 7F 00 00`	`01 00 00 00 00 00 00 00`
`B0 12 2B F2 FF 7F 00 00`	`02 00 00 00 00 00 00 00`
`A0 12 2B F2 FF 7F 00 00`	`03 00 00 00 00 00 00 00`
`90 12 2B F2 FF 7F 00 00`	`05 00 00 00 00 00 00 00`
`80 12 2B F2 FF 7F 00 00`	`08 00 00 00 00 00 00 00`
`70 12 2B F2 FF 7F 00 00`	`0D 00 00 00 00 00 00 00`
`60 12 2B F2 FF 7F 00 00`	`15 00 00 00 00 00 00 00`
`00 00 00 00 00 00 00 00`	`22 00 00 00 00 00 00 00`

Same sort of pattern as we saw when using malloc for each element, but look at the addresses. This time they differ by exactly 0x10 (16 bytes), which is exactly the size of structure. That's what we were hoping for.

The compiler is doing some work here by assuming that the memory pointer is referring to struct element objects. This lets it calculate where objects start, by multiplying any added digits by the size of the structure, and so on. Not every language will necessarily give us this, so we can re-implement it using raw pointers and figuring out where things are on our own. The kind of pointer to use here is void*. This is a special pointer type that the compiler won't check against other pointer types, so it works fine to pass along to the init_element function even though that function expects a pointer to a struct element.

void* memory = malloc(10 * sizeof(struct element));                                                                     struct element* list = init_element(memory, NULL, 34);                                                                  list = init_element(memory + sizeof(struct element), list, 21);                                                         list = init_element(memory + 2 * sizeof(struct element), list, 13);                                                     list = init_element(memory + 3 * sizeof(struct element), list, 8);                                                      list = init_element(memory + 4 * sizeof(struct element), list, 5);                                                      list = init_element(memory + 5 * sizeof(struct element), list, 3);                                                      list = init_element(memory + 6 * sizeof(struct element), list, 2);                                                      list = init_element(memory + 7 * sizeof(struct element), list, 1);                                                      list = init_element(memory + 8 * sizeof(struct element), list, 1);

It's messy, but it's identical.

A Small Tweak

When we are managing our own data like this, then we aren't actually constrained to use memory locations as pointers. For instance, in the above code, there are only 9 objects, and they all fall within the block of memory that the program controls. Since this is the case, then instead of the full memory location, we can instead refer to the offset within the memory. Even better, since everything in that memory is the same size of 16 bytes, then instead of offsets of [0,16,32,48,64,...] we can use [0,1,2,3,4,...] and just multiply them by the size of the objects. One thing to be careful of is that the 0 offset is valid (unlike for pointers), so instead of NULL to indicate the end of the list, we can use a special guard value of -1.

The resulting code needs to tweak every function so far. So let's show the entire program:

Smaller

As a final tweak, since we know the characteristics of the data, we know that we have only a small number of elements, and we know that 64-bit alignment is not necessary, we don't actually need to be using 2 long values at all. We can halve the memory by moving to int values. If we expect small lists that never contain numbers of more than 256, we can even use single-byte unsigned char values. So all references to long can change to unsigned char, and the guard value of -1 will need to be represented as (unsigned char)-1. Here is the new smaller element structure.

struct element {
  unsigned char next;
  unsigned char value;
};

The full version can be seen here.

Recap

What has been shown here is how C uses the struct syntax to represent allocated memory as structured data records. These data records can use pointers to other records, thereby forming more complex data structures, with the provided example being a linked list.

While the allocated memory is usually handled by the system (via malloc) it can also be managed within a single large block of memory. With control of that memory being handled in the program, then the program can use its own identifiers to reference individual objects, rather than memory pointers. This eliminates most of the need and use of memory pointers that C facilitates.

Next Steps

The data in these demonstrations no longer contains pointers. Any references between objects in the data are independent of the memory environment. This allows data structures like linked lists to be internally consistent within that data. This forms the basis of how data can be stored durably, on disk, in the cloud, etc.

While C was used to demonstrate how to make all of this work, now that the need for pointers has been removed, then the pointer manipulation facilities in C are not needed any more. This means that we can return to languages we are more familiar with, and continue from there. We'll start this in part 4.

Latest comments (1)

Patrick Wendo • Jan 1 '23

Computers which store the small part of numbers at lower addresses are called "Little Endian" architectures, while computers which store the large part of the number first are called "Big Endian".

Did not know this. Very interesting.