DEV Community: Zhijun

From reversed linked-list to registers, memory and assembly

Zhijun — Sun, 01 Jun 2025 13:44:49 +0000

Reversing a linked-list is like the "Hello World" from programmers. In this article I'm gonna use it as an example (written below in C) to showcase how C code gets translated into assembly, with a specific focus on stack frames, register usage, memory layout, and the power of the popular debugging tool gdb.

#include <stdio.h>

typedef struct Node {
    int val;
    int id;
    struct Node* next;
} Node;

// Reverses a singly linked list and returns the new head
Node* reverse(Node* head) {
    Node* prev = NULL;
    Node* curr = head;
    Node* next;

    while (curr != NULL) {
        next = curr->next;
        curr->next = prev;
        prev = curr;
        curr = next;
    }

    return prev;
}

// Prints the list
void print_list(Node* head) {
    while (head != NULL) {
        printf("Node ID: %d, Value: %d\n", head->id, head->val);
        head = head->next;
    }
}

int main() {
    // Declare four nodes on the stack and manually connect them
    Node node_4 = {40, 4, NULL};
    Node node_3 = {30, 3, &node_4};
    Node node_2 = {20, 2, &node_3};
    Node node_1 = {10, 1, &node_2};

    printf("Original list:\n");
    print_list(&node_1);

    // Reverse the list
    Node* new_head = reverse(&node_1);

    printf("\nReversed list:\n");
    print_list(new_head);
    return 0;
}

The code above should be pretty straightforward. Before we dig into this piece of code, let's refresh our mind with some basic knowledge on CPU registers, memory and assembly.

Registers and Memory in a 64-bit World

When writing low-level programs like C, the compiler transforms our code into assembly instructions that run on the CPU. Each CPU has a small set of registers, which are fast-access storage units. On a 64-bit machine, these registers are 64 bits wide — i.e., they can hold a number as large as 2⁶⁴.

Computer memory is essentially a giant array of bytes. Every byte in memory has a unique address (think of it like array index), which is also a 64-bit number on modern architectures. So, each register can natually point to a particular memory address.

Notice that the values in the registers don't always have to be memory address. They can also simply be numbers like -5, 42, etc. The CPU doesn't care about the meaning—it just sees binary values. It's up to us (and the compiler) to interpret them correctly. This dual interpretation — value or address — is what makes assembly programming flexible yet confusing to beginners.

Essentially all data inside a machine is just bits. When we see two bytes 0x48 0x69, we can either intepret them as string "Hi", or as decimal number 18537, and the computer itself wouldn't know the difference. Note that the prefix '0x' means this a hex number (0x10 is 16 in decimal). You'll run into hex numbers all the time in the low-level world, because memory addresses are 64-bit and decimal is just too clumsy - you would definitely prefer using 0x401256to 010 000 000 001 001 001 010 110 to denote an address. Hex is compact and maps neatly to binary.

There are 16 general-purpose registers in x86-64 systems:

%rax, %rbx, %rcx, %rdx,
%rsi, %rdi, %rbp, %rsp,
%r8 through %r15

Some registers have conventional purposes:

%rsp: stack pointer, always points to the top of the stack
%rax: return value from a function
%rdi, %rsi, %rdx, %rcx, %r8, %r9: used to pass the first 6 function arguments

Common Assembly Instructions

Assembly instructions manipulate the data between registers and memory. These instructions are where we begin to see how higher-level constructs like linked lists are implemented under the hood.

Here are a few of the most common instructions you'll run into all the time. These might seem overwhelming for now, but don't worry, we will see them work in action later.

mov: Copy value from source to destination (register-to-register, memory-to-register, etc.). For example, say register %rax is now 0x402000
- mov $0x28, %rax updates register %rax to value 0x28.
- mov $0x28, (%rax) puts integer 0x28 into memory starting at address 0x402000
lea: Load effective address. Usually used to perform simple arthmetic operations.
- Say %rax is 0x402000, and %rbx is 0x2. Instruction lea 0x1(%rax, %rbx, 4), %rdx sets %rdx = 0x402000 + 2 * 4 + 0x1 = 0x402009.
call: Push return address onto stack and jump to the function
ret: Pop return address and jump back to where the function call was made
push: Decrease %rsp and store a value at the new top of the stack
pop: Opposite of push: Load value from top of stack and increase %rsp
sub, add: Perform arithmetic (often used to grow/shrink stack frame)
cmp: Compare two values
jmp, je, jne, jg, etc.: Control flow via unconditional or conditional jumps. Usually used together with cmp instruction, for example, cmp $0x3, %rax
- je equal: %rax == 3
- jne not equak: %rax != 3
- jg greater: %rax > 3

Dissecting the main Function

I use the following code to compile the C file and then disassemble it:

gcc -g -no-pie -Og -o linked_list linked_list.c
objdump -d --no-show-raw-insn linked_list > linked_list.s

Let's look at the disassembled version of our C program’s main function. I've added lots of annotations to help reading through. At each line, the leftmost hex number is the address of the instruction.

00000000004011c5 <main>:
  4011c5:   endbr64                    # Control-flow enforcement. Let's ignore it
  4011c9:   push   %rbx                  # Register rbx will be modified in the process. We push it onto the stack so that we can retrieve later 
  4011ca:   sub    $0x50,%rsp          # Move the top of the stack to allocate 80 bytes on stack(0x50 is 80 in decimal) 
  4011ce:   mov    %fs:0x28,%rax       
  4011d7:   mov    %rax,0x48(%rsp)     # Put special value in the stack to detect stack buffer overflows. We can ignore for now.
  4011dc:   xor    %eax,%eax           # Clear %eax

# === Initialize 4 linked list nodes ===
  4011de:   movl   $0x28,(%rsp)        # node4.val = 40
  4011e5:   movl   $0x4,0x4(%rsp)      # node4.id = 4
  4011ed:   movq   $0x0,0x8(%rsp)      # node4.next = NULL

  4011f6:   movl   $0x1e,0x10(%rsp)    # node3.val = 30
  4011fe:   movl   $0x3,0x14(%rsp)     # node3.id = 3
  401206:   mov    %rsp,%rax           # address of node4
  401209:   mov    %rax,0x18(%rsp)     # node3.next = &node4

  40120e:   movl   $0x14,0x20(%rsp)    # node2.val = 20
  401216:   movl   $0x2,0x24(%rsp)     # node2.id = 2
  40121e:   lea    0x10(%rsp),%rax     # address of node3
  401223:   mov    %rax,0x28(%rsp)     # node2.next = &node3

  401228:   movl   $0xa,0x30(%rsp)     # node1.val = 10
  401230:   movl   $0x1,0x34(%rsp)     # node1.id = 1
  401238:   lea    0x20(%rsp),%rax     # address of node2
  40123d:   mov    %rax,0x38(%rsp)     # node1.next = &node2

# === Print Original List ===
  401242:   lea    0xdd3(%rip),%rdi    # Load "Original list:" format string
  401249:   call   401060 <puts@plt>   # Print string to our screen

  40124e:   lea    0x30(%rsp),%rbx     # Set rbp to point to node1
  401253:   mov    %rbx,%rdi           # Set pointer to node1 as the first argument to function call `print_list` so that we're ready for the next line
  401256:   call   401195 <print_list>

# === Reverse the List ===
  40125b:   mov    %rbx,%rdi           # first arg: head node
  40125e:   call   401176 <reverse>    # returns new head
  401263:   mov    %rax,%rbx           # save new head

# === Print Reversed List ===
  401266:   lea    0xdbe(%rip),%rdi    # Load "Reversed list:" string
  40126d:   call   401060 <puts@plt>

  401272:   mov    %rbx,%rdi
  401275:   call   401195 <print_list>

# === Stack Canary Check ===
  40127a:   mov    0x48(%rsp),%rax
  40127f:   sub    %fs:0x28,%rax
  401288:   jne    401295 <main+0xd0>   # if canary changed → stack overflow → crash

# === Return Normally ===
  40128a:   mov    $0x0,%eax           # 0 is the success return code
  40128f:   add    $0x50,%rsp          # Clean up stack frame
  401293:   pop    %rbx                # Retrieve the original value of register rbx
  401294:   ret                        # Return 0
  401295:   call   401070 <__stack_chk_fail@plt>  # stack corrupted → crash

A very important concept is stack frame - a block of memory to use whenever a function is called. It stores everything that the function needs while it runs, such as return address (where to jump back after the function finishes), local variables, function arguments (if passed via the stack) and sometimes, saved registers (like %rbp, %rbx) to restore later. It’s created at the start of a function and destroyed at the end.

The register %rsp is the top of the stack frame, and is extremely important since it marks the boundary of the stack frame. When we say the stack frame is destroyed, we don't mean physically destroying the hardware. Instead, it means the previous chunk of memory is no longered considered part of the current stack frame, and other programs are now allowed to write new values to this memory space to overwrite the existing ones.

Now let’s run GDB (GNU Debugger) to execute the program. add a breakpoint to address 401256 (you can find this number in the disassembled code above), right before calling print_list function.

$ gdb linked_list
GNU gdb (Ubuntu 12.1-0ubuntu1~22.04.2) 12.1
Copyright (C) 2022 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
(gdb) break *0x401256
(gdb) run
Starting program: /home/ubuntu/linked_list 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Now let’s examine the stack frame. The GDB command to examine memory (a very powerful tool!) is x/<count><format><size> <address>.

codes for [format]: x = hex, d = decimal, u = unsigned, c = char, s = string
codes for [size]: b = byte (1 byte), h = halfword (2 bytes), w = word (4 bytes), g = giant (8 bytes)

For example, the following code examines 20 words (4 bytes each) at the current stack pointer:

(gdb) x/20wx $rsp
0x7fffffffe350: 0x00000028      0x00000004      0x00000000      0x00000000
0x7fffffffe360: 0x0000001e      0x00000003      0xffffe350      0x00007fff
0x7fffffffe370: 0x00000014      0x00000002      0xffffe360      0x00007fff
0x7fffffffe380: 0x0000000a      0x00000001      0xffffe370      0x00007fff
0x7fffffffe390: 0x00000000      0x00000000      0xfbeca800      0x9629f2d4

The first 4 lines correspond to the 4 linked-list nodes we just created, and we can clearly see each node’s value, id and the pointer to next node. Each node consumes 16bytes (two integers, 4 bytes for each, and a 8-byte pointer), and that’s why their memory addresses all differ by 0x10.

Here’s how the stack looks during execution.

|------------------------| <- at 0x50(%rsp), 0x7fffffffe3a0
| stack overflow check   | 
|------------------------| 
|                        | 
|------------------------| <- at 0x40(%rsp), 0x7fffffffe390
| node1 (next)           | 
|------------------------| <- at 0x38(%rsp), 0x7fffffffe368
| node1 (val, id)        | 
|------------------------| <- at 0x30(%rsp), 0x7fffffffe370
| node2 (next)           | 
|------------------------| <- at 0x28(%rsp), 0x7fffffffe368
| node2 (val, id)        | 
|------------------------| <- at 0x20(%rsp), 0x7fffffffe370
| node3 (next)           | 
|------------------------| <- at 0x18(%rsp), 0x7fffffffe368
| node3 (val, id)        | 
|------------------------| <- at 0x10(%rsp), 0x7fffffffe360
| node4 (next)           | 
|------------------------| <- at 0x8(%rsp), 0x7fffffffe358
| node4 (val, id)        | 
|------------------------| <- %rsp, 0x7fffffffe350

The stack frame contains exactly 0x50 bytes of memory, as allocated by the instruction 4011ca: sub $0x50,%rsp above. You may notice that the 8 bytes at address 0x40(%rsp) are simply unused, and you may wonder why did we bother to allocate space for it. On x86-64, the stack must be aligned to 16 bytes before a call instruction, and we need those “0x00” bytes for padding.

Dissecting the print_list Function

Now let's resume from we set the breakpoint, and dig into the print_list function.

0000000000401195 <print_list>:
  401195:   endbr64

# === Function Prologue ===
  401199:   push   %rbx            # Save %rbx
  40119a:   mov    %rdi,%rbx       # Move first arg (head pointer) into %rbx

# === Loop Condition ===
  40119d:   jmp    4011be          # Jump to address 4011be for loop condition check

# === Loop Body ===
  40119f:   mov    (%rbx),%ecx     # Load val into %ecx
  4011a1:   mov    0x4(%rbx),%edx  # Load id into %edx
  4011a4:   lea    0xe59(%rip),%rsi # Load the address of format string "Node ID: %d, Value: %d\n" into %rsi
  4011ab:   mov    $0x1,%edi       # First arg to printf (fd = stdout)
  4011b0:   mov    $0x0,%eax       # Clear %eax before variadic call
  4011b5:   call   401080 <__printf_chk@plt>  # secure version of function printf

  4011ba:   mov    0x8(%rbx),%rbx  # Move to next node (follow .next)

# === Loop Condition ===
  4011be:   test   %rbx,%rbx       # Check if current node is NULL
  4011c1:   jne    40119f          # If not NULL, jump to address 40119f to continue loop

# === Function Epilogue ===
  4011c3:   pop    %rbx            # Restore %rbx
  4011c4:   ret                    # Return

We can add a second breakpoint to the function to examine the stack

(gdb) break *0x4011ba
Breakpoint 2 at 0x4011ba: file linked_list.c, line 30.
(gdb) continue
Continuing.
Node ID: 1, Value: 10

Breakpoint 2, print_list (head=head@entry=0x7fffffffe380) at linked_list.c:30
30              head = head->next;
(gdb) x/20wx $rsp
0x7fffffffe340: 0xffffe380      0x00007fff      0x0040125b      0x00000000
0x7fffffffe350: 0x00000028      0x00000004      0x00000000      0x00000000
0x7fffffffe360: 0x0000001e      0x00000003      0xffffe350      0x00007fff

We notice that the stack pointer register %rsp has been changed from 0x7fffffffe350 to 0x7fffffffe340, even though the print_list function doesn't allocate any space to the stack frame. So, why did the stack frame grow by 0x10 (16bytes)?

In the main function, when we execute instruction 401256: call 401195 <print_list>, we push the address of next instruction 40125b: mov %rbx,%rdi onto the stack, so that when function call print_list finishes, we can return to the correct location of main function and continue executing. This return addreses takes 8 bytes.
Inside the print_list function , we run 401199: push %rbx to push the value of register %rbx onto the stack. That takes another 8 bytes.

So, even though register %rsp isn't directly called, it was modified by instructions call and push. Notice that instructions ret and pop are counterparts of them, and will do the opposite - pop data out of the stack.

Dissecting the reverse Function

Now let's move on to the last function reverse.

0000000000401176 <reverse>:
  401176:   endbr64

# === Function Prologue ===
  40117a:   mov    $0x0,%eax       # Set return register %rax = NULL (this will track the new head)

# === Loop Condition (initial jump) ===
  40117f:   jmp    40118f          # Jump to loop condition before first iteration

# === Loop Body ===
  401181:   mov    0x8(%rdi),%rdx  # Load next node: %rdx = current->next
  401185:   mov    %rax,0x8(%rdi)  # Reverse the link: current->next = previous
  401189:   mov    %rdi,%rax       # Update previous: previous = current
  40118c:   mov    %rdx,%rdi       # Move to next: current = next

# === Loop Condition ===
  40118f:   test   %rdi,%rdi       # Check if current (now in %rdi) is NULL
  401192:   jne    401181          # If not, continue the loop

# === Function Epilogue ===
  401194:   ret                    # Return; %rax holds new head of reversed list

This piece of assembly should be rather straightforward now.

One thing interesting here is that in my C function reverse , I declare 3 local variables prev, curr and next, but in the assembly code there isn't any stack space allocated for them. It's because the compiler tries to optimize away unnecessary memory operations, and it's smart enough to realize that we don't actually need any additional memory - simply using registers to update pointers is enough. And this process is just beautiful.

Thanks

I just started learning fundamentals about machines recently, although I've been a developer for many years. This article is my personal attempt to internalize my learnings and share them with others. If you're just getting into the low-level world, I hope you're just as excited as I'm.

One last note: I'm using my M2 Macbook Air for daily task and it doesn't work well with GDB and all those x86-64 stuff. I ended up getting an AWS ECS Ubuntu t2.micro machine.

$ uname -a
Linux ip-172-31-18-227 6.8.0-1024-aws #26~22.04.1-Ubuntu SMP Wed Feb 19 06:54:57 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
$ gcc --version
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

Distributed Systems | What can we learn from Roblox's 3-day outage?

Zhijun — Sun, 01 Jun 2025 13:41:40 +0000

In October 2021, Roblox suffered the longest outage in its history—73 hours of complete downtime, affecting millions of players worldwide. The root cause? A subtle yet devastating issue buried deep inside an outdated database structure. What makes this case fascinating is that the issue wasn’t an obvious system failure or an external attack—it was a slow-burning technical debt hidden inside the database.

I recently did a deep dive into the post-mortem and gained valuable insights. This is arguably one of the most detailed and thorough outage reports available online.

🌐 Background:

Roblox employs a microservices architecture for its backend, using HashiCorp’s Consul for service discovery, enabling internal services to locate and communicate with each other. On the afternoon of October 28th, a single Consul server experienced high CPU load. The Consul cluster's performance continued to degrade, ultimately bringing down the entire Roblox system since Consul acted as a single point of failure. Engineers from both Roblox and HashiCorp collaborated to diagnose and resolve the issue.

🔧 Debugging Attempts:

Suspected Hardware Failure: The team replaced one of the Consul cluster nodes, but the issue persisted.
Suspected Increased Traffic: The team replaced all Consul cluster nodes with more powerful machines featuring 128 cores (2x increase) and faster NVMe SSD disks. This also did not resolve the issue.
Resetting Consul’s State: The team shut down the entire Consul cluster and restored its state using a snapshot from a few hours before the outage began. Initially, the system appeared stable, but it soon degraded again, returning to an unhealthy state.
Reducing Consul Usage: Roblox services that typically had hundreds of instances running were scaled down to single digits. This approach provided temporary relief for a few hours before Consul again became unhealthy.
Identifying Resource Contention Issues: Upon deeper inspection of debug logs, the team discovered resource contention problems. They reverted to machines similar to those used before the outage. Eventually, they identified the issue: Consul’s new streaming feature. This feature utilized fewer concurrency control elements (Go channels), which led to excessive contention on a single Go channel under high read/write loads. Disabling streaming dramatically improved the Consul cluster’s health.
Leader Election Optimization: The team observed Consul intermittently electing new cluster leaders, which was normal. However, some leaders exhibited the same latency issues. The team worked around this by preventing problematic leaders from staying elected.

✅ Following these measures, the system was finally stable. The team carefully brought Roblox back online by restoring caching systems and gradually allowing randomly selected players to reconnect. After 73 hours, Roblox was fully operational.

🔍 BoltDB’s Freelist: A Silent Culprit

The root cause of the outage stemmed from two key issues: the Consul streaming feature (as mentioned above) and Consul’s underlying database BoltDB exhibiting severe performance degradation, which I find pretty interesting.

Consul uses Raft consensus for leader election to ensure data consistency in a distributed environment. For persistence, it relies on BoltDB, a popular embedded key-value store, to store Raft logs.

Like many databases, BoltDB has a freelist, which tracks free pages—disk space that was previously occupied but is now available for reuse. This mechanism is crucial for efficient database performance, preventing unnecessary disk growth and optimizing read/write operations.

However, BoltDB’s freelist implementation had a critical inefficiency (see source code here). It used an array to store the ID of each free page, meaning that every database read/write operation involved a linear scan of the freelist. As the freelist grew large, the cost of operations increased significantly.

📈 Interestingly, this performance issue was first reported in 2016 (GitHub Issue), but it was never fixed. The author of BoltDB, Ben Johnson, stopped maintaining the project in 2017, stating:

"Maintaining an open-source database requires an immense amount of time and energy. Changes to the code can have unintended and sometimes catastrophic effects, so even simple changes require hours of careful testing and validation.

Unfortunately, I no longer have the time or energy to continue this work. Bolt is in a stable state and has years of successful production use. As such, I feel that leaving it in its current state is the most prudent course of action

Although BoltDB was no longer maintained, the Go community forked it into a new project called bbolt (bbolt GitHub) to continue active maintenance and add new features. Unfortunately, Consul was still using the outdated, unmaintained version of BoltDB.

In 2019, the freelist performance issue was finally resolved in bbolt (Alibaba Cloud Blog). The fix was straightforward: using a hashmap instead of an array, reducing the linear scan overhead and allowing instant lookups. ( 🚀 I love how a simple idea brings huge performance boost!)

Since this fix was committed to bbolt—not BoltDB—Consul did not benefit from the improvement, ultimately leading to the three-day Roblox outage in 2021.

🤔 Unanswered Questions

This post has covered a lot, but there’s only so much that can fit into a single post. As an engineer, I find myself eager to explore further details. Several intriguing questions remain unanswered:

Why didn’t Roblox roll back Consul’s streaming feature sooner?

Given that Consul was clearly the culprit early on and a significant change had just been made to its infrastructure, rolling back should have been one of the first things attempted. What factors delayed this decision?

Why did only some Consul servers experience the BoltDB freelist performance issue?

In theory, all servers should have been in a similar state since the leader is usually ahead of its followers only by a small margin. Yet, only some instances suffered severe degradation. What caused this inconsistency?

Why didn’t restoring Consul’s state using a previous snapshot fix the issue?

My hypothesis is that restoring Consul’s state did not reset BoltDB’s underlying raft.db file on each server, meaning the bloated freelist persisted even after the rollback. If true, this suggests snapshots do not include critical optimizations for internal database structures.

Why did reducing Consul usage work temporarily before failing again?

If the freelist was already too large, reducing usage shouldn’t have provided any relief. Did scaling down slow the growth of the freelist temporarily, delaying the inevitable, or was another factor at play?

Why did the new streaming feature work for a day before the outage occurred?

If the new Consul streaming feature was inherently flawed, why didn’t the system immediately degrade? Was there an initial buffer that temporarily masked the issue, or did a specific traffic pattern trigger the breakdown?

Since the BoltDB freelist performance issue has been around for years, why didn’t Roblox experience such system performance degradation in earlier months?

BoltDB’s freelist inefficiency had been a known issue since 2016. What changed in Roblox’s workload or data structure that made this issue surface now? Did the new Consul streaming feature exacerbate the problem by dramatically increasing write operations to BoltDB?

💡 Endnotes

This post-mortem report offers invaluable lessons, and I highly encourage everyone to check it out! There are also extensive discussions about this outage on Hacker News, where BoltDB's author, Ben Johnson, also participated in the conversations.

As a software engineer, I firmly believe that a key trait of a great engineer is the ability to efficiently navigate large, complex systems and diagnose issues under pressure. I deeply admire and respect the engineers from Roblox and HashiCorp, who worked tirelessly under immense pressure to investigate and resolve the issue. Hats off to them for their resilience and expertise.

Thank you for reading this far! If you found this post useful, I’d truly appreciate it if you could share it with others.