DEV Community: Prachi Jha

My RAM is Basically... Sand?

Prachi Jha — Mon, 15 Jun 2026 20:15:49 +0000

I'm a CSE student who learned what SRAM is today. While trying to figure that out, I accidentally traced a path from U-Boot all the way down to sand. This is that path.

What is an SRAM?

SRAM (Static RAM) is a fast form of memory commonly found inside CPU caches and embedded systems. During boot, a Boot ROM may load a secondary bootloader such as SPL into SRAM before larger DRAM has been initialized.

Okay, so how is it different from Dynamic RAM?

Both are volatile memories. However, other than the size and location, SRAM is way faster and more expensive than DRAM. Why? Because SRAM is made of 6 transistors while a DRAM is made of one transistor and one capacitor, materials that are way cheaper.

Now being a CSE student, transistors were one of those vague words that I keep hearing engineers toss around but had never really looked into what they are or how they actually function.

What is a transistor?

Basically, an electrical switch. Which means based on the electric power that we provide it, we can turn the switch 'on' or 'off'. Well, while that is great, why does one SRAM require six of these electrical switches instead of like... one? They could just turn it on and off going from there.

Turns out, one bit in SRAM is represented by something called a latch. The latch is how the SRAM remembers values like 0 or 1.

The idea is that memory needs to hold a value, not just pass one through. A latch does this by having two inverters argue with each other until one wins.

A latch comprises of two 'NOT' gates, connecting two electronic nodes, A and B. Each NOT gate is connected so that its output feeds the input of the other, creating a feedback loop.

An electronic node is part of an electrical component which has the same voltage. So in the practical sense, A and B are wires.

Since both are connected via NOT gates,there are only two valid states: A=0, B=1 and A=1, B=0.

That brings us to the question: what exactly does A=0 or B=1 mean if they are literally wires?

As it turns out, A=0 or B=1 does not literally mean that the voltage is exactly 0 or 1 volts. Instead, digital circuits define ranges of voltages that are interpreted as logical 0 and logical 1. Whether a node is considered high or low depends on threshold voltages built into the circuit.

But then, wouldn't the two NOT gates need some comparison mechanism to be able to determine if they are at a voltage higher than the threshold?

For that we need to understand how these NOT gates are made.

What makes up a NOT gate?

Transistors.

There are two types of transistors that are used for one NOT gate.

PMOS and NMOS. Both of these are types of MOSFETs.

PMOS is a p-channel MOSFET, while NMOS is a n-channel MOSFET. What they mean will make sense soon enough. Now, take a look at the diagram below. That is how these two transistors comprise of one NOT gate.

When the voltage is low, the PMOS 'switches on' while NMOS remains 'off'. Which means current flows from the voltage source to the output. Thus, B is 1.
When the voltage is high, i.e. A=1, NMOS 'switches on' while PMOS remains off. The voltage from the source, which here is GND, flows to the output. Thus B becomes 0.

This architecture for creating a NOT gate is called a CMOS inverter, where CMOS stands for Complementary Metal Oxide Semiconductor.

That makes 4 transistors, 2 for each NOT gate. The other two are access transistors : they act like controlled doorways for the latch.

I keep saying when voltage is high, or low, but how does any transistor, whether it be PMOS or NMOS, sense it?

To understand this, we need to go a layer deeper and look into the terminals that make up these transistors.

The three terminals of a MOSFET

A terminal is an end point through which an electrical component connects to the rest of the circuit. A MOSFET has three of those.

Source : terminal from which charge carriers enter the channel.
Drain : terminal through which charge carriers leave the channel.
Gate : what generates the electric field

Now, here's the interesting part. The gate is never in contact with the circuit. However, the input, A, is connected to both the gates of the NMOS as well as the PMOS for a particular CMOS inverter.

This gate uses the voltage to generate an electronic field that allows the movement of electrons or holes from one side to the other.

Once this starts happening, the space that was previously occupied by the transistor becomes a conductive region. This region is now called a channel.

Now, why does the gate work the way it does? Why does a higher voltage at the gate cause NMOS to open up, while a lower voltage causes PMOS to open up?

For that we need to understand how NMOS and PMOS transistors are made.

Doped Silicon

That means we take a semiconductor called Silicon and intentionally introduce impurities in it. Depending on the type of impurity introduced, it becomes either n-type or p-type.

N-type is when we add Phosphorus, which has 5 valence electrons, to Silicon which has only 4. So there is one extra electron, that is only weakly bound and can move through the material, increasing conductivity.

P-type is when we add Boron, which has only 3 valence electrons. So only three of four bonds of Silicon are formed, which leaves an absence of a electron, that we call a hole.

A silicon is doped at the correct places with the correct impurities to create an NMOS or PMOS transistor.

An NMOS consists of:
N+ Source
P-type Body
N+ Drain

Which looks like this:

Now, imagine putting a positively charged plate above this silicon.
The gate doesn't touch the silicon, it sits above an insulating oxide layer.

The electrons currently can't flow because there's a p-type body between two N+ source and drain. However, when the gate exerts positive electric field, the electrons get attracted and start gathering at the surface, i.e. above the p-type body.

We started with N+ Source --- P Body --- N+ Drain
Now we have created N+ Source --- N-type channel --- N+ Drain.
That is N-N-N all the way across. Now the electrons have a path.

This layer is called the inversion layer since we successfully turned a p-type body into an n-type channel.

This is also why MOSFETs are Metal Oxide Semiconductor Field Effect Transistors.

Similarly, for a PMOS, this is how the Silicon is doped.

So, coming back to our original question.

How does a NOT gate compare voltages between two wires?

It doesn't.

Each CMOS inverter only observes the voltage at its own input node. If the input is high, the NMOS transistor turns on while the PMOS turns off, pulling the output toward ground. If the input is low, the PMOS turns on while the NMOS turns off, pulling the output toward the supply voltage.

How does it determine this 'high' or 'low' though?

For an NMOS transistor, a conducting channel forms only when the voltage applied to the gate exceeds a certain value known as the threshold voltage (Vth). Below this threshold, the electric field is too weak to attract enough electrons to create a channel, and the transistor remains effectively off. Above it, a conductive path forms between the source and drain, allowing current to flow.

Digital logic is built on top of this analog behavior. Engineers define ranges of voltages that are guaranteed to be interpreted as logical 0 and logical 1. For example, in a system with a 1 V supply, voltages near 0 V may be treated as a logical 0, while voltages near 1 V are treated as a logical 1.

Values in between form a transition region where the circuit's behavior is less certain. The role of a CMOS inverter is to push voltages away from this ambiguous region and toward one of the stable extremes. In this sense, digital computation emerges from analog physics: the silicon only ever sees electric fields and voltages, while the familiar 0s and 1s arise from carefully chosen thresholds and voltage ranges.

Before we move on, there's one more thing worth sitting with: what happens when the latch can't decide?

For example, instead of one node being near 1 V and the other near 0 V, what if both A and B were sitting at roughly 0.5 V?

This turns out to be a real phenomenon called metastability.

In this situation, neither inverter sees a clearly high or clearly low input. Both are hovering around the threshold region where the transistors are only partially on. As a result, the feedback loop has not yet "made up its mind" about which stable state it should settle into.

However, this balanced state is extremely fragile. Even the tiniest disturbance, like a bit of electrical noise, a temperature fluctuation, or a few stray electrons moving around, will make one node slightly higher than the other. Once that happens, the cross-coupled inverters amplify the difference.

If A becomes even a little higher than B, one inverter starts pushing B lower, which causes the other inverter to push A higher, and the gap rapidly grows until the cell settles into either (A = 1, B = 0) or (A = 0, B = 1).

Metastability is therefore best thought of as a temporary state where the circuit has not yet decided which of its two stable states to occupy.

This is the knife-edge that a memory cell balances on. And it's made of Silicon.

It makes you wonder how disastrous it would be if we were to ever run out of Silicon.

Fortunately for us, Silicon is the second most abundant element on Earth, since it is literally found in sand.

This sand is doped, using which we get transistors, which we use to create NOT gates, which is used to create latches, and six transistors and two wires represent one bit of SRAM.

A bunch of these bits together represent our executable machine code.
Before the linker ran, it was relocatable machine code.
Before the assembler ran, it was assembly.
Before the compiler ran, it was the code you actually typed.
Sand, all the way up.

It is difficult not to be impressed by the fact that modern computing ultimately depends on engineers taking purified sand, introducing carefully chosen impurities into regions measured in nanometers, and somehow arranging the result into a machine capable of running an operating system.

Why Copy Something That's About to Die?

Prachi Jha — Wed, 10 Jun 2026 19:44:11 +0000

I know the title could be a metaphor for a billion things in programming, so bear with me for a minute as I build this up to rvalue references in cpp.

How it started

I was writing a recursive palindrome checker and hit an Out Of Memory error while submitting my solution.

Someone (GPT) suggested changing

bool check(string s, int left, int right)

to:

bool check(const string& s, int left, int right)

to avoid copying the string on every recursive call.

Now, while I understood that string& s created a reference to the variable that holds my string, and const ensured that we didn't alter the value later, I didn't really understand how that worked as a function argument.

Why References?

Typically, defining:

void foo(string s){...}

and then calling:

foo(name);

creates a copy.

name --> "Prachi"
s --> "Prachi"

These are two separate strings. However, with:

void foo(string& s){...}

calling:

foo(name);

creates a reference for name. Here, s becomes another name for name.

References don't create a new object at all, and are thus able to avoid copying.

This however brings us to the question, what would happen if I were to write:

foo("Prachi");

Here, "Prachi" is not an object. So what does s refer to now?

As expected,

void foo(string& s){...}

would not work with:

foo("Prachi");

However, quite unexpectedly, if foo is defined as:

void foo(const string& s){...}

foo("Prachi") would totally work.

And the only difference is that of a const.

So bringing us to the obvious question in the next part-

How. Doesn't the Temporary Get Destroyed?

Normally, temporary objects don't stick around for very long.

For example:

string("Prachi");

creates a temporary std::string object that is destroyed at the end of the statement.

This means that if C++ simply allowed references to bind to temporaries, we could end up with a reference pointing to an object that no longer exists.

To avoid that, C++ makes a special exception for const T&.

Consider:

const string& s = string("Prachi");

Without any special rules, the execution might look like this:

Create temporary string("Prachi")
↓
Bind s to it
↓
Destroy temporary
↓
Use s

which would leave s referring to an object that no longer exists.

Instead, C++ extends the lifetime of the temporary so that it survives for as long as the reference does.

Conceptually, the compiler behaves more like:

Create temporary string("Prachi")
↓
Bind s to it
↓
Keep temporary alive
↓
Use s safely
↓
Destroy temporary when s goes out of scope

This rule is known as temporary lifetime extension.

Which brings us to another question:
Why not just do the same for every reference instead of just const T&?

Why such bias with non-constants?

Imagine cpp allowed:

void scream(string& s)
{
    s += "!!!";
}

and then:

scream(string("Prachi"));

by extending a temporary's lifetime.

What would happen?

A temporary gets created, say temp.

temp = "Prachi"

It gets bound to s.

string& s = temp;

Then the function runs, edits s and thus temp, so now temp becomes:

temp = "Prachi!!!"

Then the function exits. The temporary is destroyed.

The weird thing is that the modification was technically valid. the string was successfully modified. But nobody will ever see the result. The object would immediately disappear afterward.

Instead if we did:

string name = "Prachi";
scream(name);

Now, name is a concrete object. And thus the modification would survive.

C++ could have extended temporary lifetimes for all references. Instead, it chose to reserve non-const references for "real" objects whose state can meaningfully be modified and observed later. A temporary object is, by definition, about to disappear, making modification largely pointless.

But what if we did want to do something with the temporary?

Let's consider another example:

string makeName()
{
    string s = "Prachi";
    return s;
}

Here, when s is returned,

string name = makeName();

all of its characters are copied to a new string.

Why do that when s is about to get destroyed anyway?

Move Instead of Copy

In the above example, when s is returned, we have two options.

Option 1. Copy

When s is copied to the return value, the entire buffer is duplicated.

For a small string, that's not a big deal. But imagine a vector containing millions of elements. Copying becomes expensive very quickly.

And here's the thing:

Immediately after the return statement executes, s is destroyed.

We spent time and memory duplicating data from an object that was about to disappear anyway.

Option 2: Move

Instead of duplicating the contents, we transfer ownership of the buffer.

No characters are copied.

No new buffer is allocated.

The return value simply takes ownership of the memory that s was already using.

When s is eventually destroyed, it has nothing left to clean up.

This is the core idea behind move semantics.

Instead of asking:

How do we copy this object efficiently?

C++ asks:

Does this object still need its resources?

If the answer is no, we can simply transfer ownership instead of copying!

Thus for the first time, having temporaries becomes an optimization opportunity. But for this, we would have to correctly identify which are the temporaries that are about to get destroyed.

Basically, how do we know if an object is safe to steal resources from?

Enter Rvalue References

Before this point, I only knew about one kind of reference:

string& ref = name;

which can only bind to a named object.

string name = "Prachi";
string& ref = name;      // Valid

string& ref = string("Prachi"); // Error

C++11 introduced a second kind of reference:

string&& ref = string("Prachi");

Unlike ordinary references, rvalue references specifically allowed to bind to temporary objects: an object that is about to be destroyed and whose resources can potentially be reused.

While Lvalue refers to an object with an identity that can be referred to later, Rvalues are temporaries.

An rvalue reference is simply a reference that is allowed to bind to these temporary objects.

Once the language syntax made this distinction possible, move semantics came into picture.

Move Semantics is the idea that rather than creating a brand-new copy of an object's resources, we "move" those resources from one object to another.

The moved-from object remains valid, but no longer owns the resource it previously held.

Why I Thought Rvalue References Were Redundant

When I first encountered rvalue references, I was thinking exclusively about function arguments.

From that perspective, they seemed almost pointless.

After all, I had already learned that:

const string& s

avoids copies.

It can bind to temporaries.

It extends the temporary's lifetime.

So what exactly was left for:

string&& s

to do?

For nearly an hour, I kept looking at rvalue references through that lens and couldn't understand why the feature existed at all.

The breakthrough came when I stopped thinking about function arguments and started thinking about ownership.

The goal was never to create "another kind of reference."

The goal was to give the language a way to recognize objects that were about to disappear and make use of that fact.

Instead of treating temporary objects as a nuisance, C++ turns them into an optimization opportunity.

Once I realized that, the entire design suddenly clicked.

Rvalue references weren't solving a parameter-passing problem.

They were solving an ownership problem.

And move semantics was the elegant consequence of that solution.

Interestingly, while reading further, I learned that one of the key contributors behind rvalue references and move semantics in C++11 was Howard Hinnant.

So if by some miracle this article ever reaches him:

Thank you.

What started as an Out Of Memory error in a recursive palindrome checker somehow turned into one of my favorite language-design rabbit holes.

What a Switch Tells the Compiler that If-Else Doesn't

Prachi Jha — Mon, 08 Jun 2026 06:46:39 +0000

My first reaction to switch-case in 11th grade was a single word: why.

The if-else conditional had a perfectly easy, perfectly understandable syntax. And as far as my understanding went, they pretty much did the same thing.

It was not until I studied compilers and systems during my undergraduate that I realized switch statements are not merely an alternate syntax for if-else chains. They communicate additional information to the compiler, which can enable optimizations that may not be as obvious from an equivalent if-else sequence.
How so?

For the longest time, I believed that switch-case does exactly what an if-else does. We have a few conditions, we pick one of the cases, execute it, and come back. So it didn't really make sense to me about why we would need a break after every case.

Unless you write it for every case until you reach default, it would execute every code block for every single case written below the case executed too!

For example:

switch (nod){
    case 1:
        cout << "Yes" << endl;
    case 2:
        cout << "No" << endl;
    default:
        cout << "Invalid" << endl;

A selection of nod = 1 would print YesNoInvalid instead of just yes. Which is why we use breaks to exit the conditional block.

switch (nod){
    case 1:
        cout << "Yes" << endl;
        break;
    case 2:
        cout << "No" << endl;
        break;
    default:
        cout << "Invalid" << endl;

Neither is it efficient, nor is it a popular use case. The fallthrough makes more sense once you think of cases as jump destinations, not conditions. My mistake was assuming that switch-case is just if-else with a different syntax, and functionality probably remains the same.

However, a switch exposes a much lower-level control flow model than if-else.

If you squint carefully, the structure of a switch looks remarkably similar to low-level control flow. Rather than describing a sequence of conditions, it describes a dispatch from one value to one of several possible destinations. Thinking about a switch this way makes its fallthrough behavior much easier to understand.

For if-else, you describe the logic. The compiler constructs the control flow. For switch-case, you partially describe the control flow itself. The compiler gets much more structure to work with.

That said, it carries more information to the compiler, in a way, than the typical if-else.

In the above example, for if-else, the compiler knows:
Condition 1: nod == 1
Condition 2: nod == 2
Condition 3: otherwise

To realize this is a dispatch problem, the compiler has to analyze the conditions and notice:

Same variable
Compared to constants
Mutually exclusive

Now look at the switch example. We are basically telling the compiler that there is one controlling expression - 'nod'. The compiler can immediately construct:

Value Destination

1 case1
2 case2
3 case3

without doing any analysis. The source code is already expressing a dispatch relationship between values and destinations.

A switch ensures that there is a single expression compared against constant cases to select one destination. It is essentially carrying metadata about the coder's intent.

A switch statement describes a dispatch problem, which allows the compiler to implement it using various optimization techniques.

While experimenting, I repeatedly ran into a different problem: GCC kept deleting my switch statements entirely. In one example, the compiler noticed that every case returned the same value as the case label itself and rewrote the logic into a much simpler form.

In one experiment, I wrote:

case 1: return 1;
case 2000: return 2000;

expecting to study switch lowering. Instead, GCC noticed that the returned value was identical to the matched case value and generated a much simpler implementation. The switch effectively disappeared.

This was an important lesson: by the time we inspect assembly, we are often looking at the result of many optimization passes rather than a direct translation of the source code.

In fact, a switch statement does not necessarily survive long enough to be lowered at all. If the compiler can prove something stronger about the program, it may simplify or eliminate the switch entirely.

However, when a switch does survive to the lowering stage, the compiler still has several possible ways to implement it. This process, known as switch lowering, involves choosing a concrete control-flow strategy for the multiway branch.

A few common choices used by compilers are:

Compare Chain Here

switch(x) {
case 1:
case 2:
case 3:
}

becomes

cmp x, 1
je case1

cmp x, 2
je case2

cmp x, 3
je case3

jmp default

Good when there are only a few cases, or when the case values are sparse.

Jump Table

For denser values or larger numbers of cases, the compiler may generate a jump table. It however consumes memory.

index = x - 1

table:
0 -> case1
1 -> case2
2 -> case3
3 -> case4
4 -> case5

Decision Tree For many sparse values like 100, 200, 500, 800 - the compiler might emit a decision tree equivalent to a binary search tree. It is good for when there are many sparse values constructing a jump table for that would be huge.

A few other optimizations also include bit test, range checks and special transforms.

Now, how does the compiler know which optimization route to take? Based on heuristics.

Most compilers look at the number of cases, range and density.

For example:
Number of cases = 5
Range = 1..5
Density = 5/5 = 100%

If the density is high, jump table is probably good.

Another example:
case 1
case 1000
case 1000000

Here:
Cases = 3
Range = 1..1000000
Density ≈ 0

Jump tables would waste memory. Comparisons are probably better.

When the code compiles, the compiler often doesn't know several details like how often a branch will be executed, what values the user will enter, etc. But it still needs to decide optimizations, and whether to generate a jump table, a decision tree, etc. So it makes an educated guess.

Each compiler uses their own heuristics. For example, in GCC's source code, depending on the number of cases, their range, density, and even special patterns in the values, GCC may choose a comparison chain, a decision tree, a jump table, or an entirely different transformation. The compiler is not simply translating a switch statement into machine code; it is evaluating multiple possible implementations and selecting the one its heuristics believe will be most efficient.

Here is an example of different assembly code generated for the same decision problem using switch-case and if-else

int decision(int x){
        switch(x){
                case 1: return 10;
                case 100: return 20;
                case 200: return 30;
                case 300: return 40;
                case 400: return 50;
                case 500: return 60;
                case 600: return 70;
                case 700: return 80;
                case 800: return 90;
                case 900: return 100;
                default: return 0;
        }
}

The simplified version of the assembly code generated for this is:

cmp x, 500
je  case500

jg  right_half
jl  left_half

left_half:
    cmp x, 200
    je case200

    cmp x, 300
    ...

Notice how the compiler starts by comparing directly to 500 and then jumping in two different directions on the basis of the result. This is a decision tree.

GCC essentially built

                500
              /     \
          <500       >500
          /             \
       200             800
      /   \           /   \
    ...   ...      ...   ...

Let us try the same with if-else using the following code:

int decision(int x){
        if (x == 1)
                return 10;
        else if (x == 100)
                return 20;
        else if (x == 200)
                return 30;
        else if (x == 300)
                return 40;
        else if (x == 400)
                return 50;
        else if (x == 500)
                return 60;
        else if (x == 600)
                return 70;
        else if (x == 700)
                 return 80;
        else if (x == 800)
                 return 90;
        else if (x == 900)
                return 100;
        else
                return 0;
}

A simplified version of the assembly generated for this is:

cmp x, 1
je case1

cmp x, 100
je case100

cmp x, 200
je case200

cmp x, 300
je case300

...

cmp x, 900
je case900

jmp default

In this particular experiment, GCC generated a decision tree for the switch version and a linear comparison chain for the if-else version.

Switch isn't merely alternate syntax for if-else. It communicates additional semantic information to the compiler: that a single value is being dispatched to one of several destinations. That additional structure allows the compiler to apply optimizations such as jump tables, decision trees, bit tests, and other lowering strategies.

What surprised me most while exploring GCC's implementation was that the question "Is switch faster than if-else?" is actually the wrong question. In many cases, the compiler may generate nearly identical code for both. In other cases, it may completely transform the control flow into a jump table, a decision tree, or something even more specialized. The answer depends on the distribution of case values, the number of branches, target architecture, and compiler heuristics.

The real advantage of a switch statement is not that it is inherently faster. It is that it expresses intent more clearly. By telling the compiler that we are performing a multiway dispatch on a single value, we give it the information needed to choose the most efficient implementation. As it turns out, a switch statement is not just a language feature - it is a hint to the optimizer.

I Baked a Flutter App Into a Car OS. Here's What Broke and What Didn't.

Prachi Jha — Sat, 21 Mar 2026 18:20:23 +0000

This came out of preparing for GSoC 2026 with AGL.

Automotive Grade Linux runs the infotainment systems in production Mazdas and Subarus. It's backed by most major automakers and compiles entirely from source - kernel, C library, every system tool. I spent five days building an AGL image from scratch, wrote a Flutter app, and baked it into the OS. This is what actually happened.

The Scale of the Problem

You can't sudo apt install agl. AGL is built using Yocto, an industry-standard build system for custom embedded Linux distributions. Yocto doesn't download a pre-built OS. It compiles everything from source: the kernel, the C library, every system tool, the Flutter engine, and the app itself.

My laptop had neither the compute nor the disk space. I spun up a GCP VM:

Machine: e2-standard-8 (8 vCPUs, 32 GB RAM)
OS: Ubuntu 22.04 LTS
Disk: 200 GB

First attempt failed overnight at 74%:

WARNING: The free space is running low (0.823GB left)
ERROR: No new tasks can be executed since the disk space monitor action is "STOPTASKS"

Yocto needs more than 200 GB. I hit a quota limit trying to expand the disk in Asia, deleted the VM, recreated it in us-central1 with 400 GB, and started over.

8 hours later, after 12,145 compilation tasks:

Tasks Summary: Attempted 12145 tasks of which 0 didn't need to be rerun and all succeeded.

I booted it in QEMU:

Automotive Grade Linux 21.90.0 qemux86-64 ttyS0
qemux86-64 login: root
root@qemux86-64:~#

AGL 21.90.0. Codename: vimba. A virtual car computer, inside a cloud VM, in Iowa.

Writing the Flutter App

The app reads /etc/os-release at runtime to display the AGL version, which means the same binary shows Ubuntu values during local development and AGL values on the actual image - no build flags, no conditionals. The relevant field is PRETTY_NAME:

Future<void> _loadAglVersion() async {
  final file = File('/etc/os-release');
  final contents = await file.readAsString();
  final lines = contents.split('\n');
  for (final line in lines) {
    if (line.startsWith('PRETTY_NAME')) {
      setState(() {
        _aglVersion = line.split('=')[1].replaceAll('"', '');
      });
      break;
    }
  }
}

Flutter app on local machine. For the image: Levi Ackerman from Attack on Titan. The sound button plays an audio clip I will not describe further.

Flutter app on QUEMU

Baking It In: Yocto Layers and Recipes

Yocto builds from "layers" - folders that each contribute something to the final image. AGL ships with layers for its core system, demo apps, and Flutter engine support. To add my app, I created meta-agl-prachi:

meta-agl-prachi/
├── conf/
│   └── layer.conf
└── recipes-apps/
    └── agl-quiz-app/
        ├── agl-quiz-app.bb
        └── files/
            └── agl_quiz_app.desktop

The .bb file (a "recipe") tells Yocto: where to fetch the source, how to build it, where to install it. Mine pointed to my GitHub repo and used inherit flutter-app, a class provided by meta-flutter that handles all the Flutter-specific build logic.

Then I added my layer to the build and ran:

bitbake agl-ivi-demo-flutter

It finished in minutes. Suspiciously fast, only 5 tasks rerun. Yocto had used cached output from the previous build and skipped my layer entirely.

The Lockfile Problem

I ran bitbake agl-quiz-app in isolation to see what was actually failing:

ERROR: agl-quiz-app-1.0-r0 do_archive_pub_cache:
flutter pub get --enforce-lockfile failed: 1

The error named the failed command. It didn't say where that command came from or how to change it. So I followed the source:

find ~/AGL/master/external/meta-flutter -name "*.bbclass"
# → flutter-app.bbclass
cat flutter-app.bbclass
# → require conf/include/flutter-app.inc
cat flutter-app.inc
# → require conf/include/common.inc
cat common.inc

Three files deep. In common.inc I found it:

if d.getVar("PUBSPEC_IGNORE_LOCKFILE") == "1":
    pubspec_lock = os.path.join(app_root, 'pubspec.lock')
    if os.path.exists(pubspec_lock):
        run_command(d, 'rm -rf pubspec.lock', app_root, env)

# ...later...
run_command(d, 'flutter pub get --enforce-lockfile', app_root, env)

flutter pub get --enforce-lockfile requires the lockfile to exactly match resolved dependencies. My lockfile was generated with a slightly different Dart SDK version than the build VM. The fix was a single line in my recipe:

PUBSPEC_IGNORE_LOCKFILE = "1"

One line. After two days of debugging.

Getting the Display Working

QEMU runs headless by default. To see the AGL UI, I exposed its display over VNC:

runqemu qemux86-64 serialstdio slirp qemuparams="-display vnc=:1"

I opened port 5901 in GCP's firewall and connected with TigerVNC. First connection: the AGL warning screen, rotated 90 degrees. AGL IVI is designed for portrait car dashboards. One more line in weston.ini fixed the backend:

backend=vnc

Getting my app to render required understanding AGL's Wayland setup. The compositor runs as agl-driver (uid 1001). Root cannot access agl-driver's Wayland socket - not a permissions workaround, just how Wayland works. The socket lives at /run/user/1001/wayland-0 and only the user who started the compositor can connect to it.

I found the correct environment by reading the existing service file:

cat /usr/lib/systemd/system/flutter-ics-homescreen.service

Which revealed the exact variables and paths needed. With those:

su agl-driver -s /bin/sh -c '
  WAYLAND_DISPLAY=wayland-0
  XDG_RUNTIME_DIR=/run/user/1001/
  LD_PRELOAD=/usr/lib/librive_text.so
  LIBCAMERA_LOG_LEVELS=*:ERROR
  flutter-auto -b /usr/share/flutter/agl_quiz_app/3.38.3/release --xdg-shell-app-id agl_quiz_app'

The app launched.

The Result

AGL 21.90.0 (vimba). The version string - "Automotive Grade Linux 21.90.0 (vimba)", pulled live from /etc/os-release at runtime. Sound doesn't come through QEMU yet (that requires additional ALSA configuration), but everything else works.

Takeaways

The most useful thing I practiced here wasn't Flutter or Yocto syntax. It was following require statements until I found the line actually doing the thing. common.inc wasn't linked from anywhere in the docs. Three files of reading got me there. When something breaks in Yocto, the error names the failed task, and that task is a readable function somewhere in the layer files. Start there and keep reading.

The structure itself is simpler than it looks: layers are folders, recipes are config files, classes are reusable logic. The surface area is large but not deep.

The full code and Yocto layer are on GitHub: here

I Made My Code Do Nothing. It Got Slower.

Prachi Jha — Sat, 07 Feb 2026 12:22:41 +0000

I stripped my code down to do absolutely nothing. Just count events and move on. It got 8% slower.

This isn't measurement noise. Over 30 seconds of processing 12+ million events across 5 test runs, the "optimized" version was consistently, measurably slower than the version doing expensive kernel symbol lookups and string formatting.

The Setup

I built an eBPF tool that monitors TCP packet drops in the Linux kernel. eBPF (Extended Berkeley Packet Filter) lets you hook into kernel functions without modifying kernel source. When a packet is dropped, my eBPF program captures the event (PID, drop reason, kernel function) and sends it to userspace through a ring buffer.

During stress testing, I flooded my machine with SYN packets:

sudo hping3 -S -p 80 --flood

The kernel started dropping ~400,000 packets per second. My tool reads each drop event from the ring buffer and processes it.

I created four benchmark modes to find the bottleneck:

Benchmark - Just count events. Zero processing.
Busy - Do expensive work (symbol lookups, string formatting), then discard the result.
File - Same expensive work, but write to a file.
Terminal - Same expensive work, print to terminal.

Each mode ran for 30 seconds. Here's what happened:

Mode	Events Read	Throughput	Rank
File	12,548,707	373,828/sec	1
Busy	12,432,234	370,316/sec	2
Benchmark	11,570,132	344,616/sec	3
Terminal	653,848	19,353/sec	4

The mode doing zero work came in third.

The Paradox

Benchmark Mode should have won. It's literally just incrementing a counter:

// Benchmark Mode - the "fast" path
for {
    record, err := ringbuf.Read()
    if err != nil { break }

    metrics.EventsRead.Add(1)
}

No symbol lookups. No string formatting. No I/O. Just atomic increment and repeat.

But it lost by 8% to Busy Mode, which does this:

// Busy Mode
func (p *EventProcessor) ProcessEventBusy(event *monitorEvent) {
    p.metrics.EventsRead.Add(1)

    // Map lookup for drop reason
    reasonStr := p.dropReasons[event.Reason]
    if reasonStr == "" {
        reasonStr = fmt.Sprintf("UNKNOWN(%d)", event.Reason)
    }

    // Binary search through 300k kernel symbols
    symbolName := findNearestSymbol(event.Location)
    if symbolName == "" {
        symbolName = fmt.Sprintf("0x%x", event.Location)
    }

    // String formatting with allocations
    _ = fmt.Sprintf("[%s] Drop | PID: %-6d | Reason: %-18s | Function: %s\n",
        time.Now().Format("15:04:05"),
        event.Pid,
        reasonStr,
        symbolName)

    p.metrics.EventsPrinted.Add(1)
}

How does doing more work make code faster?

Ruling Out the Obvious

Memory Allocation?

Benchmark Mode:

Total allocated: 303 MB
GC runs: 14

Busy Mode:

Total allocated: 2,592 MB (8x more)
GC runs: 75 (5x more)

Busy Mode was doing 5x more garbage collection and still winning. The bottleneck wasn't memory.

The Smoking Gun: CPU Profiling

I added CPU profiling with pprof to both modes. The difference was stark.

Benchmark Mode (344k events/sec)

unix.EpollWait:     11.12s (76%) ← Blocking on syscalls
ringbuf.Read:       13.61s (93%) ← Total time in read

76% of CPU time spent waiting in epoll_wait(), blocked while the kernel writes the next event.

Busy Mode (370k events/sec)

ProcessEventBusy:   11.27s (54%) ← Actually doing work
  ├─ fmt.Sprintf:    6.53s (31%)
  ├─ findNearestSymbol: 2.61s (13%)
  └─ mallocgc:       3.19s (15%)

unix.EpollWait:     5.57s (27%) ← Much less waiting

Only 27% waiting. The rest was productive work.

File Mode (373k events/sec)

ProcessEvent:         11.66s (56%) ← Work
  ├─ fmt.Fprintf:      4.95s (24%) ← Cheaper than Sprintf!
  ├─ findNearestSymbol: 2.68s (13%)
  └─ mallocgc:         2.64s (13%)
unix.EpollWait:        5.87s (28%) ← Same batching as Busy

File and Busy have identical syscall overhead (28% waiting). File edges ahead because fmt.Fprintf() to a buffered file is more efficient than fmt.Sprintf() creating throwaway strings.

Why This Happens: The Syscall Tax

The kernel generates events at ~2.5µs per event (400k/sec).

Benchmark Mode processes each event in ~1µs:

Event arrives → Process (1µs) → ringbuf.Read()
                                    ↓
                            Ring buffer empty!
                                    ↓
                    epoll_wait() blocks (context switch)
                                    ↓
                    Kernel writes next event (1.5µs)
                                    ↓
                        Wake up userspace
                                    ↓
                                 Repeat

Result: ~400,000 context switches per second, constant blocking.

Benchmark Mode was too fast. It kept asking for the next event before the kernel had written it, forcing the process to sleep in epoll_wait() waiting for data.

Busy Mode processes each event in ~4µs:

Event arrives → Process (4µs) → ringbuf.Read()
       ↑                            ↓
       |                    3 events already waiting!
       |                            ↓
       └──────── Kernel wrote more while we were busy

Result: ~150,000 context switches per second, natural batching.

By the time userspace calls Read(), multiple events are already queued in the ring buffer. No blocking needed.

Accidental Batching

Busy Mode's processing time (~4µs per event) accidentally created natural batching. While userspace was busy formatting strings and looking up symbols, the kernel queued multiple events. Each ringbuf.Read() call pulled several events without blocking.

Benchmark Mode outpaced its data source and spent most of its time context-switching between userspace and kernel, waiting for the next event.

The paradox: Removing work made the code too fast, causing it to waste time waiting.

The Real Cost of Performance

Here's what's actually expensive in event-driven programming:

Cheap:

Memory allocation (even 8x more)
Garbage collection (even 5x more)
CPU work (symbol lookups, string formatting)

Expensive:

Context switches (crossing the kernel/userspace boundary)
Blocking syscalls (epoll_wait, read, poll)

When you optimize CPU work to near-zero, you don't eliminate the cost, you just shift it to I/O overhead. If your processing loop is faster than your data source, you end up paying the syscall tax on every single event.

Who Should Care?

This pattern appears in any high-frequency event processing:

Network packet processing (eBPF, DPDK, raw sockets)
Financial trading systems (market data feeds)
Log aggregation (reading from message queues)
Metrics collection (statsd, Prometheus exporters)
Game engines (input event processing)

Rule of thumb: If your event processing is faster than your event arrival rate, you're probably paying unnecessary syscall overhead. Profile your code. If you see >50% time in epoll_wait/poll/select, you're thrashing on syscalls. Batch your reads.

The Actual Fix

The real lesson here isn't "batch your reads better in userspace." The cilium/ebpf library's ringbuf.Read() is already reasonably efficient. You're still bound by the poll/epoll cycle regardless.

The actual fix is to stop sending 400,000 events to userspace in the first place.

This is where eBPF's real power comes in: move the aggregation logic into the kernel.

Instead of:

Kernel: Drop packet → Send event to ring buffer
Userspace: Read event → Process → Count
Result: 400,000 context switches/sec

The proper approach:

Kernel: Drop packet → Increment counter in BPF map (no userspace trip!)
Userspace: Read aggregated counts once per second
Result: 1 context switch/sec

Using eBPF maps, I can aggregate packet drops directly in kernel memory, count drops by kernel function, by IP address, by drop reason; and pull the summary to userspace periodically instead of streaming 400,000 individual events.

Expected improvement: 99.9% reduction in context switches (from 400k/sec to ~1/sec).

This is future work for me. The interesting benchmark results I found were educational. They taught me about syscall overhead and accidental batching, but they also revealed I was solving the problem in the wrong place.

The Takeaway

The fastest code isn't always the code doing the least work, it's the code that minimizes expensive operations.

In this case:

Benchmark Mode: Optimized CPU work, paid 76% overhead in syscalls
Busy Mode: Did 8x more allocation and 5x more GC, reduced syscall overhead to 27%
File Mode: Most efficient I/O primitives, same batching benefits

The real enemy wasn't symbol lookups or string formatting. It was calling the kernel 400,000 times per second instead of 150,000 times.

Three lessons:

Profile before optimizing - My intuition said "remove work." The profiler said "syscalls are the bottleneck."
Batching beats speed - Reading 10 events with 1 syscall is faster than reading 10 events with 10 syscalls.
There's a sweet spot - Too-fast processing just means waiting for I/O.

The bottleneck is rarely where you think it is.

Environment: Ubuntu 24.04, Linux 6.5, Go 1.21 (variance <2% across 5 runs)

Full code: GitHub

Profiling implementation: benchmark/investigation branch

Previous article: My Logs Lied: How I Used eBPF to Find the Truth

My Logs Lied: How I Used eBPF to Find the Truth

Prachi Jha — Sun, 01 Feb 2026 17:34:24 +0000

A while back, I wrote about the time I accidentally DDOSed my own laptop while load testing a Go auction server. 1000 concurrent clients generated connections so fast that the kernel's TCP listen queue overflowed, but silently, dropping packets before they ever reached my application.

The bug itself had a simple fix. But the experience left me with a harder problem: I had no way to see it happening in real time. Application logs showed nothing. netstat -s gave me a system-wide counter. "11,053 listen queue overflows". But not which process, not when, not why.

So I built a tool to see inside the kernel. This is how it works, what I learned, and one genuinely weird thing I found while benchmarking it.

The Problem with Debugging Network Issues

When a distributed system behaves badly, a database replica lags, a microservice times out, a message queue backs up - the first instinct is to look at application logs. But a whole category of problems lives below the application, in the kernel's networking stack, invisible to anything your code can observe directly.

TCP is designed to be resilient. It retransmits. It backs off. It recovers. But when it can't recover, when the listen queue overflows, when a firewall drops a packet, when a checksum fails, it just... drops the packet. No log entry. No error propagated upward. The application sees a timeout, eventually, but the actual cause happened microseconds earlier, deep in kernel code.

The tools most developers reach for don't help here.

tcpdump captures packets on the wire, but it's too heavyweight to run continuously in production. It also shows you what arrived, not what got dropped. netstat -s gives you aggregate counters - "11,053 listen queue overflows" - but nothing about which process, when, or why. You're left guessing.

What I needed was something that could sit right at the point where the kernel drops a packet and report back: who was affected, why did it happen, and exactly where in kernel code did the decision get made.

Enter eBPF

eBPF (extended Berkeley Packet Filter) is a technology that lets you run small, sandboxed programs inside the Linux kernel without modifying kernel source code or loading kernel modules (which is a lot more painful). It's been around for years, used by tools like Cilium and Datadog, but it's surprisingly accessible for individual developers once you understand the basics.

The key insight is that the Linux kernel has built-in attachment points or hooks called tracepoints - stable, documented hooks left by kernel developers for exactly this kind of observability. For packet drops, the relevant tracepoint is skb/kfree_skb. Every time the kernel frees a socket buffer (which is what happens when a packet is dropped), this tracepoint fires.

So the plan was straightforward: hook kfree_skb, capture the information I needed, and get it out to my Go application in real time.

Building the Monitor

The Kernel Side

The eBPF program itself is surprisingly small — about 30 lines of C:

#include "vmlinux.h"
#include <bpf/bpf_helpers.h>

struct event {
    u32 pid;        // Process context when drop occurred
    u32 reason;     // Why the kernel dropped it
    u64 location;   // Instruction pointer — where in kernel code
};

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 16);  // 64KB ring buffer
    __type(value, struct event);
} events SEC(".maps");

SEC("tracepoint/skb/kfree_skb")
int trace_tcp_drop(struct trace_event_raw_kfree_skb *ctx) {
    if (ctx->reason <= 1) return 0;  // Not a real drop, bail immediately

    struct event *e = bpf_ringbuf_reserve(&events, sizeof(*e), 0);
    if (!e) return 0;  // Ring buffer full — skip silently

    e->pid      = bpf_get_current_pid_tgid() >> 32;
    e->reason   = ctx->reason;
    e->location = (u64)ctx->location;
    bpf_ringbuf_submit(e, 0);
    return 0;
}

A few things worth noting here. The filter on line one (reason <= 1) is critical - kfree_skb fires for every packet that gets freed, including ones that completed successfully. Reason 0 (SKB_DROP_REASON_NOT_SPECIFIED) and reason 1 (SKB_DROP_REASON_NO_REASON) are normal lifecycle events. The kernel freed the buffer after it was done with the packet, not because something went wrong. We only care about actual drops, so we bail out immediately for everything else. This keeps the overhead minimal even though we're hooking a very hot kernel path.

The ring buffer is a 64KB circular queue shared between kernel and userspace. When we call bpf_ringbuf_reserve, we claim space in it. When we call bpf_ringbuf_submit, the data becomes visible to userspace. If the buffer is full because userspace isn't reading fast enough, reserve returns NULL and we silently skip that event — no blocking, no spinning. The eBPF verifier enforces this: our program must terminate quickly, no exceptions.

Before any of this runs, the kernel's eBPF verifier statically analyzes the program. It proves there are no infinite loops, no unsafe memory accesses, no calls to unapproved functions. If verification fails, the program doesn't load. This is why eBPF is safe to run in production — you literally cannot get dangerous code past the verifier.

The Userspace Side

The Go side reads from the ring buffer and turns raw kernel events into something useful.

rd, _ := ringbuf.NewReader(objs.Events)

for {
    record, _ := rd.Read()  // Blocks until an event is available
    event := *(*monitorEvent)(unsafe.Pointer(&record.RawSample[0]))

    symbolName := findNearestSymbol(event.Location)
    reasonStr  := dropReasons[event.Reason]

    fmt.Fprintf(bufferedWriter, "[%s] Drop | PID: %-6d | Reason: %-18s | Function: %s\n",
        time.Now().Format("15:04:05"), event.Pid, reasonStr, symbolName)
}

The unsafe.Pointer cast deserves explanation. The ring buffer gives us raw bytes. We know the layout matches our C struct exactly - same fields, same order, same sizes. Rather than parsing the bytes manually (slow, error-prone), we reinterpret them directly as a Go struct. Zero allocation, zero copying. It's the only unsafe usage in the codebase, and it's justified.

Symbol resolution is where the interesting work happens. The kernel gave us a raw instruction pointer, something like 0xffffffff81a2b574. Meaningless to a human. To translate it, we load /proc/kallsyms at startup, around 200,000 kernel symbols, sorted by address. Then for each event, we do a binary search to find the function that contains our address, calculate the offset, and produce output like tcp_v4_syn_recv_sock+0x234. Now you know exactly which kernel function dropped the packet.

The output is written through a 256KB bufio.Writer. This matters more than it might seem, and it connects to something I discovered later.

What It Looks Like in Practice

[15:04:23] Drop | PID: 1234 | Reason: TCP_LISTEN_OVERFLOW | Function: tcp_v4_syn_recv_sock+0x234
[15:04:23] Drop | PID: 1234 | Reason: TCP_LISTEN_OVERFLOW | Function: tcp_v4_syn_recv_sock+0x234
[15:04:23] Drop | PID: 5678 | Reason: NETFILTER_DROP      | Function: nf_hook_slow+0x12a

Each line tells you: when it happened, which process was in context, why the kernel dropped it, and exactly where in kernel code the decision was made.

A Note on PID Accuracy

bpf_get_current_pid_tgid() returns the PID of whichever process the kernel is running when the drop occurs. For TCP_LISTEN_OVERFLOW, this is typically the listening process, the drop happens in its context. But for other drop types, particularly ones that occur during interrupt handling or in kernel threads, the PID might not correspond to the actual owner of the dropped packet.

This is a fundamental limitation of the approach. The kernel doesn't always know which userspace process "owns" a packet at the point it gets dropped. For debugging specific issues like my listen queue overflow, the PID is accurate and useful. For a general-purpose production monitoring tool, you'd want to validate accuracy per drop type before relying on it.

Beyond the Code

This project started as a way to fix a single bug, but it ended up being a masterclass in how much complexity lives just beneath our main() functions.

While you might reach for a platform like Cilium or Datadog for 24/7 production observability, there is something incredibly powerful about writing 30 lines of C that can peer into the heart of the kernel. It turns the "black box" of networking into a transparent stream of events.

The source for this project is open on GitHub.

Presented at Bengaluru Systems Meetup, January 2026. Thanks to the organizers for the welcoming "just show up and talk about what you built" energy, it made all the difference.

My Go Server was so fast it self-DDoS'd my laptop

Prachi Jha — Sun, 28 Dec 2025 11:34:10 +0000

In my fourth semester, my teammate and I built an auction server using sockets in Python. It was a final project for our Computer Networks course - functional enough to pass, inefficient enough to haunt me later.

Fast forward to last month, when polishing my resume, I realized I needed a project that showed I could handle high-concurrency systems. So, I revisited that old auction server and rewrote it in Go, aiming for a 2-3x performance improvement.

Full source code available here

Instead, I ended up with a 80% failure rate at 1,000 concurrent users.
Not because my code was broken. Because it was too fast. My Go implementation was so efficient it overwhelmed the Windows TCP stack and DDoS'd my own laptop.

This is that story.

The Python Baseline

The original Python version used selectors for I/O multiplexing. The standard stuff - The OS would wake up our event loop when clients sent bids, we'd process them, repeat. The architecture was clean and worked fine for the project demo.

However, I later realized that it had limitations. Python's Global Interpreter Lock (GIL) meant only one thread could execute bytecode at a time, no matter how many cores were available. My 18-core laptop was essentially operating on one thread.

Still, for a baseline test with 100 users sending 50 bids each (5,000 total requests), Python performed decently:

These are some solid numbers! Time to see how I went about improving this performance using Go.

The Go Rewrite

Go's concurrency model is fundamentally different from Python's. Every client connection gets its own goroutine, a lightweight thread that costs about 2KB of memory. When a goroutine blocks waiting for network I/O, Go's scheduler parks it and moves on to other work. There is no spinning, polling or wasted cycles.

I made three key architectural changes:

Binary protocol
I replaced string parsing with fixed-size headers. The server now knows exactly how many bytes to read for each message, eliminating guesswork and partial frame errors.
Persistence: Moving from local memory to Redis
I switched from using in-memory dictionaries to Redis with Lua scripts for atomic check-then-set operations. Now, every bid survives a server crash.
Buffered channels
I created a 5,000-slot channel buffer between the network layer and the bid processor. This decoupled "receiving data" from "processing data," allowing the system to handle traffic spikes without blocking.

I ran the same 100-user test and expected around 300k RPM.

I got 480,000 RPM. 100% success rate. With the Redis overhead.

Python was storing everything in local memory, with no external I/O beyond the client connections. In contrast, Go was making a network round-trip to Redis for each bid and still outperformed Python by 3.4x.

The Self-DDoS

I scaled the test to 1,000 users, each sending 50 bids. That's 50,000 total requests.

Python struggled but stayed functional:

Then I ran Go with the exact same parameters.

What.

The throughput was still higher than Python, but 80% of the requests failed. I checked my code looking for race conditions, panics, however I couldn't find anything that was obviously broken.

Then I ran netstat -s -p tcp to check the tcp stats, which revealed the problem: 11,053 Failed Connection Attempts and 31,984 Segments Retransmitted.

My server hadn't crashed. The Windows TCP stack just couldn't keep up.

The Root Cause

Here's what happened:

My Go benchmarker spawned 1,000 goroutines almost simultaneously and each tried to connect to the server immediately. That meant 1,000 SYN packets hit the kernel in a single burst.

The kernel has a finite "waiting room" for new connections (the listen backlog queue). When that queue overflowed, it started silently dropping SYN packets.

The clients, receiving no SYN-ACK response, assumed packet loss and retransmitted. This created a feedback loop: more retransmissions led to more congestion, which caused more drops and retransmissions.

Classic TCP Congestion Collapse.

But then, why did Python succeed?

Python "succeeded" because its GIL accidentally rate-limited the connections. It was too slow to overwhelm the Operating System.

Go exposed a bottleneck that Python never even reached.

The Fix

The solution wasn't in my application code. It lay in working with the physical limits of the operating system.

I implemented connection pacing by introducing a small delay between spawning each client goroutine.

1ms pacing: Success rate jumped from 21% to 82%.

2ms pacing:

3ms pacing:

The result? 100.00% Success Rate at 650,209 RPM.

That was great, but I thought, why stop here?

I experimented with increasing the delay between bids from the same user (from 10ms to 20ms), hoping it would help give the system more breathing room. However, I saw the success rate drop instead.

The problem? Holding 1,000 sockets open while they sat idle put massive pressure on the TCP window. Windows eventually timed them out to reclaim resources.

The sweet spot turned out to be 3ms pacing between connections, 10ms between bids. That's where the OS and Go runtime finally synced up.

Conclusion

The contrast is striking: Python stored all bid data in local memory with no external database or network hops beyond the client connection. Go made a Redis round-trip for every single bid.

Python peaked at ~150k RPM with dropped requests. Go sustained 650k RPM with 100% reliability and full persistence.

This wasn't about Go being "faster" in some abstract sense. It was about Go's runtime, designed to maximize modern hardware until hitting the next bottleneck, which in this case, was the operating system itself.

Python managed 100 users, not because it was exceptionally built, but because it was too slow to hit the limits of the OS that Go found at 1,000 users.

My Go server was finally fast enough to find the one limit I couldn't code my way out of: the physical capacity of the Windows TCP stack.

Moving from Python to Go was more than just changing syntax. It shifted how the application approached concurrency and I/O. By utilizing Go’s M:N scheduler and runtime netpoller instead of Python’s GIL-limited model, I was able to push the system to a point where the operating system became the bottleneck.

Hardware Details

CPU: Intel(R) Core(TM) Ultra 5 125 H (18 cores)
OS: Windows 11 Version 24H2
Redis: 7.x running locally on localhost
Network: All connections over loopback (127.0.0.1)
Go Version: 1.25.5
Python Version: 3.12.6

All tests were conducted on a single machine to eliminate network variability from application-level performance.

Resources

GitHub Repository: View full implementation and benchmarks
Raw benchmark data: Available in /benchmarks directory