In my fourth semester, my teammate and I built an auction server using sockets in Python. It was a final project for our Computer Networks course - functional enough to pass, inefficient enough to haunt me later.
Fast forward to last month, when polishing my resume, I realized I needed a project that showed I could handle high-concurrency systems. So, I revisited that old auction server and rewrote it in Go, aiming for a 2-3x performance improvement.
Full source code available here
Instead, I ended up with a 80% failure rate at 1,000 concurrent users.
Not because my code was broken. Because it was too fast. My Go implementation was so efficient it overwhelmed the Windows TCP stack and DDoS'd my own laptop.
This is that story.
The Python Baseline
The original Python version used selectors for I/O multiplexing. The standard stuff - The OS would wake up our event loop when clients sent bids, we'd process them, repeat. The architecture was clean and worked fine for the project demo.
However, I later realized that it had limitations. Python's Global Interpreter Lock (GIL) meant only one thread could execute bytecode at a time, no matter how many cores were available. My 18-core laptop was essentially operating on one thread.
Still, for a baseline test with 100 users sending 50 bids each (5,000 total requests), Python performed decently:
These are some solid numbers! Time to see how I went about improving this performance using Go.
The Go Rewrite
Go's concurrency model is fundamentally different from Python's. Every client connection gets its own goroutine, a lightweight thread that costs about 2KB of memory. When a goroutine blocks waiting for network I/O, Go's scheduler parks it and moves on to other work. There is no spinning, polling or wasted cycles.
I made three key architectural changes:
Binary protocol
I replaced string parsing with fixed-size headers. The server now knows exactly how many bytes to read for each message, eliminating guesswork and partial frame errors.Persistence: Moving from local memory to Redis
I switched from using in-memory dictionaries to Redis with Lua scripts for atomic check-then-set operations. Now, every bid survives a server crash.Buffered channels
I created a 5,000-slot channel buffer between the network layer and the bid processor. This decoupled "receiving data" from "processing data," allowing the system to handle traffic spikes without blocking.
I ran the same 100-user test and expected around 300k RPM.
I got 480,000 RPM. 100% success rate. With the Redis overhead.
Python was storing everything in local memory, with no external I/O beyond the client connections. In contrast, Go was making a network round-trip to Redis for each bid and still outperformed Python by 3.4x.
The Self-DDoS
I scaled the test to 1,000 users, each sending 50 bids. That's 50,000 total requests.
Python struggled but stayed functional:
Then I ran Go with the exact same parameters.
What.
The throughput was still higher than Python, but 80% of the requests failed. I checked my code looking for race conditions, panics, however I couldn't find anything that was obviously broken.
Then I ran netstat -s -p tcp to check the tcp stats, which revealed the problem: 11,053 Failed Connection Attempts and 31,984 Segments Retransmitted.
My server hadn't crashed. The Windows TCP stack just couldn't keep up.
The Root Cause
Here's what happened:
My Go benchmarker spawned 1,000 goroutines almost simultaneously and each tried to connect to the server immediately. That meant 1,000 SYN packets hit the kernel in a single burst.
The kernel has a finite "waiting room" for new connections (the listen backlog queue). When that queue overflowed, it started silently dropping SYN packets.
The clients, receiving no SYN-ACK response, assumed packet loss and retransmitted. This created a feedback loop: more retransmissions led to more congestion, which caused more drops and retransmissions.
Classic TCP Congestion Collapse.
But then, why did Python succeed?
Python "succeeded" because its GIL accidentally rate-limited the connections. It was too slow to overwhelm the Operating System.
Go exposed a bottleneck that Python never even reached.
The Fix
The solution wasn't in my application code. It lay in working with the physical limits of the operating system.
I implemented connection pacing by introducing a small delay between spawning each client goroutine.
1ms pacing: Success rate jumped from 21% to 82%.
2ms pacing:
3ms pacing:
The result? 100.00% Success Rate at 650,209 RPM.
That was great, but I thought, why stop here?
I experimented with increasing the delay between bids from the same user (from 10ms to 20ms), hoping it would help give the system more breathing room. However, I saw the success rate drop instead.
The problem? Holding 1,000 sockets open while they sat idle put massive pressure on the TCP window. Windows eventually timed them out to reclaim resources.
The sweet spot turned out to be 3ms pacing between connections, 10ms between bids. That's where the OS and Go runtime finally synced up.
Conclusion
The contrast is striking: Python stored all bid data in local memory with no external database or network hops beyond the client connection. Go made a Redis round-trip for every single bid.
Python peaked at ~150k RPM with dropped requests. Go sustained 650k RPM with 100% reliability and full persistence.
This wasn't about Go being "faster" in some abstract sense. It was about Go's runtime, designed to maximize modern hardware until hitting the next bottleneck, which in this case, was the operating system itself.
Python managed 100 users, not because it was exceptionally built, but because it was too slow to hit the limits of the OS that Go found at 1,000 users.
My Go server was finally fast enough to find the one limit I couldn't code my way out of: the physical capacity of the Windows TCP stack.
Moving from Python to Go was more than just changing syntax. It shifted how the application approached concurrency and I/O. By utilizing Go’s M:N scheduler and runtime netpoller instead of Python’s GIL-limited model, I was able to push the system to a point where the operating system became the bottleneck.
Hardware Details
- CPU: Intel(R) Core(TM) Ultra 5 125 H (18 cores)
- OS: Windows 11 Version 24H2
- Redis: 7.x running locally on localhost
- Network: All connections over loopback (127.0.0.1)
- Go Version: 1.25.5
- Python Version: 3.12.6
All tests were conducted on a single machine to eliminate network variability from application-level performance.
Resources
- GitHub Repository: View full implementation and benchmarks
-
Raw benchmark data: Available in
/benchmarksdirectory









Top comments (0)