TL;DR
- Increasing threads helps only up to a point (50 → 300 worked, 500 didn't)
- Too many threads add overhead (scheduling, context switching, memory, synchronization)
- More CPU/memory doesn't help if the system is waiting on DB/network
- Horizontal scaling is limited by data partitioning
- Batching improves efficiency but doesn't reduce per-event cost
- Key insight: Not all optimizations help some just move the bottleneck
Most systems don't fail because of bad code. They fail because we assume they'll behave the same at scale.
In the beginning, everything feels fine. You write clean logic, test with small datasets, maybe simulate some load, and the system looks stable. Response times are acceptable, and the architecture feels reasonable. There is a quiet confidence that the system will scale when needed.
Then reality hits.
Under real traffic, the same system starts behaving very differently. Latency appears in places you didn't expect. Operations that felt trivial begin to stack up. Increasing threads or CPU doesn't give the improvement you thought it would. At some point, the system doesn't just slow down, it starts falling behind.
I ran into this while working on a high-throughput processing problem where millions of events had to be handled within a strict time window. It looked like a scaling problem at first. It turned out to be a design problem.
This is not a story about a perfect solution or a finalized architecture. It is about what actually breaks when systems are pushed to their limits, what changes made a real difference, and how your thinking needs to evolve when working with scale.
The Problem Looks Simple Until It Isn't
At a high level, the system followed a familiar pattern. Consume events from a stream, process each event, and write the result to a datastore. It is a clean and intuitive design, and at small scale, it works without much friction.
The initial implementation processed events one by one. Each event went through a series of steps: fetching configuration data, validating existing state, applying business logic, and finally writing the result. Each of these steps was individually efficient, and during early testing, the system behaved as expected.
The problem started when the volume increased.
Each event triggered multiple interactions with external systems. Even if each interaction took only a few milliseconds, the total latency per event began to grow. When this is multiplied across millions of events, the system doesn't degrade gradually. It falls behind quickly.
Throughput problems are rarely compute problems. They are coordination and dependency problems.
The system was not spending most of its time executing logic. It was spending time waiting. Waiting for data, waiting for responses, and waiting for other systems to keep up. As a result, the throughput of the entire pipeline became limited by the slowest dependency in the chain.
What looked like a simple processing system was actually a tightly coupled pipeline where every step depended on something else. That structure worked at small scale, but under load, it became the primary limitation.
What Didn't Work As Expected
Some approaches that looked correct initially did not hold up under scale.
The first was aggressive concurrency. Increasing threads from around 50 to 300 improved throughput noticeably. But pushing further to 500 did not reduce processing time. Instead, the system spent more time managing threads than doing actual work.
At higher thread counts, overhead becomes dominant. At that scale, each thread is not just processing data. The system also has to:
- schedule threads
- switch between them
- manage memory for each execution
- handle synchronization overhead
With 500 threads competing for limited CPU cores, most are waiting rather than executing. Frequent context switching adds overhead faster than useful work increases, causing throughput to plateau or even degrade.
Adding more infrastructure showed similar limits. Increasing CPU and memory helped slightly, but the system was not compute-bound. Most of the time was spent waiting on network calls and data access, so additional compute did not remove the bottleneck.
Horizontal scaling also plateaued. Increasing instances helped only up to the level of available parallelism in the data stream. Beyond that, each instance faced the same constraints, limiting overall gains.
Batching improved efficiency, but not the core cost. Expensive operations were still happening per event inside each batch, so the impact remained limited.
Not all optimizations improve performance. Some just move the bottleneck.
What Actually Helped
The real improvement came from reducing unnecessary work rather than trying to make existing work faster.
The first step was to remove repeated external lookups from the critical path. Instead of fetching configuration data for every event, the data was loaded once and reused. This eliminated a large number of redundant calls and significantly reduced latency.
A similar approach was applied to state validation. Instead of querying the datastore for every event, relevant data was cached in memory or in a fast in-memory store. This allowed the system to make decisions quickly without relying on network-bound operations.
If your system depends on another system for every event, it is not truly scalable.
Batch processing also improved efficiency. Instead of processing events strictly one by one, consuming them in batches reduced overhead in both data fetching and execution. This allowed better utilization of available resources.
Concurrency was still useful, but only within limits. Around 300 threads provided the best balance between parallel execution and system overhead. Beyond that, additional threads increased complexity without improving throughput.
Another important improvement was aligning the number of processing units with how the data was partitioned. The system performed best when the level of parallelism matched the structure of the incoming data stream. This ensured that resources were used effectively without unnecessary contention.
The key shift was simple but powerful: reduce the amount of work happening inside the critical path.
The Mental Model That Changed Everything
The biggest change was not in tools or technologies, but in how the problem was approached.
Instead of asking how to make the system faster, the better question became: where is the time actually going?
In this case, the answer was clear. The system was not compute-heavy. It was wait-heavy.
That distinction matters.
If a system spends most of its time waiting, increasing concurrency does not solve the problem. It increases contention and overhead. If a system spends most of its time computing, then parallel execution becomes effective.
Concurrency is not a solution by itself. It is a multiplier of the underlying behavior.
This leads to a practical approach to system design:
- first identify the bottleneck
- then reduce or eliminate it from the critical path
- and only then apply concurrency where it makes sense
Another important realization is that every system has limits. These limits could come from thread management, network latency, or how work is partitioned. Once those limits are reached, adding more resources does not improve performance.
Understanding these limits early helps avoid wasted effort on optimizations that do not provide real gains.
Where Things Still Break
Even after applying these improvements, the system still did not meet the required processing window.
This is where scaling becomes significantly more complex.
At this stage, most obvious inefficiencies have already been addressed. Improvements become incremental, and each change provides smaller benefits compared to the previous one.
Some operations remain unavoidable. State updates, validations, and writes to the datastore are essential parts of the system. Even when optimized, they still introduce latency.
There is also a growing coordination cost. As the system scales, managing multiple workers, handling shared state, and ensuring consistency introduces additional overhead. These costs are not always visible at smaller scales but become significant under heavy load.
At this point, scaling is no longer about fixing clear inefficiencies. It becomes a problem of trade-offs. Improving one aspect of the system may negatively impact another. Reducing latency might increase complexity. Simplifying the design might reduce performance.
Many systems reach this stage and stop improving, not because they are poorly designed, but because they have reached the limits of their current architecture.
Key Takeaways
- Scaling is not about applying a single technique or tool. It is about understanding how the system behaves under real conditions.
- Throughput problems are rarely caused by slow computation. They are caused by dependencies and coordination overhead.
- Concurrency improves performance only up to a certain point. Beyond that, it introduces overhead that can reduce efficiency.
- External dependencies become the dominant factor at scale. Reducing reliance on them within the critical path is one of the most effective optimizations.
- Removing unnecessary work is often more impactful than optimizing existing work.
- Finally, scalability is constrained by how work is distributed. Systems can only scale as much as their underlying parallelism allows.
What This Changed for Me
Before working on systems like this, scaling felt like a resource problem. If something was slow, the solution seemed straightforward: add more threads, increase capacity, or distribute the workload.
That perspective changed completely.
What appears to be a performance issue is often a design issue. Systems slow down not because they cannot process data fast enough, but because of how work is structured and where time is spent.
There is no universal solution. Approaches like multithreading, multiprocessing, or distributed systems are only effective when they align with the nature of the workload.
Scaling does not fail because systems are inherently slow. It fails because we misjudge where the real cost lies.
Understanding that changes how you design everything that follows.
🔗 Connect with Me
📖 Blog by Naresh B. A.
👨💻 Building AI & ML Systems | Backend-Focused Full Stack
🌐 Portfolio: Naresh B A
📫 Let's connect on LinkedIn | GitHub: Naresh B A
Thanks for spending your precious time reading this. It's my personal take on a tech topic, and I really appreciate you being here. ❤️


Top comments (0)