Aviad Rozenhek

Posted on May 28

The Bitter Lesson for Backend Engineering

#ai #backend #software #agents

Every senior backend engineer has sat in a war room while smart people argue
over instincts.

"It is the ASGI event loop."
"No, it is the database proxy."
"Just throw more pods at it."

The Core Thesis

We treat optimization as a test of intuition. But in "The Bitter Lesson,"
Richard Sutton's account of 70 years of AI research, the durable winners were
not the systems that best encoded human intuition. They were the general methods
that scaled with computation, especially search and learning. Human knowledge
helped locally and felt satisfying, but it tended to plateau. Compute-scaled
methods kept improving as compute got cheaper.

Backend engineering is facing its own bitter lesson. The future of systems
engineering is not "replacing judgment with agents." It is:

Build the system of work so that more computation can be turned into more
correct engineering progress.

For this backend effort, that means using agent compute to run a systematic
engineering loop. The human sets the goal, picks the risk boundary, and decides
what evidence is good enough. The agents spend parallel compute on the tedious
but decisive work: reading call paths, building harnesses, running experiments,
fitting scaling curves, checking payload equivalence, and keeping a durable log.

The Problem We Unleashed It On

The immediate problem was not "make one slow endpoint faster." It was harsher:
we needed a path to 500-1000 RPS on endpoints we had not even written yet.

At the same time, many existing endpoints were already struggling at
single-digit RPS. Horizontal scaling was ineffective: each additional CPU or pod
contributed too little throughput. There was no single obvious culprit. It
might have been Python CPU, serializer shape, the ORM, database compute,
connection fan-out, external calls, worker slots, event-loop blocking, load
shedding, or some interaction between them.

That made local cleverness the wrong tool. We did not need one more confident
guess. We needed a methodology that could discover what had to be better before
the next generation of endpoints was designed around the same bottlenecks.

The Limits Of The 30 W Brain

A human body runs at roughly 100 W; the brain is often estimated around 20-30 W.
That biological efficiency is remarkable, but it is also a bottleneck. A single
engineer can only hold so many hypotheses, command histories, query plans,
deploy states, and before/after measurements in working memory.

Backend engineers pride themselves on being empirical. We talk in latency
histograms, queue depths, lock contention, p95s, CPU profiles, connection pools,
and query plans. Yet we still fall into vibe-based optimization because real
systems exceed human working memory.

The irony is that backend performance is governed by rigid laws: discrete math,
queuing theory, network physics, storage behavior, CPU scheduling, cache
invalidation, and contention. Unlike product taste or visual design, these
systems eventually answer to measurement. But a 30 W brain cannot accurately
simulate an asymmetric network bottleneck, a contended connection pool, and a
multi-join query plan under load.

It certainly cannot do that while also tracking deploy state, cache behavior,
and error budgets.

Cheap-enough agent compute changes the shape of the work. Three agents can
simultaneously:

design instrumentation;
build harnesses and runbooks;
inspect hot paths;
prove behavior equivalence;
watch readiness, proxy metrics, app metrics, and logs;
maintain shared coordination state.

The rational move is to spend compute to reduce uncertainty. The scarce
resource is no longer typing time. It is experimental discipline.

So we reach for plausible stories. The event loop. The ORM. The GIL. The proxy.
The pods. The cache. The database. Each story may contain truth, but treating
the story as the strategy is exactly the human-centric mistake Sutton described.

The scalable strategy is to turn those stories into experiments and let compute
do the search.

The Ego Cost Is Familiar

Sutton's essay is uncomfortable because it says researchers repeatedly preferred
approaches that embedded human insight, even after scalable computation kept
winning. The pain was not just technical; it was identity. It feels better to
believe progress comes from our clever model of the domain than from a general
method that searches harder, learns more, or evaluates more candidates.

Backend engineering has the same temptation. The scalable engineering method is
to make each plausible story falsifiable and let the measurements decide.

The Case Study

Three specialized agents were turned loose on one hot endpoint — a paginated
list view central to the product — with instructions to find the binding
constraint, not the prettiest fix. A gentle 3 RPS live stress run did the
work: the database hit ~100% CPU while the proxy and its connections sat
flat. The bottleneck was DB compute. Python CPU was falsified separately by
a local sweep (parallel efficiency held at ~93% across four-way parallelism
with negligible variance). Connection fan-out and the serving stack were
not loaded enough at 3 RPS to matter. The stress test, not the code review,
is what named the axis.

With the axis identified, the agents went after the query shapes that were
burning the CPU. The serializer N+1 collapsed by an order of magnitude
(179 → 15 queries at page size 10, byte-identical payload). The paginated
COUNT(*) over a distinct multi-join was the bigger win (~639 ms → ~1.5 ms,
~400× faster, same value). A forced GROUP BY rewritten as a correlated
subquery took the full local view path from about 1590 ms to about 174 ms;
on a warm pod the median dropped from about 2.25 s to about 1.57 s.

Those wins matter as evidence that the DB CPU axis is reducible — that
the binding constraint stress testing surfaced is also a tractable one.
They are not the lesson. The lesson is what the search exposed when it ran
past one endpoint. The system is not bound by a single bottleneck. It is
bound by five, each with a different blast radius, a different toolchain
to investigate, and a different treatment cost:

The DB CPU wall. The 2-vCPU writer hits ~100% long before connections saturate. The surface fixes compress one query class on one endpoint; underneath, dozens of unrelated endpoints share the same query shapes, with the same low-RPS CPU cost. A hotspot inventory pairing pg_stat_statements with EXPLAIN is where the next ten- and twenty-times wins live.
The serving-layer ceiling. Even with the database idle, an in-cluster harness reproduces a ~1k-RPS wall per instance: four GIL-bound gunicorn workers paying a per-request cost. Adding pods does not buy RPS until that ratio improves.
The middleware tax. A cache-hit response has no DB cost and no business logic, yet still measures in tens of milliseconds. That floor — auth resolution, DRF rendering, observability middleware, session and CSRF on routes that need neither — is what the cache-hit ceiling cares about.
Caching strategy. A cache that is not stampede-protected fans out a miss into exactly the DB compute the surface fixes were trying to avoid. The cache only matters if it is reachable cheaply and protected end-to-end.
3rd-party blocking on the request path. Some endpoints still make synchronous calls to external services during request handling. Those endpoints cannot scale linearly with workers — and offloading them is a behavior change that needs product sign-off rather than a perf-only PR.

The Pareto matters because each scenario hits a different axis. The 50K
simultaneous-join burst is dominated by one endpoint and probably goes the
caching plus middleware-bypass route. The sustained 1K livestream uses
six endpoints driven by FE polling cadence and is dominated by the DB CPU
wall and 3rd-party blocking axes; middleware optimization buys it less.
"Make the backend faster" is a wish; "make this scenario's binding axis
cheaper" is a ranked list of PRs.

The interesting part is not any single optimization, or even the cascade of
falsifications and confirmations that produced the surface wins. It is the
shape of what was found underneath: not one bottleneck, but a class of
them, with different scenarios bound by different axes. The agents narrowed
the search space with evidence and let the bottleneck move without defending
the previous theory — which is the only way a system this shape gets mapped
at all.

The Evidence At Scale

One endpoint's hypothesis cascade is a vignette. The harder claim — the
uncomfortable one — is that this kind of work scales: in roughly
thirty-six hours, a small team plus parallel agents did the shape and
volume of investigation that a senior engineer would historically have
spread across a quarter. The most important output is not any single
fix. It is the kinds of work that suddenly become routine.

The Investigation At Scale

The agents read across four production codebases plus a PRD for a system
that does not exist yet — the API backend, the web frontend (to extract
real polling cadences and request fan-out), the AI/livestream service,
and the mobile/edge client paths, alongside the product requirements for
a 50K-person event flow with no working implementation. They produced,
per traffic scenario, the cadences, the at-risk factors, and the
treatments, and turned each into durable artifacts: a capacity model
deriving per-instance RPS from worker count and CPU-per-request; a
DB-CPU hotspot inventory ranking query shapes by cost so the ten- and
twenty-times wins land before the 1.2× ones; a four-scenario event-day
load profile (bursty purchase, 50K simultaneous join, initial-pressure
livestream, sustained livestream — each with a different binding axis);
and a synthesis collapsing dozens of hypotheses into the five axes the
case study named.

That analysis fed a hypothesis pipeline rather than a stack of one-off
optimizations. Roughly sixty tracking tickets opened under the parent
audit — about twenty-five explicit hypothesis tickets (each naming the
smell, the affected endpoints, and a falsifier), about seven paired
data-verification tickets whose only job is to confirm or kill a sibling
hypothesis, and the rest execution sub-projects (per-PR side backend,
read-replica routing, perf budgets, observability automation, DevOps
unblocks). Several stories were confirmed by measured wins of the kind
the case-study table chronicles — orders of magnitude on serializer
fan-out and paginated counts, an order of magnitude on a forced
aggregation — extended to more than one endpoint. Several others were
explicitly killed, which matters more than it sounds: the
Python-CPU-as-primary-limiter story (falsified by a local sweep that
scaled near-linearly); the load-source story for the ~1k-RPS ceiling
(the ceiling is real and lives in the worker count × CPU-per-request
budget); the token-cache premise (raw signature verification was too
cheap for the cache to matter); a codec swap assumed to be a meaningful
win that turned out to be noise. About two dozen hypotheses remain open
in triage, each queued behind its own data-verification ticket. The
point is not that the audit is finished. The point is that the queue
itself — falsifiable, prioritized, paired with a verification recipe —
is the durable output, built in parallel with the work that closed the
first batch. A traditional team builds this queue or builds the fixes;
this team built both.

Treatments At Scale

The shipped treatments span the categories the analysis pointed at
rather than clustering on a single favorite. In the same window:
query-shape rewrites (N+1 collapse, cheap-count pagination, GROUP-BY
elimination); per-request-cost reductions (eager-loading on the auth
path, explicit connection budgeting); platform plumbing (a per-PR
side backend so each branch can be benchmarked in isolation, an
in-cluster load harness, an observability-bundle validator, a fixed-rate
stress client); and methodology infrastructure (a clean perf-PR
pattern, an opt-in perf measurement toolkit, a Django caching playbook,
a per-scenario load-profile doc, a stress-test report for review by
humans who were not in the experiment loop).

Each treatment shipped as a small, byte-identical, single-anti-pattern
PR off main rather than as a bundled lab branch. That discipline is
itself a result. A senior engineer doing all of this serially does not
get to be that disciplined — they bundle, because bundling is how
working memory preserves context across changes. Parallel agents can
afford the discipline, and the team inherits a reviewable trail instead
of a heroic mega-merge. Beyond the PRs themselves, the cycle produced
roughly fifteen durable documents — capacity models, methodology specs,
optimization playbooks, validity gates, worked examples, runbooks, and
coordination logs — so the reasoning lives in the repository instead of
in chat history.

The honest read of this is not that the agents won. It is that the
parts of senior engineering work that used to be scarce — exhaustive
analysis across codebases, parallel falsification of competing stories,
ranked treatment lists with measured magnitudes, and writeups that
survive past the conversation that produced them — are no longer
rate-limited by typing time or by working memory. They are rate-limited
by whether the team sets the goal clearly, defines the falsifier
honestly, and accepts the evidence when it arrives.

That last one is where the bitter part lives. Some of the killed
hypotheses were stories the team had defended. Some of the confirmed
wins were not in anyone's top three guesses. The method does not care
which story was satisfying; it cares which story survives contact with
measurement. Accepting that — and learning to take pride in defining
goals and falsifiers rather than in being the one who also typed every
fix — is what makes the rest of it possible.

The Meta-Method And Architecture

The work succeeded because it was not run as a pile of disconnected chat turns.
It was run as a long-running goal architecture with a shared objective,
specialized agents, explicit containment, and durable state.

The first phase was deliberately constrained: prepare PRs, tests, hypotheses,
data, scripts, runbooks, and reasoning, but do not deploy and do not touch the
shared database. The agents did not burn the target environment just to feel
productive. They built the lab: local provider stubs, fresh-migrate workarounds,
isolated data paths, baseline controls, instrumentation, load tooling, parsers,
runbooks, and a coordination protocol.

After live execution was approved, the operating model shifted from isolation
to controlled target-environment work. The agents imported target-shaped data
into local Postgres, coordinated deployment windows, recorded deployed SHAs,
watched stop conditions, and pushed the system only until the next useful data
point appeared.

When a 3 RPS run saturated the environment, they treated it not as a crisis or
a reason to push harder, but as evidence: wall time times worker slots was
already enough to tip the system over.

That architecture had five essential parts:

A shared objective:
make 500-1000 RPS plausible with effective horizontal scaling, not just make
one benchmark look better.
A falsifiable model:
use Amdahl's law, Little's Law (in-flight work ≈ throughput × latency),
and the Universal Scalability Law to ask
whether throughput is capped by CPU, DB compute, DB connections, event-loop
blocking, external calls, or queueing.
Isolation before risk:
local Docker, an imported target-environment clone, local provider stubs, and
explicit live-execution gates allowed fast iteration without turning the
shared environment into the lab by default.
Agent specialization:
one agent owned app-code endpoint proofs and synthesis; another owned
harnesses, deployment, observability, and scaling sweeps; another owned
independent review and adjacent endpoint proofs. The taskboard and
append-only channel made collisions visible.
Proof over assertion:
changes were paired with query counts, timings, tests, payload hashes,
isolated worktree comparisons, target-environment samples, and explicit
caveats.

Backend optimization is longer than one prompt. It crosses code reading,
experiment design, harness construction, local reproduction, PR preparation,
deploy orchestration, observability capture, measurement correction, endpoint
fixes, and synthesis.

A stateless chat loop tends to forget the global objective or collapse into the
next visible task. Long-running goals kept the agents attached to the durable
outcome and gave them a place to preserve context across hours of asynchronous
code generation, deployment windows, and log analysis.

The operating loop was simple:

State the bottleneck hypothesis in physical terms.
Define the falsifier.
Build the smallest diagnostic.
Measure wall, CPU, DB time, external waits, queueing, errors, and saturation.
Change one lever.
Prove behavior, ideally with byte-identical payloads.
Re-measure the same way.
Update the model when the bottleneck moves.

That last step is the hardest culturally. If the bottleneck moves, that is
success. The point is not to defend the first theory. The point is to make the
next limiter visible.

The Guardrails

Cheap parallel compute is an accelerant, which means bad goal architecture will
amplify engineering errors just as quickly as it amplifies useful work. An
agent running an undisciplined optimization loop will happily DDoS a shared
environment, burn through observability spend, hide a regression behind a
cached response, or hallucinate a benchmarking victory by measuring failures as
throughput.

The human role therefore does not disappear. It moves from "manually perform
every step" to "define the containment boundary." The safety rails are not
optional process decoration; they are what make autonomous exploration usable:
local-first execution, explicit live-execution gates, one load window at a
time, stop conditions, deployed SHA capture, query-count and payload-equivalence
proofs, error-rate thresholds, and append-only coordination logs.

The measurement bug caught in this work is the small version of the same risk:
the first load harness reported achieved RPS in a way that counted failures.
Without experimental guardrails, cheap compute would have made that wrong
number look more convincing. With guardrails, the agents corrected the harness,
reclassified the run, and improved the method.

This method also has boundaries. It is strongest when the system is observable,
the goal is measurable, and improvement can proceed through small,
behavior-preserving changes. It is weaker when the task is green-field
architecture, product judgment, or a novel failure mode with poor telemetry.
There, human architectural taste and risk judgment still lead. The agents can
map the terrain, but they cannot decide what terrain is worth owning.

The Real Lesson

The human contribution is still essential, but it moves up a level. The human
should not spend most of the scarce 30 W on manually remembering every path and
guessing which bottleneck matters. The human should set the objective, enforce
experimental hygiene, choose acceptable risk, and judge the evidence.

The agents should burn compute on the scalable parts:

exhaustive code reading;
independent hypothesis generation;
repeated benchmark runs;
query-plan and metric collection;
before/after equivalence checks;
deployment state verification;
runbook and taskboard maintenance;
synthesis into a capacity model.

This is the backend version of Sutton's lesson: do not over-invest in a human's
favorite explanation of the system. Invest in the meta-process that searches the
system, measures it, and improves as more compute is applied.

The discomfort is the lesson. Mature backend engineering will feel less like
winning arguments about the system and more like building machinery that makes
those arguments unnecessary.

The result may sting. A single senior engineer's intuition is valuable, but it
is not a substitute for three agents running disciplined experiments in
parallel. That is not an insult to engineering judgment. It is what engineering
judgment should choose when compute becomes cheap enough.

Counter Factual
Estimate for an staff engineer — twenty years of wins across many varied fields data engineering, applied AI, MongoDB at scale, video codecs you've actually heard of — none of which necessarily teaches you Django querysets, Postgres profiling, stampede-protected Redis, gunicorn worker math, or how a Kubernetes liveness probe interacts with a saturated worker pool.

Ramp-up to credible-not-expert across that stack: 2–3 months. Investigation work on top of the ramp-up: another 1–2 months. Call it half a year, assuming nothing else catches fire. They'd ship ~10 heroic merged PRs and an oral folklore in their head — because under deadline the writeups are the first thing that gets cut. They'd also miss at least one axis with high probability;

Versus ~36 hours with agents in parallel, all five axes named, the writeups extant. Call it a ~100× compression in wall time, plus the writeups exist at all — which, on a year-long solo project, they wouldn't.

External Reference

Richard Sutton, "The Bitter Lesson", March 13, 2019: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

Top comments (2)

Gilder Miller • May 29

Yes, I like how it reframes engineering choices as a trade between encoded structure vs letting general mechanisms scale over time.
The point about schemas, pipelines, and “human-imposed structure” being the real constraint hits especially hard in modern agent-heavy systems.
Curious where you personally draw the line between necessary structure and over-engineering in production systems?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.