The first thing I tried was the obvious thing: stand up a fake OpenAI endpoint that returns a hardcoded response, point the agent at it, and ramp up concurrent users. Mock the expensive dependency, isolate the variable, measure the infrastructure.
The agent entered an infinite loop on every request.
The reason comes down to how LangGraph agents work. Each turn, the agent calls OpenAI and gets back either a tool invocation or text. If it's a tool invocation, the agent runs the tool, appends the result to the message history, then calls OpenAI again, now with that result in context. OpenAI sees the tool result and responds with text. Turn over.
A dumb mock returns the same response every time regardless of what's in the history. So the agent calls a tool, gets back a tool invocation, runs the tool, appends the result, calls the mock again. Same response, same tool invocation. The result is sitting there in the history. The mock doesn't care. It loops forever.
The obvious workaround is turn-counting: return a tool invocation on the first call, text on the second. (Databricks' agent load-testing guide takes the same approach; I only found this after building and running my own.) It works if every conversation goes exactly one tool call then a response. Mine didn't. Some requests hit no tools at all. Some chained two or three in sequence depending on what the first returned. Turn-counting breaks the moment the real path doesn't match what you hardcoded.
What actually works is simpler: check the incoming message history before deciding what to return. If the last message is a tool result, return text. If it isn't, return a tool invocation. No state, no counter. The mock just reads what the agent sends on each request and responds to what actually happened.
const messages = req.body.messages;
const lastMessage = messages[messages.length - 1];
if (lastMessage?.role === 'tool') {
return res.json(generateTextResponse());
}
return res.json(generateToolInvocation());
With that in place, the first run, all dependencies mocked and ramped to three times production load, was uneventful.
Heap stayed at 47MB throughout. Event loop was 4.7ms at production load, crept up to 51ms at three times that. Sounds bad until you remember the response path is measured in seconds, not milliseconds. Nothing leaked, nothing saturated. I filed it away and moved on to the real test.
The third run was where it got interesting. With the hosting layer already confirmed clean in isolation, I brought in full real dependencies at production load. 52 iterations, zero errors, zero dropped SSE connections. The p95 response time was 102 seconds.
Same as single-user. I'd expected concurrency to show something: some degradation, some sign the load was registering. It hadn't moved at all.
The 102 seconds wasn't a contention effect. OpenAI synthesis was consuming 54 to 78 percent of every turn's duration regardless of concurrency. That's inherent generation time, not load.
The load finding is the rate-limit ceiling. A web research turn generates up to 5 upstream API calls: query analysis, up to three parallel synthesis passes, a follow-up classifier. Running the numbers on calls-per-turn and typical session cadence (about one heavy synthesis every two to three minutes), the ceiling before hitting our deployment's RPM quota is somewhere between 60 and 90 simultaneous sessions. The server is barely registering any of it.
None of this shows up in the dashboards. Event loop is green. Memory is flat. Every metric that belongs to me looks healthy. The thing that's binding is in a system I don't instrument.
That's the irony the mocked run sets up. I built a protocol-aware fake of OpenAI specifically so I could measure everything except OpenAI. The runtime was fine. The hosting was fine. Then I added OpenAI back and it was the only thing that mattered.
Databricks' guide and mine share the same premise: mock the LLM to separate infrastructure throughput from model latency. They're using it to find where the platform breaks. I found the platform doesn't break at any load that matters. The ceiling is in the upstream the mock removes.
A dumb mock skips all of this. Either it fails outright (infinite loops) or it passes for the wrong reason (responds instantly, making your infra look faster than it ever will with a real model behind it). Either way you never actually isolate anything, and you miss the only finding worth having.
I spent more time on the mock than on the test scripts. Not because it's hard to write (it isn't), but because I didn't know I needed it until the agent started looping.
Curious whether others load-testing agents hit the same wall, or found the ceiling somewhere else.
Top comments (0)