Lars Winstand

Posted on Jun 26 • Originally published at standardcompute.com

I thought I needed a better model for 10 agents, but I really needed a queue

#ai #agents #n8n #devops

If you’re running 10+ agents at once, the bottleneck usually isn’t model quality.

It’s shared execution capacity.

Org-level API limits. Browser/runtime contention. Chat-style subscriptions that look fine at 2 conversations and start getting weird at 6-8.

The fix is usually boring: queueing, worker isolation, retries, and explicit concurrency control.

I keep seeing teams ask for the best model for agents when their setup starts failing in a very specific way:

one agent pauses mid-task
one thread keeps going while another goes silent
a Telegram topic looks dead until you send ?
then it suddenly wakes up and continues like nothing happened

That does not look like a model-quality problem.

That looks like a scheduling problem.

The Reddit thread that explains the failure mode perfectly

I ran across a thread on r/openclaw that described this better than most polished architecture posts do:

https://reddit.com/r/openclaw/comments/1ufd864/how_to_run_10_agents_at_the_same_time_while/

The setup was very concrete:

10 topic threads
one per app
running through OpenClaw
inside a Telegram supergroup
on a 16 GB Hetzner VPS

And the symptom was painfully familiar:

As my number of simultaneous conversations increases, I've noticed that sometimes the agent just stops responding entirely in some topics. It won't continue until I send another message (even just a '?'), after which it suddenly picks the conversation back up.

That bug tells you a lot.

Most people see it and blame the model:

maybe Claude got flaky
maybe GPT-5 is overloaded
maybe Qwen would be better
maybe Llama would behave differently

My take: most of the time, that diagnosis is wrong.

The stall is the clue

What makes this interesting is that the visible failure looks like “the model stopped thinking.”

But usually the deeper problem is that too many things are sharing one bottleneck.

A commenter in that same thread said this:

If you're using Claude CLI (ie max sub), you're basically limited to ~6-8 concurrent agents working at the same time. More will stall each other/wait for others to finish.

I wouldn’t treat 6-8 as some universal law.

But I absolutely believe the pattern.

Chat subscriptions are built for humans opening a few conversations.

They are not execution systems.

Once you move into real parallelism, the question stops being:

what’s the best model for agents?

and becomes:

what exactly is sharing capacity with what?

That’s where most agent stacks fall apart.

What is actually being shared?

Usually it’s not one thing. It’s three.

1. Provider-side rate limits

OpenAI rate limits are enforced at the organization and project level, not per chat window. Some model families also share limits.

That matters a lot more than people expect.

If Agent A is hammering GPT-5.4 and Agent B is quietly summarizing logs, those requests can still interfere with each other if they draw from the same org-level bucket.

From the outside, it looks random.

From the inside, it’s just shared quota.

A simple example:

# Agent 1 is doing heavy extraction
# Agent 2 is doing tiny summaries
# Both still hit the same org/project limits

If you don’t have backpressure, one noisy worker can make the rest of the system look flaky.

2. Local runtime contention

The Reddit replies also pointed at the other obvious culprit: the machine itself.

If you’re running OpenClaw with shared Chromium state, long transcripts, tool calls, and multiple active sessions on a 16 GB VPS, you do not need a provider outage to get stalls.

You just need enough:

memory pressure
event loop contention
I/O wait
browser state bloat
session overhead

One commenter asked the right question:

Is every topic a new session? I find the only reason my agents stop is because memory overhead has been reach. Especially on VPS.

That’s not glamorous, but it’s probably closer to the truth than “the model got confused.”

3. Chat-session architecture

This is the sneaky one.

A chat subscription feels like an execution environment because you can open lots of threads.

But visible threads are not the same thing as:

a queue
worker pools
retry policies
dead-letter handling
admission control
explicit concurrency caps

At 2 conversations, the difference barely matters.

At 12, it matters a lot.

Why n8n hits the same wall

This is not just an OpenClaw problem.

It’s an architecture problem.

n8n says it pretty clearly in the docs: if you allow too many concurrent executions in regular mode, you can thrash the event loop and make the instance unresponsive.

That sentence is refreshingly unsexy, and also exactly correct.

What happens in practice:

one workflow gets busy
another webhook comes in
then another
CPU and memory get noisy
the event loop gets hammered
suddenly “AI is unreliable”

No.

Your scheduler is unreliable.

n8n’s answer was not “switch to a smarter model.”

It was concurrency control and queue mode.

For example:

export N8N_CONCURRENCY_PRODUCTION_LIMIT=20

That one env var tells you a lot.

Mature workflow systems assume there must be an admission gate.

Because if everything can run immediately, eventually nothing runs well.

The architectural shift: chat threads vs queued work

The clean break is the queue.

In n8n queue mode:

the main instance accepts triggers and webhooks
Redis stores pending executions
worker instances pull jobs when capacity is available

That is a completely different model from:

I opened 10 Telegram conversations and hoped OpenClaw, Chromium, Claude, and my VPS would sort it out.

The config makes the difference obvious:

export EXECUTIONS_MODE=queue

Then run workers with explicit concurrency:

n8n worker --concurrency=10

That’s boring infrastructure.

Which is exactly why it works.

Quick comparison

Approach	What happens under load
Chat subscription workflow	Shared interactive-session limits, weak control over queueing and retries, simple for 1-2 conversations, starts stalling under parallel agent load
Direct API workflow	Explicit RPM/TPM and org/project limits, can add queues, workers, retries, and backpressure, but token costs rise with usage
n8n regular mode vs queue mode	Regular mode can become unresponsive under high concurrency, queue mode separates intake from execution using Redis and workers

That middle row is where a lot of teams have their “oh” moment.

They think they’re shopping for intelligence.

They’re actually shopping for throughput discipline.

The annoying part: the API architecture is better, but the bill can get ugly

This is where things get real.

Per-token pricing feels fine when you’re testing one agent in a notebook.

It feels very different once you fix concurrency and your workers are actually running all day.

That’s the trap.

You finally build the system correctly, and now your token bill starts acting like a second outage.

So the decision stops being just:

which model is smartest?

and becomes:

what gives me quality?
what gives me stable throughput?
what gives me predictable cost?

That’s why this category is getting interesting.

A lot of teams want API-style control:

OpenAI-compatible endpoints
real queues and workers
retries and backpressure
existing SDK support

But they do not want per-token anxiety every time they add more automations.

That’s exactly the gap Standard Compute is aiming at.

It gives you an OpenAI-compatible API for agent and automation workloads, but with flat monthly pricing instead of metered token billing.

So you can build the architecture you actually want:

API-based execution
explicit concurrency control
long-running automations
predictable cost

That matters a lot if you’re running agents in n8n, Make, Zapier, OpenClaw, or custom worker systems and you’re tired of choosing between flaky chat subscriptions and scary token bills.

More here:

https://standardcompute.com

What I would do for 10+ agents

If you need real concurrency, here’s the setup I’d reach for.

1. Separate intake from execution

Do not let incoming work immediately compete with currently running work.

Use a queue.

Examples:

n8n queue mode
BullMQ
Celery
SQS
RabbitMQ

Example with BullMQ:

import { Queue, Worker } from 'bullmq';

const queue = new Queue('agents', {
  connection: { host: 'localhost', port: 6379 }
});

await queue.add('run-agent', {
  agentId: 'agent-7',
  task: 'summarize support tickets'
});

const worker = new Worker(
  'agents',
  async job => {
    // call model API here
    console.log(`running ${job.data.agentId}`);
  },
  {
    concurrency: 8,
    connection: { host: 'localhost', port: 6379 }
  }
);

The point is simple: intake should be cheap, execution should be bounded.

2. Put hard caps on concurrency

Not vibes. Numbers.

If your box can safely run 8 workers, set 8.

If your provider quota supports 20 active requests with headroom, cap at 20.

Examples:

export N8N_CONCURRENCY_PRODUCTION_LIMIT=20

n8n worker --concurrency=10

const MAX_PARALLEL_AGENTS = 8;

The goal is not “maximum possible parallelism.”

The goal is stable throughput.

3. Isolate heavy sessions

Not every agent belongs in the same lane.

A scraping agent opening 40 tabs in Chromium should not share execution capacity with a tiny summarizer that just needs a few API calls.

Split workloads by resource profile:

browser-heavy
memory-heavy
long-context
lightweight text transforms

That alone fixes a lot of “random” instability.

4. Treat retries as a first-class feature

If an agent only resumes after you send ?, you already have a retry system.

It’s just a bad one, because the retry operator is a human.

Build explicit handling for:

timeouts
retries
stuck executions
dead-letter queues
idempotency

A rough pseudo-pattern:

async function runWithRetry(task, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await task();
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise(r => setTimeout(r, attempt * 2000));
    }
  }
}

That is infinitely better than hoping a Telegram poke wakes the agent back up.

5. Measure the real bottleneck

If you don’t know what saturates first, you’ll keep blaming the model.

At minimum, track:

queue depth
worker utilization
provider RPM/TPM errors
memory usage
CPU load
transcript length
browser/session count
retry rate
stuck job count

If queue depth is climbing while workers are pinned, that’s a worker-capacity problem.

If workers are idle but requests are failing, that’s probably provider-side limits.

If memory spikes correlate with browser-heavy tasks, that’s local contention.

This stuff is diagnosable if you instrument it.

Do subscriptions still have a place?

Definitely.

If you’re one person running one or two long-lived chats, Claude Max or ChatGPT can be great.

That’s real value.

But the breakpoint arrives earlier than people think.

Once you need:

parallelism
retries
isolation
predictable throughput
cost control

…you’re no longer doing chat.

You’re doing distributed work.

Even if the UI still happens to be Telegram, Discord, or a browser tab.

And distributed work punishes wishful thinking.

The uncomfortable truth: a worse model can win

This is the part people hate hearing.

A slightly worse model running behind a clean queue with stable workers will often beat a better model trapped inside a shared, stall-prone chat setup.

Not on benchmark screenshots.

On actual throughput.

On actual reliability.

On actual unattended automation.

That’s the real lesson from the OpenClaw thread.

The user did not primarily have a “which model is smartest?” problem.

They had a concurrency architecture problem wearing a model-shaped mask.

Once you see that, a lot of agent weirdness gets easier to debug.

If your 10th agent makes your 3rd one freeze, stop shopping for magic prompts.

Stop rotating between Claude, GPT-5, Qwen, and Llama hoping one of them will rescue a blocked queue.

Build the queue first.

Then pick the model.

And if you want API-style control without token-billing anxiety, that’s the whole pitch behind Standard Compute: OpenAI-compatible API access for agent workloads, flat monthly pricing, and no need to babysit every token while your automations run.

DEV Community