DEV Community

Cover image for I thought I needed a better model for 10 agents, but I really needed a queue
Lars Winstand
Lars Winstand

Posted on • Originally published at standardcompute.com

I thought I needed a better model for 10 agents, but I really needed a queue

If you’re running 10+ agents at once, the bottleneck usually isn’t model quality.

It’s shared execution capacity.

Org-level API limits. Browser/runtime contention. Chat-style subscriptions that look fine at 2 conversations and start getting weird at 6-8.

The fix is usually boring: queueing, worker isolation, retries, and explicit concurrency control.

I keep seeing teams ask for the best model for agents when their setup starts failing in a very specific way:

  • one agent pauses mid-task
  • one thread keeps going while another goes silent
  • a Telegram topic looks dead until you send ?
  • then it suddenly wakes up and continues like nothing happened

That does not look like a model-quality problem.

That looks like a scheduling problem.

The Reddit thread that explains the failure mode perfectly

I ran across a thread on r/openclaw that described this better than most polished architecture posts do:

https://reddit.com/r/openclaw/comments/1ufd864/how_to_run_10_agents_at_the_same_time_while/

The setup was very concrete:

  • 10 topic threads
  • one per app
  • running through OpenClaw
  • inside a Telegram supergroup
  • on a 16 GB Hetzner VPS

And the symptom was painfully familiar:

As my number of simultaneous conversations increases, I've noticed that sometimes the agent just stops responding entirely in some topics. It won't continue until I send another message (even just a '?'), after which it suddenly picks the conversation back up.

That bug tells you a lot.

Most people see it and blame the model:

  • maybe Claude got flaky
  • maybe GPT-5 is overloaded
  • maybe Qwen would be better
  • maybe Llama would behave differently

My take: most of the time, that diagnosis is wrong.

The stall is the clue

What makes this interesting is that the visible failure looks like “the model stopped thinking.”

But usually the deeper problem is that too many things are sharing one bottleneck.

A commenter in that same thread said this:

If you're using Claude CLI (ie max sub), you're basically limited to ~6-8 concurrent agents working at the same time. More will stall each other/wait for others to finish.

I wouldn’t treat 6-8 as some universal law.

But I absolutely believe the pattern.

Chat subscriptions are built for humans opening a few conversations.

They are not execution systems.

Once you move into real parallelism, the question stops being:

what’s the best model for agents?

and becomes:

what exactly is sharing capacity with what?

That’s where most agent stacks fall apart.

What is actually being shared?

Usually it’s not one thing. It’s three.

1. Provider-side rate limits

OpenAI rate limits are enforced at the organization and project level, not per chat window. Some model families also share limits.

That matters a lot more than people expect.

If Agent A is hammering GPT-5.4 and Agent B is quietly summarizing logs, those requests can still interfere with each other if they draw from the same org-level bucket.

From the outside, it looks random.

From the inside, it’s just shared quota.

A simple example:

# Agent 1 is doing heavy extraction
# Agent 2 is doing tiny summaries
# Both still hit the same org/project limits
Enter fullscreen mode Exit fullscreen mode

If you don’t have backpressure, one noisy worker can make the rest of the system look flaky.

2. Local runtime contention

The Reddit replies also pointed at the other obvious culprit: the machine itself.

If you’re running OpenClaw with shared Chromium state, long transcripts, tool calls, and multiple active sessions on a 16 GB VPS, you do not need a provider outage to get stalls.

You just need enough:

  • memory pressure
  • event loop contention
  • I/O wait
  • browser state bloat
  • session overhead

One commenter asked the right question:

Is every topic a new session? I find the only reason my agents stop is because memory overhead has been reach. Especially on VPS.

That’s not glamorous, but it’s probably closer to the truth than “the model got confused.”

3. Chat-session architecture

This is the sneaky one.

A chat subscription feels like an execution environment because you can open lots of threads.

But visible threads are not the same thing as:

  • a queue
  • worker pools
  • retry policies
  • dead-letter handling
  • admission control
  • explicit concurrency caps

At 2 conversations, the difference barely matters.

At 12, it matters a lot.

Why n8n hits the same wall

This is not just an OpenClaw problem.

It’s an architecture problem.

n8n says it pretty clearly in the docs: if you allow too many concurrent executions in regular mode, you can thrash the event loop and make the instance unresponsive.

That sentence is refreshingly unsexy, and also exactly correct.

What happens in practice:

  1. one workflow gets busy
  2. another webhook comes in
  3. then another
  4. CPU and memory get noisy
  5. the event loop gets hammered
  6. suddenly “AI is unreliable”

No.

Your scheduler is unreliable.

n8n’s answer was not “switch to a smarter model.”

It was concurrency control and queue mode.

For example:

export N8N_CONCURRENCY_PRODUCTION_LIMIT=20
Enter fullscreen mode Exit fullscreen mode

That one env var tells you a lot.

Mature workflow systems assume there must be an admission gate.

Because if everything can run immediately, eventually nothing runs well.

The architectural shift: chat threads vs queued work

The clean break is the queue.

In n8n queue mode:

  • the main instance accepts triggers and webhooks
  • Redis stores pending executions
  • worker instances pull jobs when capacity is available

That is a completely different model from:

I opened 10 Telegram conversations and hoped OpenClaw, Chromium, Claude, and my VPS would sort it out.

The config makes the difference obvious:

export EXECUTIONS_MODE=queue
Enter fullscreen mode Exit fullscreen mode

Then run workers with explicit concurrency:

n8n worker --concurrency=10
Enter fullscreen mode Exit fullscreen mode

That’s boring infrastructure.

Which is exactly why it works.

Quick comparison

Approach What happens under load
Chat subscription workflow Shared interactive-session limits, weak control over queueing and retries, simple for 1-2 conversations, starts stalling under parallel agent load
Direct API workflow Explicit RPM/TPM and org/project limits, can add queues, workers, retries, and backpressure, but token costs rise with usage
n8n regular mode vs queue mode Regular mode can become unresponsive under high concurrency, queue mode separates intake from execution using Redis and workers

That middle row is where a lot of teams have their “oh” moment.

They think they’re shopping for intelligence.

They’re actually shopping for throughput discipline.

The annoying part: the API architecture is better, but the bill can get ugly

This is where things get real.

Per-token pricing feels fine when you’re testing one agent in a notebook.

It feels very different once you fix concurrency and your workers are actually running all day.

That’s the trap.

You finally build the system correctly, and now your token bill starts acting like a second outage.

So the decision stops being just:

  • which model is smartest?

and becomes:

  • what gives me quality?
  • what gives me stable throughput?
  • what gives me predictable cost?

That’s why this category is getting interesting.

A lot of teams want API-style control:

  • OpenAI-compatible endpoints
  • real queues and workers
  • retries and backpressure
  • existing SDK support

But they do not want per-token anxiety every time they add more automations.

That’s exactly the gap Standard Compute is aiming at.

It gives you an OpenAI-compatible API for agent and automation workloads, but with flat monthly pricing instead of metered token billing.

So you can build the architecture you actually want:

  • API-based execution
  • explicit concurrency control
  • long-running automations
  • predictable cost

That matters a lot if you’re running agents in n8n, Make, Zapier, OpenClaw, or custom worker systems and you’re tired of choosing between flaky chat subscriptions and scary token bills.

More here:

https://standardcompute.com

What I would do for 10+ agents

If you need real concurrency, here’s the setup I’d reach for.

1. Separate intake from execution

Do not let incoming work immediately compete with currently running work.

Use a queue.

Examples:

  • n8n queue mode
  • BullMQ
  • Celery
  • SQS
  • RabbitMQ

Example with BullMQ:

import { Queue, Worker } from 'bullmq';

const queue = new Queue('agents', {
  connection: { host: 'localhost', port: 6379 }
});

await queue.add('run-agent', {
  agentId: 'agent-7',
  task: 'summarize support tickets'
});

const worker = new Worker(
  'agents',
  async job => {
    // call model API here
    console.log(`running ${job.data.agentId}`);
  },
  {
    concurrency: 8,
    connection: { host: 'localhost', port: 6379 }
  }
);
Enter fullscreen mode Exit fullscreen mode

The point is simple: intake should be cheap, execution should be bounded.

2. Put hard caps on concurrency

Not vibes. Numbers.

If your box can safely run 8 workers, set 8.

If your provider quota supports 20 active requests with headroom, cap at 20.

Examples:

export N8N_CONCURRENCY_PRODUCTION_LIMIT=20
Enter fullscreen mode Exit fullscreen mode
n8n worker --concurrency=10
Enter fullscreen mode Exit fullscreen mode
const MAX_PARALLEL_AGENTS = 8;
Enter fullscreen mode Exit fullscreen mode

The goal is not “maximum possible parallelism.”

The goal is stable throughput.

3. Isolate heavy sessions

Not every agent belongs in the same lane.

A scraping agent opening 40 tabs in Chromium should not share execution capacity with a tiny summarizer that just needs a few API calls.

Split workloads by resource profile:

  • browser-heavy
  • memory-heavy
  • long-context
  • lightweight text transforms

That alone fixes a lot of “random” instability.

4. Treat retries as a first-class feature

If an agent only resumes after you send ?, you already have a retry system.

It’s just a bad one, because the retry operator is a human.

Build explicit handling for:

  • timeouts
  • retries
  • stuck executions
  • dead-letter queues
  • idempotency

A rough pseudo-pattern:

async function runWithRetry(task, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await task();
    } catch (err) {
      if (attempt === maxRetries) throw err;
      await new Promise(r => setTimeout(r, attempt * 2000));
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

That is infinitely better than hoping a Telegram poke wakes the agent back up.

5. Measure the real bottleneck

If you don’t know what saturates first, you’ll keep blaming the model.

At minimum, track:

  • queue depth
  • worker utilization
  • provider RPM/TPM errors
  • memory usage
  • CPU load
  • transcript length
  • browser/session count
  • retry rate
  • stuck job count

If queue depth is climbing while workers are pinned, that’s a worker-capacity problem.

If workers are idle but requests are failing, that’s probably provider-side limits.

If memory spikes correlate with browser-heavy tasks, that’s local contention.

This stuff is diagnosable if you instrument it.

Do subscriptions still have a place?

Definitely.

If you’re one person running one or two long-lived chats, Claude Max or ChatGPT can be great.

That’s real value.

But the breakpoint arrives earlier than people think.

Once you need:

  • parallelism
  • retries
  • isolation
  • predictable throughput
  • cost control

…you’re no longer doing chat.

You’re doing distributed work.

Even if the UI still happens to be Telegram, Discord, or a browser tab.

And distributed work punishes wishful thinking.

The uncomfortable truth: a worse model can win

This is the part people hate hearing.

A slightly worse model running behind a clean queue with stable workers will often beat a better model trapped inside a shared, stall-prone chat setup.

Not on benchmark screenshots.

On actual throughput.

On actual reliability.

On actual unattended automation.

That’s the real lesson from the OpenClaw thread.

The user did not primarily have a “which model is smartest?” problem.

They had a concurrency architecture problem wearing a model-shaped mask.

Once you see that, a lot of agent weirdness gets easier to debug.

If your 10th agent makes your 3rd one freeze, stop shopping for magic prompts.

Stop rotating between Claude, GPT-5, Qwen, and Llama hoping one of them will rescue a blocked queue.

Build the queue first.

Then pick the model.

And if you want API-style control without token-billing anxiety, that’s the whole pitch behind Standard Compute: OpenAI-compatible API access for agent workloads, flat monthly pricing, and no need to babysit every token while your automations run.

Top comments (0)