DEV Community

Cover image for I thought we needed another agent framework — turns out we needed a job_id and a boring config folder
Lars Winstand
Lars Winstand

Posted on • Originally published at standardcompute.com

I thought we needed another agent framework — turns out we needed a job_id and a boring config folder

A lot of agent engineering advice still sounds like framework shopping.

Should you use OpenClaw or n8n?
Is LiteLLM enough?
Do you need LangGraph, an MCP server, or a custom Rust runtime with a dashboard that looks like Mission Control?

After reading a bunch of real production threads, I think most teams are solving the wrong problem.

They think they need a better framework.

What they actually need is:

  • a shared config layer for prompts, tools, and policies
  • explicit model routing
  • run-level tracing with a stable job_id
  • one place to see what happened across retries, tool calls, fallbacks, and provider swaps

That’s the boring part of agent systems.

It’s also the part that keeps long-running automations from turning into folklore.

The pattern I kept seeing

I kept running into Reddit posts from people who said they wanted an agent framework comparison.

But when you read closely, they were describing operations problems.

One thread on r/openclaw was from someone running OpenClaw in production on a Mac Mini M4 with 16GB RAM, using GPT-5.5 via OAuth, Telegram as the interface, memory, workflow routing, and a side-by-side sandbox for testing a second framework.

The key line was this:

Building a portable 'brain' layer (prompts, memory, workflows, routing rules) that can eventually work across multiple frameworks

That is not a framework problem.

That is the adult version of agent engineering.

Another thread described an API gateway with a Rust correlator where every run gets a job_id and that ID follows the run across LLM calls and tool invocations.

That’s the layer most teams are missing.

Not another runtime.

A durable operational spine.

What actually breaks first in long-running agents?

Not intelligence.

Operations.

The first failures are usually boring:

  • runaway loops
  • fallback confusion
  • stale memory
  • duplicated retry logic
  • expensive models handling cheap tasks
  • no way to explain one bad run end-to-end

One OpenClaw user said they burned through tokens their first week because the agent looped on heartbeat checks and cron pings.

That should sound familiar to anyone who has let an automation run overnight.

The fix was not a better prompt.

The fix was routing policy.

They moved routine work to cheaper models and kept stronger reasoning models for the hard parts.

That’s the move.

Not “make the agent smarter.”

Make the default path cheaper and easier to debug.

Cheap defaults beat clever prompts

If your agent is doing background work like this:

  • heartbeat checks
  • cron pings
  • email triage
  • status polling
  • repetitive browser steps
  • simple classification

...then sending every step to Claude Opus or GPT-5 is just expensive laziness.

Use the expensive model when the run has earned it.

A simple routing policy gets you further than another week of prompt tuning:

TASK_TO_MODEL = {
    "heartbeat_check": "fast-cheap",
    "cron_ping": "fast-cheap",
    "email_triage": "fast-cheap",
    "status_poll": "fast-cheap",
    "classification": "mid-tier",
    "browser_exception": "strong-reasoning",
    "complex_reasoning": "strong-reasoning",
}


def pick_model(task_name: str) -> str:
    return TASK_TO_MODEL.get(task_name, "mid-tier")
Enter fullscreen mode Exit fullscreen mode

If you’re running agents in n8n, Make, Zapier, OpenClaw, or custom workers, this matters a lot more than people admit.

Most runaway cost comes from boring background work nobody classified.

The one thing I’d add before adopting another framework

Before you migrate anything, add a job_id.

Not request IDs.

Run IDs.

A single long-running automation can touch:

  • GPT-5.4
  • Claude Opus 4.6
  • Grok 4.20
  • browser tools
  • webhooks
  • approval steps
  • retries
  • queues

If your observability stops at request logs, you don’t really have observability.

You have receipts.

What you need is a story for one run.

Here’s the minimum useful pattern:

import uuid


def start_job() -> str:
    return f"job_{uuid.uuid4().hex}"


job_id = start_job()

headers = {
    "x-job-id": job_id,
    "x-agent-name": "support-triage",
}

# pass these headers into every LLM request, tool call, and webhook
Enter fullscreen mode Exit fullscreen mode

Then aggregate by job_id:

  • model used at each step
  • latency
  • retries
  • tool calls
  • fallbacks
  • token usage
  • cost
  • human interventions

Once you do that, incident review gets much easier.

Instead of asking:

Why is the dashboard weird?

You can ask:

What happened in job_123?

That’s a much better question.

The repo shape tells you whether a team gets agent ops

The healthiest setups I’ve seen all converge on the same basic shape.

Keep the durable stuff separate from the replaceable stuff.

agents/
  openclaw-prod/
    .env
    workflows/
    runtime/
  sandbox-framework/
    .env
    workflows/
    runtime/
shared-brain/
  prompts/
  tools/
  policies/
  memory-schema.json
  routing.yaml
Enter fullscreen mode Exit fullscreen mode

That layout says:

  • prompts are portable
  • tool contracts are portable
  • policies are portable
  • memory schema is portable
  • runtimes are disposable

That’s what you want.

Because OpenClaw might change.
Your n8n flow might become a Python worker.
Your memory layer might move to a Cloudflare Worker exposed over MCP.
Your provider mix might change next month.

If your prompts, policies, and memory schema are trapped inside one framework’s opinionated format, every migration becomes painful for no good reason.

A practical routing config beats framework magic

I’d rather have a plain YAML file I can inspect than hidden routing logic buried in a framework abstraction.

For example:

default_model: gpt-5.4-mini
routes:
  heartbeat_check: gpt-5.4-mini
  cron_ping: gpt-5.4-mini
  email_triage: gpt-5.4-mini
  browser_automation: claude-opus-4.6
  research_synthesis: gpt-5.4
  fallback_reasoning: grok-4.20
budgets:
  max_cost_per_job_usd: 0.75
  max_llm_calls_per_job: 40
fallbacks:
  - from: claude-opus-4.6
    to: gpt-5.4
  - from: gpt-5.4
    to: grok-4.20
Enter fullscreen mode Exit fullscreen mode

Now your routing policy is visible.

You can diff it.
You can review it in PRs.
You can compare behavior across frameworks.

That is a lot more useful than another demo of an autonomous agent planning vacation itineraries.

Framework choice still matters, just less than people think

To be fair: framework choice is not fake.

It matters if you care about:

  • built-in memory models
  • local model support for Qwen or Llama
  • UI ergonomics
  • tool ecosystem
  • workflow authoring style
  • MCP support

But once agents become operationally important, framework choice stops being the center of gravity.

The real questions become:

  • Can I move prompts and policies without rewriting everything?
  • Can I compare Claude, GPT-5, and Grok on the same job type?
  • Can I see cost, latency, retries, and tool calls in one run view?
  • Can I stop silent fallback behavior before it burns budget?
  • Can I swap runtimes without losing my memory schema?

That’s agent ops.

It’s less glamorous than framework demos.

It’s also what survives six months of production use.

The tradeoff, plainly

Approach What happens over time
Framework-centric setup Fast to start, but prompts, memory, and workflow logic get tightly coupled to one runtime
API gateway plus portable config Better visibility, easier provider swaps, cleaner routing control, but requires discipline around schemas and metadata
Direct provider integrations in each workflow Fine for small projects, but routing, observability, and fallback logic get duplicated everywhere

If you are a solo builder with one short-lived agent, don’t build a giant control plane.

That’s overkill.

But if you have multiple workflows, long-running jobs, or agents running 24/7, the framework-first setup starts rotting from the edges.

Every workflow invents its own retry logic.
Every prompt drifts.
Every dashboard tells a different partial truth.

That’s usually when teams start looking for an OpenAI API alternative.

And honestly, what they often want is not just lower pricing.

They want one consistent execution layer where routing, budgets, and visibility are not reinvented inside every single agent.

Why this connects directly to cost

This is the part people miss.

Agent ops is cost control.

If you can’t see a run end-to-end, you can’t answer:

  • why one workflow got expensive
  • which model handled each step
  • whether fallback increased cost
  • whether retries multiplied spend
  • whether background tasks should be routed to cheaper models

That’s why flat, predictable AI compute is interesting for automation teams.

Not because pricing is a nice spreadsheet feature.

Because per-token billing punishes exactly the kind of experimentation and long-running execution that agent systems need.

If you’re building automations that run all day in n8n, Make, Zapier, OpenClaw, or custom workers, token anxiety becomes an architecture problem.

You start avoiding useful checks.
You under-instrument jobs.
You hesitate to add retries.
You route too much logic through one provider because cost modeling is annoying.

That’s backwards.

The infrastructure should make long-running jobs easier to operate, not harder to justify.

This is a big part of why services like Standard Compute are interesting to teams building agents and automations.

You keep the OpenAI-compatible API surface, but you get predictable monthly pricing, dynamic routing across models like GPT-5.4, Claude Opus 4.6, and Grok 4.20, and you stop treating every extra automation step like a billing event you need to babysit.

That changes how people build.

Especially once jobs run 24/7.

My practical recommendation

If your first instinct is to adopt another framework, stop for a minute.

Do these four things first:

1. Add a shared config layer

Put prompts, policies, tool definitions, and memory schema outside the runtime.

2. Add explicit routing rules

Don’t let model selection happen implicitly.

3. Add a job_id

Trace one run across every LLM call, tool call, retry, and fallback.

4. Add budget controls outside the framework

Make spend limits and fallback policy visible and editable without rewriting workflow code.

If you want a tiny starting point, even this is enough:

mkdir -p shared-brain/{prompts,tools,policies}
touch shared-brain/memory-schema.json
touch shared-brain/routing.yaml
Enter fullscreen mode Exit fullscreen mode

Then wire your runtime to read from it.

That one decision will age better than most framework migrations.

The boring layer is the real product

The cleanest mental model I’ve found is to separate three things:

1. The brain

Prompts, policies, workflow definitions, tool contracts, memory references.

2. The runtime

OpenClaw, n8n, a Python worker, a Rust gateway, a Cloudflare Worker, whatever runs the job today.

3. The ops layer

Routing, budgets, tracing, correlation, failover rules, reporting.

If those are fused together, every change becomes political.

Switching providers feels risky.
Testing a second framework feels expensive.
Debugging a bad run feels like archaeology.

If those layers are separate, your system gets boring in the best possible way.

And boring is exactly what you want when an agent has been running for eight hours, touched email, Telegram, browser automation, and background jobs, and now somebody wants to know why it made one weird decision at 3:14 AM.

My takeaway is simple.

Most teams do not need another agent framework.

They need a shared config folder, explicit routing rules, and a job_id that can explain what their agent did all night.

Top comments (0)