DEV Community

Cover image for I thought LLM tool calling would kill glue code and then my lights still wouldn’t turn on
Lars Winstand
Lars Winstand

Posted on • Originally published at standardcompute.com

I thought LLM tool calling would kill glue code and then my lights still wouldn’t turn on

LLM tool calling got dramatically better.

The glue code did not disappear.

That’s the part I think a lot of people are still underestimating.

MCP helps. OpenAI adding remote MCP support helps. Home Assistant exposing /api/mcp helps. But if you’ve actually tried to wire together OpenClaw, Home Assistant, Claude Code, n8n, Make, Zapier, or a custom agent stack, you already know the ugly truth:

the protocol is getting standardized faster than the operations are.

And operations are where projects still break.

I kept coming back to one Reddit post while researching this. One user said they spent 3.5 months, 1,300 hours, and nearly $700 before giving up on a fragile setup. Extreme? Sure. But the emotional arc is familiar: the demo works, the architecture diagram looks clean, and then your actual stack turns into auth bugs, proxy weirdness, expired tokens, file handoffs, and permission edge cases.

If you’re building agents that need to run all day without somebody watching token spend like a hawk, this stuff matters even more. The cost of the model is only one part of the pain. The other part is the maintenance tax from too many moving pieces.

The demo is clean. Production is glue.

The happy path always looks amazing.

A model sees a user request, picks a tool, calls it, and your cart updates or your lights dim or your support ticket gets filed.

Something like this:

response = client.responses.create(
    model="gpt-4.1",
    tools=[{
        "type": "mcp",
        "server_label": "shopify",
        "server_url": "https://example.com/api/mcp",
    }],
    input="Add the toner pads to my cart"
)
Enter fullscreen mode Exit fullscreen mode

That API is genuinely good.

But now try the real version:

  • the MCP server is behind auth
  • the browser tool needs a session
  • the hosted environment needs a token
  • one tool returns a file another tool can’t access
  • Home Assistant supports HTTP but your client only supports stdio
  • your proxy strips a header
  • your schema changed last week and one agent still has the old assumption baked in

At that point, you’re not “adding AI.”

You’re doing agent ops.

That’s why I don’t think “tool calling” killed glue code. I think it moved glue code into more expensive places.

What MCP actually solved

I’m pro-MCP.

The core idea is right: define a shared protocol so tools don’t need bespoke integrations for every model client.

That matters.

MCP gives you:

  • a common message format
  • JSON-RPC 2.0 semantics
  • standard transports like stdio and Streamable HTTP
  • a better path for tool interoperability than the 2024 free-for-all

That’s real progress.

But it’s important to be precise about what got standardized.

MCP standardizes the protocol layer. It does not standardize deployment, auth, file transfer, secret distribution, proxy behavior, or permission design.

And those are exactly the parts that make or break real systems.

stdio is simple until you have to support it

A lot of MCP setups still run over stdio.

That usually means:

  • spawn a local subprocess
  • send JSON-RPC over stdin/stdout
  • inject credentials through environment variables
  • trust the local machine boundary

In a toy example, that’s fine.

In a team environment, it gets messy fast.

You now own questions like:

  • how do secrets get onto the machine?
  • who can read them?
  • how does the subprocess restart?
  • what happens in CI?
  • what happens in Docker?
  • what happens on a hosted dev box?

Example of the kind of setup that starts innocent and gets sticky:

export HOME_ASSISTANT_TOKEN="..."
export MCP_SERVER_URL="http://localhost:8123/api/mcp"
python agent.py
Enter fullscreen mode Exit fullscreen mode

Works on your laptop.

Then someone else runs it in a container.

Then someone tries it in GitHub Actions.

Then someone puts it behind a supervisor.

Then the subprocess hangs and stdout buffering becomes your new hobby.

That is not an MCP failure.

That is the cost of local process orchestration pretending to be architecture.

Remote MCP is better, but it just moves the pain

I generally think remote MCP over HTTP is the better default.

It’s easier to reason about in distributed systems. It fits hosted environments better. It gives you a cleaner auth story than “hope the environment variables are right.”

But remote MCP is not magic either.

Now your checklist becomes:

  • is the endpoint reachable?
  • is TLS configured correctly?
  • is auth discovery working?
  • are tokens scoped correctly?
  • are origins locked down?
  • what happens when the server is down?
  • who rotates credentials?
  • what’s your retry policy?

A lot of teams see this and think: “wow, MCP is still immature.”

I don’t think that’s the right read.

The right read is: distributed systems are still distributed systems even when the payload is AI-shaped.

Home Assistant is the perfect reality check

Home Assistant is one of the best examples here because it’s both impressive and brutally honest.

Home Assistant can now act as:

  • an MCP server
  • an MCP client

That means it can expose lights, switches, shopping lists, and automations to external MCP clients, while also connecting outward to other MCP services like memory or web search.

That’s cool.

It’s also where the hidden tax becomes obvious.

Home Assistant exposes an endpoint at:

/api/mcp
Enter fullscreen mode Exit fullscreen mode

using Streamable HTTP.

Great.

But the docs also note that many clients still only support stdio, so you may need a bridge like mcp-proxy.

And on the other side, if a server only supports stdio, you may need a proxy to expose it over HTTP or SSE.

So the “standard” often still needs a translator.

That is the exact kind of thing that looks minor in docs and becomes permanent in production.

One proxy becomes:

  • one more thing to deploy
  • one more thing to observe
  • one more thing to secure
  • one more thing that can fail at 2 AM

The scariest sentence in agent engineering

While reading through Reddit threads, I found a line that perfectly captures where things go sideways.

A user wrote that they had their OpenClaw agent running, but after giving it their Home Assistant long-lived access token, they still couldn’t get it to reliably turn lights on and off or write basic automations.

That sentence contains the entire problem.

The model is usually not the bottleneck now.

GPT-5, Claude, Grok, Qwen, Llama — they can all infer that “turn off the kitchen lights” is an action.

The hard part is everything around the action:

  • is the tool exposed?
  • is the schema clear?
  • is the token valid?
  • is the permission model sane?
  • is the endpoint reachable?
  • is the action idempotent?
  • is the result observable?

The real work is not “can the model decide to call a tool?”

The real work is “can the surrounding system survive that decision repeatedly?”

Permissions are still way too blunt

This gets more serious when tools affect the real world.

Turning on a light is one thing.

Opening a garage door, buying groceries, sending a Stripe refund, or triggering a Shopify action is different.

A lot of systems still have coarse permission boundaries. Home Assistant’s docs are refreshingly clear about where auth is practical versus ideal. Token-based access works, but it pushes you into lifecycle management:

  • issuing tokens
  • storing tokens
  • rotating tokens
  • revoking tokens
  • auditing usage

If your security model is “give the agent a long-lived token and hope for the best,” you do not have an agent architecture.

You have a future incident report.

The file handoff problem is still embarrassingly bad

Actions are only half the story.

Then files show up.

Examples:

  • a browser MCP server takes a screenshot
  • an agent generates a CSV
  • Claude Code writes a patch file
  • n8n needs to pass an artifact to GitHub or Discord
  • another agent needs to read that artifact later

This is where a lot of stacks quietly fall apart.

I found another Reddit thread where someone basically said: every option for agent file handoff felt bad. S3 presigned URLs worked, but setup was annoying. Git commits were a joke. Temp folders were fragile.

That matches what I’ve seen.

If you don’t standardize file handoff, every workflow invents its own weird mini-protocol.

You end up rebuilding the same decisions over and over:

  1. where files live
  2. how long they live
  3. who can fetch them
  4. how metadata is attached
  5. how downstream agents discover them

This is one reason so many “works in demo” automations become maintenance magnets.

If you’ve touched n8n, you’ve felt this already

Even outside MCP, the pattern is obvious.

Say you need a community node in n8n inside Docker.

You end up doing something like:

docker exec -it n8n sh
mkdir -p ~/.n8n/nodes
cd ~/.n8n/nodes
npm i n8n-nodes-somepackage
Enter fullscreen mode Exit fullscreen mode

None of this is impossible.

That’s the problem.

The hardest glue problems are rarely impossible. They’re just annoying enough to become permanent. Every tiny workaround becomes part of your architecture whether you meant it to or not.

Which setup is actually easiest to live with?

My opinion: the best architecture is usually the one with the fewest moving connectors.

Not the one with the prettiest protocol diagram.

Here’s how I think about the tradeoffs:

Option What you’re really signing up for
MCP stdio Local subprocess management, env-based credentials, machine-specific setup, and debugging local boundaries
MCP Streamable HTTP / remote MCP Better fit for distributed systems, but now you own HTTP auth, origin validation, token lifecycle, retries, and uptime
Direct platform-native tools Less connector glue and often simpler auth, but more vendor lock-in and weaker portability

This is why I think many teams are making the same mistake:

they keep adding tools when they should be reducing surface area.

Fewer tools.
Better schemas.
Cleaner auth.
One file path.
One transport preference.

That wins more often than “support every possible connector.”

The failure mode nobody talks about: 3 months of success

The funniest and most dangerous failures are the ones that come after a long calm period.

I saw a Reddit story about an agent that handled grocery ordering fine for months and then ordered 2 kg of garlic instead of 2 heads.

That’s funny until the same class of bug hits:

  • medication
  • replacement parts
  • customer refunds
  • inventory orders
  • home automations with physical side effects

Long periods of apparent reliability hide brittle assumptions:

  • units are underspecified
  • schemas drift
  • sessions expire
  • file URLs expire sooner than expected
  • one model interprets a field differently than another
  • a proxy drops a header one client depended on

This is why “it worked in Claude Desktop” is not enough.

Move the same flow into OpenClaw, n8n, Make, Zapier, or a hosted agent runtime and you often discover that the connector stack was the actual product all along.

What I’d standardize first

If I were setting up an agent stack today across Home Assistant, OpenClaw, Claude Code, n8n, and a few remote MCP services, I would standardize the boring parts before adding one more capability.

1. Pick one preferred transport

Use remote MCP over Streamable HTTP by default.

Use stdio only when local execution is the point.

That means your architecture docs should say this explicitly, not just imply it.

Default: remote MCP over HTTP
Exception: stdio only for local-only tools
Enter fullscreen mode Exit fullscreen mode

2. Pick one auth pattern

Use scoped OAuth-style flows where possible.

Treat long-lived access tokens like hazardous material.

Bad:

export PROD_ADMIN_TOKEN="forever-token-with-broad-access"
Enter fullscreen mode Exit fullscreen mode

Better:

- short-lived access token
- refresh flow
- scoped permissions
- explicit revocation path
Enter fullscreen mode Exit fullscreen mode

3. Pick one file handoff method

If agents need to pass artifacts around, decide once.

A practical baseline:

  • object storage
  • signed URLs
  • fixed TTLs
  • explicit metadata
  • consistent naming

Example metadata shape:

{
  "artifact_id": "run_1842_screenshot_1",
  "content_type": "image/png",
  "expires_at": "2026-05-14T20:00:00Z",
  "source_agent": "browser-runner",
  "purpose": "bug-report-evidence"
}
Enter fullscreen mode Exit fullscreen mode

4. Keep the tool surface small

Ten sharp tools beat a hundred vague ones.

Bad tool design:

{
  "name": "do_home_action",
  "description": "Does things in the house"
}
Enter fullscreen mode Exit fullscreen mode

Better tool design:

{
  "name": "turn_light_on",
  "description": "Turn on a specific light entity in Home Assistant",
  "input_schema": {
    "type": "object",
    "properties": {
      "entity_id": {
        "type": "string",
        "description": "Home Assistant entity_id like light.kitchen"
      }
    },
    "required": ["entity_id"]
  }
}
Enter fullscreen mode Exit fullscreen mode

5. Build an eval harness before you trust anything

Every time you change:

  • schema
  • auth flow
  • model
  • proxy
  • timeout settings
  • tool descriptions

run evals.

Especially if the workflow touches:

  • money
  • devices
  • customer data
  • external APIs with side effects

Even a tiny shell-based smoke test is better than vibes:

./run-evals.sh home_assistant_lights
./run-evals.sh grocery_ordering
./run-evals.sh shopify_cart
Enter fullscreen mode Exit fullscreen mode

Why this matters even more for teams running agents all day

If you’re building serious automations, you don’t just need tools that work once.

You need them to work repeatedly, cheaply, and predictably.

That’s why pricing model and architecture model are connected.

Per-token billing pushes teams into weird defensive behavior:

  • avoid evals because they cost money
  • avoid long-running agents because they might spike usage
  • avoid experimentation because every glue-layer test burns budget
  • babysit workflows instead of letting them run

That’s backwards.

The whole point of agents is to let them operate continuously.

If your team is wiring together OpenAI-compatible SDKs, n8n flows, OpenClaw agents, or custom automations, predictable cost matters because it changes how aggressively you can test and run the stack.

That’s one reason Standard Compute is interesting to this audience. It’s a drop-in OpenAI API replacement with flat monthly pricing, so teams can run agent-heavy workloads without turning every architecture decision into a token-cost debate. If you’re doing evals, retries, tool chains, and long-running automations, that pricing model is a lot closer to how people actually want to build.

And honestly, that’s the right mental model for this whole category:

make the infrastructure boring enough that the agents can be ambitious.

My blunt takeaway

MCP is good.

Tool calling is real.

The demos are not lying.

But the standard did not remove the boring parts.

It just made them more visible.

If your team wants reliable agent workflows, standardize these first:

  • transport
  • auth
  • file handoff
  • permission scope
  • evals
  • observability

Do that before adding your 27th tool.

Because the teams that win with LLM tool calling are usually not the teams with the most integrations.

They’re the teams with the least chaos between integrations.

And if your lights still won’t turn on, it’s probably not because the model is dumb.

It’s because the glue is still your product.

Top comments (0)