DEV Community

HIDE
HIDE

Posted on

3 Gotchas I Hit Deploying the Claude Agent SDK to Railway

I deployed a Slack bot app built on the Claude Agent SDK to Railway, and immediately hit a string of landmines around the SDK itself. Every one of them was the "the logs don't tell you what's wrong" kind, and the second one in particular ate a lot of my time. Since other people are likely to get stuck in the same spots, I'm writing it down.

This is aimed at junior-to-mid-level devs using @anthropic-ai/claude-agent-sdk (query()) in Node.js.

TL;DR

  • Gotcha 1: In a root container, bypassPermissions isn't allowed, and the child process dies with code 1. Worse, stderr is swallowed, so you can't see why.
  • Gotcha 2: stdio MCP servers don't wait for connection by default, so on turn 1 the tool list is empty — and the model "acts out" tool calls and fabricates the results.
  • Gotcha 3: haiku shows up in your API logs, but that's not the model degrading — it's by design. It's used for internal chores.

Gotcha 1: bypassPermissions doesn't work in a root container

What happened

Code that ran fine locally started dying with code 1 the moment I deployed it to Railway — the agent did nothing and just exited. The entire error message was essentially this:

Error: Claude Code process exited with code 1
Enter fullscreen mode Exit fullscreen mode

That tells you nothing. The only stack trace was from my app; what the child process (the claude binary) actually said before dying was a complete black box.

The cause

query() spawns a claude binary internally. That binary refuses --dangerously-skip-permissions (which the SDK calls permissionMode: "bypassPermissions") when running as root or under sudo. It's a safety measure — skipping all permission checks as root is far too dangerous.

Railway, like many container environments, runs as root by default, so if you've set bypassPermissions you will always hit this. You can't catch it locally if you're running as a normal user.

Why there are no logs

This is the nasty part. Unless you pass an options.stderr callback, the SDK discards the child process's stderr with "ignore". In other words, the real error message is hidden by design.

The first step in debugging is to capture stderr:

const result = query({
  prompt: "...",
  options: {
    stderr: (data) => {
      console.error("[claude stderr]", data);
    },
    // ...
  },
});
Enter fullscreen mode Exit fullscreen mode

The moment I added this, I saw the message saying root can't use bypassPermissions, and the cause was finally confirmed. If you hit an unexplained code 1 with the Agent SDK, attach a stderr callback first — it'll save you a lot of time.

The fix

Drop bypassPermissions and explicitly list the tools you need in allowedTools.

const result = query({
  prompt: "...",
  options: {
    // don't set permissionMode (leave it default)
    allowedTools: [
      "Read",
      "Bash",
      "mcp__myserver__get_sales",
      // list the tools you need
    ],
    stderr: (data) => console.error("[claude stderr]", data),
  },
});
Enter fullscreen mode Exit fullscreen mode

Tools listed in allowedTools get auto-approved and run even under the default permission mode. The mindset shift is from "skip everything" to "allow only what you use." This is the healthier setup for production anyway.


Gotcha 2: stdio MCP doesn't wait to connect, so it fabricates tools on turn 1 (the hardest one)

This was the one that got me the worst. It's a nasty bug where the agent confidently returns fake numbers — and it looks normal at a glance, even in the logs.

What happened

My Slack bot app connects an MCP server over stdio to fetch sales and prices. But in production, the agent would sometimes make up plausible-looking sales and price figures without ever calling the MCP tool.

And it wasn't every time — it was hard to reproduce locally. The classic "intermittent, environment-dependent" kind of bug.

The cause: losing a startup race

Starting an stdio MCP server is non-blocking by default. That means the SDK moves on without waiting for the MCP connection to complete.

Here's what that causes. If the MCP is still connecting (pending) at the moment the turn-1 prompt is assembled, the tool list is handed to the model empty. From the model's point of view, there are no tools available.

Asked to "look up the sales" when no tools exist, the model dutifully "acts out" a tool call as text and writes up plausible-looking results itself. That was the source of the fabrication.

The structure is a race between "spawning a local process" and "making a network API call":

  • MCP server = spawning a local process (CPU-hungry)
  • Assembling the turn-1 prompt = a call to the Anthropic API

In a CPU-constrained environment like Railway, spawning the local process loses every time, so it reproduces reliably. Conversely, on my local machine I could reproduce it by deliberately loading the CPU — which is how I confirmed the root cause.

How to tell

Check the mcp_servers status in the init event — it's immediate:

  • connected → tools are present → normal
  • pending → tool list is empty → it'll fabricate

Just checking this tells you whether the current response can be trusted. Worth logging while debugging.

The fix: alwaysLoad: true

Add alwaysLoad: true to each MCP server config. This blocks startup until the connection completes (up to 5 seconds), guaranteeing the tools are present by turn 1.

const result = query({
  prompt: "...",
  options: {
    mcpServers: {
      myserver: {
        command: "node",
        args: ["./mcp-server.js"],
        alwaysLoad: true, // wait until connected
      },
    },
    allowedTools: ["mcp__myserver__get_sales"],
  },
});
Enter fullscreen mode Exit fullscreen mode

For an agent's use case, "wait until the tools are ready before speaking" beats "non-blocking and fast" by a mile. Fabrication is fatal for an agent that handles numbers, so this is a place to block and buy certainty.


Gotcha 3: haiku in your API logs isn't degradation

What happened

While investigating Gotcha 2, I was staring at my API logs and noticed that, separate from the model I specified for the main work (sonnet), there were requests going to haiku mixed in. And they were tiny — like 16 output tokens.

"Did the model get silently downgraded?" "Is this the cause of the fabrication?" I briefly suspected both, but this is entirely by design and normal.

The cause

The Agent SDK uses a small haiku for internal chores — summarizing, classifying, short judgments — separate from the main response generation (sonnet). The tiny requests are that. It's a legitimate design, optimized for both cost and speed.

The cause of the fabrication was tool absence (Gotcha 2), not model quality.

The lesson

It's tempting to reason "output looks off → model is weak → let's bump to Opus," but in this case bumping the model wouldn't fix the fabrication (the root cause — no tools — is the same).

Before you flee to Opus, fix whether the tools are actually being passed.

That was the biggest takeaway. Bumping the model grade is a last resort, after you've confirmed your tools, prompt, and context are wired up correctly.


Wrap-up

All three were bugs where "what the logs look like" lies to you.

Symptom Real cause Fix
Dies with code 1, no clear reason root rejects bypassPermissions Capture stderr, switch to allowedTools
Fabricates numbers Tool list empty because MCP isn't connected yet alwaysLoad: true to wait for connection
haiku shows up in logs Normal behavior for internal chores Do nothing (don't misread it)

The common lesson: the Agent SDK depends heavily on the startup timing of external things — a child process and MCP. When behavior looks suspicious, before you suspect the model or the prompt, check "what did the child process say before it died?" and "are the tools actually being passed?" — that's the shortcut.

Hope it saves someone the time I burned.

Top comments (0)