DEV Community

Mirza Iqbal
Mirza Iqbal

Posted on

5 ways AI agents quietly die inside n8n production

[ERROR] node "GPT Decision" execution_id=9f...3a status=failure
  cause: structured_output_schema_violation
  retries: 6  total_runtime_ms: 184302
  workflow: invoice-router-v3 owner: ap-team
Enter fullscreen mode Exit fullscreen mode

That node ran 47 times today.

Each run burned three retries before n8n gave up.

Cost on the OpenAI side was real money.

Cost on the human side was a finance ops lead manually re-routing 47 invoices because the agent never told anyone it was looping.

The article most teams read this week is "agents hallucinate". That problem is solved by structured output. The real failure modes in n8n agent production are different. Here are the five that actually fire on weeknights.

1. The silent retry storm

n8n retries on error by default. An LLM node that 429s under load retries with the same prompt, same model, same payload. Each retry costs money and produces the same failure.

The fix is to gate retries on the error class.

// In a Code node before the LLM call
const lastError = $input.first().json.error;
if (lastError?.code === 'rate_limit_exceeded') {
  // exponential backoff, not n8n's flat retry
  await new Promise(r => setTimeout(r, 1000 * Math.pow(2, $runIndex)));
  return $input.all();
}
if (lastError?.code === 'invalid_request') {
  // schema problem. retrying will not help.
  throw new Error('halt invalid_request, escalate to human queue');
}
return $input.all();
Enter fullscreen mode Exit fullscreen mode

The agent now distinguishes "transient" from "terminal". The retry storm dies.

2. Tool-call drift across long workflows

A multi-step agent flow calls tool A, then tool B, then tool C. Each tool returns slightly different JSON shapes. By step C the agent is reasoning over a structure that no longer matches its system prompt.

I have seen this in 6-step Clay-to-n8n-to-Salesforce flows. The Salesforce step fails because the contact object got mutated three steps back and nobody normalized it.

The fix is a normalization node between every tool call.

Tool A => Set node (rename + strip) => Tool B => Set node (rename + strip) => Tool C
Enter fullscreen mode Exit fullscreen mode

It looks redundant. It is not. The Set node enforces the schema your agent's system prompt promised. If schema and reality diverge, you get a typed error early, not a wrong invoice routed at midnight.

3. Silent payload truncation inside n8n's HTTP wrapper

n8n's OpenAI node and the HTTP Request node both have request body size limits that are NOT documented anywhere obvious. When the prompt plus tool history plus retrieval results cross about 950 KB, the HTTP body gets truncated by the n8n proxy, the LLM sees a malformed request, and the agent returns a vague refusal.

This one bit me twice on two different client projects. The agent worked fine on test inputs and failed mysteriously on real production payloads that had longer chat history.

The fix is to chunk payload BEFORE the LLM node, not inside the LLM provider.

// In a Code node sized for n8n's 1MB practical limit
const MAX_KB = 800;
const history = $input.first().json.history || [];
let total = 0;
const trimmed = [];
for (let i = history.length - 1; i >= 0; i--) {
  total += JSON.stringify(history[i]).length;
  if (total > MAX_KB * 1024) break;
  trimmed.unshift(history[i]);
}
return [{ json: { history: trimmed } }];
Enter fullscreen mode Exit fullscreen mode

Trim from the tail. Newest messages survive. Oldest get summarized in a separate node and re-injected as a single system message.

4. The credentials-rotation blackout

Enterprise rotates API keys quarterly. n8n credentials are encrypted at rest and decrypted by the credential service on every workflow run. If a key rotates and the credential update has not propagated, every active workflow fails silently to a 401, and n8n's default error path swallows the auth failure as a "node error" with no alert.

You find out when revenue dashboards stop refreshing.

The fix is a credentials health check workflow that runs every hour and pings every active integration.

Cron (every 1h) =>
  HTTP GET /credentials/all (n8n REST API) =>
  Loop over each credential =>
    Trigger a dry-run call against the provider =>
      If 401, send to ntfy.sh on the on-call channel
Enter fullscreen mode Exit fullscreen mode

That single workflow saved one of my clients 11 hours of debugging in March when their Slack OAuth key rotated and the marketing team's lead-routing flow went dark.

5. Memory poisoning across runs

If you store conversation memory in a Postgres or Redis-backed n8n credential and reuse it across runs, one bad agent output can poison every subsequent run.

I saw this happen with a customer service flow. A user typed a prompt-injection payload. The agent's "memory" node wrote the payload into Redis. Every subsequent customer for that ticket inherited the injection. Three hours later, the agent was telling people their refund was approved when it was not.

The fix is to validate memory on read, not only on write.

// In a Code node before the agent's memory-recall step
const recalled = $input.first().json.memory;
const SUSPICIOUS = /(ignore.*previous|you are now|system\W|admin\W)/i;
if (SUSPICIOUS.test(recalled)) {
  // memory is contaminated. drop it and start fresh.
  return [{ json: { memory: '', alert: 'memory_quarantined' } }];
}
return $input.all();
Enter fullscreen mode Exit fullscreen mode

The memory pattern works fine right up until the day someone feeds a poisoned input. The validate-on-read step is two lines and prevents a class of failure that costs trust to recover from.

What dies in production is not what you tested

Hallucination shows up in dev. These five patterns show up in production. The split matters because the dev-time fixes (structured output, retries, evals) do not catch any of the five above.

If you are running agents in n8n right now, the cheapest thing you can do this week is add a normalization node between every tool call and a credentials health-check workflow. Those two changes alone caught roughly 70 percent of the silent failures in the last enterprise rollout I audited.

What is the failure mode that bit you that you do not see written about anywhere? Drop a snippet in the comments. The pattern library only grows when more people share the n8n flows that actually broke.

Top comments (0)