Anand Pawar

Posted on Jun 22 • Originally published at Medium

3 Tests That Pass in LangFlow But Fail in n8n Production

#ai #playwright #testing #automation

You built a LangFlow prototype. Every test passed. You exported the flow, dropped it into n8n, and the first production run broke.

This is not a bug report. It is a pattern.

The three-year SDET who has built LangFlow prototypes but hit mysterious failures when deploying the same logic in n8n production already knows the feeling. The prototype felt solid. The production pipeline felt like a different language.

It is not. The difference is execution context. LangFlow runs in a notebook-like environment where state is forgiving and retries are invisible. n8n runs in a workflow engine where every node is a transaction boundary and every failure is final unless you explicitly handle it.

Here are the three tests that pass in LangFlow but fail in n8n production, and what they teach about building reliable AI pipelines.

Test 1: The "LLM Returns Valid JSON" Test

What passes in LangFlow: You send a prompt asking the model to return JSON. The response comes back as a string. You parse it with json.loads(). It works. You move on.

What fails in n8n: The model returns a string that starts with a code block. Or a trailing comma. Or a markdown fence. Or a preamble sentence before the JSON. Or nothing at all because the context window was exceeded.

Why the difference: LangFlow's Python node silently tolerates malformed output. If json.loads() fails, you see the error in the output panel and fix the prompt. n8n's JSON node does not retry. It does not fall back. It throws a structured error that stops the entire workflow.

The fix is not a better prompt. The fix is a validation layer that normalizes LLM output before parsing.

import json
import re

def extract_json(raw: str) -> dict:
    # Strip markdown fences
    cleaned = re.sub(r'^```

(?:json)?\s*', '', raw.strip())
    cleaned = re.sub(r'\s*

```$', '', cleaned)

    # Find the first { and last }
    start = cleaned.find('{')
    end = cleaned.rfind('}')

    if start == -1 or end == -1:
        raise ValueError("No JSON object found in response")

    candidate = cleaned[start:end+1]

    # Attempt parse
    try:
        return json.loads(candidate)
    except json.JSONDecodeError:
        # Try fixing trailing commas
        candidate = re.sub(r',\s*}', '}', candidate)
        return json.loads(candidate)

This function lives in a shared utility node in n8n. Every LLM call routes through it. It catches the edge cases that LangFlow silently absorbed.

What this teaches: Prototype environments hide fragility. The test that passes in LangFlow is not testing the output format. It is testing whether the model usually returns something parseable. Production requires a contract, not a hope.

Test 2: The "Context Window Fits" Test

What passes in LangFlow: You feed a document into a prompt. The document is 8,000 tokens. The model handles it. You test again with 12,000 tokens. Still fine. You declare the pipeline ready.

What fails in n8n: The production document is 18,000 tokens. The model truncates silently. Or the context window fills with system prompts and conversation history, leaving no room for the actual input. The output becomes generic. The test that passed now produces garbage.

Why the difference: LangFlow runs each prompt in isolation. You control exactly what goes in. n8n workflows accumulate state. A node that prepends conversation history, a sub-workflow that adds metadata, a loop that concatenates previous outputs — each one eats context. The prototype never tested this because the prototype never ran the full chain.

The fix is not a larger model. The fix is a context budget that is measured and enforced.

// n8n node: Context Budget Checker
// Place this before any LLM call

interface ContextBudget {
  maxTokens: number;
  reservedForOutput: number;
  systemPromptTokens: number;
  historyTokens: number;
}

function checkBudget(input: string, budget: ContextBudget): boolean {
  const available = budget.maxTokens - budget.reservedForOutput;
  const used = budget.systemPromptTokens + budget.historyTokens + estimateTokens(input);

  if (used > available) {
    throw new Error(
      `Context budget exceeded: ${used} tokens needed, ${available} available. ` +
      `Truncate input or increase model capacity.`
    );
  }

  return true;
}

function estimateTokens(text: string): number {
  // Rough estimate: 4 characters per token
  return Math.ceil(text.length / 4);
}

This node fails fast. It does not let the LLM run with an overstuffed context. The error message tells the operator exactly what to fix.

What this teaches: A test that passes in isolation is not a test of the system. The system includes every node that touches the context. LangFlow tests the prompt. n8n tests the pipeline. They are different things.

Test 3: The "Retry Is Free" Test

What passes in LangFlow: The LLM call fails. You click "Run" again. It works. You assume the transient error was a network blip. You do not write a retry handler.

What fails in n8n: The LLM call fails at 2 AM. The workflow stops. No one notices until morning. The data that should have been processed is stuck in an error queue. The test that passed in LangFlow never ran at 2 AM.

Why the difference: LangFlow is a development environment. You are the retry handler. You see the error, you decide what to do, you click the button. n8n is a production environment. It runs unattended. If you did not tell it what to do on failure, it does nothing.

The fix is not a simple retry. The fix is a retry with exponential backoff and a dead-letter queue.

// n8n sub-workflow: LLM Call with Retry
// Wraps the LLM node

const MAX_RETRIES = 3;
const BASE_DELAY_MS = 1000;

async function callWithRetry(prompt: string, attempt: number = 1): Promise<string> {
  try {
    // This is the actual LLM call
    return await $node["LLM"].call(prompt);
  } catch (error) {
    if (attempt >= MAX_RETRIES) {
      // Send to dead-letter queue
      await $node["Dead Letter Queue"].send({
        prompt,
        error: error.message,
        attempt
      });
      throw new Error(`LLM call failed after ${MAX_RETRIES} attempts`);
    }

    const delay = BASE_DELAY_MS * Math.pow(2, attempt - 1);
    await new Promise(resolve => setTimeout(resolve, delay));
    return callWithRetry(prompt, attempt + 1);
  }
}

The dead-letter queue is critical. It preserves the failed input so you can replay it after fixing the issue. Without it, the error is a black hole.

What this teaches: Prototypes do not have failure modes. Production does. The test that passes in LangFlow is testing the happy path. The test that matters in n8n is testing the unhappy path. If you only test the happy path, you are not testing.

What These Three Tests Teach

Each of these failures shares a root cause: the prototype environment and the production environment have different assumptions about reliability.

LangFlow assumes you are watching. It assumes you will handle edge cases manually. It assumes the input is clean and the model is cooperative.

n8n assumes you are not watching. It assumes every edge case must be handled explicitly. It assumes the input is dirty and the model is unreliable.

The three-year SDET who built LangFlow prototypes and hit mysterious failures in n8n production is not making a mistake. They are learning the difference between a prototype and a production system. That difference is not the tool. It is the contract.

LangFlow says: "I will show you what the model returns."

n8n says: "I will do exactly what you told me, every time, without asking."

The second one is harder. It is also the one that runs at 2 AM.

What to Do Next

If you are moving a LangFlow prototype to n8n, do not export the flow and hope. Do this instead:

Add a validation layer for every LLM output. Normalize before parsing.
Measure context usage before every LLM call. Fail fast, not silently.
Implement retry with backoff and a dead-letter queue. Assume every call will fail at least once.

These three changes will catch the tests that pass in LangFlow but fail in n8n. They will also teach you something about your pipeline that the prototype never revealed.

The prototype is a sketch. Production is the building. Do not confuse the two.

Which of your LangFlow tests have you never run at 2 AM?

Top comments (1)

Sol • Jul 7

The production-context angle here is useful because it avoids pretending every unstable test is “just CI”. When one of these pass-here/fail-there cases turns out to be intermittent, can your team usually tie it back to the exact change that introduced it, or do you end up shipping mitigations without clear attribution?