DEV Community: Michael Rakutko

Bulletproof design for a local LLM-as-a-judge

Michael Rakutko — Sun, 05 Jul 2026 16:34:23 +0000

I build analytics for a living, so I have a reflex: don't trust a number you can't defend.

Right now, everyone is trying to build "evals" (LLM-as-a-judge). If you want to analyze complex unstructured data — whether it's a medical compliance check or raw terminal traces from a Claude Code session — you need an LLM to score it.

Large frontier models need these evals to measure their own quality. But if you want to keep your data private and your costs low, you run small, offline, local models. And here is the reality: small models hallucinate even harder. They need an ironclad harness.

The task: score a conversation transcript against a quality rubric — a fixed checklist a domain expert wrote down beforehand, where each item is a concrete yes/no criterion ("did the speaker confirm X", "was step Y covered"). In my case that rubric has 36 items, so a perfect transcript scores 36 out of 36. That number just reflects how long the human's checklist is; the model only fills in each item's status.

When I put a local LLM in charge of grading against that fixed rubric, my reflex went off immediately. Ask a language model to grade, and it bluffs. It drops the criteria it's unsure about, marks things "done" with zero evidence, and gives a different answer every single run.

So I built a pipeline where the model never actually scores anything. It only answers small, checkable questions; the code does the judging. And it records enough telemetry to prove, on every single run, that it behaved. Here is how it works.

TL;DR: The Tech Stack & Approach

The Context: Building an LLM-as-a-judge pipeline to evaluate complex transcripts (medical, coding, etc.) locally.
The Problem: Small, offline models are unpredictable. They game the count, invent IDs, and drift.
The Solution: A deterministic harness around the LLM. 22 small parallel calls, grammar constraints, code-driven counting, and CI-enforced invariants.

Why you can't just ask "score this"

The obvious version is one prompt: paste the transcript, paste the 36-item rubric, ask for a single score out of 36. It looks like it works. But under the hood, a language model left to its own devices will:

Game the count. It quietly drops criteria it's unsure about, so the denominator shifts under you and the percentage looks better than it actually is.
Claim without evidence. It marks something "done" with no quote to back it — a confident guess dressed up as a fact.
Invent structure. It returns a criterion ID or a rule that doesn't even exist in your rubric.
Drift run to run. Same transcript, different score. You built a random number generator and called it a measurement.

In this build, the model never gets to be the scorer. Everything that produces a number lives strictly in code.

The shape of the pipeline

One run is a chain of small LLM calls with deterministic code between them — about 22 calls total. Only two stages can change the final score; the rest is read-only:

Extract — pull header facts and rate data quality.
Judge ×3 — score each of the 6 sections, running three independent votes each (scoring stage).
Merge + repair — combine the votes, then verify every "done" (scoring stage).
Synthesize — generate summaries, coaching, and narratives (cosmetic stage — strictly read-only for the score).

Here's where the ~22 calls actually go:

The six ideas that make it trustworthy

1. Model judges, code counts

The model returns only a status per criterion. All arithmetic happens in code, over a fixed denominator of 36. If the pipeline skips a criterion, it counts as a 0 and stays in the denominator. The count cannot be gamed.

2. Vote three times

Each criterion is judged three times in parallel. A "done" status only survives if the votes agree and at least one carries a real quote. A lone confident vote loses.

3. Ground-or-demote

If the model says "done" but the code cannot find its exact quote in the transcript, the code downgrades it to "partial" automatically. The code enforces the evidence itself, where the model can't argue with it.

Here is the logic in five lines:

# The model said "done" — don't believe it yet
for c in criteria_marked_done_without_a_quote:
    quote = ask_model_for_verbatim_quote(c)
    if transcript.contains_verbatim(quote):
        c.evidence = quote          # Grounded → keep "done"
    else:
        c.status = "partial"        # No proof → demote automatically

4. Constrain the grammar

The model's output is heavily constrained. A criterion ID from the wrong section, or a rule that does not exist, is grammatically impossible for the model to emit. Hallucinated IDs cannot happen — code-level validation is just a backup safety belt.

5. Score-sensitive vs cosmetic layers

Every layer is strictly labeled. Only the two scoring layers may change the number. Summaries and narratives run afterward and are explicitly blocked from touching a status.

6. Reuse the prompt prefix

The system prompt and the transcript go first and are byte-identical across all ~22 calls. This allows the inference server to cache them once. If you reorder the prompt, you pay for that massive context 22 times.

The flight recorder — and the one invariant

Every run writes a trace next to the result: each of the ~22 calls with its output, tokens, latency, and the hash of its prompt and schema. Plus, we capture the full 36-criterion status vector at four distinct checkpoints as it moves through the pipeline:

after votes → after repair → pre-validation → validated

Then, one invariant holds it all together:

assert status_validated == status_after_repair

The architecture claims: after the repair stage, the score is final — everything downstream is purely cosmetic. Instead of trusting that claim in a comment, the pipeline checks it on every single run.

If a downstream "cosmetic" layer silently bugs out and moves a score, the pipeline fails on that very run, while the change is still fresh in your head. Don't trust your own "this layer doesn't touch the score" code comments. Turn them into tests.

The takeaway

The real fix here is the harness. Same model, same prompts; the scaffolding around them is what makes the scores trustworthy.

Let the model answer small checkable questions, do the arithmetic in code, constrain the grammar, isolate the scoring layers, and log enough to prove the system behaved. That's how an LLM stops being a black box and becomes a reliable system component you can measure, regression-test, and defend.

When something breaks, you find exactly which node failed and you patch there, instead of rewriting the entire system prompt.

Building in n8n with Claude

Michael Rakutko — Sat, 04 Apr 2026 12:37:41 +0000

n8n raised $180M at a $2.5B valuation last October. Their pitch calls it an "AI-first automation platform," and founder Jan Oberhauser describes it as "the Excel of AI."

I've always been a "code-first" guy. But with the ecosystem shifting toward n8n as the "brain" for AI automations, I wanted to see if it's a legitimate production tool or just a fancy playground for drawing boxes.

GitHub: r-ms/n8n-mcp | 20 tools | MIT | Claude Code / Desktop / Cursor

The use case

I follow ~30 YouTube channels on AI research and engineering. 90% of uploads are fluff. I needed a system that monitors channels, extracts transcripts, scores relevance with an LLM, and delivers a 30-second brief to Telegram every morning.

Why n8n and not a Python script?

Sure, Claude writes the script in 20 minutes. But then you need monitoring, alerting, logging, restart logic, state management. Claude can write all of that too — but now you're spending hours on infrastructure instead of the product.

n8n solves the ops around the code: visual execution traces (what went into the LLM, what came out), OAuth/retry/state management out of the box.

The "UI Gap"

I wanted Claude to build for me via n8n's API. But n8n's UI does a massive amount of invisible heavy lifting the API doesn't:

Missing Defaults: Code Node v2 requires a language parameter the UI sets automatically. Omit it via API — silent break.

Version Drift: SplitInBatches v3 swapped its output ports. Wrong version = infinite loop.

The Trailing Space: Two hours debugging a 404. The API had created a webhook URL with a trailing space from my prompt. The UI would have trimmed it.

The MCP

I built an MCP to bridge the gap. 20 tools for n8n, plus a know-how database and auto-fix rules. When Claude creates a workflow, the MCP intercepts it:

Validates node versions
Auto-injects missing UI-default parameters
Fixes naming conventions before they break webhook routing

Claude gets the full context of the n8n instance and debugs execution errors by reading JSON logs directly.

Full list of 7 auto-fix rules and 11 knowhow entries in the README.

From Low-Code to Agentic-Code

The canvas is becoming a debugger, not an editor.

Drawing lines between boxes is just another way of manually managing complexity — with a mouse instead of a keyboard. In the Agentic era, humans manage the intent. The agent manages the structural complexity. n8n's role shifts from "design tool" to "reliable runtime" — the engine that ensures intent is executed and logged.

The visual canvas remains, but as an audit trail. It's where you verify what the agent built, not where you build.

Try it

git clone https://github.com/r-ms/n8n-mcp.git
cd n8n-mcp && npm install && npm run build

{
  "mcpServers": {
    "n8n": {
      "type": "stdio",
      "command": "node",
      "args": ["/path/to/n8n-mcp/dist/index.js"],
      "env": {
        "N8N_API_URL": "http://localhost:5678",
        "N8N_API_KEY": "your-api-key"
      }
    }
  }
}

Tell Claude: "Build me a YouTube monitor in n8n." If you've spent hours debugging an n8n quirk that should have just worked — open an issue.

How Claude Code tracks your coding sessions

Michael Rakutko — Wed, 01 Apr 2026 19:45:01 +0000

As a Head of Analytics, I build tracking systems for a living. So at some point the obvious question hit me: how does my own tool track me?

I decompiled the Claude Code CLI binary and cross-checked it against source code Anthropic accidentally leaked via npm.

Your prompts aren't being exfiltrated. Your code stays local. But there's a regex that flags when you swear, 40 background LLM calls you never see, a remote flag that can change what gets collected without asking, and DO_NOT_TRACK=1 is silently ignored.

TLDR

# Add to ~/.zshrc if you want to opt out:
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

4 services your CLI talks to

Every time you run claude, your terminal opens connections to four external services:

#	Service	What it does	Can you turn it off?
1	GrowthBook (via api.anthropic.com)	Feature flags, A/B tests	Yes
2	Datadog (datadoghq.com)	Ops monitoring. ~44 whitelisted events, feature-flagged off by default	Yes
3	Anthropic OTEL (api.anthropic.com)	First-party OpenTelemetry logs — this is where almost everything goes	Yes
4	Anthropic Metrics (api.anthropic.com)	OTEL counters and histograms for BigQuery	Org-level opt-in only

Three of the four endpoints are Anthropic's own servers. The only third-party service is Datadog, and it's gated behind a feature flag that's off by default. Anthropic can flip it on server-side for any user or cohort through GrowthBook targeting — no @anthropic.com check in the code, the restriction is purely server-side.

What gets tracked: 838 event types

All events go to Anthropic's OTEL endpoint (service #3 above). ~44 of them also go to Datadog if the feature flag is on. Every event is prefixed with tengu_ — probably an internal codename. 838 distinct event types, covering every interaction you have with the tool. The number is high because each flow is tracked at every step — OAuth token refresh alone is 7 separate events (_starting, _lock_acquiring, _acquired, _completed, _success, _failure, _released). Multiply that by every feature and it adds up fast.

API & Model — every request to Claude: model, tokens, cost in USD, latency, fallbacks, refusals.

User input — every prompt fires tengu_input_prompt. Not the text itself (more on that below), but metadata: was it negative? Was it "keep going"? Single word?

Tools — every tool call: name, duration, result size. For bash commands, the first word of your command is sent raw — ./deploy-prod.sh goes as-is, not sanitized to "bash" or "other".

Files — tengu_file_operation on every read/write/edit. SHA256 hash of the file path (first 16 chars) and SHA256 of the content. Not the actual path or content. But the hashes are deterministic — same file, same hash. They can tell you keep editing the same file without knowing which one.

MCP — server connections, tool calls, errors. MCP server URLs are sent in cleartext. I'll come back to this.

Sessions — init, exit, resume, fork, compact, memory access.

Remote sessions — ~40 tengu_bridge_* events for WebSocket infrastructure.

Voice — recording start/stop, transcription metadata.

Team memory — sync push/pull, secret skipping, entry limits.

Auto-dream — background memory consolidation events.

Scheduled tasks — tengu_kairos_* for cron-based agents.

Agents — creation, model used, prompt length, response length, tool uses, duration.

Permissions — every dialog: shown, accepted, rejected, escaped. Every config change: setting name and value.

At exit, tengu_exit sends a session summary: cost in USD, lines added/removed, total tokens, duration, UI performance metrics. No conversation content.

The swearing detector

Every prompt you type gets run through this regex:

function QaK(input) {
  let text = input.toLowerCase();
  return /\b(wtf|wth|ffs|omfg|shit(ty|tiest)?|dumbass|horrible|awful|
    piss(ed|ing)? off|piece of (shit|crap|junk)|what the (fuck|hell)|
    fucking? (broken|useless|terrible|awful|horrible)|fuck you|
    screw (this|you)|so frustrating|this sucks|damn it)\b/.test(text);
}

Result: is_negative: true in tengu_input_prompt. Just the boolean, not your words. There's also a "keep going" detector — fires is_keep_going: true when you type "continue", "keep going", or "go on".

If users are swearing, something's broken. If users keep saying "continue", the model stops too early. Proxy metrics for product quality. I've built similar things myself.

Facet extraction: local session analysis

After a session ends, Claude Code can run a full LLM-based analysis and extract structured "facets":

Dimension	What it measures
Goal (13 types)	debug, implement feature, fix bug, write tests, deploy, etc.
Satisfaction (8 levels)	frustrated → dissatisfied → neutral → ... → delighted
Friction (11 types)	misunderstood request, wrong approach, buggy code, user rejected action, etc.
Outcome (5 levels)	fully achieved → not achieved
Helpfulness (5 levels)	unhelpful → essential

Plus underlying_goal, brief_summary, primary_success, primary_friction.

This only runs when you type /insights, not automatically. Facets are saved locally to ~/.claude/usage-data/session-meta/{session_id}.json and are not sent anywhere. There are no tengu_facet* or tengu_insights* events in the codebase. The data stays on your machine.

40 hidden LLM calls you never asked for

Besides the main model, Claude Code has 40 different types of background LLM calls — mostly to claude-haiku-4-5 — for things like extracting bash command prefixes, generating terminal titles, compressing context, and auto-extracting memories. Which ones fire depends on what you're doing. Not tracking per se, but your content goes to Anthropic's API either way.

#	What	What it sends to Haiku
1	Bash prefix extraction	Your full bash command
2	Tool use summary (status bar)	Tool inputs/outputs (300 chars)
3	Web fetch processing	Web page content (up to 100K chars)
4	Worktree title generation	Task description
5	Bug report formatting	Your bug report text
6	Prompt suggestion	Full conversation context
7	Compact (context compression)	Your full conversation
8	Side question (/btw)	Your question
9	Session memory	Full conversation + MEMORY.md
10	Hook evaluation	Conversation + hook condition
11	Speculation (pre-computation)	Full context. Ant-only (disabled for external users)
12	Magic docs generation	File path + content
13	Agent creation	Agent description
14	Agent summary	Agent work results
15	Custom agent	Custom agent context
16	Auto-dream	Session transcript — background memory consolidation
17	Auto-mode classifier	Tool call + user messages only — decides whether to auto-approve. Uses the main model, not Haiku
18	Auto-mode critique	Auto-mode rules analysis
19	Buddy companion	Generates a virtual terminal pet (name, species, personality). temperature=1
20	Extract memories	Full conversation — background auto-extraction
21	Generate session title	Your prompt text
22	Hook agent	Context + hook config (up to 50 turns)
23	Insights	Multiple session transcripts — facet extraction, report generation
24	MCP datetime parse	Datetime string
25	Memory directory relevance	Memory metadata
26	Model validation	Model info
27	Permission explainer	Command + context
28	Rename generation	Context
29	SDK	SDK/programmatic API
30	Session search	Session metadata (titles, first 300 chars)
31	Skill improvement	Skill data
32	Web search	Search query
33	Away summary	Last 30 messages + session memory — "while you were away" recap
34	Chrome MCP	Chrome bridge tool calls
35	Fork agent	Worktree agent context
36	Session notes	Session-level memory (separate from extract_memories)
37	REPL main thread	Main REPL loop context
38	Auto-mode critique (user rules)	Validation of user-defined auto-mode rules
39	Teleport title	Teleport title generation
40	Rename	Session rename context

A few of these are worth pausing on. Auto-dream runs in the background, reads your session transcripts, and synthesizes durable memories through four phases: Orient → Review → Consolidate → Housekeep. The auto-mode classifier is interesting for a different reason: it deliberately excludes model responses from the transcript it analyzes. A comment in the source reads "assistant text is model-authored and could be crafted to influence the classifier's decision" — anti-prompt-injection by design. And yes, there's a side-call that generates a virtual terminal pet with a random personality.

Some side-calls are restricted to Anthropic employees (USER_TYPE === 'ant'): speculation (pre-computing responses with a copy-on-write filesystem overlay) and frustration-triggered transcript sharing. For external users, those code paths are replaced with no-ops.

You can override the model with ANTHROPIC_SMALL_FAST_MODEL, but you can't turn these calls off without losing the features they power.

The data flow

What happens when you type a prompt:

You type a prompt
        |
        |-- regex QaK() --> is_negative: bool --------+
        |-- regex daK() --> is_keep_going: bool ------+
        |-- prompt length --> prompt_length -----------+
        |-- r_1(prompt) --> "<REDACTED>" (default) ---+
        |                                              |
        |   +------------------------------------------+
        |   |
        |   v
        |   tengu_input_prompt event
        |   |
        |   |-- OTEL 1P  --> api.anthropic.com/api/event_logging/batch
        |   +-- Datadog   --> datadoghq.com      [if flag on + whitelist]
        |
        |-- Anthropic API (main Claude request)
        |   |
        |   |-- LLM side-calls (Haiku): 40 calls
        |   |   |-- bash_extract_prefix
        |   |   |-- auto_mode (auto-approve, uses main model)
        |   |   |-- extract_memories (auto-memory)
        |   |   |-- auto_dream (memory consolidation)
        |   |   +-- ... 28 others
        |   |
        |   +-- Model response
        |
        |-- After session ends
        |   +-- Facet Extraction (LLM, local only)
        |       |-- goal, satisfaction, friction, outcome
        |       +-- saved to ~/.claude/usage-data/session-meta/
        |
        +-- Local storage
            |-- ~/.claude/projects/{cwd}/{session}.jsonl  (full transcript)
            |-- ~/.claude/telemetry/                        (retry queue)
            |-- ~/.claude/usage-data/facets/                (facet cache)
            +-- ~/.claude/debug/                            (debug logs)

Your prompt text is redacted by default in OTEL spans (replaced with "<REDACTED>"). File paths are always hashed. If you set OTEL_LOG_USER_PROMPTS=true, your full prompt text goes to the OTEL endpoint — off by default, but enterprise deployments might flip it. Same for OTEL_LOG_TOOL_CONTENT=true (file contents, bash output, diffs).

What leaks (and what doesn't)

Error messages go through a sanitizer that maps known error types to safe messages and truncates unknown ones to 60 chars of class name only. Stack traces don't leave your machine. But validation errors can still contain up to 2,000 characters, and API errors are unlimited, so fragments of paths and commands can slip through.

MCP server URLs leak in cleartext. mcpServerBaseUrl is spread into telemetry events without any allowlist check. If you connect to https://internal-corp-api.company.com/mcp, that URL goes to OTEL. MCP tool names get anonymized to "mcp_tool", but the server URL doesn't.

ANTHROPIC_BASE_URL also leaks in plaintext. If you use a custom proxy, the full URL goes into tengu_api_query, tengu_api_success, and tengu_api_error.

Repo hash — a field rh sends SHA256[0:16] of your normalized git remote URL with events. Not the URL itself, but a deterministic hash that allows correlating all sessions on the same repo.

MCP proxy for claude.ai connectors — if you connect Gmail, Google Calendar, Slack, etc. through claude.ai, all tool call inputs and outputs route through mcp-proxy.anthropic.com. Anthropic sees the contents of your emails, calendar entries, Slack messages going through those connectors. This only applies to the claudeai-proxy type; stdio/sse/http MCP servers connect directly.

Team memory syncs automatically when files change. Pushes full file contents to api.anthropic.com/api/claude_code/team_memory. Files containing secrets are skipped (regex filter), max 250KB per file, max 200 files. Disable with CLAUDE_CODE_DISABLE_AUTO_MEMORY=1.

Session transcripts are only shared if you explicitly consent. Four gates: give feedback → probability check → explicit dialog asking "Can Anthropic look at your transcript?" → click "Yes". You can permanently dismiss it.

Grove opt-out doesn't affect tracking. The privacy toggle in /privacy-settings only controls whether your data is used for model training. Tracking runs the same either way.

What you can and can't turn off

One environment variable kills almost everything:

export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

Disables GrowthBook, Datadog, OTEL, auto-updates, connectivity checks.

claude config set --global env.CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC 1

DO_NOT_TRACK=1 — the standard convention — is completely ignored. Zero references in the source.

Env var	GrowthBook	Datadog	OTEL 1P	Metrics
`CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1`	Off	Off	Off	Off
`DISABLE_TELEMETRY=1`	Off	Off	Off	Off
`DO_NOT_TRACK=1`	Ignored	Ignored	Ignored	Ignored

For more granular control:

CLAUDE_CODE_DISABLE_AUTO_MEMORY=1      # stop auto-memory extraction
CLAUDE_CODE_DISABLE_TERMINAL_TITLE=1   # stop LLM title generation
CLAUDE_CODE_DISABLE_BACKGROUND_TASKS=1 # stop background tasks
CLAUDE_CODE_DISABLE_CRON=1             # stop scheduled tasks

What you cannot turn off:

API calls to Claude. The product itself. Anthropic logs requests server-side.
40 LLM side-calls. Features, not tracking. Your content goes to Anthropic's API.
Facet extraction. LLM analysis of sessions. Data stays local.
Auto-dream. Background memory consolidation. Only numbers leave your machine (hours_since, sessions_reviewed), not your content.
Remote session events. Full message content when using Claude Code remotely.
WebFetch domain check. Domain name sent to api.anthropic.com/api/web/domain_info. Disable with skipWebFetchPreflight in config.

The remote flag problem

This was the most uncomfortable finding. Anthropic can remotely enable enhanced tracking through a GrowthBook feature flag:

function XQ1() {
  let q = process.env.CLAUDE_CODE_ENHANCED_TELEMETRY_BETA 
       ?? process.env.ENABLE_ENHANCED_TELEMETRY_BETA;
  if (Q6(q)) return true;    // env var ON → enabled
  if (A_(q)) return false;   // env var OFF → disabled
  return u8("enhanced_telemetry_beta", false);  // ← REMOTE FLAG
}

If you haven't explicitly set the env var, the decision falls through to a remote flag. Anthropic could flip this on for any user or cohort through GrowthBook targeting. In practice, DISABLE_TELEMETRY=1 blocks all backends so the data wouldn't go anywhere. But for enterprise/team setups with their own OTEL infrastructure, this is a real consideration.

Other things GrowthBook can change remotely: enable Datadog for your account, change event sampling rates to 100%, adjust batch parameters. It cannot remotely enable OTEL_LOG_USER_PROMPTS (your actual prompt text), that's strictly env var controlled.

What I think about all this

I've spent a career building product analytics.

The architecture is clean. Three of four tracking endpoints are Anthropic's own. Datadog is the only third party, and it's flagged off by default. Prompts redacted. File paths hashed. Content logging opt-in. Transcript sharing behind four consent gates.

The source code confirms they take this seriously at the engineering level, not just the policy level. The TypeScript type system enforces PII safety at compile time — LogEventMetadata only accepts boolean | number | undefined, and adding a string requires an explicit cast through a type named AnalyticsMetadata_I_VERIFIED_THIS_IS_NOT_CODE_OR_FILEPATHS. Plugin names go into restricted _PROTO_* BigQuery columns that get stripped before forwarding to Datadog. The team memory secret scanner has 30+ gitleaks-based regex patterns. A source code comment in sink.ts reads: "With Segment removed the two remaining sinks are fire-and-forget." They're actively simplifying.

What bothers me:

MCP server URLs and ANTHROPIC_BASE_URL leak in plaintext. Internal infrastructure ends up in Anthropic's pipeline.
DO_NOT_TRACK=1 is silently ignored. Either support the standard or say you don't.
The remote flag for enhanced tracking can change what gets collected without asking. Make it env-var-only.

838 event types, 40 background LLM calls, and a remote flag — all in a tool that has full access to your source code. The tracking itself is well-designed: prompts redacted, file paths hashed, session analysis stays local, the kill switch works. But that's a lot of metadata about how you work, when you work, and how your team collaborates. I'd want to know about that. Now you do.

# Add to ~/.zshrc if you want to opt out:
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1

Full technical report with all 838 event names and source code references: link.

Airflow vs n8n: what's the difference in 2026?

Michael Rakutko — Sat, 21 Mar 2026 15:26:42 +0000

I can spin up an Airflow DAG with Claude Code in the same time it takes me to build an n8n workflow. Describe what I want in English, get working Python in minutes, deploy, done.

So why do I still use both?

Because the "visual vs code" framing is dead. AI killed it. The real question in 2026 is what each tool gives you after the workflow is built — in production, at 3 AM, when the Slack API silently changed their rate limits and your pipeline is on fire.

The tools, briefly

n8n is open-source workflow automation with a visual canvas. 500+ integrations, self-hosted or cloud. In 2025, n8n pivoted hard into AI: LangChain nodes, MCP support, AI agent builder, human-in-the-loop approvals. The market responded — $2.5B valuation, 180K GitHub stars, 700K+ developers, 75% of customers using AI features. It's no longer "that Zapier alternative." It's a platform.

Apache Airflow is code-first DAG orchestration in Python. The de facto standard for data engineering. Kubernetes executor, Celery workers, battle-tested at companies running millions of DAG executions daily. If your data team exists, they're probably using Airflow.

The 2026 twist: AI coding changed the equation

In 2024, the comparison was simple:

Can your team write Python? → Airflow. Can't? → n8n.

In 2026, that logic collapsed. Claude Code, Cursor, GitHub Copilot — they write Python for you. Here's my actual workflow: I open a terminal, describe a pipeline in plain English, and get deployable code:

"Create an Airflow DAG that runs daily at 6 AM UTC.
Pull new rows from our Postgres orders table since yesterday,
calculate revenue per region,
load into BigQuery,
send a Slack summary."

Three minutes later:

from airflow.decorators import dag, task
from datetime import datetime, timedelta
from airflow.providers.postgres.hooks.postgres import PostgresHook
from airflow.providers.google.cloud.hooks.bigquery import BigQueryHook
from airflow.providers.slack.hooks.slack_webhook import SlackWebhookHook

@dag(
    schedule="0 6 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    default_args={"retries": 2, "retry_delay": timedelta(minutes=5)},
)
def daily_revenue_pipeline():

    @task
    def extract_orders():
        hook = PostgresHook("orders_db")
        return hook.get_pandas_df(
            "SELECT region, amount FROM orders "
            "WHERE created_at >= NOW() - INTERVAL '1 day'"
        ).to_dict(orient="records")

    @task
    def transform(rows):
        from collections import defaultdict
        totals = defaultdict(float)
        for r in rows:
            totals[r["region"]] += r["amount"]
        return dict(totals)

    @task
    def load_to_bigquery(revenue):
        BigQueryHook("bigquery_conn").insert_rows(
            "analytics.daily_revenue",
            [{"region": k, "revenue": v, "date": str(datetime.today())}
             for k, v in revenue.items()])

    @task
    def notify_slack(revenue):
        summary = "\n".join(f"  {r}: ${v:,.0f}" for r, v in revenue.items())
        SlackWebhookHook("slack_conn").send(
            text=f"Daily revenue:\n{summary}")

    data = extract_orders()
    result = transform(data)
    load_to_bigquery(result)
    notify_slack(result)

daily_revenue_pipeline()

That's real, deployable code. AI wrote it in minutes.

But here's what I've learned running both tools in production: writing the code was never the hard part. Maintaining it was.

A METR study found that experienced developers using AI tools actually took 19% longer on real-world tasks — despite believing they were faster. The bottleneck isn't writing code. It's everything around the code.

Where n8n wins

1. Managed auth for 500+ APIs

This is n8n's deepest moat, and most people underestimate it.

Every API has quirks. Slack requires bot scopes and socket mode tokens. Google silently rotates refresh tokens every 7 days for apps in "testing" mode. Salesforce routes requests to instance-specific URLs. HubSpot deprecated API keys entirely in 2024, breaking thousands of integrations overnight.

n8n handles all of this. Click "Connect," authenticate via OAuth, done. Token refresh, retry-on-401, scope management — built in for 500+ services.

Claude Code generates a generic OAuth flow. It works on day one. It breaks on day eight when Google revokes your token. In my experience, maintaining auth logic for even 5 SaaS APIs is a part-time job.

2. Visual debugging in production

When step 7 of a 15-step workflow fails in n8n, you open the execution, see the exact node that failed, inspect the input data, inspect the output, and retry that single step. No redeployment. No re-running the entire pipeline.

With Airflow: check the scheduler logs, find the task instance, read the log output, maybe add debug logging, commit, push, wait for the scheduler to pick up the new DAG, trigger a manual run, check logs again. It works — but it's 15 minutes where n8n takes 30 seconds.

For engineers, this is acceptable overhead. For anyone else, it's a wall.

3. The maintenance argument

This is the one nobody talks about.

AI writes code fast. But after the code exists, someone needs to:

Deploy it — to a server, with a scheduler, with health checks
Monitor it — set up alerting for failures
Manage secrets — store API keys, rotate credentials
Update dependencies — when a library releases a breaking change
Fix it at 3 AM — when the upstream API changed their response format

n8n abstracts all of this into the platform. Slack changed their API? n8n updates the node — your workflow keeps running. OAuth token expired? n8n rotates via credential manager. Workflow failed? Visual retry with one click.

With AI-generated code, every single one of these is your problem.

The analogy I keep coming back to: AI writes you a Dockerfile, but n8n is Heroku. Both get your app running. Only one of them handles ops.

4. AI agent orchestration

n8n's biggest bet — and it's paying off. In 2025-2026, n8n shipped:

LangChain nodes — connect any LLM as a first-class workflow step
Tool nodes — any n8n workflow becomes a callable tool for an AI agent
Human-in-the-loop — pause execution, wait for human approval, resume
Guardrails — jailbreak detection, NSFW filtering, custom rules
MCP support — the emerging standard for AI-tool integration

Building an AI agent that reads emails, classifies intent, drafts a response, asks a human for approval, then sends — that's a 20-minute drag-and-drop job in n8n.

In Airflow, you'd write custom operators, manage conversation state via XCom, and build your own approval mechanism. Possible? Yes. Worth it? Almost never.

5. Time-to-first-workflow

10 minutes from signup to a working, deployed workflow. That's n8n cloud. Self-hosted: a single docker run command.

Airflow: install, configure metadata DB, set up connections, write a DAG file, place it in the dags folder, wait for the scheduler to parse it, test, fix, redeploy. Even with AI writing the code, the infrastructure overhead is real.

Where Airflow wins

1. Unlimited customization

When you hit n8n's ceiling — and for complex data transforms, you will — there's no clean escape hatch. You can write JavaScript in a Function node or Python in a Code node, but you're still inside n8n's execution model.

What I've found is that workflows that start simple in n8n tend to accumulate Code nodes until 60% of the logic is hand-written JavaScript. At that point, you've lost the visual advantage and you'd be better off in Airflow where the entire thing is code you can test, lint, and version-control properly.

Airflow is Python. Custom operators, dynamic DAG generation, conditional branching, complex dependency graphs — no ceiling.

2. Scale

n8n is a Node.js process. Even in queue mode with multiple workers, there's a limit. For TB-scale ETL, thousands of concurrent tasks, or long-running compute jobs — Airflow with the Kubernetes executor spins up isolated pods per task:

# Process 1000 files in parallel, each in its own K8s pod
@task(executor_config={"KubernetesExecutor": {
    "request_memory": "2Gi", "request_cpu": "1"
}})
def process_file(file_path: str):
    # Heavy processing — isolated pod, dedicated resources
    ...

@dag(...)
def batch_pipeline():
    files = list_files()
    process_file.expand(file_path=files)  # Dynamic task mapping

n8n can't do this. If your pipeline processes terabytes, Airflow is the only serious option.

3. Data engineering ecosystem

dbt, Spark, BigQuery, Snowflake, Databricks, Great Expectations — all have first-class Airflow providers. The apache-airflow-providers-* ecosystem has 80+ packages.

n8n has basic database nodes, but if your pipeline involves dbt model runs → data quality checks → Spark jobs → warehouse loading — Airflow is where that ecosystem lives.

4. Production-grade reliability

SLAs with automatic alerting. Task-level retries with configurable exponential backoff. Sensor patterns that wait for external conditions. XCom for cross-task data passing. Pool-based concurrency limits. Priority weights for scheduling.

These matter when you're running hundreds of DAGs and need to explain to your VP of Finance exactly why Tuesday's revenue numbers were 4 hours late.

5. No vendor lock-in

Airflow DAGs are .py files. Move them to any Airflow instance — self-hosted, Google Cloud Composer, AWS MWAA, Astronomer. Or strip out the decorators and run the logic as plain Python.

n8n workflows are JSON tied to n8n's runtime. Exportable, sure. Portable? Only to another n8n instance.

The decision matrix

Use case	Choose	Why
Connect SaaS tools (Slack + Sheets + CRM)	n8n	500+ managed connectors with OAuth
ETL pipeline (extract → transform → load)	Airflow	Python flexibility, scale, ecosystem
AI agent with human-in-the-loop	n8n	Visual agent builder, guardrails, MCP
ML pipeline (train → evaluate → deploy)	Airflow	Native Python, GPU scheduling, K8s
Business process automation	n8n	Non-technical users, visual canvas
Data quality monitoring	Airflow	Sensors, SLAs, Great Expectations
Webhook-triggered actions	n8n	Built-in webhook node, instant
Batch processing at scale	Airflow	K8s executor, dynamic task mapping
Prototype / MVP	n8n	10 min to working workflow
Mission-critical data pipeline	Airflow	Battle-tested, horizontal scaling

The plot twist: use both

Here's what I actually run in production:

n8n handles event-driven work: SaaS integrations, AI agent chains, Slack bots, webhook receivers, anything that talks to external APIs with OAuth.
Airflow handles data work: batch ETL, scheduled processing, anything that needs scale or touches the data warehouse.

They connect trivially. n8n fires a webhook that triggers an Airflow DAG. Airflow calls n8n via HTTP when it needs to notify humans or interact with SaaS tools:

@task
def notify_via_n8n(results):
    import requests
    requests.post(
        "https://n8n.example.com/webhook/pipeline-complete",
        json={"status": "success", "rows_processed": results["count"]},
    )

Not architecturally beautiful. But pragmatic — each tool does what it's best at.

What AI actually changes

Let me be specific about what AI coding tools change in this equation:

What AI accelerates:

Writing DAG boilerplate (extract/transform/load patterns)
Writing SQL transformations and dbt models
Creating custom operators for new data sources
Debugging failed tasks from log output

What AI doesn't help with:

Setting up infrastructure (servers, Docker, networking)
Managing credentials and OAuth flows long-term
Debugging intermittent production failures
Tuning sensor timeouts when upstream data arrives late
Capacity planning when your DAG count grows from 10 to 100

AI shrinks the development cost of Airflow dramatically. But the operational cost — the infra, the on-call, the credential rotation, the monitoring — stays the same.

n8n's real value proposition in 2026 isn't "you don't need to code" (AI handles that). It's "you don't need to operate."

The real question

The question isn't "n8n or Airflow?" It's: who is operating this in production?

Data engineer who lives in the terminal → Airflow. You'll appreciate the control when you're debugging a sensor timeout at 3 AM.
Business user who needs automation → n8n. They'll appreciate fixing things without filing a Jira ticket.
Developer prototyping an AI agent → n8n first. Migrate to code if it outgrows the platform.
Team with mixed technical skills → Both. Engineers own Airflow, business users own n8n.

n8n's CEO Jan Oberhauser put it well: "n8n allows you to combine humans, AI, and code." Airflow gives you full code and full control. Both are right — for different problems, for different teams.

AI didn't make n8n obsolete. It didn't make Airflow unnecessary. What it did is kill "we can't write code" as a reason to choose n8n, and sharpen "we don't want to operate code" as the real reason.

Before you choose, ask one question: will this workflow be maintained by an engineer or a business user?

That single question answers which tool you need.

How we cut MCP context by 95% and stopped wasting the team's time

Michael Rakutko — Wed, 18 Mar 2026 16:31:20 +0000

We've all been there. Claude hits an MCP error, tries a different approach, hits another one, tries again — and eventually figures it out. You wait a minute, maybe two, it's fine.

But here's the thing nobody talks about: when a whole team uses Claude Code daily, those minutes stack. One person watches the agent spin through three wrong SQL dialects before landing on the right one. Another waits while it retries the same failed tool call four times. Someone else loses the thread of a complex session because query results flooded the context. Multiply that by four people, every day, and you're not talking about a minor inconvenience anymore — you're talking about hours of engineering time burned while Claude figures out things it should have known from the start.

I started tracking it. Two weeks, 23+ of these incidents across my analytics team. And the worst part? The agent wasn't being stupid. It was being sabotaged by the server we gave it.

We were using the official aws-dataprocessing MCP server from AWS Labs. Good project, well-maintained, 34 tools covering Glue, EMR, Athena, IAM, S3. We needed Athena. That's it. Five tools out of thirty-four.

I should've noticed sooner.

The thing is, it worked — sometimes. When it worked it was great. But when it didn't, it failed in the most demoralizing way possible: the agent would try something, get an error, try a variation, get a different error, try again, get the first error back. You'd watch it spin for ten minutes on something a junior analyst could fix in thirty seconds.

After a while I started looking at why it failed. Not the specific errors — the underlying reasons. And it turned out there were five of them, all happening at once.

The first thing I noticed was the metadata blindspot.

Athena is strict about types. If your column is varchar and you write a query treating it like a date, you get an error. Simple. But Claude didn't know our column types — there was no mechanism to tell it upfront. So it guessed. And when it guessed wrong and got an error, it... guessed again. I watched it try three different CAST approaches on the same column, each one wrong in a different way, before I just typed event_date is STRING in the chat and it immediately fixed everything. The information was always there — we just never gave it to the agent.

The second thing was the context bill.

34 tools × ~600 tokens each = roughly 20K tokens just for tool descriptions. Before the agent runs a single query, a fifth of its context window is gone. On a normal session that's annoying. With parallel sub-agents — which we use constantly — it's a disaster. Each sub-agent gets the full 34-tool payload. When someone on the team was running parallel agents on a complex analysis task, each one was starting with almost no usable context. Dozens of failures in a single session — and now it made sense.

Third: dialect hallucinations.

Claude kept writing TIMESTAMP_SUB. That's BigQuery syntax. Athena runs Presto/Trino, which uses DATE_ADD('day', -N, CURRENT_DATE). Every single time someone ran a date filter, the agent defaulted to what it knew best. Because nothing in the tool description said "hey, this isn't BigQuery."

Fourth: the DROP VIEW trap.

The server has a built-in SQL analyzer that blocks write operations. Makes sense as a safety feature. Except it classified DROP VIEW as a write operation and blocked it — even though we had --allow-write enabled. Drop a view before recreating it, like you do in any dbt workflow, and you get a permissions error that makes zero sense. Eight attempts. Eight times the same wall.

Fifth: the data dump problem.

No default row limit on query results. SELECT * on a large table returns 1,000 rows of JSON into context. One query. Multiply by parallel agents running simultaneously, and the useful context just disappears.

So. 160 lines of Python later, here's what we changed — and why each thing actually matters.

We cut 33 tools.

This sounds obvious but it isn't. You can't just disable tools in the config. We wrote a new entry point, server_athena.py, that imports the original handler but only registers one tool instead of thirty-four. Same codebase, same logic, different surface area. Context overhead dropped from ~20K tokens to ~1K. That single change made parallel sub-agents viable again.

We merged three calls into one.

The previous flow was: start-query-execution → poll get-query-execution until it finishes → get-query-results. Three tool calls minimum, each one a decision point where the agent could drift. We added an execute-query operation that handles all of it internally — 500ms polling, 30 second timeout, returns results directly. For 95% of queries, it's one call. For slow queries, it returns the execution ID so you can check back.

The agent stops forgetting what it was doing mid-query.

We preloaded the schema.

At startup, the server scans all databases and caches every table, every column, every type. It also downloads our dbt manifest from S3 — so it knows which model owns which table, who's responsible for it, how fresh the data is supposed to be. This all gets baked into the server's system instructions before Claude sees anything.

Then we added get-all-schemas with two modes: a cheap compact mode that returns table names and descriptions (so the agent can orient), and a deep mode where you ask for specific tables and get full column types plus dbt lineage. The agent orients cheaply, drills when it needs to.

No more guessing column types. The metadata is just there.

We made errors actually helpful.

This was the biggest change in practice. The old behavior: query fails, agent gets a raw error string, guesses what went wrong. New behavior: the server parses the error type and returns the specific context needed to fix it.

TABLE_NOT_FOUND      → full list of available tables
COLUMN_NOT_FOUND     → all columns for tables in the query
TYPE_MISMATCH        → column types with correct CAST suggestion
PARTITION_MISMATCH   → partition keys and their actual types
QUERY_EXHAUSTED      → hints: add LIMIT, use APPROX_DISTINCT
SYNTAX_ERROR         → "this is Presto, use DATE_ADD not DATE_SUB"

Most errors now resolve in one retry. The spiraling stopped almost immediately.

We compressed the output.

JSON query results are verbose. We switched to TOON (Token-Oriented Object Notation) — pipe-separated tabular format. Same data, 74% fewer tokens on our actual queries. Sounds like micro-optimization. At 50 queries deep into a complex analysis session, it's the difference between the agent remembering what it was asked and losing the thread.

The results:

	Before	After
Tools in context	34	1
Context overhead	~20K tokens	~1K tokens
Calls per query	3+	1
Default row limit	none	100
Error response	raw string	schema + hint
Result format	JSON	TOON (-74%)

The wasted cycles stopped. The team runs parallel agents on complex tasks without babysitting them. The agent gets it right on the first or second try.

The thing I keep coming back to is that none of this required a smarter model. Claude was fine the whole time. We handed it a server built for maximum coverage — 34 tools, every AWS data service, zero assumptions about your use case — and expected it to perform like a specialist. AWS Labs built that server for everyone. We needed it to work for us.

If your agent keeps spinning in circles, before blaming the model, check what you're loading into its context:

How many tools does it see? Do you actually use all of them?
What does it know about your data model before the first call?
When it gets an error, does it get a hint or a wall?
How big are the responses coming back?

We write MCP servers as access layers. But they're also the agent's cognitive environment. Design them badly and even a great model will look stupid.

If you're just starting to think about MCP servers — what they're for, when they make sense, how to structure them — I wrote about that here. This post is what happens two levels deeper, when the theory meets production.

Why Build MCP? 4 Levels of Adoption — From API Access to Company-Wide Semantic Layer

Michael Rakutko — Mon, 16 Feb 2026 18:10:19 +0000

Our team builds a lot of MCPs — for ourselves and for external users. Over time, recurring patterns have emerged. Here are the key use cases we see over and over again, organized by complexity.

Level 0. Give the agent access to APIs

The simplest and most obvious use case. You ask the agent: "analyze the Telegram channel @llm_under_hood, identify topics and popular posts" — it calls the Telegram API, fetches posts, calculates metrics, and returns the analysis.

Level 1. Automate routine by raising abstraction

AI frequently makes mistakes — forgets where servers and data are, makes syntax errors, even when everything is spelled out in context. MCP solves this by raising the abstraction level.

For example, I have 3 MCP servers written for a specific project. Each is 200-300 lines of TypeScript:

infra — vm_health generates a health report (12+ threshold alerts), container_logs returns logs, redis_query runs queries.

Sure, the agent can compose a long SSH command on its own, but it fails every other time. With MCP we remove the cognitive load:

// Without MCP: agent composes this and often gets it wrong
ssh user@server "docker exec redis redis-cli -a $PASS INFO memory | grep used_memory_human"

// With MCP: one tool call
redis_query({ server: "audioserver", command: "INFO memory" })

deps — dep_versions across 5 repositories, tag_api_types, update_consumer. Checking dependency versions, syncing API types between services — scripted and automatic.

s3 — S3 navigation: s3_org_tree, s3_device_files, s3_cat. Instead of aws s3 ls with endless paths — "show files for device X from yesterday".

Level 2. Semantic layer for data

An MCP server can wrap not just an API, but a semantic layer. Data is already prepared and labeled — the agent doesn't need to know the database schema, it operates with business concepts.

Yes, you can connect an MCP for GA4. But how do you account for all the custom tagging rules and complex logic of merging data from different sources?

That's what ETL is for — it handles the processing. The MCP server wraps the result as a semantic layer, and then anyone in the company can ask:

"show traffic insights for yesterday"
"which ASNs should we block?"
"which users generated the most revenue?"

The agent doesn't need to know table names, join logic, or filtering rules. The MCP server encapsulates all of that.

This changes who can use the tool. An analyst builds the semantic layer once — then the entire team uses it, including managers who don't know SQL.

Level 3. Shared authorization and access control

One MCP server can serve the entire company.

Example: Google Search Console. Instead of handing out credentials to everyone — one internal OAuth. Connect to the MCP server, authenticate via corporate SSO, get access based on your role.

Or an MCP that gives some people access to yesterday's revenue and others — not. Role-based access at the tool level.

This is already the industry standard. Sentry, Stripe, GitHub, Atlassian — all offer remote MCP servers with OAuth. Zero-config for the user: add a URL, log in via browser, start working.

Building MCP servers: a skill with best practices

We analyzed the source code and documentation of 50 production MCP servers from Stripe, Sentry, GitHub, Cloudflare, Supabase, Linear, Grafana, Playwright, AWS, Terraform, MongoDB, and others.

Packaged it as a Claude Code skill — 23 sections covering:

Architecture: transport choice (STDIO vs StreamableHTTP), deployment models, OAuth 2.1
Tool design: naming conventions, writing descriptions for LLMs, managing tool count (1 to 1400+)
Implementation: error handling, security, prompt injection protection, token optimization
Operations: debugging with MCP Inspector, LLM-based eval testing, Docker deployment
Industry patterns: top 35 patterns from production, pre-release checklist

Drop it into your .claude/skills/ directory and run /mcp-guide:

MCP Building Guide Skill on GitLab

The agent will use these best practices automatically when planning, developing, or reviewing MCP servers.