DEV Community

Pablo Nieto
Pablo Nieto

Posted on

Why we stopped using LLMs to parse CSVs: Building a Waterfall Architecture for AI Agents

Why we stopped using LLMs to parse CSVs: Building a Waterfall Architecture for AI Agents

Let's be honest: asking an LLM to parse a 10,000-row CSV or a complex financial PDF is a recipe for disaster.

If you've built AI agents recently, you know the "Data Tax" is real:

  1. Token Exhaustion: Sending raw text to a context window is like burning dollar bills to keep your house warm.
  2. Hallucinations: One misplaced comma in a financial trade history, and your $10,000 gain becomes a $100,000 error.
  3. Spatial Blindness: Standard libraries lose the visual structure of tables, leaving the AI guessing which value belongs to which header.

We decided to solve this by building ETL-D, a deterministic data middleware for the AI era.

The Waterfall Architecture (The "Secret Sauce")

We moved away from "LLM-first" parsing. Instead, we implemented a 3-Layer Waterfall that our MCP Server exposes to Claude (and any other agent):

Layer 1: Heuristic (0-shot Python)

We use strict Python logic (dateutil, regex, csv.Sniffer). If the pattern is known (like a standard bank statement or a CSV), we parse it instantly.
Hallucination Risk: 0%.

Layer 2: Semantic Routing

If headers are obfuscated or slightly non-standard, we use a lightweight semantic router to map columns to our strict Pydantic schemas.
Hallucination Risk: <1%.

Layer 3: LLM Fallback (The Last Resort)

Only when the data is pure unstructured "free-text" noise, we trigger an LLM-powered extraction.

Moving to the Model Context Protocol (MCP)

To make this accessible to the world, we've just published the ETL-D MCP Server. By connecting this to Claude Desktop, you stop sending "noise" to the context window.

Instead of:

"Claude, here is 50 pages of raw text from a bank export. Please find the trades."

The flow is now:

  1. Claude identifies the file type.
  2. Claude routes the raw data to the etld-mcp-server tool.
  3. ETL-D "crushes" the data into a flattened, deterministic JSON array.
  4. Claude receives only the perfectly structured facts, ready for reasoning.

The Result

We've seen an 85% reduction in token usage and 99.9% accuracy on legacy formats like Spanish Norma 43, SEC XBRL, and complex Broker Trade Histories.


🚀 Get Started (v3.2.x)

You can find ETL-D in the official Anthropic Registry or install it via NPM:

How are you handling messy data in your AI pipelines? Let's discuss in the comments!

Top comments (0)