Why we stopped using LLMs to parse CSVs: Building a Waterfall Architecture for AI Agents
Let's be honest: asking an LLM to parse a 10,000-row CSV or a complex financial PDF is a recipe for disaster.
If you've built AI agents recently, you know the "Data Tax" is real:
- Token Exhaustion: Sending raw text to a context window is like burning dollar bills to keep your house warm.
- Hallucinations: One misplaced comma in a financial trade history, and your $10,000 gain becomes a $100,000 error.
- Spatial Blindness: Standard libraries lose the visual structure of tables, leaving the AI guessing which value belongs to which header.
We decided to solve this by building ETL-D, a deterministic data middleware for the AI era.
The Waterfall Architecture (The "Secret Sauce")
We moved away from "LLM-first" parsing. Instead, we implemented a 3-Layer Waterfall that our MCP Server exposes to Claude (and any other agent):
Layer 1: Heuristic (0-shot Python)
We use strict Python logic (dateutil, regex, csv.Sniffer). If the pattern is known (like a standard bank statement or a CSV), we parse it instantly.
Hallucination Risk: 0%.
Layer 2: Semantic Routing
If headers are obfuscated or slightly non-standard, we use a lightweight semantic router to map columns to our strict Pydantic schemas.
Hallucination Risk: <1%.
Layer 3: LLM Fallback (The Last Resort)
Only when the data is pure unstructured "free-text" noise, we trigger an LLM-powered extraction.
Moving to the Model Context Protocol (MCP)
To make this accessible to the world, we've just published the ETL-D MCP Server. By connecting this to Claude Desktop, you stop sending "noise" to the context window.
Instead of:
"Claude, here is 50 pages of raw text from a bank export. Please find the trades."
The flow is now:
- Claude identifies the file type.
-
Claude routes the raw data to the
etld-mcp-servertool. - ETL-D "crushes" the data into a flattened, deterministic JSON array.
- Claude receives only the perfectly structured facts, ready for reasoning.
The Result
We've seen an 85% reduction in token usage and 99.9% accuracy on legacy formats like Spanish Norma 43, SEC XBRL, and complex Broker Trade Histories.
🚀 Get Started (v3.2.x)
You can find ETL-D in the official Anthropic Registry or install it via NPM:
- Official Registry: ETL-D on MCP Registry
-
NPM Package:
npm install etld-mcp-server -
Python Core:
pip install etld - GitHub (Public Mirror): pablixnieto2/etld-mcp-server
How are you handling messy data in your AI pipelines? Let's discuss in the comments!
Top comments (0)