Michael San Martim

Posted on Apr 28 • Originally published at dataspoc.com

Your AI agent needs data. Here's how to feed it without RAG.

#ai #agents #automation #dataengineering

Your AI agent is smart. It can write code, plan projects, debug errors. But ask it about your data — "how many orders did we get yesterday?" — and it falls apart.

The usual fix is RAG: chunk your data, embed it, stuff it into a vector store, pray the retrieval finds the right rows. It works 60% of the time. The other 40%, the agent hallucinates numbers, mixes up customers, invents trends that don't exist.

Here's the thing: your data is already in a database. In tables. With schemas. SQL exists. Why are we turning structured data into embeddings just so an LLM can guess at it?

The real problem

AI agents don't have a data problem. They have an access problem.

Your agent can't just SELECT * FROM orders. It doesn't have a database connection. It doesn't know your schema. It doesn't have credentials. So you build a RAG pipeline — which is really just a worse, slower, less accurate version of SQL.

What if the agent could just... query the database?

MCP changes everything

Model Context Protocol (MCP) lets AI agents call tools. Not "tools" in the prompt-engineering sense — actual tools. Functions with parameters and return values. Claude, Cursor, Windsurf — they all support it.

So we built MCP servers for our data platform. The agent connects, discovers your tables, and runs real SQL. No embeddings. No vector stores. No hallucinations.

User: "How did sales perform last week vs the week before?"

Agent: [MCP] list_tables()
       [MCP] describe_table("orders")
       [MCP] query("SELECT DATE_TRUNC('week', created_at) AS week,
                    SUM(total) AS revenue FROM orders
                    WHERE created_at >= CURRENT_DATE - 14
                    GROUP BY 1 ORDER BY 1")

Agent: "Sales were $312k last week, up 8% from $289k the week before.
        Electronics had the biggest jump at 15%."

Every number comes from a real SQL query. The agent can show its work.

How we built it

DataSpoc is an open-source data platform — three CLI tools connected by Parquet files in a cloud bucket. The two that matter for agents:

Pipe — ingests data from 400+ sources (databases, APIs, SaaS) to Parquet in S3/GCS/Azure.

Lens — mounts your cloud Parquet as DuckDB views. SQL shell, Jupyter notebooks, AI queries, local cache.

Both expose MCP servers:

pip install dataspoc-lens[mcp] dataspoc-pipe[mcp]

dataspoc-lens mcp    # agents query your data
dataspoc-pipe mcp    # agents manage pipelines

Setting it up (5 minutes)

Step 1: Get data into your bucket

pip install dataspoc-pipe

dataspoc-pipe init
dataspoc-pipe add my-postgres
dataspoc-pipe run my-postgres
# → Parquet files in s3://my-lake/raw/postgres/orders/...

Step 2: Point Lens at the bucket

pip install dataspoc-lens[mcp]

dataspoc-lens add-bucket s3://my-lake
dataspoc-lens catalog
# → orders (45,000 rows, 12 columns)
# → customers (12,000 rows, 8 columns)

Step 3: Connect your AI agent

Add to Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"]
    },
    "dataspoc-pipe": {
      "command": "dataspoc-pipe",
      "args": ["mcp"]
    }
  }
}

That's it. Your agent now has 14 tools:

Lens tools (query):

Tool	What it does
`list_tables`	Discover all tables in your lake
`describe_table`	Get column names and types
`query`	Run SQL (read-only)
`ask`	Natural language → SQL → results
`cache_status`	Check data freshness
`cache_refresh`	Re-download stale tables
`cache_refresh_stale`	Refresh all stale at once

Pipe tools (ingestion):

Tool	What it does
`list_pipelines`	Show configured sources
`run_pipeline`	Trigger ingestion
`pipeline_status`	Check all pipelines
`pipeline_logs`	Read execution logs
`pipeline_config`	View pipeline settings
`show_manifest`	View data catalog
`validate_pipeline`	Test connectivity

Why this beats RAG for structured data

	RAG	MCP + SQL
Accuracy	~60% (retrieval + LLM interpretation)	~99% (SQL is deterministic)
Speed	Seconds (embed → search → rerank → generate)	Milliseconds (SQL query)
Freshness	Stale until re-indexed	Real-time (query live data)
Auditability	Black box	Agent shows the SQL it ran
Schema changes	Re-embed everything	Automatic (reads live schema)
Cost	Embedding API + vector DB + LLM	Just the LLM (DuckDB is free)

RAG is great for unstructured data — documents, emails, tickets. But for tables with rows and columns? SQL wins. Every time.

The Python SDK (for agent frameworks)

If you're building with CrewAI, LangGraph, or AutoGen:

from dataspoc_lens import LensClient

with LensClient() as lens:
    # Discover
    tables = lens.tables()
    schema = lens.schema("orders")

    # Query
    result = lens.query("SELECT status, COUNT(*) FROM orders GROUP BY 1")

    # Ask in natural language
    answer = lens.ask("which customers are at risk of churning?")

    # Ensure fresh data
    lens.cache_refresh_stale()

from dataspoc_pipe import PipeClient

pipe = PipeClient()
pipe.run("my-postgres")           # trigger ingestion
status = pipe.status()            # check all pipelines
logs = pipe.logs("my-postgres")   # read last execution

No subprocess calls. No CLI parsing. Native Python.

The AGENT.md pattern

Every DataSpoc repo ships with an AGENT.md — a skill file that documents every function, pattern, and constraint. AI agents read it and know exactly what they can do.

Think of it as a README, but for machines. Drop it into Claude Code, Cursor, or any agent workflow. No prompt engineering needed.

Governed access (the security angle)

This isn't a free-for-all. The agent gets the same access as the human user:

Read-only by default. The query MCP tool rejects DROP, DELETE, INSERT, UPDATE.
Cloud IAM enforced. The agent uses your AWS/GCP/Azure credentials. No bucket access = no data.
No new credentials. Same IAM permissions your analysts use.
Auditable. Every query the agent runs is real SQL you can review.

No shadow data pipelines. No unaudited access. Your data governance stays intact.

What this looks like in practice

Data Engineer agent (via Pipe MCP):

You: "The sales pipeline failed last night. What happened?"

Agent: [MCP] pipeline_logs("sales")
       [MCP] validate_pipeline("sales")

Agent: "The Postgres source returned a connection timeout.
        The DB host is unreachable. Once it's back, I can
        re-run with --full to backfill the missed data."

Data Analyst agent (via Lens MCP):

You: "Build me a weekly revenue report by product line."

Agent: [MCP] cache_refresh_stale()
       [MCP] list_tables()
       [MCP] describe_table("orders")
       [MCP] query("SELECT ...")

Agent: "Revenue this week: $312k. Electronics leads at $98k (+15%),
        followed by Software at $87k. Here's the CSV export."

ML agent (via Lens ML commands):

You: "Can we predict which customers will churn?"

Agent: [MCP] query("SELECT * FROM customers LIMIT 5")
       → ml train --target churn --from customers
       → ml explain --model churn

Agent: "Model trained (AUC 0.87). Top predictors: days_since_order,
        support_tickets, contract_type. 23 customers flagged as
        high risk. Predictions saved to ml/predictions/churn."

Try it now

pip install dataspoc-pipe dataspoc-lens[mcp]

# Get some data in
dataspoc-pipe init
dataspoc-pipe add my-source
dataspoc-pipe run my-source

# Connect your agent
dataspoc-lens add-bucket s3://my-data
dataspoc-lens mcp

Or just query from the terminal first:

dataspoc-lens shell
dataspoc-lens ask "how many orders yesterday?"

Honest limitations

Batch only. No streaming. Pipe runs on cron, not in real-time.
Single-node. DuckDB is fast but not distributed. Petabyte-scale needs Snowflake.
Early stage (v0.2.0). API may evolve.
ML is commercial. Pipe and Lens are Apache 2.0 — free forever.

DEV Community