DEV Community

Cover image for Your AI agent needs data. Here's how to feed it without RAG.
Michael San Martim
Michael San Martim

Posted on • Originally published at dataspoc.com

Your AI agent needs data. Here's how to feed it without RAG.

Your AI agent is smart. It can write code, plan projects, debug errors. But ask it about your data — "how many orders did we get yesterday?" — and it falls apart.

The usual fix is RAG: chunk your data, embed it, stuff it into a vector store, pray the retrieval finds the right rows. It works 60% of the time. The other 40%, the agent hallucinates numbers, mixes up customers, invents trends that don't exist.

Here's the thing: your data is already in a database. In tables. With schemas. SQL exists. Why are we turning structured data into embeddings just so an LLM can guess at it?

The real problem

AI agents don't have a data problem. They have an access problem.

Your agent can't just SELECT * FROM orders. It doesn't have a database connection. It doesn't know your schema. It doesn't have credentials. So you build a RAG pipeline — which is really just a worse, slower, less accurate version of SQL.

What if the agent could just... query the database?

MCP changes everything

Model Context Protocol (MCP) lets AI agents call tools. Not "tools" in the prompt-engineering sense — actual tools. Functions with parameters and return values. Claude, Cursor, Windsurf — they all support it.

So we built MCP servers for our data platform. The agent connects, discovers your tables, and runs real SQL. No embeddings. No vector stores. No hallucinations.

User: "How did sales perform last week vs the week before?"

Agent: [MCP] list_tables()
       [MCP] describe_table("orders")
       [MCP] query("SELECT DATE_TRUNC('week', created_at) AS week,
                    SUM(total) AS revenue FROM orders
                    WHERE created_at >= CURRENT_DATE - 14
                    GROUP BY 1 ORDER BY 1")

Agent: "Sales were $312k last week, up 8% from $289k the week before.
        Electronics had the biggest jump at 15%."
Enter fullscreen mode Exit fullscreen mode

Every number comes from a real SQL query. The agent can show its work.

How we built it

DataSpoc is an open-source data platform — three CLI tools connected by Parquet files in a cloud bucket. The two that matter for agents:

Pipe — ingests data from 400+ sources (databases, APIs, SaaS) to Parquet in S3/GCS/Azure.

Lens — mounts your cloud Parquet as DuckDB views. SQL shell, Jupyter notebooks, AI queries, local cache.

Both expose MCP servers:

pip install dataspoc-lens[mcp] dataspoc-pipe[mcp]

dataspoc-lens mcp    # agents query your data
dataspoc-pipe mcp    # agents manage pipelines
Enter fullscreen mode Exit fullscreen mode

Setting it up (5 minutes)

Step 1: Get data into your bucket

pip install dataspoc-pipe

dataspoc-pipe init
dataspoc-pipe add my-postgres
dataspoc-pipe run my-postgres
# → Parquet files in s3://my-lake/raw/postgres/orders/...
Enter fullscreen mode Exit fullscreen mode

Step 2: Point Lens at the bucket

pip install dataspoc-lens[mcp]

dataspoc-lens add-bucket s3://my-lake
dataspoc-lens catalog
# → orders (45,000 rows, 12 columns)
# → customers (12,000 rows, 8 columns)
Enter fullscreen mode Exit fullscreen mode

Step 3: Connect your AI agent

Add to Claude Desktop config (claude_desktop_config.json):

{
  "mcpServers": {
    "dataspoc-lens": {
      "command": "dataspoc-lens",
      "args": ["mcp"]
    },
    "dataspoc-pipe": {
      "command": "dataspoc-pipe",
      "args": ["mcp"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

That's it. Your agent now has 14 tools:

Lens tools (query):

Tool What it does
list_tables Discover all tables in your lake
describe_table Get column names and types
query Run SQL (read-only)
ask Natural language → SQL → results
cache_status Check data freshness
cache_refresh Re-download stale tables
cache_refresh_stale Refresh all stale at once

Pipe tools (ingestion):

Tool What it does
list_pipelines Show configured sources
run_pipeline Trigger ingestion
pipeline_status Check all pipelines
pipeline_logs Read execution logs
pipeline_config View pipeline settings
show_manifest View data catalog
validate_pipeline Test connectivity

Why this beats RAG for structured data

RAG MCP + SQL
Accuracy ~60% (retrieval + LLM interpretation) ~99% (SQL is deterministic)
Speed Seconds (embed → search → rerank → generate) Milliseconds (SQL query)
Freshness Stale until re-indexed Real-time (query live data)
Auditability Black box Agent shows the SQL it ran
Schema changes Re-embed everything Automatic (reads live schema)
Cost Embedding API + vector DB + LLM Just the LLM (DuckDB is free)

RAG is great for unstructured data — documents, emails, tickets. But for tables with rows and columns? SQL wins. Every time.

The Python SDK (for agent frameworks)

If you're building with CrewAI, LangGraph, or AutoGen:

from dataspoc_lens import LensClient

with LensClient() as lens:
    # Discover
    tables = lens.tables()
    schema = lens.schema("orders")

    # Query
    result = lens.query("SELECT status, COUNT(*) FROM orders GROUP BY 1")

    # Ask in natural language
    answer = lens.ask("which customers are at risk of churning?")

    # Ensure fresh data
    lens.cache_refresh_stale()
Enter fullscreen mode Exit fullscreen mode
from dataspoc_pipe import PipeClient

pipe = PipeClient()
pipe.run("my-postgres")           # trigger ingestion
status = pipe.status()            # check all pipelines
logs = pipe.logs("my-postgres")   # read last execution
Enter fullscreen mode Exit fullscreen mode

No subprocess calls. No CLI parsing. Native Python.

The AGENT.md pattern

Every DataSpoc repo ships with an AGENT.md — a skill file that documents every function, pattern, and constraint. AI agents read it and know exactly what they can do.

Think of it as a README, but for machines. Drop it into Claude Code, Cursor, or any agent workflow. No prompt engineering needed.

Governed access (the security angle)

This isn't a free-for-all. The agent gets the same access as the human user:

  • Read-only by default. The query MCP tool rejects DROP, DELETE, INSERT, UPDATE.
  • Cloud IAM enforced. The agent uses your AWS/GCP/Azure credentials. No bucket access = no data.
  • No new credentials. Same IAM permissions your analysts use.
  • Auditable. Every query the agent runs is real SQL you can review.

No shadow data pipelines. No unaudited access. Your data governance stays intact.

What this looks like in practice

Data Engineer agent (via Pipe MCP):

You: "The sales pipeline failed last night. What happened?"

Agent: [MCP] pipeline_logs("sales")
       [MCP] validate_pipeline("sales")

Agent: "The Postgres source returned a connection timeout.
        The DB host is unreachable. Once it's back, I can
        re-run with --full to backfill the missed data."
Enter fullscreen mode Exit fullscreen mode

Data Analyst agent (via Lens MCP):

You: "Build me a weekly revenue report by product line."

Agent: [MCP] cache_refresh_stale()
       [MCP] list_tables()
       [MCP] describe_table("orders")
       [MCP] query("SELECT ...")

Agent: "Revenue this week: $312k. Electronics leads at $98k (+15%),
        followed by Software at $87k. Here's the CSV export."
Enter fullscreen mode Exit fullscreen mode

ML agent (via Lens ML commands):

You: "Can we predict which customers will churn?"

Agent: [MCP] query("SELECT * FROM customers LIMIT 5")
        ml train --target churn --from customers
        ml explain --model churn

Agent: "Model trained (AUC 0.87). Top predictors: days_since_order,
        support_tickets, contract_type. 23 customers flagged as
        high risk. Predictions saved to ml/predictions/churn."
Enter fullscreen mode Exit fullscreen mode

Try it now

pip install dataspoc-pipe dataspoc-lens[mcp]

# Get some data in
dataspoc-pipe init
dataspoc-pipe add my-source
dataspoc-pipe run my-source

# Connect your agent
dataspoc-lens add-bucket s3://my-data
dataspoc-lens mcp
Enter fullscreen mode Exit fullscreen mode

Or just query from the terminal first:

dataspoc-lens shell
dataspoc-lens ask "how many orders yesterday?"
Enter fullscreen mode Exit fullscreen mode

Honest limitations

  • Batch only. No streaming. Pipe runs on cron, not in real-time.
  • Single-node. DuckDB is fast but not distributed. Petabyte-scale needs Snowflake.
  • Early stage (v0.2.0). API may evolve.
  • ML is commercial. Pipe and Lens are Apache 2.0 — free forever.

Links


Your data is in tables. Let your agent query it like tables. Stop turning SQL into embeddings.

Built in Brazil. Open source. For humans and AI agents.

Top comments (0)