Your AI agent is smart. It can write code, plan projects, debug errors. But ask it about your data — "how many orders did we get yesterday?" — and it falls apart.
The usual fix is RAG: chunk your data, embed it, stuff it into a vector store, pray the retrieval finds the right rows. It works 60% of the time. The other 40%, the agent hallucinates numbers, mixes up customers, invents trends that don't exist.
Here's the thing: your data is already in a database. In tables. With schemas. SQL exists. Why are we turning structured data into embeddings just so an LLM can guess at it?
The real problem
AI agents don't have a data problem. They have an access problem.
Your agent can't just SELECT * FROM orders. It doesn't have a database connection. It doesn't know your schema. It doesn't have credentials. So you build a RAG pipeline — which is really just a worse, slower, less accurate version of SQL.
What if the agent could just... query the database?
MCP changes everything
Model Context Protocol (MCP) lets AI agents call tools. Not "tools" in the prompt-engineering sense — actual tools. Functions with parameters and return values. Claude, Cursor, Windsurf — they all support it.
So we built MCP servers for our data platform. The agent connects, discovers your tables, and runs real SQL. No embeddings. No vector stores. No hallucinations.
User: "How did sales perform last week vs the week before?"
Agent: [MCP] list_tables()
[MCP] describe_table("orders")
[MCP] query("SELECT DATE_TRUNC('week', created_at) AS week,
SUM(total) AS revenue FROM orders
WHERE created_at >= CURRENT_DATE - 14
GROUP BY 1 ORDER BY 1")
Agent: "Sales were $312k last week, up 8% from $289k the week before.
Electronics had the biggest jump at 15%."
Every number comes from a real SQL query. The agent can show its work.
How we built it
DataSpoc is an open-source data platform — three CLI tools connected by Parquet files in a cloud bucket. The two that matter for agents:
Pipe — ingests data from 400+ sources (databases, APIs, SaaS) to Parquet in S3/GCS/Azure.
Lens — mounts your cloud Parquet as DuckDB views. SQL shell, Jupyter notebooks, AI queries, local cache.
Both expose MCP servers:
pip install dataspoc-lens[mcp] dataspoc-pipe[mcp]
dataspoc-lens mcp # agents query your data
dataspoc-pipe mcp # agents manage pipelines
Setting it up (5 minutes)
Step 1: Get data into your bucket
pip install dataspoc-pipe
dataspoc-pipe init
dataspoc-pipe add my-postgres
dataspoc-pipe run my-postgres
# → Parquet files in s3://my-lake/raw/postgres/orders/...
Step 2: Point Lens at the bucket
pip install dataspoc-lens[mcp]
dataspoc-lens add-bucket s3://my-lake
dataspoc-lens catalog
# → orders (45,000 rows, 12 columns)
# → customers (12,000 rows, 8 columns)
Step 3: Connect your AI agent
Add to Claude Desktop config (claude_desktop_config.json):
{
"mcpServers": {
"dataspoc-lens": {
"command": "dataspoc-lens",
"args": ["mcp"]
},
"dataspoc-pipe": {
"command": "dataspoc-pipe",
"args": ["mcp"]
}
}
}
That's it. Your agent now has 14 tools:
Lens tools (query):
| Tool | What it does |
|---|---|
list_tables |
Discover all tables in your lake |
describe_table |
Get column names and types |
query |
Run SQL (read-only) |
ask |
Natural language → SQL → results |
cache_status |
Check data freshness |
cache_refresh |
Re-download stale tables |
cache_refresh_stale |
Refresh all stale at once |
Pipe tools (ingestion):
| Tool | What it does |
|---|---|
list_pipelines |
Show configured sources |
run_pipeline |
Trigger ingestion |
pipeline_status |
Check all pipelines |
pipeline_logs |
Read execution logs |
pipeline_config |
View pipeline settings |
show_manifest |
View data catalog |
validate_pipeline |
Test connectivity |
Why this beats RAG for structured data
| RAG | MCP + SQL | |
|---|---|---|
| Accuracy | ~60% (retrieval + LLM interpretation) | ~99% (SQL is deterministic) |
| Speed | Seconds (embed → search → rerank → generate) | Milliseconds (SQL query) |
| Freshness | Stale until re-indexed | Real-time (query live data) |
| Auditability | Black box | Agent shows the SQL it ran |
| Schema changes | Re-embed everything | Automatic (reads live schema) |
| Cost | Embedding API + vector DB + LLM | Just the LLM (DuckDB is free) |
RAG is great for unstructured data — documents, emails, tickets. But for tables with rows and columns? SQL wins. Every time.
The Python SDK (for agent frameworks)
If you're building with CrewAI, LangGraph, or AutoGen:
from dataspoc_lens import LensClient
with LensClient() as lens:
# Discover
tables = lens.tables()
schema = lens.schema("orders")
# Query
result = lens.query("SELECT status, COUNT(*) FROM orders GROUP BY 1")
# Ask in natural language
answer = lens.ask("which customers are at risk of churning?")
# Ensure fresh data
lens.cache_refresh_stale()
from dataspoc_pipe import PipeClient
pipe = PipeClient()
pipe.run("my-postgres") # trigger ingestion
status = pipe.status() # check all pipelines
logs = pipe.logs("my-postgres") # read last execution
No subprocess calls. No CLI parsing. Native Python.
The AGENT.md pattern
Every DataSpoc repo ships with an AGENT.md — a skill file that documents every function, pattern, and constraint. AI agents read it and know exactly what they can do.
Think of it as a README, but for machines. Drop it into Claude Code, Cursor, or any agent workflow. No prompt engineering needed.
Governed access (the security angle)
This isn't a free-for-all. The agent gets the same access as the human user:
-
Read-only by default. The
queryMCP tool rejects DROP, DELETE, INSERT, UPDATE. - Cloud IAM enforced. The agent uses your AWS/GCP/Azure credentials. No bucket access = no data.
- No new credentials. Same IAM permissions your analysts use.
- Auditable. Every query the agent runs is real SQL you can review.
No shadow data pipelines. No unaudited access. Your data governance stays intact.
What this looks like in practice
Data Engineer agent (via Pipe MCP):
You: "The sales pipeline failed last night. What happened?"
Agent: [MCP] pipeline_logs("sales")
[MCP] validate_pipeline("sales")
Agent: "The Postgres source returned a connection timeout.
The DB host is unreachable. Once it's back, I can
re-run with --full to backfill the missed data."
Data Analyst agent (via Lens MCP):
You: "Build me a weekly revenue report by product line."
Agent: [MCP] cache_refresh_stale()
[MCP] list_tables()
[MCP] describe_table("orders")
[MCP] query("SELECT ...")
Agent: "Revenue this week: $312k. Electronics leads at $98k (+15%),
followed by Software at $87k. Here's the CSV export."
ML agent (via Lens ML commands):
You: "Can we predict which customers will churn?"
Agent: [MCP] query("SELECT * FROM customers LIMIT 5")
→ ml train --target churn --from customers
→ ml explain --model churn
Agent: "Model trained (AUC 0.87). Top predictors: days_since_order,
support_tickets, contract_type. 23 customers flagged as
high risk. Predictions saved to ml/predictions/churn."
Try it now
pip install dataspoc-pipe dataspoc-lens[mcp]
# Get some data in
dataspoc-pipe init
dataspoc-pipe add my-source
dataspoc-pipe run my-source
# Connect your agent
dataspoc-lens add-bucket s3://my-data
dataspoc-lens mcp
Or just query from the terminal first:
dataspoc-lens shell
dataspoc-lens ask "how many orders yesterday?"
Honest limitations
- Batch only. No streaming. Pipe runs on cron, not in real-time.
- Single-node. DuckDB is fast but not distributed. Petabyte-scale needs Snowflake.
- Early stage (v0.2.0). API may evolve.
- ML is commercial. Pipe and Lens are Apache 2.0 — free forever.
Links
- Website: dataspoc.com
- Pipe: github.com/dataspoclab/dataspoc-pipe
- Lens: github.com/dataspoclab/dataspoc-lens
- PyPI: dataspoc-pipe | dataspoc-lens
- Docs: dataspoc.com/getting-started/overview/
Your data is in tables. Let your agent query it like tables. Stop turning SQL into embeddings.
Built in Brazil. Open source. For humans and AI agents.
Top comments (0)