For the last six months, I've been using the official ClickHouse MCP every day to investigate sports data incidents at LSports. I use it primarily through Cursor (the AI-powered code editor), though it works with Claude Desktop and other MCP clients too.
The official MCP is excellent at what it does: executing SQL queries and providing schema introspection.
But every day, I hit the same wall: I still write SQL manually.
So I'm building the intelligence layer on top. In the open. Here's why.
The Official ClickHouse MCP is Good (Really)
ClickHouse Inc built a solid MCP server. It does exactly what a database connector should do:
- ✅ Executes SQL queries reliably
- ✅ Provides schema introspection (list databases, tables, columns)
- ✅ Proper security (readonly mode)
- ✅ Active maintenance and updates
For generic database access, it works great. I use it every day.
But it's missing the intelligence layer.
The Daily Pain Points
Let me show you what a typical investigation looks like:
Pain #1: Manual SQL Writing (Every. Single. Time.)
Me in Cursor: "Show me NBA incidents from yesterday"
Cursor (using official ClickHouse MCP):
-- Cursor writes this from scratch every time
SELECT * FROM incidents
WHERE sport = 'Basketball'
AND league = 'NBA'
AND timestamp >= yesterday()
LIMIT 100
What I actually want:
- Template already exists
- Just fill in parameters:
sport=NBA,timeframe=yesterday - Instant execution
Pain #2: No Safety Checks
Bad query generated:
SELECT * FROM incidents WHERE description LIKE '%timeout%'
Result: 💥 Full table scan, times out, wastes cluster resources
What I need:
- Validate it uses ORDER BY columns (ClickHouse optimization requirement)
- Enforce LIMIT clause
- Check query won't cause full table scan
- Timeout enforcement
Pain #3: No Business Context
The AI (whether in Cursor or Claude) doesn't understand my domain:
- What is an "incident" in my schema?
- How do I calculate "settlement quality"?
- Which tables should be JOINed for investigations?
- What are the common investigation patterns?
Every query starts from zero knowledge.
Pain #4: No Investigation Workflow
Complex questions require 5-10 queries:
- Query match events around specific time
- Query market changes
- Query incidents
- Manually correlate the results
- Ask follow-up questions
- Repeat
No context retention. No correlation. Manual synthesis every time.
What I'm Building
An intelligent investigation agent that sits on top of the official ClickHouse MCP.
Think of it this way:
Official ClickHouse MCP → My Investigation Agent
(Database Driver) (Intelligent Engine)
↓ ↓
psycopg2 Django ORM
AWS SDK Serverless Framework
HTTP client GraphQL client
Architecture (High-Level)
User Question (Natural Language)
↓
Intent Parser (LLM - Claude, GPT-4, or other)
↓
Semantic Layer (maps business concepts to ClickHouse)
↓
Query Template Selection (30-50 pre-optimized patterns)
↓
Parameter Extraction & Validation
↓
Official ClickHouse MCP (executes the SQL)
↓
ClickHouse Database
↓
Result Formatting & Insights
↓
Response (via Cursor, Claude Desktop, or any MCP client)
The Four Intelligence Layers
Layer 1: Semantic Understanding
- Maps "incidents" to specific tables/columns
- Knows business metrics and how to calculate them
- Understands investigation patterns
- Domain-specific knowledge baked in
Layer 2: Query Intelligence
- 30-50 pre-optimized ClickHouse query templates
- LLM picks the right template and fills parameters
- Only generates SQL from scratch for edge cases
- Learns from usage to create new templates
Layer 3: Safety & Performance
- Validates queries use ORDER BY columns (ClickHouse requirement)
- Enforces LIMIT clauses (prevents unbounded results)
- Checks query cost before execution
- Guarantees < 5 second execution time
- Timeout enforcement at 4.5 seconds
Layer 4: Investigation Features
- Multi-turn conversation state (remembers context)
- Automatic correlation between queries
- Proactive insights ("I noticed high incident rate at 23:42. Want me to check affected markets?")
- Rich result formatting
Why ClickHouse Needs Special Handling
You can't just let an LLM generate ClickHouse queries naively:
Performance Gotchas:
- ❌ Wrong ORDER BY → 100x slower query
- ❌ Missing LIMIT → out of memory
- ❌ Bad JOIN → cluster-wide performance impact
- ❌ High-cardinality GROUP BY → resource exhaustion
ClickHouse-Specific Optimizations:
- Must use ORDER BY columns in WHERE clause
- PREWHERE for heavy filtering
- Materialized views for aggregations
- Skip indexes for selective queries
- Different engine types (MergeTree, ReplacingMergeTree, etc.)
This is why no one has built a production-grade ClickHouse agent yet. It's too risky without proper guardrails.
Template-First Approach (Why It's Actually Perfect for Agents)
Despite the agent hype, I'm going template-first:
// Fast Path (80% of queries)
const template = {
pattern: "incidents for {sport} {timeframe}",
sql: `
SELECT
timestamp, sport, league,
incident_type, severity
FROM incidents
WHERE sport = {sport}
AND timestamp >= {timeframe_start}
ORDER BY timestamp DESC
LIMIT 100
`
};
Why templates?
- ✅ Guaranteed ClickHouse optimization (pre-tuned)
- ✅ Predictable performance (< 5 sec guaranteed)
- ✅ Safe by default (no dangerous queries)
- ✅ Handles 80% of use cases
- ✅ Perfect for autonomous agents (stronger guardrails)
For the 20% edge cases: LLM generates SQL, but with heavy validation.
This isn't limiting—it's intelligent:
- Template = muscle memory for common patterns
- LLM = creativity for novel questions
- Best of both worlds
Why Build This?
Three reasons:
1. I Need It (Real Pain, Daily)
This isn't theoretical. I use ClickHouse + Claude for data investigation every day. The official MCP gets me 80% there. I'm building the missing 20%.
2. Market Gap
There is no production-grade intelligent agent for ClickHouse:
- Generic text-to-SQL tools (Vanna, Defog) don't handle ClickHouse optimization
- Other database agents (for Postgres/MySQL) don't transfer
- The official MCP is execution only, not intelligence
Someone needs to build this. Might as well be me.
3. Technical Challenge
ClickHouse + LLM is genuinely hard:
- Different SQL dialect (arrayJoin, PREWHERE, engine-specific syntax)
- Query optimization is critical (ORDER BY, partitions, indexes)
- Performance unpredictable without templates
- Safety requirements for production use
This is the kind of problem that's fun to solve.
Building in Public
I'm building this completely in the open:
- ✅ Open source from day 1 (Apache 2.0)
- ✅ Weekly blog posts on progress
- ✅ GitHub repo with full code
- ✅ Community feedback welcomed
- ✅ LSports as design partner (validates with real data)
Why?
- The problem is real and affects many teams
- Building in public gets better feedback faster
- Open source means others can contribute and improve it
- Someone needs to solve this—might as well share the solution
Not trying to:
- Compete with ClickHouse Inc (building on their foundation)
- Keep it proprietary
- Build in secret
The Roadmap
Week 1-2: POC (Proof of Concept)
- 10 investigation query templates
- Basic LLM integration
- Validation layer
- Integration with official ClickHouse MCP
- Success: Handles 40% of my daily investigations
Month 2: Core Engine
- 30-50 query templates (80% coverage)
- Complete semantic layer
- Multi-turn conversation state
- Performance monitoring
- Success: Replaces manual SQL for most use cases
Month 3: Launch
- Production-ready release
- MCP server implementation (works in Cursor, Claude Desktop, ChatGPT Apps)
- REST API + SDKs
- Documentation and examples
- Conference talk submissions
What I'm NOT Building
Important to be clear:
❌ Not replacing the official ClickHouse MCP
- The official MCP is great at execution
- I'm building the intelligence layer on top
- Complementary, not competitive
❌ Not a generic text-to-SQL tool
- Purpose-built for ClickHouse
- Optimized for real-time analytics and investigations
- Won't support other databases
❌ Not a consumer product (yet)
- Starting as open source
- LSports is design partner
- May evolve into hosted service later
Initial Tech Direction (Subject to Change)
I'm starting with these technologies, but may adjust as I build:
Agent Core:
- Language: TypeScript or Python (TBD based on MCP integration)
- LLM: Claude or GPT-4 (testing both)
- MCP Integration: Official ClickHouse MCP
ClickHouse Access:
- Primary: Official ClickHouse MCP
- Fallback: Direct client if needed
Semantic Layer:
- Configuration files (format TBD)
- Schema caching (Redis or in-memory)
Observability:
- Query logging and performance tracking
- Error monitoring
This will evolve as I learn what actually works. Building in public means sharing the journey, including what doesn't work.
How You Can Help
If you're using ClickHouse + AI agents:
-
What questions are you trying to answer with your data?
- I want to understand common investigation patterns
- What templates would be most useful?
-
What's broken in current solutions?
- Official MCP pain points?
- LLM SQL generation issues?
- Performance problems?
-
What would make this 10x better for you?
- Features you need?
- Integration requirements?
- Use cases I'm missing?
GitHub: [Coming this week - will update post]
Follow progress:
- Twitter: https://x.com/zevikgg
- LinkedIn: https://linkedin.com/in/zeevgerstner
Or just comment below. I read everything.
Where This is Heading
The Product Vision:
In 12 months, this should be the go-to solution for teams running ClickHouse + AI agents:
- Production-ready and battle-tested
- Handles 80%+ of investigation patterns out of the box
- Active community contributing templates
- Used by teams beyond just sports analytics
Personal Goals:
If I become known as a ClickHouse + AI expert along the way—great. If I end up speaking at conferences about this—even better.
But the real measure of success is: Did we solve a real problem that others have too?
Closing Thoughts
The official ClickHouse MCP is excellent at execution.
But execution alone isn't enough for production investigations. You need intelligence:
- Understanding of business context
- Query optimization for ClickHouse
- Safety guarantees
- Investigation workflows
That's what I'm building.
Standing on the shoulders of giants (the official MCP), building the intelligence layer that's missing.
Want to follow along? Drop a comment, star the repo (coming this week), or reach out directly.
Let's build something useful. Together. In the open.
Zevik - Group Manager, Data Intelligence @ LSports
Building ClickHouse investigation agents in public
Top comments (0)