zeevgg

Posted on Feb 9

Building an Intelligent ClickHouse Investigation Agent (The Official MCP Isn't Enough)

#clickhouse #ai #agents #buildinpublic

For the last six months, I've been using the official ClickHouse MCP every day to investigate sports data incidents at LSports. I use it primarily through Cursor (the AI-powered code editor), though it works with Claude Desktop and other MCP clients too.

The official MCP is excellent at what it does: executing SQL queries and providing schema introspection.

But every day, I hit the same wall: I still write SQL manually.

So I'm building the intelligence layer on top. In the open. Here's why.

The Official ClickHouse MCP is Good (Really)

ClickHouse Inc built a solid MCP server. It does exactly what a database connector should do:

✅ Executes SQL queries reliably
✅ Provides schema introspection (list databases, tables, columns)
✅ Proper security (readonly mode)
✅ Active maintenance and updates

For generic database access, it works great. I use it every day.

But it's missing the intelligence layer.

The Daily Pain Points

Let me show you what a typical investigation looks like:

Pain #1: Manual SQL Writing (Every. Single. Time.)

Me in Cursor: "Show me NBA incidents from yesterday"

Cursor (using official ClickHouse MCP):

-- Cursor writes this from scratch every time
SELECT * FROM incidents 
WHERE sport = 'Basketball' 
  AND league = 'NBA' 
  AND timestamp >= yesterday()
LIMIT 100

What I actually want:

Template already exists
Just fill in parameters: sport=NBA, timeframe=yesterday
Instant execution

Pain #2: No Safety Checks

Bad query generated:

SELECT * FROM incidents WHERE description LIKE '%timeout%'

Result: 💥 Full table scan, times out, wastes cluster resources

What I need:

Validate it uses ORDER BY columns (ClickHouse optimization requirement)
Enforce LIMIT clause
Check query won't cause full table scan
Timeout enforcement

Pain #3: No Business Context

The AI (whether in Cursor or Claude) doesn't understand my domain:

What is an "incident" in my schema?
How do I calculate "settlement quality"?
Which tables should be JOINed for investigations?
What are the common investigation patterns?

Every query starts from zero knowledge.

Pain #4: No Investigation Workflow

Complex questions require 5-10 queries:

Query match events around specific time
Query market changes
Query incidents
Manually correlate the results
Ask follow-up questions
Repeat

No context retention. No correlation. Manual synthesis every time.

What I'm Building

An intelligent investigation agent that sits on top of the official ClickHouse MCP.

Think of it this way:

Official ClickHouse MCP  →  My Investigation Agent
(Database Driver)            (Intelligent Engine)
     ↓                             ↓
  psycopg2                    Django ORM
  AWS SDK                     Serverless Framework
  HTTP client                 GraphQL client

Architecture (High-Level)

User Question (Natural Language)
    ↓
Intent Parser (LLM - Claude, GPT-4, or other)
    ↓
Semantic Layer (maps business concepts to ClickHouse)
    ↓
Query Template Selection (30-50 pre-optimized patterns)
    ↓
Parameter Extraction & Validation
    ↓
Official ClickHouse MCP (executes the SQL)
    ↓
ClickHouse Database
    ↓
Result Formatting & Insights
    ↓
Response (via Cursor, Claude Desktop, or any MCP client)

The Four Intelligence Layers

Layer 1: Semantic Understanding

Maps "incidents" to specific tables/columns
Knows business metrics and how to calculate them
Understands investigation patterns
Domain-specific knowledge baked in

Layer 2: Query Intelligence

30-50 pre-optimized ClickHouse query templates
LLM picks the right template and fills parameters
Only generates SQL from scratch for edge cases
Learns from usage to create new templates

Layer 3: Safety & Performance

Validates queries use ORDER BY columns (ClickHouse requirement)
Enforces LIMIT clauses (prevents unbounded results)
Checks query cost before execution
Guarantees < 5 second execution time
Timeout enforcement at 4.5 seconds

Layer 4: Investigation Features

Multi-turn conversation state (remembers context)
Automatic correlation between queries
Proactive insights ("I noticed high incident rate at 23:42. Want me to check affected markets?")
Rich result formatting

Why ClickHouse Needs Special Handling

You can't just let an LLM generate ClickHouse queries naively:

Performance Gotchas:

❌ Wrong ORDER BY → 100x slower query
❌ Missing LIMIT → out of memory
❌ Bad JOIN → cluster-wide performance impact
❌ High-cardinality GROUP BY → resource exhaustion

ClickHouse-Specific Optimizations:

Must use ORDER BY columns in WHERE clause
PREWHERE for heavy filtering
Materialized views for aggregations
Skip indexes for selective queries
Different engine types (MergeTree, ReplacingMergeTree, etc.)

This is why no one has built a production-grade ClickHouse agent yet. It's too risky without proper guardrails.

Template-First Approach (Why It's Actually Perfect for Agents)

Despite the agent hype, I'm going template-first:

// Fast Path (80% of queries)
const template = {
  pattern: "incidents for {sport} {timeframe}",
  sql: `
    SELECT 
      timestamp, sport, league, 
      incident_type, severity
    FROM incidents
    WHERE sport = {sport}
      AND timestamp >= {timeframe_start}
    ORDER BY timestamp DESC
    LIMIT 100
  `
};

Why templates?

✅ Guaranteed ClickHouse optimization (pre-tuned)
✅ Predictable performance (< 5 sec guaranteed)
✅ Safe by default (no dangerous queries)
✅ Handles 80% of use cases
✅ Perfect for autonomous agents (stronger guardrails)

For the 20% edge cases: LLM generates SQL, but with heavy validation.

This isn't limiting—it's intelligent:

Template = muscle memory for common patterns
LLM = creativity for novel questions
Best of both worlds

Why Build This?

Three reasons:

1. I Need It (Real Pain, Daily)

This isn't theoretical. I use ClickHouse + Claude for data investigation every day. The official MCP gets me 80% there. I'm building the missing 20%.

2. Market Gap

There is no production-grade intelligent agent for ClickHouse:

Generic text-to-SQL tools (Vanna, Defog) don't handle ClickHouse optimization
Other database agents (for Postgres/MySQL) don't transfer
The official MCP is execution only, not intelligence

Someone needs to build this. Might as well be me.

3. Technical Challenge

ClickHouse + LLM is genuinely hard:

Different SQL dialect (arrayJoin, PREWHERE, engine-specific syntax)
Query optimization is critical (ORDER BY, partitions, indexes)
Performance unpredictable without templates
Safety requirements for production use

This is the kind of problem that's fun to solve.

Building in Public

I'm building this completely in the open:

✅ Open source from day 1 (Apache 2.0)
✅ Weekly blog posts on progress
✅ GitHub repo with full code
✅ Community feedback welcomed
✅ LSports as design partner (validates with real data)

Why?

The problem is real and affects many teams
Building in public gets better feedback faster
Open source means others can contribute and improve it
Someone needs to solve this—might as well share the solution

Not trying to:

Compete with ClickHouse Inc (building on their foundation)
Keep it proprietary
Build in secret

The Roadmap

Week 1-2: POC (Proof of Concept)

10 investigation query templates
Basic LLM integration
Validation layer
Integration with official ClickHouse MCP
Success: Handles 40% of my daily investigations

Month 2: Core Engine

30-50 query templates (80% coverage)
Complete semantic layer
Multi-turn conversation state
Performance monitoring
Success: Replaces manual SQL for most use cases

Month 3: Launch

Production-ready release
MCP server implementation (works in Cursor, Claude Desktop, ChatGPT Apps)
REST API + SDKs
Documentation and examples
Conference talk submissions

What I'm NOT Building

Important to be clear:

❌ Not replacing the official ClickHouse MCP

The official MCP is great at execution
I'm building the intelligence layer on top
Complementary, not competitive

❌ Not a generic text-to-SQL tool

Purpose-built for ClickHouse
Optimized for real-time analytics and investigations
Won't support other databases

❌ Not a consumer product (yet)

Starting as open source
LSports is design partner
May evolve into hosted service later

Initial Tech Direction (Subject to Change)

I'm starting with these technologies, but may adjust as I build:

Agent Core:
- Language: TypeScript or Python (TBD based on MCP integration)
- LLM: Claude or GPT-4 (testing both)
- MCP Integration: Official ClickHouse MCP

ClickHouse Access:
- Primary: Official ClickHouse MCP
- Fallback: Direct client if needed

Semantic Layer:
- Configuration files (format TBD)
- Schema caching (Redis or in-memory)

Observability:
- Query logging and performance tracking
- Error monitoring

This will evolve as I learn what actually works. Building in public means sharing the journey, including what doesn't work.

How You Can Help

If you're using ClickHouse + AI agents:

What questions are you trying to answer with your data?
- I want to understand common investigation patterns
- What templates would be most useful?
What's broken in current solutions?
- Official MCP pain points?
- LLM SQL generation issues?
- Performance problems?
What would make this 10x better for you?
- Features you need?
- Integration requirements?
- Use cases I'm missing?

GitHub: [Coming this week - will update post]

Follow progress:

Twitter: https://x.com/zevikgg
LinkedIn: https://linkedin.com/in/zeevgerstner

Or just comment below. I read everything.

Where This is Heading

The Product Vision:

In 12 months, this should be the go-to solution for teams running ClickHouse + AI agents:

Production-ready and battle-tested
Handles 80%+ of investigation patterns out of the box
Active community contributing templates
Used by teams beyond just sports analytics

Personal Goals:

If I become known as a ClickHouse + AI expert along the way—great. If I end up speaking at conferences about this—even better.

But the real measure of success is: Did we solve a real problem that others have too?

Closing Thoughts

The official ClickHouse MCP is excellent at execution.

But execution alone isn't enough for production investigations. You need intelligence:

Understanding of business context
Query optimization for ClickHouse
Safety guarantees
Investigation workflows

That's what I'm building.

Standing on the shoulders of giants (the official MCP), building the intelligence layer that's missing.

Want to follow along? Drop a comment, star the repo (coming this week), or reach out directly.

Let's build something useful. Together. In the open.

Zevik - Group Manager, Data Intelligence @ LSports
Building ClickHouse investigation agents in public

DEV Community