DEV Community

Cover image for Building an Intelligent ClickHouse Investigation Agent (The Official MCP Isn't Enough)
zeevgg
zeevgg

Posted on

Building an Intelligent ClickHouse Investigation Agent (The Official MCP Isn't Enough)

For the last six months, I've been using the official ClickHouse MCP every day to investigate sports data incidents at LSports. I use it primarily through Cursor (the AI-powered code editor), though it works with Claude Desktop and other MCP clients too.

The official MCP is excellent at what it does: executing SQL queries and providing schema introspection.

But every day, I hit the same wall: I still write SQL manually.

So I'm building the intelligence layer on top. In the open. Here's why.


The Official ClickHouse MCP is Good (Really)

ClickHouse Inc built a solid MCP server. It does exactly what a database connector should do:

  • ✅ Executes SQL queries reliably
  • ✅ Provides schema introspection (list databases, tables, columns)
  • ✅ Proper security (readonly mode)
  • ✅ Active maintenance and updates

For generic database access, it works great. I use it every day.

But it's missing the intelligence layer.


The Daily Pain Points

Let me show you what a typical investigation looks like:

Pain #1: Manual SQL Writing (Every. Single. Time.)

Me in Cursor: "Show me NBA incidents from yesterday"

Cursor (using official ClickHouse MCP):

-- Cursor writes this from scratch every time
SELECT * FROM incidents 
WHERE sport = 'Basketball' 
  AND league = 'NBA' 
  AND timestamp >= yesterday()
LIMIT 100
Enter fullscreen mode Exit fullscreen mode

What I actually want:

  • Template already exists
  • Just fill in parameters: sport=NBA, timeframe=yesterday
  • Instant execution

Pain #2: No Safety Checks

Bad query generated:

SELECT * FROM incidents WHERE description LIKE '%timeout%'
Enter fullscreen mode Exit fullscreen mode

Result: 💥 Full table scan, times out, wastes cluster resources

What I need:

  • Validate it uses ORDER BY columns (ClickHouse optimization requirement)
  • Enforce LIMIT clause
  • Check query won't cause full table scan
  • Timeout enforcement

Pain #3: No Business Context

The AI (whether in Cursor or Claude) doesn't understand my domain:

  • What is an "incident" in my schema?
  • How do I calculate "settlement quality"?
  • Which tables should be JOINed for investigations?
  • What are the common investigation patterns?

Every query starts from zero knowledge.

Pain #4: No Investigation Workflow

Complex questions require 5-10 queries:

  1. Query match events around specific time
  2. Query market changes
  3. Query incidents
  4. Manually correlate the results
  5. Ask follow-up questions
  6. Repeat

No context retention. No correlation. Manual synthesis every time.


What I'm Building

An intelligent investigation agent that sits on top of the official ClickHouse MCP.

Think of it this way:

Official ClickHouse MCP  →  My Investigation Agent
(Database Driver)            (Intelligent Engine)
     ↓                             ↓
  psycopg2                    Django ORM
  AWS SDK                     Serverless Framework
  HTTP client                 GraphQL client
Enter fullscreen mode Exit fullscreen mode

Architecture (High-Level)

User Question (Natural Language)
    ↓
Intent Parser (LLM - Claude, GPT-4, or other)
    ↓
Semantic Layer (maps business concepts to ClickHouse)
    ↓
Query Template Selection (30-50 pre-optimized patterns)
    ↓
Parameter Extraction & Validation
    ↓
Official ClickHouse MCP (executes the SQL)
    ↓
ClickHouse Database
    ↓
Result Formatting & Insights
    ↓
Response (via Cursor, Claude Desktop, or any MCP client)
Enter fullscreen mode Exit fullscreen mode

The Four Intelligence Layers

Layer 1: Semantic Understanding

  • Maps "incidents" to specific tables/columns
  • Knows business metrics and how to calculate them
  • Understands investigation patterns
  • Domain-specific knowledge baked in

Layer 2: Query Intelligence

  • 30-50 pre-optimized ClickHouse query templates
  • LLM picks the right template and fills parameters
  • Only generates SQL from scratch for edge cases
  • Learns from usage to create new templates

Layer 3: Safety & Performance

  • Validates queries use ORDER BY columns (ClickHouse requirement)
  • Enforces LIMIT clauses (prevents unbounded results)
  • Checks query cost before execution
  • Guarantees < 5 second execution time
  • Timeout enforcement at 4.5 seconds

Layer 4: Investigation Features

  • Multi-turn conversation state (remembers context)
  • Automatic correlation between queries
  • Proactive insights ("I noticed high incident rate at 23:42. Want me to check affected markets?")
  • Rich result formatting

Why ClickHouse Needs Special Handling

You can't just let an LLM generate ClickHouse queries naively:

Performance Gotchas:

  • ❌ Wrong ORDER BY → 100x slower query
  • ❌ Missing LIMIT → out of memory
  • ❌ Bad JOIN → cluster-wide performance impact
  • ❌ High-cardinality GROUP BY → resource exhaustion

ClickHouse-Specific Optimizations:

  • Must use ORDER BY columns in WHERE clause
  • PREWHERE for heavy filtering
  • Materialized views for aggregations
  • Skip indexes for selective queries
  • Different engine types (MergeTree, ReplacingMergeTree, etc.)

This is why no one has built a production-grade ClickHouse agent yet. It's too risky without proper guardrails.


Template-First Approach (Why It's Actually Perfect for Agents)

Despite the agent hype, I'm going template-first:

// Fast Path (80% of queries)
const template = {
  pattern: "incidents for {sport} {timeframe}",
  sql: `
    SELECT 
      timestamp, sport, league, 
      incident_type, severity
    FROM incidents
    WHERE sport = {sport}
      AND timestamp >= {timeframe_start}
    ORDER BY timestamp DESC
    LIMIT 100
  `
};
Enter fullscreen mode Exit fullscreen mode

Why templates?

  1. ✅ Guaranteed ClickHouse optimization (pre-tuned)
  2. ✅ Predictable performance (< 5 sec guaranteed)
  3. ✅ Safe by default (no dangerous queries)
  4. ✅ Handles 80% of use cases
  5. ✅ Perfect for autonomous agents (stronger guardrails)

For the 20% edge cases: LLM generates SQL, but with heavy validation.

This isn't limiting—it's intelligent:

  • Template = muscle memory for common patterns
  • LLM = creativity for novel questions
  • Best of both worlds

Why Build This?

Three reasons:

1. I Need It (Real Pain, Daily)

This isn't theoretical. I use ClickHouse + Claude for data investigation every day. The official MCP gets me 80% there. I'm building the missing 20%.

2. Market Gap

There is no production-grade intelligent agent for ClickHouse:

  • Generic text-to-SQL tools (Vanna, Defog) don't handle ClickHouse optimization
  • Other database agents (for Postgres/MySQL) don't transfer
  • The official MCP is execution only, not intelligence

Someone needs to build this. Might as well be me.

3. Technical Challenge

ClickHouse + LLM is genuinely hard:

  • Different SQL dialect (arrayJoin, PREWHERE, engine-specific syntax)
  • Query optimization is critical (ORDER BY, partitions, indexes)
  • Performance unpredictable without templates
  • Safety requirements for production use

This is the kind of problem that's fun to solve.


Building in Public

I'm building this completely in the open:

  • ✅ Open source from day 1 (Apache 2.0)
  • ✅ Weekly blog posts on progress
  • ✅ GitHub repo with full code
  • ✅ Community feedback welcomed
  • ✅ LSports as design partner (validates with real data)

Why?

  • The problem is real and affects many teams
  • Building in public gets better feedback faster
  • Open source means others can contribute and improve it
  • Someone needs to solve this—might as well share the solution

Not trying to:

  • Compete with ClickHouse Inc (building on their foundation)
  • Keep it proprietary
  • Build in secret

The Roadmap

Week 1-2: POC (Proof of Concept)

  • 10 investigation query templates
  • Basic LLM integration
  • Validation layer
  • Integration with official ClickHouse MCP
  • Success: Handles 40% of my daily investigations

Month 2: Core Engine

  • 30-50 query templates (80% coverage)
  • Complete semantic layer
  • Multi-turn conversation state
  • Performance monitoring
  • Success: Replaces manual SQL for most use cases

Month 3: Launch

  • Production-ready release
  • MCP server implementation (works in Cursor, Claude Desktop, ChatGPT Apps)
  • REST API + SDKs
  • Documentation and examples
  • Conference talk submissions

What I'm NOT Building

Important to be clear:

Not replacing the official ClickHouse MCP

  • The official MCP is great at execution
  • I'm building the intelligence layer on top
  • Complementary, not competitive

Not a generic text-to-SQL tool

  • Purpose-built for ClickHouse
  • Optimized for real-time analytics and investigations
  • Won't support other databases

Not a consumer product (yet)

  • Starting as open source
  • LSports is design partner
  • May evolve into hosted service later

Initial Tech Direction (Subject to Change)

I'm starting with these technologies, but may adjust as I build:

Agent Core:
- Language: TypeScript or Python (TBD based on MCP integration)
- LLM: Claude or GPT-4 (testing both)
- MCP Integration: Official ClickHouse MCP

ClickHouse Access:
- Primary: Official ClickHouse MCP
- Fallback: Direct client if needed

Semantic Layer:
- Configuration files (format TBD)
- Schema caching (Redis or in-memory)

Observability:
- Query logging and performance tracking
- Error monitoring
Enter fullscreen mode Exit fullscreen mode

This will evolve as I learn what actually works. Building in public means sharing the journey, including what doesn't work.


How You Can Help

If you're using ClickHouse + AI agents:

  1. What questions are you trying to answer with your data?

    • I want to understand common investigation patterns
    • What templates would be most useful?
  2. What's broken in current solutions?

    • Official MCP pain points?
    • LLM SQL generation issues?
    • Performance problems?
  3. What would make this 10x better for you?

    • Features you need?
    • Integration requirements?
    • Use cases I'm missing?

GitHub: [Coming this week - will update post]

Follow progress:

Or just comment below. I read everything.


Where This is Heading

The Product Vision:

In 12 months, this should be the go-to solution for teams running ClickHouse + AI agents:

  • Production-ready and battle-tested
  • Handles 80%+ of investigation patterns out of the box
  • Active community contributing templates
  • Used by teams beyond just sports analytics

Personal Goals:

If I become known as a ClickHouse + AI expert along the way—great. If I end up speaking at conferences about this—even better.

But the real measure of success is: Did we solve a real problem that others have too?


Closing Thoughts

The official ClickHouse MCP is excellent at execution.

But execution alone isn't enough for production investigations. You need intelligence:

  • Understanding of business context
  • Query optimization for ClickHouse
  • Safety guarantees
  • Investigation workflows

That's what I'm building.

Standing on the shoulders of giants (the official MCP), building the intelligence layer that's missing.

Want to follow along? Drop a comment, star the repo (coming this week), or reach out directly.

Let's build something useful. Together. In the open.


Zevik - Group Manager, Data Intelligence @ LSports
Building ClickHouse investigation agents in public

Top comments (0)