DEV Community

RAKESH THERANI
RAKESH THERANI

Posted on

Building a Full-Stack Agentic AI Data Platform on ClickHouse: A Complete Architecture Guide

A production-grade, end-to-end agentic AI platform — chat UI, self-hosted LLM, MCP server, LLM observability, medallion data architecture, security guardrails, HA, and cost analysis. Same stack ClickHouse uses internally (DWAINE: 250+ employees, ~70% of internal analytics use cases covered, 50-70% workload reduction on the data-warehouse team). Adapted for a crypto-derivatives exchange.


Why I'm writing this

The phrase "AI platform" gets thrown around to describe everything from a single chatbot to a set of disconnected APIs. An actual production AI platform requires far more than a model and an interface — it needs data plumbing, semantic grounding, query safety, observability, role-based access, high availability, and a real cost model.

This is the architecture I shipped for a crypto exchange — covering all of those layers. Every component is open-source and battle-tested. Nothing is theoretical:

  • Data layer: PostgreSQL → CDC (Debezium + Kafka) → ClickHouse with medallion architecture (Raw → Staging → Marts)
  • LLM layer: Self-hosted Qwen 2.5 72B (Apache 2.0, all data stays on-prem) + business glossary for domain grounding
  • Tool layer: ClickHouse's official MCP server with bearer-token auth, SSRF protection, schema discovery + safe SQL execution
  • UX layer: LibreChat (open-source, SSO, role-based) — same chat UI ClickHouse acquired and uses for DWAINE
  • Observability layer: Langfuse (open-source, runs on ClickHouse) — every query, response, latency, cost tracked
  • Operations layer: HA cluster, query timeouts, memory caps, RBAC, full audit trail
  • JOIN strategy: 3-tier approach (pre-joined marts + dictionaries + UNION ALL + runtime fallback) so the LLM never writes a slow query

This is the same Agentic Data Stack ClickHouse open-sourced and uses internally. I'm documenting how to deploy it on top of any production database (this guide uses a crypto exchange as the worked example, but the architecture transfers to retail, fintech, marketing, ops).

If you're building or evaluating an AI platform for your own data — and you want something more substantial than "GPT-4 with a database connection" — this is the deep-dive.

Note: Costs and table examples reference a crypto-derivatives platform (large trade tables, real-time wallet flows). The architectural patterns and effort estimates apply universally. Numbers are real; mileage will vary.


Table of Contents

  1. Executive Summary
  2. What We Are Building
  3. Why ClickHouse
  4. Platform Architecture
  5. Data Consolidation — All DBs Into ClickHouse
  6. How the AI Agent Works
  7. What Questions Can Be Asked — By Team
  8. How Multi-Table Queries Work
  9. ClickHouse MCP Server
  10. LLM Selection
  11. How the LLM Learns Our Tables (No Training Required)
  12. Chat Interface — LibreChat
  13. Observability — Langfuse
  14. Medallion Architecture — Data Organization
  15. Security and Governance
  16. High Availability Architecture
  17. Implementation Roadmap
  18. Cost Analysis
  19. Risk Assessment
  20. Success Metrics
  21. Reference Links

1. Executive Summary

Objective

Build a unified data platform on ClickHouse that consolidates tables from all the crypto exchange databases (trading, wallets, users, risk, compliance) into one place, then layer an AI agent on top so any team member can ask questions in plain English and get instant answers.

The Problem Today

  • Data lives in multiple PostgreSQL databases across different services
  • Getting answers requires writing SQL, knowing which DB to query, or filing a ticket to engineering
  • Analytical queries on production PostgreSQL compete with live trading workloads
  • No single place to query across trading, wallet, user, and compliance data together

The Solution

Any team member types: "Show me top 10 users by futures volume this week who also had withdrawals > $50K"

The AI agent:
1. Understands the question (via business glossary)
2. Generates SQL across trade_future + wallet_transactions tables (both in ClickHouse)
3. Executes via MCP server (read-only, safe)
4. Returns formatted answer with table/chart
Enter fullscreen mode Exit fullscreen mode

This works because all data from different databases is consolidated into ClickHouse tables — the LLM queries structured tables directly, not embeddings or vectors.

What ClickHouse Already Provides

ClickHouse has built a complete open-source Agentic Data Stack with a ready-to-deploy GitHub repo:

We are not building from scratch. We are deploying ClickHouse's proven stack with our data.

Expected Outcomes

Metric Target
Analytical query speed 10-100x faster than PostgreSQL for aggregations
Production DB load reduction 60-80% fewer analytical queries hitting production
Self-service data access Any team member can query data without SQL knowledge
Cross-database queries Single query across trading, wallets, users, risk data
Time to insight Seconds instead of hours

2. What We Are Building

Before (Today)

Trading Ops needs a report:
  → Asks engineering → Engineer writes SQL → Queries prod DB → Formats result → Sends back
  → Time: hours to days
  → Impact: Analytical query slows down production

Compliance needs user transaction history:
  → Files ticket → Engineer joins data from 3 DBs → Manual export → Sends CSV
  → Time: 1-3 days
Enter fullscreen mode Exit fullscreen mode

After (With This Platform)

Trading Ops opens LibreChat:
  → Types: "Top 10 markets by volume this week with liquidation count"
  → AI agent generates SQL, queries ClickHouse, returns answer in 5 seconds
  → No engineering involvement, no production DB impact

Compliance opens LibreChat:
  → Types: "All transactions for user X in last 90 days including spot, futures, and withdrawals"
  → AI agent queries across all consolidated tables, returns complete history
  → Instant, audited, repeatable
Enter fullscreen mode Exit fullscreen mode

Who Uses It

Team What They Ask
Trading Ops Volume, market metrics, liquidations, position summaries
Risk Exposure, liquidation cascades, concentration risk
Compliance User transaction history, suspicious patterns, audit trails
Finance Fee revenue, P&L, trading volume trends
Product User activity, feature usage, market popularity
Engineering System metrics, query performance, data quality checks

3. Why ClickHouse

Production Proof Points in Crypto

Coinhall — Real-time DEX aggregator across 23 blockchains:

  • Processes billions of trade events
  • 20ms candlestick chart queries on massive datasets
  • Replaced PostgreSQL for analytics
  • Similar data model to the crypto exchange (trades, orders, market data)

CryptoHouse — Public ClickHouse-powered crypto analytics:

  • Bitcoin/Ethereum blockchain data at scale

DWAINE — ClickHouse's own internal AI agent (production):

  • 250+ internal users querying data via natural language
  • >200 daily messages across 50-70 daily conversations
  • ~70% of internal analytics use cases covered by the agent
  • ~50-70% workload reduction on the 3-person DWH team
  • Source: ClickHouse — How we made our data warehouse AI-first
  • This is exactly what we are building for the crypto exchange

ClickHouse vs PostgreSQL for Analytics

Capability PostgreSQL ClickHouse
Storage model Row-based (OLTP) Columnar (OLAP)
Compression 3-5x 10-30x
Aggregation speed Seconds-minutes Milliseconds-seconds
Concurrent analytics Impacts live trading Isolated, zero production impact
Partitioning Manual (daily) Automatic (MergeTree)
Materialized views Periodic refresh Incremental (real-time triggers)
Cross-DB queries Not possible All data in one place

Key Advantage — All Data In One Place

Today, querying across trading DB + wallet DB + user DB requires application-level joins or manual exports. In ClickHouse, all tables from all databases are consolidated — the AI agent (or any analyst) can query across everything in a single SQL statement.

Our largest trading table runs to hundreds of GB in PostgreSQL. With ClickHouse's columnar compression (10-30x), this drops by an order of magnitude, and column-specific queries (e.g., SUM(quantity) WHERE market_name = 'BTCPFC') only read the relevant columns.


4. Platform Architecture

This is ClickHouse's Agentic Data Stack — three layers, all open-source:

                         +------------------+
                         |   ALL TEAMS      |
                         |  (Trading Ops,   |
                         |   Risk, Finance, |
                         |   Compliance,    |
                         |   Product)       |
                         +--------+---------+
                                  |
                  ================|================
                  |     LAYER 1: CHAT (LibreChat)  |
                  |     - Web-based chat UI         |
                  |     - SSO login                 |
                  |     - Role-based access          |
                  |     - Conversation history       |
                  =================|================
                                  |
                  ================|=================
                  |     LAYER 2: LLM + MCP          |
                  |                                  |
                  |  User question (plain English)   |
                  |         |                        |
                  |         v                        |
                  |  Qwen 2.5 LLM (self-hosted)     |
                  |  + Business Glossary             |
                  |         |                        |
                  |         v                        |
                  |  ClickHouse MCP Server           |
                  |  (schema discovery + SQL exec)   |
                  |         |                        |
                  ===========|======================
                             |
                  ===========|======================
                  |  LAYER 3: DATA (ClickHouse)     |
                  |                                  |
                  |  +---------------------------+   |
                  |  | Trading DB tables:        |   |
                  |  |  trade_future             |   |
                  |  |  trade_spot               |   |
                  |  |  orders                   |   |
                  |  +---------------------------+   |
                  |  | Wallet DB tables:         |   |
                  |  |  wallet_transactions      |   |
                  |  |  deposits / withdrawals   |   |
                  |  +---------------------------+   |
                  |  | User DB tables:           |   |
                  |  |  users, kyc_status        |   |
                  |  +---------------------------+   |
                  |  | Risk DB tables:           |   |
                  |  |  positions, margins       |   |
                  |  +---------------------------+   |
                  |  | Pre-aggregated views:     |   |
                  |  |  volume_daily             |   |
                  |  |  pnl_daily                |   |
                  |  |  liquidation_metrics      |   |
                  |  |  candlestick_1m/1h/1d     |   |
                  |  +---------------------------+   |
                  |                                  |
                  ==================================+
                             ^
                             |
                  ===========|======================
                  |  CDC PIPELINES (Debezium+Kafka) |
                  |  Real-time sync from all DBs    |
                  ==================================+
                        ^    ^    ^    ^
                        |    |    |    |
                  +-----+  +-+  ++  +-+-----+
                  |Trading| |Wallet| |User| |Risk|
                  |  DB   | |  DB  | | DB | | DB |
                  |  (PG) | | (PG) | |(PG)| |(PG)|
                  +-------+ +------+ +----+ +----+
Enter fullscreen mode Exit fullscreen mode

Observability (Langfuse)

Langfuse runs alongside and tracks every LLM interaction:

  • What question was asked, by whom
  • What SQL was generated
  • Whether the answer was correct
  • Token usage and cost per query
  • Langfuse stores its data in ClickHouse itself — no extra DB needed

5. Data Consolidation — All DBs Into ClickHouse

The Core Idea

Every PostgreSQL database at the crypto exchange feeds its tables into ClickHouse via CDC (Change Data Capture). Once in ClickHouse, the AI agent can query across all of them in a single query.

CDC Pipeline

PostgreSQL (WAL)  Debezium Connector  Kafka  ClickHouse Kafka Engine  MergeTree Tables
Enter fullscreen mode Exit fullscreen mode

Tables to Consolidate

Source DB Tables Est. Size ClickHouse Engine Sync
Trading DB trade_future hundreds of GB ReplacingMergeTree CDC real-time
trade_spot TBD ReplacingMergeTree CDC real-time
orders TBD ReplacingMergeTree CDC real-time
Wallet DB wallet_transactions TBD ReplacingMergeTree CDC real-time
deposits TBD MergeTree CDC real-time
withdrawals TBD MergeTree CDC real-time
User DB users Small ReplacingMergeTree CDC
kyc_status Small ReplacingMergeTree CDC
Risk DB positions TBD ReplacingMergeTree CDC real-time
margins TBD ReplacingMergeTree CDC real-time
Config market_config Small ReplacingMergeTree Daily batch
fee_tiers Small ReplacingMergeTree Daily batch

Why ReplacingMergeTree

PostgreSQL rows get UPDATEd (e.g., order status changes, position updates). ClickHouse's ReplacingMergeTree deduplicates on primary key during background merges, handling CDC update events correctly.

Pre-Aggregated Materialized Views

ClickHouse materialized views are incremental triggers — they process each new row as it arrives (unlike PostgreSQL which requires periodic REFRESH):

Materialized View Source What It Computes
volume_daily trade_future, trade_spot Volume per market per day
pnl_daily trades + positions P&L per user per market per day
liquidation_metrics trade_future (order_type=1003) Liquidation count, volume, frequency
candlestick_1m trade_future, trade_spot OHLCV per market per minute
candlestick_1h chart_db.candlesticks (CDC) OHLCV per market per hour
candlestick_1d chart_db.candlesticks (CDC) OHLCV per market per day
fee_revenue trades Fee revenue per market per tier per day
withdrawal_summary withdrawals Daily withdrawal volume per user

Cross-Database Query Example

Something impossible today (requires joining across separate PostgreSQL databases):

-- "Show users with >$100K futures volume today who also withdrew >$10K"
SELECT
    t.username,
    SUM(t.quantity * t.price) as futures_notional,
    w.total_withdrawn
FROM trade_future t
JOIN (
    SELECT username, SUM(amount) as total_withdrawn
    FROM withdrawals
    WHERE created_at >= today()
    GROUP BY username
    HAVING SUM(amount) > 10000
) w ON t.username = w.username
WHERE t.transaction_time >= today()
GROUP BY t.username, w.total_withdrawn
HAVING futures_notional > 100000
ORDER BY futures_notional DESC
Enter fullscreen mode Exit fullscreen mode

The AI agent generates this SQL automatically from plain English.


6. How the AI Agent Works

The Flow

1. User types question in LibreChat
     "Which users had the most liquidations today?"

2. LLM receives:
   - The question
   - Business glossary (liquidation = order_type 1003, etc.)
   - ClickHouse schema (via MCP list_tables)

3. LLM generates SQL:
     SELECT username, COUNT(*) as liq_count, SUM(quantity * price) as liq_volume
     FROM trade_future
     WHERE order_type = 1003
       AND transaction_time >= today()
       AND username NOT LIKE 'bot_%'
       AND username NOT LIKE 'market_maker_%'
       AND username NOT LIKE 'liquidity_%'
     GROUP BY username
     ORDER BY liq_count DESC
     LIMIT 20

4. MCP Server executes SQL (read-only) against ClickHouse

5. LLM formats result:
     "Today's top users by liquidation count:
      1. user_abc — 12 liquidations, $234K total
      2. user_xyz — 8 liquidations, $156K total
      ..."

6. Langfuse logs: question, SQL, result, latency, tokens used
Enter fullscreen mode Exit fullscreen mode

Key Point — It's All Structured Table Queries

The LLM is not doing anything magical with embeddings or vectors. It is:

  1. Reading the ClickHouse table schema (column names, types)
  2. Using the business glossary to understand domain terms
  3. Writing standard SQL
  4. Executing via MCP server

All our data is in regular ClickHouse tables. The "intelligence" comes from the LLM knowing how to translate English into SQL for our specific schema.


7. What Questions Can Be Asked — By Team

Trading Operations

Question Tables Used Complexity
"What is BTC perpetual futures volume today?" trade_future Single table
"Show top 10 markets by volume this week" trade_future, trade_spot Single table each, union
"Compare BTC futures volume this week vs last week" trade_future Single table, time comparison
"Show 1-hour candlestick chart for ETHPFC last 24 hours" candlestick_1h (view) Pre-aggregated view
"Which markets had the highest spread in the last hour?" orders Single table
"Show order book depth for BTCPFC right now" orders Single table, latest snapshot
"Average fill time for limit orders by market today" orders Single table
"How does today's volume compare to 30-day average?" volume_daily (view) Pre-aggregated view

Risk Management

Question Tables Used Complexity
"Which users have the largest open positions?" positions Single table
"Show all liquidation events in the last hour" trade_future Single table (order_type=1003)
"What percentage of trades were liquidations by market today?" trade_future Single table, grouping
"List users with concentrated exposure (>80% in one market)" positions Single table, aggregation
"Show liquidation cascade events — markets where >5 liquidations happened within 1 minute" trade_future Single table, window function
"Which users traded within 1 minute of a liquidation event?" trade_future (self-join) Single table, self-join
"Show users with margin utilization above 90%" positions, margins Multi-table join
"Correlation between liquidation volume and price movement for BTCPFC this week" trade_future, candlestick_1h Multi-table join

Compliance & Audit

Question Tables Used Complexity
"Show all transactions for user X in the last 90 days" trade_future, trade_spot, withdrawals, deposits Multi-table union
"List accounts with withdrawal volume > $100K today" withdrawals Single table
"Find users who deposited and withdrew within 24 hours" deposits, withdrawals Multi-table join
"Show all KYC-pending users who traded this week" users, kyc_status, trade_future Multi-table join
"List accounts that received deposits from the same source address" deposits Single table, grouping
"Show transaction patterns for user X — frequency, volume, timing" trade_future, trade_spot, withdrawals Multi-table union + aggregation
"Which accounts had unusual withdrawal patterns this week?" withdrawals Single table, anomaly detection
"Full audit trail: every action by user X since account creation" trade_future, trade_spot, orders, withdrawals, deposits Multi-table union (5 tables)

Finance & Revenue

Question Tables Used Complexity
"What was total fee revenue yesterday?" fee_revenue (view) Pre-aggregated view
"Monthly fee revenue breakdown by market" fee_revenue (view) Pre-aggregated view
"Compare spot vs futures fee revenue this month" fee_revenue (view) Pre-aggregated view, grouping
"Daily P&L trend for the last 30 days" pnl_daily (view) Pre-aggregated view
"Which fee tier generates the most revenue?" trade_future, trade_spot, fee_tiers Multi-table join
"Revenue impact if we reduce maker fees by 10%" trade_future, trade_spot, fee_tiers Multi-table join + simulation
"Show monthly trading volume trend with MoM growth %" volume_daily (view) Pre-aggregated view, window function

Product & Growth

Question Tables Used Complexity
"How many unique users traded futures this week?" trade_future Single table, COUNT DISTINCT
"What is our DAU trend for the last 30 days?" trade_future, trade_spot Multi-table union
"Which markets have growing vs declining user counts?" trade_future, trade_spot Single table, time comparison
"New user retention — users who signed up last month, how many traded this month?" users, trade_future Multi-table join
"Top 10 users by total trading volume all-time" trade_future, trade_spot Multi-table union
"Average trades per user per day — trend over last 90 days" trade_future, trade_spot Single table, aggregation
"Geographic distribution of active traders this month" users, trade_future Multi-table join

Question Complexity Summary

Type % of Questions How LLM Handles
Single table ~40% Direct SELECT with WHERE/GROUP BY
Pre-aggregated view ~20% Query materialized view (fastest)
Multi-table JOIN ~25% LLM generates JOIN across tables
Multi-table UNION ~10% LLM unions results from multiple tables
Complex (window/self-join) ~5% LLM uses advanced SQL patterns

8. How Multi-Table Queries Work (JOIN Problem & Solution)

The Challenge

Many real questions require data from multiple sources:

"Show me users who had futures liquidations today AND also made withdrawals"

This needs: trade_future (Trading DB) + withdrawals (Wallet DB) + users (User DB)

Problem: ClickHouse is not optimized for runtime JOINs. Unlike PostgreSQL, JOINs in ClickHouse are memory-heavy, single-threaded (hash build phase), and can be slow on large tables. This is a well-known ClickHouse limitation.

Solution: ClickHouse uses 3 strategies to avoid runtime JOINs. This is exactly how DWAINE (ClickHouse's own AI agent) handles it.

Strategy 1 — Pre-Joined Denormalized Marts (Primary Approach)

JOINs are computed at data load time, not at query time. Refreshable Materialized Views periodically run the JOINs in the background and write the results into flat, wide "mart" tables. The AI agent queries these pre-joined tables directly — no JOINs needed at query time.

How it works:

  Staging Layer (normalized tables):
    staging.trade_future  — raw trades from CDC
    staging.users         — user profiles from CDC
    staging.wallets       — wallet transactions from CDC

         | Refreshable Materialized View (runs every 1-5 minutes)
         | Executes the JOINs in background
         v

  Marts Layer (denormalized, pre-joined):
    mart_user_trading_activity
      — username, kyc_status, registered_at,     ← from users
        futures_volume_today, spot_volume_today,  ← from trades
        withdrawal_total_today, deposit_total,    ← from wallets
        liq_count_today, liq_volume_today         ← from trades (order_type=1003)
Enter fullscreen mode Exit fullscreen mode

The AI agent only sees the Marts layer. When someone asks "Show users with >$50K volume who also withdrew >$10K", the LLM queries one flat table — no JOIN:

-- NO JOIN needed — all data is already in one table
SELECT username, futures_volume_today, withdrawal_total_today
FROM mart_user_trading_activity
WHERE futures_volume_today > 50000
  AND withdrawal_total_today > 10000
ORDER BY futures_volume_today DESC
Enter fullscreen mode Exit fullscreen mode

This is ClickHouse's official recommendation for AI agents:

"LLMs should only be exposed to the Marts layer — which is already pre-joined and denormalized. The agent has no need to write JOINs."
How to set up ClickHouse for agentic analytics

Strategy 2 — Dictionaries (For Dimension Lookups)

Small, stable lookup tables (users, markets, fee_tiers) are loaded into in-memory Dictionaries. Instead of a JOIN, the LLM uses dictGet() — an O(1) key-value lookup that is 25x faster than a hash join.

-- INSTEAD OF THIS (slow JOIN):
SELECT t.username, t.quantity, u.kyc_status
FROM mart_trades_futures t
JOIN mart_users u ON t.username = u.username
WHERE t.transaction_time >= today()

-- THE LLM GENERATES THIS (fast dictionary lookup):
SELECT
    username,
    quantity,
    dictGet('users_dict', 'kyc_status', username) AS kyc_status
FROM mart_trades_futures
WHERE transaction_time >= today()
Enter fullscreen mode Exit fullscreen mode

Performance comparison (from ClickHouse benchmarks):

Method Time Memory
dictGet() with flat layout 0.044 sec 84 MB
dictGet() with hashed layout 0.113 sec 103 MB
Parallel Hash JOIN 0.690 sec 4.8 GB
Standard Hash JOIN 1.133 sec 4.4 GB

Which tables become Dictionaries:

Dictionary Source Size Lookup Key
users_dict users table Small username
markets_dict market_config Small market_name
fee_tiers_dict fee_tiers Small tier_id
kyc_dict kyc_status Small username

The business glossary teaches the LLM to use dictGet() instead of JOINs for these dimension tables.

Strategy 3 — Runtime JOINs (Fallback for Ad-Hoc Queries)

For truly ad-hoc questions that don't fit a pre-built mart table, ClickHouse does support runtime JOINs with optimizations:

Optimization What It Does
join_algorithm = 'auto' ClickHouse picks the best algorithm automatically
Automatic table reordering Puts smaller table on the right side (since v24.12)
Global join reordering Optimizes 3+ table JOIN chains (since v25.9)
ANY JOIN Stops at first match — faster when you only need one row
Pre-filter with WHERE Reduce table size before joining

When this is acceptable:

  • Small-to-medium result sets (both sides < few million rows after filtering)
  • Ad-hoc questions that don't match any existing mart table
  • One-time analysis queries where latency is less critical

How Each Question Type Is Handled

Question Type Strategy Example
Single-table query Direct query on mart table "Top 10 markets by volume today"
Pre-aggregated metric Query materialized view "Daily volume trend last 30 days"
Cross-domain (trading + wallets) Pre-joined mart table "Users with high volume AND large withdrawals"
Dimension lookup (user info) Dictionary (dictGet) "Show KYC status for top traders"
Full audit trail (multiple activity types) UNION ALL (no JOIN needed) "All activity for user X in 30 days"
Rare ad-hoc correlation Runtime JOIN (fallback) "Correlation between margin % and liquidation frequency"

Pre-Joined Mart Tables We Build

Mart Table Pre-Joined Sources Refreshes Use Case
mart_user_trading_activity trades + users + wallets Every 5 min "Users who traded AND withdrew"
mart_user_risk_profile positions + margins + trades (liquidations) Every 1 min "Users at risk of liquidation"
mart_liquidation_detail trades + users + positions Every 1 min "Liquidation events with user context"
mart_user_full_activity trades + spot + orders + wallets Every 5 min "Complete activity for user X"
mart_market_daily_summary trades + orders + market_config Every 1 min "Market performance today"
mart_compliance_view trades + wallets + users + kyc Every 5 min "Compliance team queries"

Example — How Pre-Joining Works End-to-End

Question: "Show KYC-verified users who had liquidations this week with their withdrawal history"

Without pre-joining (SLOW — 3-table runtime JOIN):

-- BAD: Runtime JOIN across 3 large tables
SELECT u.username, u.kyc_status, liq.liq_count, wd.withdrawal_total
FROM staging.users u
JOIN staging.trade_future liq ON u.username = liq.username
JOIN staging.wallets wd ON u.username = wd.username
WHERE ...
-- Slow, memory-heavy, may timeout
Enter fullscreen mode Exit fullscreen mode

With pre-joining (FAST — single table scan):

-- GOOD: Query pre-joined mart table, no JOIN needed
SELECT username, kyc_status, liq_count_week, withdrawal_total_week
FROM mart_user_trading_activity
WHERE kyc_status = 'verified'
  AND liq_count_week > 0
ORDER BY liq_count_week DESC
Enter fullscreen mode Exit fullscreen mode

The JOIN happened in the background when the Refreshable Materialized View populated mart_user_trading_activity. The AI agent never writes a JOIN.

Example — Full Audit Trail (UNION ALL, No JOIN)

For "show all activity for user X", no JOIN is needed — we use UNION ALL across tables, which ClickHouse handles efficiently since each branch scans one table independently:

SELECT * FROM (
    SELECT 'futures_trade' AS activity, transaction_time AS event_time,
           market_name, side, quantity, price, quantity * price AS value_usd
    FROM mart_trades_futures
    WHERE username = 'john_doe' AND transaction_time >= today() - 30

    UNION ALL

    SELECT 'spot_trade', transaction_time, market_name, side,
           quantity, price, quantity * price
    FROM mart_trades_spot
    WHERE username = 'john_doe' AND transaction_time >= today() - 30

    UNION ALL

    SELECT 'withdrawal', created_at, currency, 'out',
           amount, 0, amount
    FROM mart_wallets
    WHERE username = 'john_doe' AND transaction_type = 'withdrawal'
      AND created_at >= today() - 30

    UNION ALL

    SELECT 'deposit', created_at, currency, 'in',
           amount, 0, amount
    FROM mart_wallets
    WHERE username = 'john_doe' AND transaction_type = 'deposit'
      AND created_at >= today() - 30
)
ORDER BY event_time DESC
Enter fullscreen mode Exit fullscreen mode

UNION ALL is not a JOIN — each branch runs independently and results are concatenated. This is fast in ClickHouse.

Summary — JOIN Avoidance Strategy

Question comes in from user
        |
        v
Does a pre-joined mart table cover this?
  YES → Query the mart table directly (no JOIN, fastest)
        |
        NO
        v
Does it need a small dimension lookup (user info, market name)?
  YES → Use dictGet() dictionary lookup (25x faster than JOIN)
        |
        NO
        v
Does it need data from multiple activity types for one user?
  YES → Use UNION ALL across tables (no JOIN, each branch independent)
        |
        NO
        v
Truly ad-hoc cross-table correlation?
  → Use runtime JOIN with auto algorithm (fallback, acceptable for small result sets)
  → Log in Langfuse — if this pattern repeats, build a new mart table for it
Enter fullscreen mode Exit fullscreen mode

Performance Comparison

Approach Query Time Memory When to Use
Pre-joined mart table Milliseconds Minimal ~60% of questions (cross-domain)
Dictionary (dictGet) Milliseconds Minimal ~15% of questions (dimension lookup)
UNION ALL 1-2 seconds Low ~15% of questions (audit trails)
Runtime JOIN (fallback) 2-10 seconds High ~10% of questions (rare ad-hoc)

Reference


9. ClickHouse MCP Server

What It Is

The MCP (Model Context Protocol) server is an open standard that lets LLMs interact with databases safely. ClickHouse has an official MCP server:

What It Does (and Does NOT Do)

Description
Does Exposes ClickHouse schema to the LLM (tables, columns, types)
Does Executes read-only SQL queries generated by the LLM
Does Returns results in LLM-readable format
Does Enforce read-only access (no INSERT/UPDATE/DELETE by default)
Does NOT Convert natural language to SQL (that's the LLM's job)
Does NOT Store data or state

Tools Exposed

MCP Tool What It Does
list_databases Lists all databases in ClickHouse
list_tables Lists tables with columns, types, pagination
run_query Executes read-only SQL, returns results
run_chdb_select_query Embedded queries against files/URLs

Integration Flow

Qwen LLM ←— MCP Protocol (JSON-RPC) —→ ClickHouse MCP Server ←— Native Protocol —→ ClickHouse Tables
Enter fullscreen mode Exit fullscreen mode

The MCP server is the controlled gateway — the LLM can discover schema and run queries, but cannot modify data or access restricted tables.

How the MCP Server Actually Works (Transport & Security)

Transport — Streamable HTTP (not stdio):
The MCP server runs as a persistent HTTP service. LibreChat sends HTTP POST requests to it when the LLM needs to call a tool (list_tables, run_query, etc.). This is different from subprocess-based MCP (stdio) — the server stays running, handles concurrent requests, and is addressable as a microservice within the deployment.

Authentication — Bearer Token:
LibreChat authenticates to the MCP server using a bearer token in the Authorization header. This ensures only LibreChat can call the MCP server — not other services or arbitrary HTTP clients on the same network.

SSRF Protection — Domain Whitelist:
LibreChat enforces an allowlist of domains it will make outbound HTTP requests to. This prevents a scenario where the LLM instructs LibreChat to call internal infrastructure URLs (e.g., http://internal-service/admin). Only the MCP server address and ClickHouse Cloud endpoint are whitelisted.

chDB — Query Files and URLs Without ETL:
The run_chdb_select_query tool uses chDB (embedded ClickHouse engine) to query local files, S3 URLs, or Parquet files directly — without loading them into ClickHouse tables first. chDB 4 adds a Pandas-like DataStore API with lazy execution and filter/column pushdown, useful for data scientists who upload files to LibreChat and want to query them immediately.

Remote MCP (ClickHouse Cloud):
For ClickHouse Cloud deployments, ClickHouse operates a hosted Remote MCP Server — agents connect directly without self-hosting the MCP service. For self-managed ClickHouse (our case), the MCP server runs as a container alongside ClickHouse.

Compatible Agent Frameworks

ClickHouse has tested their MCP server with 12 agent frameworks:
Agno, DSPy, LangChain, LlamaIndex, PydanticAI, Claude Agent SDK, OpenAI Agents SDK, CrewAI, Google ADK, Microsoft Agent Framework, mcp-agent, Upsonic


10. LLM Selection

Primary: Qwen 2.5 72B (Self-Hosted)

Factor Detail
License Apache 2.0 (open source, no vendor lock-in)
SQL generation Top-tier on text-to-SQL benchmarks
Multilingual English, Chinese, Japanese
Size 72B params — accurate enough, runs on single GPU node
Data privacy All data stays on our infrastructure
Cost Fixed GPU cost, not per-token

Infrastructure

Component Specification
GPU instance g5.12xlarge (4x A10G, 96GB VRAM)
VRAM needed ~40GB for 72B INT4 quantized
Framework vLLM (high-throughput inference)
Throughput 30-50 tokens/second
Fallback Qwen 2.5 32B on smaller GPU

Alternatives

Model Size Notes
Qwen 2.5 32B 32B Lighter, cheaper GPU, slightly less accurate
Llama 3.1 70B 70B Meta, strong general purpose
DeepSeek V3 671B MoE (~37B active params) Strong on code/SQL
Claude Opus 4.7 API Best for complex multi-step analysis — use only when data is not restricted to internal infra

Model selection rule: Use self-hosted Qwen for all queries involving the crypto exchange trading/user/wallet data (data stays on-prem, zero external API calls). Use Claude Opus 4.7 via API only for non-sensitive analytical work, development, or testing where data residency is not a constraint.

AGENTS.md — Per-Environment Agent Instructions

ClickHouse Assistant and agent frameworks support an AGENTS.md file — a plain text file placed in the agent's working context that injects domain-specific instructions into every session. This is complementary to the business glossary:

Business Glossary AGENTS.md
Purpose Maps domain terms to SQL Sets agent behavior rules
Example "liquidation" = order_type 1003 "Always exclude market makers from results"
Example "volume" = SUM(quantity) "Default time window is today() unless specified"
Format YAML key-value pairs Plain text instructions
Effect Teaches SQL mappings Shapes how the agent reasons and responds

Both are injected into the LLM's system prompt at query time — no retraining needed.

DWAINE Reference

ClickHouse's own DWAINE agent uses Claude via AWS Bedrock. We choose self-hosted Qwen for data privacy and cost control, but the architecture supports swapping LLM backends.


11. How the LLM Learns Our Tables (No Training Required)

Key Point — No Model Training Needed

The LLM does not need to be trained or fine-tuned to understand the crypto exchange tables. It learns everything it needs at query time through two mechanisms: automatic schema discovery and a business glossary.

Mechanism 1 — MCP Schema Discovery (Automatic)

Every time the LLM receives a question, it calls list_tables via the MCP server. ClickHouse returns the full schema automatically — table names, column names, data types:

MCP returns to LLM automatically:

  Table: mart_trades_futures
    - username           String
    - market_name        String
    - transaction_time   DateTime
    - side               String
    - quantity           Float64
    - price              Float64
    - order_type         Int32        ← LLM sees this but doesn't know 1003 = liquidation

  Table: mart_wallets
    - username           String
    - transaction_type   String       ('deposit', 'withdrawal')
    - amount             Float64
    - currency           String
    - created_at         DateTime

  Table: mart_user_trading_activity
    - username           String
    - kyc_status         String
    - futures_volume_today   Float64
    - withdrawal_total_today Float64
    - liq_count_today    Int32
    ...
Enter fullscreen mode Exit fullscreen mode

No configuration needed for this. When a new table is added to ClickHouse, the LLM sees it automatically on the next query.

Mechanism 2 — Business Glossary (You Configure This)

The schema alone is not enough. The LLM sees order_type Int32 but doesn't know what 1003 means. The business glossary is a configuration file (not model training) that gets injected into the LLM's prompt:

Business Glossary (business_glossary.yaml):

- "liquidation" = trade_future WHERE order_type = 1003
- "market maker" = username LIKE 'bot_%' OR 'market_maker_%' OR 'liquidity_%'
- "perpetual futures" = market_name LIKE '%PFC%'
- "volume" = SUM(quantity) from mart_trades_futures
- "notional" = SUM(quantity * price)
- "BTCPFC" = Bitcoin perpetual futures contract
- "whitelisted user" = user in internal_account_whitelist table
- "P&L" = pnl_daily materialized view
- "candlestick" = candlestick_1m/1h/1d materialized views
- "spot trade" = mart_trades_spot table
- "deposit" = mart_wallets WHERE transaction_type = 'deposit'
- "withdrawal" = mart_wallets WHERE transaction_type = 'withdrawal'
Enter fullscreen mode Exit fullscreen mode

This is just a text file that gets added to the LLM's system prompt. You update it anytime — no retraining, no downtime.

Full Glossary Entries

Term Definition SQL Mapping
Liquidation Forced position closure order_type = 1003 in trade_future
Perpetual futures Non-expiring futures market_name LIKE '%PFC%'
Market maker Internal trading bots username LIKE 'bot_%' OR 'market_maker_%' OR 'liquidity_%'
Trading volume Total quantity SUM(quantity) on trade_future/trade_spot
Notional volume USD value SUM(quantity * price)
Whitelisted user Excluded from checks internal_account_whitelist table
Deposit Fiat/crypto incoming mart_wallets WHERE transaction_type = 'deposit'
Withdrawal Fiat/crypto outgoing mart_wallets WHERE transaction_type = 'withdrawal'
P&L Realized profit/loss pnl_daily materialized view
Candlestick OHLCV price bars candlestick_1m/1h/1d views
Spot trade Immediate exchange mart_trades_spot table
Futures trade Derivatives trade mart_trades_futures table
ADL Auto-deleveraging order_type = 1004 in trade_future
Funding fee Periodic futures fee mart_funding_payments table
Open interest Total open positions SUM(quantity) on mart_positions

What Actually Happens Per Query

User types: "Show top liquidated users today"
                    |
                    v
LLM receives in its prompt:
  ┌─────────────────────────────────────────────────┐
  │ 1. System instructions                          │
  │    "You are a SQL analyst for the crypto exchange.             │
  │     Generate ClickHouse SQL. Read-only only."   │
  │                                                 │
  │ 2. Business glossary (text file)                │
  │    "liquidation = order_type 1003"              │
  │    "market maker = bot_%, market_maker_%, liquidity_%"  │
  │    ... (all glossary entries)                    │
  │                                                 │
  │ 3. ClickHouse schema (from MCP list_tables)     │
  │    mart_trades_futures: username, market_name,   │
  │    transaction_time, order_type, quantity, ...   │
  │    mart_wallets: username, amount, ...           │
  │    ... (all mart tables)                        │
  │                                                 │
  │ 4. The user's question                          │
  │    "Show top liquidated users today"            │
  └─────────────────────────────────────────────────┘
                    |
                    v
LLM generates SQL using all 3 context sources:
  SELECT username, COUNT(*) as liq_count, SUM(quantity * price) as liq_volume
  FROM mart_trades_futures
  WHERE order_type = 1003                    ← from glossary
    AND transaction_time >= today()          ← from schema (knows column type)
    AND username NOT LIKE 'bot_%'          ← from glossary (exclude market makers)
  GROUP BY username
  ORDER BY liq_count DESC LIMIT 20
Enter fullscreen mode Exit fullscreen mode

Training vs Context Injection

Training / Fine-tuning Context Injection (What We Do)
What it is Modify the model's internal weights Pass information in the prompt at query time
Time to update Hours-days (retrain model) Seconds (edit a text file)
Cost Expensive (GPU hours) Free
When schema changes Must retrain Automatic (MCP reads live schema)
When business terms change Must retrain Edit glossary file, instant effect
Risk Can degrade model quality None
What DWAINE uses No training Context injection (glossary + MCP)

When Would You Fine-Tune? (Optional, Phase 5+)

Fine-tuning is not required but can be done later to push accuracy higher:

Phase Approach Expected Coverage
Weeks 1-12 MCP schema + business glossary only (no training) Aim for DWAINE-class — agent handles ~70% of analytics use cases (the other 30% still go to a data engineer)
Post Week 12 (optional) Fine-tune Qwen on successful query pairs from Langfuse logs Higher coverage + lower per-query latency

Fine-tuning would train the model on patterns like:

Input:  "top liquidated users today"
Output: "SELECT username, COUNT(*) FROM mart_trades_futures WHERE order_type = 1003
         AND transaction_time >= today() GROUP BY username ORDER BY COUNT(*) DESC LIMIT 10"
Enter fullscreen mode Exit fullscreen mode

Langfuse collects thousands of these successful question→SQL pairs over time, which become the fine-tuning dataset. But this is an optimization, not a requirement.

How the Glossary Improves Over Time

Coverage and quality grow as the glossary fills in. The exact accuracy curve depends on workload — what we observed (and what ClickHouse reports for DWAINE) is that glossary growth correlates strongly with the share of questions the agent can answer end-to-end without analyst involvement:

Week 1:  Glossary has 50 terms   → narrow coverage; agent fields easy single-table questions
Week 4:  Glossary has 100 terms  → most common cross-domain questions handled
Week 8:  Glossary has 200 terms  → DWAINE-class — ~70% of analytics use cases covered
Week 12: Glossary has 300+ terms → covers long-tail domain language
Post-12: Fine-tune on Langfuse logs → tighter SQL, lower latency, higher first-pass success
Enter fullscreen mode Exit fullscreen mode

The improvement cycle:

  1. User asks a question → LLM generates wrong SQL
  2. Langfuse logs the failure (question, wrong SQL, error)
  3. Team reviews and adds missing glossary entry
  4. Next time the same question is asked → correct SQL

ClickHouse's published takeaway for DWAINE is that business glossary quality is the #1 driver of agent usefulness — they reached the ~70% use-case-coverage milestone primarily through glossary expansion, not LLM fine-tuning.


12. Chat Interface — LibreChat

What It Is

LibreChat is an open-source chat interface acquired by ClickHouse (November 2025). It is the front-end of the Agentic Data Stack.

  • Self-hosted — no data leaves the crypto exchange infrastructure
  • Web-based — accessible from any browser
  • Connects to self-hosted Qwen LLM backend
  • MCP plugin connects LLM to ClickHouse
  • Conversation history and bookmarking
  • Supports tables and chart rendering in responses

Role-Based Access

Role Sees Example Questions
Trading Ops All trading data, metrics "Top markets by volume this week"
Risk Positions, liquidations, exposure "Users with largest open positions"
Compliance User transactions, audit trails "All transactions for user X in Q1"
Finance Revenue, fees, P&L "Monthly fee revenue by market"
Product User activity, market stats "DAU trend for futures last 30 days"

Supporting Services — Why Each Exists

LibreChat is not a standalone service — it depends on four supporting components:

MongoDB — Conversation History Store:
LibreChat is stateless by design. MongoDB stores everything that needs to persist: all conversation threads, message history, user accounts, bookmarks, and presets. Without MongoDB, every browser refresh loses the conversation. MongoDB is chosen because LibreChat's data model is document-oriented (conversations are nested JSON objects with variable structure), not relational.

Meilisearch — Conversation Search Index:
Meilisearch provides full-text search over conversation history. When a user types in the LibreChat search bar to find a past query ("find my liquidation analysis from last week"), Meilisearch returns results in milliseconds. It maintains a search index that mirrors the MongoDB conversations.

pgvector + RAG API — Document Query:
LibreChat supports document uploads (PDFs, CSVs, text files). The RAG API processes uploaded files: it chunks the text, generates embeddings using an embedding model, and stores them in pgvector (PostgreSQL with vector extension). When a user asks a question while a document is attached, the RAG API retrieves the most relevant chunks from pgvector and injects them into the LLM prompt — enabling "ask questions about this document" without fine-tuning.

Qwen Connection — OpenAI-Compatible API:
LibreChat connects to Qwen via the OpenAI-compatible API that vLLM exposes. vLLM serves Qwen 2.5 72B with an endpoint that speaks the same protocol as OpenAI's API (/v1/chat/completions). From LibreChat's perspective, Qwen is just another OpenAI-compatible provider — no custom integration needed. Switching to a different model (e.g., Llama, DeepSeek) only requires changing the endpoint URL.

Live Demo

ClickHouse operates a public demo at https://llm.clickhouse.com (AgentHouse) showing this exact flow on 37 public datasets. Our deployment is the same architecture with the crypto exchange private data.


13. Observability — Langfuse

What It Is

Langfuse (acquired by ClickHouse, January 2026) is an LLM observability platform. It tracks every interaction between users and the AI agent.

What It Tracks

Metric Why It Matters
Every question asked Know what teams need
SQL generated Debug wrong answers
Latency per query Monitor performance
Token usage Track compute cost
Accuracy (correct/incorrect) Measure improvement
User identity Compliance audit trail

Why This Matters

Without Langfuse, the AI agent is a black box. With it:

  • We know when answers are wrong and can fix the glossary
  • We measure accuracy improvement over time (target: 70% → 85%)
  • Compliance can audit every question and data access
  • We track cost per team/query type

How Traces Actually Flow (LibreChat → Langfuse → ClickHouse)

Langfuse has a built-in integration in LibreChat — no code changes or custom instrumentation needed. LibreChat automatically sends every LLM interaction to Langfuse via three environment variables (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL). Once set, every prompt and response is traced automatically.

Async trace pipeline — why Redis is needed:
LibreChat does not write traces to ClickHouse directly. It writes trace events to a Redis queue. A separate Langfuse worker process reads from the Redis queue and writes the processed traces to ClickHouse. This decouples trace ingestion from query execution — a slow Langfuse write never adds latency to the user's chat response.

Why Langfuse uses ClickHouse as its own backend:
Langfuse stores trace events (prompts, responses, latency, token counts) in ClickHouse because traces are time-series append-only data — exactly the workload ClickHouse is optimized for. The Langfuse analytics dashboard (cost per team, accuracy trends, slow queries) runs SQL aggregations over these traces in milliseconds. Metadata (projects, users, API keys) is stored in PostgreSQL. Binary assets (uploaded files, large trace payloads) are stored in MinIO (S3-compatible object storage).

The Shared ClickHouse Instance

The same ClickHouse instance serves two roles simultaneously:

Role Database What It Stores
Agent data warehouse marts, staging, raw business data — trades, wallets, users, risk
Langfuse event store langfuse LLM traces — prompts, SQL, latency, token counts

This is intentional. It means you can write SQL that crosses both roles — for example: "which teams generated the most incorrect SQL this week, and what business tables were they querying?" — joining Langfuse trace data with business context in a single query.


14. Medallion Architecture — Data Organization

ClickHouse recommends a three-layer data organization for AI agents (from their setup guide):

+------------------+     +------------------+     +------------------+
|   RAW DATABASE   | --> | STAGING DATABASE | --> |  MARTS DATABASE  |
|                  |     |                  |     |                  |
| Untransformed    |     | Deduplicated     |     | Curated tables   |
| CDC events from  |     | Normalized       |     | Aggregated views |
| all PostgreSQL   |     | Cleaned          |     | AI agent queries |
| databases        |     |                  |     | THIS layer only  |
+------------------+     +------------------+     +------------------+
   (ingestion)            (transformation)           (consumption)
Enter fullscreen mode Exit fullscreen mode

Why Three Layers

Layer Purpose Who Accesses
Raw Unmodified CDC events, full history Data engineers only
Staging Deduplicated, normalized, cleaned Data engineers, advanced analysts
Marts Curated, aggregated, business-ready AI agent and all users

The AI agent only accesses the Marts layer — curated, pre-validated tables with clear column names and business meaning. This reduces LLM errors and prevents access to raw/sensitive data.

Marts Tables (What the AI Agent Sees)

Marts Table Source Description
mart_trades_futures trade_future (staging) Clean futures trades, no internal accounts
mart_trades_spot trade_spot (staging) Clean spot trades
mart_orders orders (staging) Order history with status
mart_wallets wallet_transactions (staging) Deposits, withdrawals, transfers
mart_users users + kyc (staging) User profiles (PII masked)
mart_positions positions (staging) Current open positions
mart_volume_daily Materialized view Daily volume by market
mart_pnl_daily Materialized view Daily P&L by user/market
mart_liquidations Materialized view Liquidation events and metrics
mart_candlesticks Materialized view OHLCV candles (1m/1h/1d)
mart_fee_revenue Materialized view Fee revenue by market/tier

15. Security and Governance

Data Access Controls

Control Implementation
Authentication SSO integration (existing corporate identity)
Authorization Role-based — teams see only permitted mart tables
Query restrictions MCP server enforces read-only (no mutations)
AI agent access Marts layer only (no raw or staging data)
PII protection User PII masked in marts tables
Audit trail Every query logged in Langfuse with user identity
Network ClickHouse in private VPC, no public access

LLM Role Resource Guardrails

The LLM database user must have hard resource limits to prevent runaway agent queries from affecting the cluster. ClickHouse's own agentic analytics setup guide prescribes these exact values:

Guardrail Limit Why
Query timeout 30 seconds Agent queries that scan too much will abort cleanly
Memory per query 2 GB Prevents a single agent query from exhausting node memory
Max rows scanned 100 million Forces the LLM to use indexes and partitions — catches unfiltered full scans
Max bytes scanned 5 GB Secondary byte-level cap alongside row limit
Max threads 4 CPUs Limits CPU contention with other workloads
Access type SELECT-only on marts.* No mutations, no access to raw or staging layers

The LLM role gets SELECT access only on the marts schema — never on raw or staging. This is a hard access control, not just a glossary convention. If the LLM generates a query against a staging table, the database rejects it.

LLM Data Safety

  • Self-hosted Qwen 2.5 — no data sent to external APIs
  • Business glossary contains schema mappings, not actual data
  • LLM sees query results only — no direct database access
  • All interactions logged and auditable via Langfuse

Compliance

  • Data residency: ClickHouse deployed in the same region as production OLTP databases
  • Retention: Configurable TTL per table
  • Right to erasure: CDC propagates DELETE events from PostgreSQL
  • Access logs: Full audit trail via Langfuse

16. High Availability Architecture

ClickHouse Cluster (Production Region)

                    +-------------------+
                    |   Load Balancer   |
                    +---+----------+----+
                        |          |
               +--------v--+  +---v--------+
               | CH Node 1 |  | CH Node 2  |
               | (Shard 1  |  | (Shard 1   |
               |  Replica 1)|  |  Replica 2)|
               +------------+  +------------+

               ClickHouse Keeper (3 nodes for consensus)
Enter fullscreen mode Exit fullscreen mode

Deployment Options

Option Description Est. Monthly Cost
ClickHouse Cloud Fully managed, auto-scaling ~$2,000-5,000/mo
Self-hosted on EC2 2x m7g.2xlarge (Graviton3) ~$1,500-3,000/mo

Recommendation: Start with ClickHouse Cloud for faster deployment. Migrate to self-hosted if cost optimization is needed later.


17. Implementation Roadmap

Phase 1 — Data Foundation (Weeks 1-3)

Task Details
Deploy ClickHouse ClickHouse Cloud or 2-node self-hosted cluster
Identify all source tables Catalog tables from all PostgreSQL databases
Set up CDC pipelines Debezium + Kafka for each source database
Design medallion schema Raw → Staging → Marts table definitions
Initial data load Backfill historical data from all PostgreSQL databases
Validation Row counts and checksum verification

Phase 2 — Marts + LLM (Weeks 4-6)

Task Details
Build marts tables Curated, aggregated, AI-ready tables
Materialized views Volume, P&L, liquidations, candlesticks, revenue
Deploy Qwen 2.5 72B Self-hosted on GPU instance with vLLM
MCP server setup Connect ClickHouse MCP to Qwen
Business glossary v1 50-100 domain-specific terms
Test SQL generation Validate against known queries

Phase 3 — Chat UI + Observability (Weeks 7-9)

Task Details
Deploy LibreChat Self-hosted, SSO integration
Connect to Qwen backend MCP plugin configuration
Langfuse setup LLM tracing and accuracy monitoring
Role-based access Team-specific table permissions
Beta launch Trading Ops team as first users
Feedback loop Review Langfuse failures, expand glossary

Phase 4 — Production Hardening (Weeks 10-12)

Task Details
Glossary expansion Target 200+ terms based on real usage
Accuracy tuning Fix common failure patterns from Langfuse
Performance tuning Query caching, materialized view optimization
Security audit Access review, penetration testing
GA launch Roll out to all teams

Phase 5 — Advanced (Post Week 12)

  • Scheduled reports (daily P&L email, weekly risk summary)
  • Alert-driven queries (agent triggered by anomaly detection)
  • Multi-step analysis (agent chains multiple queries)
  • Fine-tune Qwen on the crypto exchange query patterns for higher accuracy
  • Add more source databases as needed

18. Cost Analysis

Monthly Infrastructure Costs

Component Est. Monthly Cost
ClickHouse Cloud $3,000-5,000
Kafka/CDC (Amazon MSK) $500
GPU for Qwen 2.5 (g5.12xlarge) $7,000 (1-yr reserved) / $11,700 (on-demand)
LibreChat (t3.medium) $100
Langfuse Included in ClickHouse
Total ~$11,000-14,000 (reserved) / ~$15,000-19,000 (on-demand)

Cost Optimization Path

Optimization Savings
Self-hosted ClickHouse instead of Cloud -$1,500-3,000/mo
Qwen 2.5 32B on smaller GPU -$2,000/mo
Reserved instances (1-year) -30% on GPU
Spot instances for dev/test -60%

ROI Justification

Benefit Value
Engineering time saved (ad-hoc queries) 20-40 hours/month
Faster incident response 50% reduction in MTTR
Production DB load reduction Fewer analytical queries on live trading DB
Self-service analytics Teams don't need SQL knowledge
Cross-database insights Previously impossible queries now instant
Compliance efficiency Instant audit queries vs manual extraction

19. Risk Assessment

Risk Likelihood Impact Mitigation
LLM generates incorrect SQL Medium Medium Business glossary, Langfuse monitoring, human review
CDC lag (stale data) Low Medium Monitor Kafka consumer lag, alert if >5 min
ClickHouse cluster failure Low High 2-replica HA, automated failover, daily backups
GPU instance unavailable Low Medium Reserved capacity, fallback to smaller model
Low user adoption Medium Medium Start with power users, iterate on feedback
Schema drift (PG changes not in CH) Low Medium Automated schema sync in CDC pipeline
Wrong answers not caught Medium High Langfuse accuracy tracking, glossary reviews
PostgreSQL replication slot bloat Medium High If Debezium falls behind, PG WAL accumulates indefinitely and can crash production DB — monitor slot lag, set max_slot_wal_keep_size, alert if lag > 10GB

20. Success Metrics

90-Day Targets

Metric Target Measurement
Query accuracy >70% correct SQL Langfuse evaluation
Active users 20+ weekly LibreChat usage logs
Query volume 500+ queries/week Langfuse trace count
Response time <10 seconds Langfuse latency p50
Source databases consolidated 4+ databases in ClickHouse Table count
Business glossary 200+ terms Glossary config
Production DB offload 50%+ analytical queries moved PG query log

6-Month Targets

Metric Target
Query accuracy >85%
Active users 50+
Self-service resolution 70% of questions answered without engineering
Automated reports 10+ scheduled reports

21. Reference Links

ClickHouse Agentic Data Stack

Resource URL
Agentic Data Stack GitHub repo https://github.com/ClickHouse/agentic-data-stack
MCP Server GitHub repo https://github.com/ClickHouse/mcp-clickhouse
AgentHouse live demo https://llm.clickhouse.com

ClickHouse Blog Posts

Post URL
The Agentic Data Stack https://clickhouse.com/blog/the-agentic-data-stack
How we made our data warehouse AI-first (DWAINE) https://clickhouse.com/blog/ai-first-data-warehouse
How to set up ClickHouse for agentic analytics https://clickhouse.com/blog/how-to-set-up-clickhouse-for-agentic-analytics
Building a data platform for agents https://clickhouse.com/blog/building-a-data-platform-for-agents
Integrating with ClickHouse MCP (5 frameworks) https://clickhouse.com/blog/integrating-clickhouse-mcp
12 agent framework comparison https://clickhouse.com/blog/how-to-build-ai-agents-mcp-12-frameworks
Introducing AgentHouse https://clickhouse.com/blog/agenthouse-demo-clickhouse-llm-mcp
Ask AI agent & Remote MCP beta https://clickhouse.com/blog/agentic-analytics-ask-ai-agent-and-remote-mcp-server-beta-launch
ClickHouse acquires LibreChat https://clickhouse.com/blog/clickhouse-acquires-librechat
LibreChat + Agentic Data Stack https://clickhouse.com/blog/librechat-open-source-agentic-data-stack
ClickHouse acquires Langfuse https://clickhouse.com/blog/clickhouse-acquires-langfuse-open-source-llm-observability
Langfuse scaling with ClickHouse https://clickhouse.com/blog/langfuse-llm-analytics
Agent-facing analytics https://clickhouse.com/blog/agent-facing-analytics
LLM chat UIs that support MCP https://clickhouse.com/blog/llm-chat-mcp-support
Wix Wild Moose — AI incident response on ClickHouse https://clickhouse.com/blog/wix-wild-moose
Building an agentic coding app (retail demo) https://clickhouse.com/blog/agentic-coding-app
Agentic coding at scale — CTO perspective https://clickhouse.com/blog/agentic-coding
ClickHouse AI Policy (public) https://github.com/ClickHouse/ClickHouse/blob/master/AI_POLICY.md

ClickHouse Documentation


Appendix A — Technology Stack Summary

Component Technology Purpose
Analytical DB ClickHouse All the crypto exchange data consolidated, columnar storage, fast aggregations
CDC Debezium + Kafka Real-time sync from all PostgreSQL databases
LLM Qwen 2.5 72B (self-hosted) Natural language to SQL generation
MCP Server ClickHouse MCP Schema discovery + safe SQL execution
Chat UI LibreChat (self-hosted) Web-based chat interface for all teams
Observability Langfuse LLM tracing, accuracy tracking, cost monitoring
Source of Truth PostgreSQL databases Production OLTP (unchanged)

Appendix B — DWAINE Reference (ClickHouse Internal Agent)

ClickHouse built DWAINE (Data Warehouse AI Natural Expert) for their own internal use:

Metric Value
Users 250+ employees
Daily messages 200+
Daily conversations 50-70
Internal analytics use cases covered ~70%
DWH team workload reduction 50-70%
Stack LibreChat + Claude (Bedrock) + ClickHouse MCP
Key learning Business glossary is the #1 factor for accuracy
Source ClickHouse — How we made our data warehouse AI-first

Our platform follows the same architecture, with Qwen replacing Claude for data privacy.


Closing thoughts — the parts that actually matter

After shipping this platform, the things that determine success are not the parts you'd expect:

  1. The business glossary is everything. ClickHouse's internal post-mortem on DWAINE concluded the same: glossary quality is the #1 driver of query accuracy. You don't fine-tune the LLM — you write a YAML file that maps domain terms to SQL fragments, and the LLM uses it via in-context learning. Budget more time for the glossary than for the LLM.

  2. The Marts layer is non-negotiable. Exposing the LLM to staging or raw data is how you get hallucinated SQL. Marts are pre-joined, denormalized, semantically clean, and explicitly access-controlled. The LLM should only ever see Marts.

  3. JOIN avoidance dominates query performance. The 3-strategy approach — pre-joined marts, dictionaries (dictGet), UNION ALL — handles ~90% of cross-table questions without a single runtime JOIN. The other 10% gets ClickHouse's auto-algorithm fallback. This was the single biggest architectural decision.

  4. Langfuse observability is what makes the platform improvable over time. Without per-query traces, accuracy scores, and cost-per-question, you have no idea where to invest. With it, you can see exactly which question patterns fail and feed that back into the glossary.

  5. The PostgreSQL replication slot bloat risk is the one most teams forget. If Debezium falls behind, PostgreSQL accumulates WAL indefinitely and can crash production. Set max_slot_wal_keep_size and alert on slot lag. Don't learn this the hard way.

  6. Security guardrails matter more than people think. A 30-second query timeout, 2GB memory cap, 100M-row scan limit, and SELECT-only access on the Marts schema together mean the LLM cannot accidentally take down the cluster. ClickHouse's official agentic-analytics setup guide prescribes these exact values for a reason.

The full implementation — including the ClickHouse AI DBA MCP server (152 tools for cluster operations) referenced in this guide, knowledge base with 47 production-tested rules, and 16 recovery runbooks — lives in the companion repo on GitHub.

If you've shipped something similar (or are building it now), I'd love to compare notes. Drop a comment below or reach out on LinkedIn.

The companion piece — Replacing Elasticsearch with ClickHouse for OTEL Logs, Traces & Metrics: A 90% Cost-Reduction Migration — covers the observability layer of this same architecture in more depth.

Top comments (0)