A production-grade, end-to-end agentic AI platform — chat UI, self-hosted LLM, MCP server, LLM observability, medallion data architecture, security guardrails, HA, and cost analysis. Same stack ClickHouse uses internally (DWAINE: 250+ employees, ~70% of internal analytics use cases covered, 50-70% workload reduction on the data-warehouse team). Adapted for a crypto-derivatives exchange.
Why I'm writing this
The phrase "AI platform" gets thrown around to describe everything from a single chatbot to a set of disconnected APIs. An actual production AI platform requires far more than a model and an interface — it needs data plumbing, semantic grounding, query safety, observability, role-based access, high availability, and a real cost model.
This is the architecture I shipped for a crypto exchange — covering all of those layers. Every component is open-source and battle-tested. Nothing is theoretical:
- Data layer: PostgreSQL → CDC (Debezium + Kafka) → ClickHouse with medallion architecture (Raw → Staging → Marts)
- LLM layer: Self-hosted Qwen 2.5 72B (Apache 2.0, all data stays on-prem) + business glossary for domain grounding
- Tool layer: ClickHouse's official MCP server with bearer-token auth, SSRF protection, schema discovery + safe SQL execution
- UX layer: LibreChat (open-source, SSO, role-based) — same chat UI ClickHouse acquired and uses for DWAINE
- Observability layer: Langfuse (open-source, runs on ClickHouse) — every query, response, latency, cost tracked
- Operations layer: HA cluster, query timeouts, memory caps, RBAC, full audit trail
- JOIN strategy: 3-tier approach (pre-joined marts + dictionaries + UNION ALL + runtime fallback) so the LLM never writes a slow query
This is the same Agentic Data Stack ClickHouse open-sourced and uses internally. I'm documenting how to deploy it on top of any production database (this guide uses a crypto exchange as the worked example, but the architecture transfers to retail, fintech, marketing, ops).
If you're building or evaluating an AI platform for your own data — and you want something more substantial than "GPT-4 with a database connection" — this is the deep-dive.
Note: Costs and table examples reference a crypto-derivatives platform (large trade tables, real-time wallet flows). The architectural patterns and effort estimates apply universally. Numbers are real; mileage will vary.
Table of Contents
- Executive Summary
- What We Are Building
- Why ClickHouse
- Platform Architecture
- Data Consolidation — All DBs Into ClickHouse
- How the AI Agent Works
- What Questions Can Be Asked — By Team
- How Multi-Table Queries Work
- ClickHouse MCP Server
- LLM Selection
- How the LLM Learns Our Tables (No Training Required)
- Chat Interface — LibreChat
- Observability — Langfuse
- Medallion Architecture — Data Organization
- Security and Governance
- High Availability Architecture
- Implementation Roadmap
- Cost Analysis
- Risk Assessment
- Success Metrics
- Reference Links
1. Executive Summary
Objective
Build a unified data platform on ClickHouse that consolidates tables from all the crypto exchange databases (trading, wallets, users, risk, compliance) into one place, then layer an AI agent on top so any team member can ask questions in plain English and get instant answers.
The Problem Today
- Data lives in multiple PostgreSQL databases across different services
- Getting answers requires writing SQL, knowing which DB to query, or filing a ticket to engineering
- Analytical queries on production PostgreSQL compete with live trading workloads
- No single place to query across trading, wallet, user, and compliance data together
The Solution
Any team member types: "Show me top 10 users by futures volume this week who also had withdrawals > $50K"
The AI agent:
1. Understands the question (via business glossary)
2. Generates SQL across trade_future + wallet_transactions tables (both in ClickHouse)
3. Executes via MCP server (read-only, safe)
4. Returns formatted answer with table/chart
This works because all data from different databases is consolidated into ClickHouse tables — the LLM queries structured tables directly, not embeddings or vectors.
What ClickHouse Already Provides
ClickHouse has built a complete open-source Agentic Data Stack with a ready-to-deploy GitHub repo:
- GitHub: https://github.com/ClickHouse/agentic-data-stack
- Live demo: https://llm.clickhouse.com (AgentHouse — 37 public datasets)
- Internal proof: DWAINE — 250+ ClickHouse employees use it, >200 daily messages across 50-70 daily conversations, ~70% of internal analytics use cases covered, 50-70% workload reduction on the 3-person DWH team (source)
We are not building from scratch. We are deploying ClickHouse's proven stack with our data.
Expected Outcomes
| Metric | Target |
|---|---|
| Analytical query speed | 10-100x faster than PostgreSQL for aggregations |
| Production DB load reduction | 60-80% fewer analytical queries hitting production |
| Self-service data access | Any team member can query data without SQL knowledge |
| Cross-database queries | Single query across trading, wallets, users, risk data |
| Time to insight | Seconds instead of hours |
2. What We Are Building
Before (Today)
Trading Ops needs a report:
→ Asks engineering → Engineer writes SQL → Queries prod DB → Formats result → Sends back
→ Time: hours to days
→ Impact: Analytical query slows down production
Compliance needs user transaction history:
→ Files ticket → Engineer joins data from 3 DBs → Manual export → Sends CSV
→ Time: 1-3 days
After (With This Platform)
Trading Ops opens LibreChat:
→ Types: "Top 10 markets by volume this week with liquidation count"
→ AI agent generates SQL, queries ClickHouse, returns answer in 5 seconds
→ No engineering involvement, no production DB impact
Compliance opens LibreChat:
→ Types: "All transactions for user X in last 90 days including spot, futures, and withdrawals"
→ AI agent queries across all consolidated tables, returns complete history
→ Instant, audited, repeatable
Who Uses It
| Team | What They Ask |
|---|---|
| Trading Ops | Volume, market metrics, liquidations, position summaries |
| Risk | Exposure, liquidation cascades, concentration risk |
| Compliance | User transaction history, suspicious patterns, audit trails |
| Finance | Fee revenue, P&L, trading volume trends |
| Product | User activity, feature usage, market popularity |
| Engineering | System metrics, query performance, data quality checks |
3. Why ClickHouse
Production Proof Points in Crypto
Coinhall — Real-time DEX aggregator across 23 blockchains:
- Processes billions of trade events
- 20ms candlestick chart queries on massive datasets
- Replaced PostgreSQL for analytics
- Similar data model to the crypto exchange (trades, orders, market data)
CryptoHouse — Public ClickHouse-powered crypto analytics:
- Bitcoin/Ethereum blockchain data at scale
DWAINE — ClickHouse's own internal AI agent (production):
- 250+ internal users querying data via natural language
- >200 daily messages across 50-70 daily conversations
- ~70% of internal analytics use cases covered by the agent
- ~50-70% workload reduction on the 3-person DWH team
- Source: ClickHouse — How we made our data warehouse AI-first
- This is exactly what we are building for the crypto exchange
ClickHouse vs PostgreSQL for Analytics
| Capability | PostgreSQL | ClickHouse |
|---|---|---|
| Storage model | Row-based (OLTP) | Columnar (OLAP) |
| Compression | 3-5x | 10-30x |
| Aggregation speed | Seconds-minutes | Milliseconds-seconds |
| Concurrent analytics | Impacts live trading | Isolated, zero production impact |
| Partitioning | Manual (daily) | Automatic (MergeTree) |
| Materialized views | Periodic refresh | Incremental (real-time triggers) |
| Cross-DB queries | Not possible | All data in one place |
Key Advantage — All Data In One Place
Today, querying across trading DB + wallet DB + user DB requires application-level joins or manual exports. In ClickHouse, all tables from all databases are consolidated — the AI agent (or any analyst) can query across everything in a single SQL statement.
Our largest trading table runs to hundreds of GB in PostgreSQL. With ClickHouse's columnar compression (10-30x), this drops by an order of magnitude, and column-specific queries (e.g., SUM(quantity) WHERE market_name = 'BTCPFC') only read the relevant columns.
4. Platform Architecture
This is ClickHouse's Agentic Data Stack — three layers, all open-source:
+------------------+
| ALL TEAMS |
| (Trading Ops, |
| Risk, Finance, |
| Compliance, |
| Product) |
+--------+---------+
|
================|================
| LAYER 1: CHAT (LibreChat) |
| - Web-based chat UI |
| - SSO login |
| - Role-based access |
| - Conversation history |
=================|================
|
================|=================
| LAYER 2: LLM + MCP |
| |
| User question (plain English) |
| | |
| v |
| Qwen 2.5 LLM (self-hosted) |
| + Business Glossary |
| | |
| v |
| ClickHouse MCP Server |
| (schema discovery + SQL exec) |
| | |
===========|======================
|
===========|======================
| LAYER 3: DATA (ClickHouse) |
| |
| +---------------------------+ |
| | Trading DB tables: | |
| | trade_future | |
| | trade_spot | |
| | orders | |
| +---------------------------+ |
| | Wallet DB tables: | |
| | wallet_transactions | |
| | deposits / withdrawals | |
| +---------------------------+ |
| | User DB tables: | |
| | users, kyc_status | |
| +---------------------------+ |
| | Risk DB tables: | |
| | positions, margins | |
| +---------------------------+ |
| | Pre-aggregated views: | |
| | volume_daily | |
| | pnl_daily | |
| | liquidation_metrics | |
| | candlestick_1m/1h/1d | |
| +---------------------------+ |
| |
==================================+
^
|
===========|======================
| CDC PIPELINES (Debezium+Kafka) |
| Real-time sync from all DBs |
==================================+
^ ^ ^ ^
| | | |
+-----+ +-+ ++ +-+-----+
|Trading| |Wallet| |User| |Risk|
| DB | | DB | | DB | | DB |
| (PG) | | (PG) | |(PG)| |(PG)|
+-------+ +------+ +----+ +----+
Observability (Langfuse)
Langfuse runs alongside and tracks every LLM interaction:
- What question was asked, by whom
- What SQL was generated
- Whether the answer was correct
- Token usage and cost per query
- Langfuse stores its data in ClickHouse itself — no extra DB needed
5. Data Consolidation — All DBs Into ClickHouse
The Core Idea
Every PostgreSQL database at the crypto exchange feeds its tables into ClickHouse via CDC (Change Data Capture). Once in ClickHouse, the AI agent can query across all of them in a single query.
CDC Pipeline
PostgreSQL (WAL) → Debezium Connector → Kafka → ClickHouse Kafka Engine → MergeTree Tables
Tables to Consolidate
| Source DB | Tables | Est. Size | ClickHouse Engine | Sync |
|---|---|---|---|---|
| Trading DB | trade_future | hundreds of GB | ReplacingMergeTree | CDC real-time |
| trade_spot | TBD | ReplacingMergeTree | CDC real-time | |
| orders | TBD | ReplacingMergeTree | CDC real-time | |
| Wallet DB | wallet_transactions | TBD | ReplacingMergeTree | CDC real-time |
| deposits | TBD | MergeTree | CDC real-time | |
| withdrawals | TBD | MergeTree | CDC real-time | |
| User DB | users | Small | ReplacingMergeTree | CDC |
| kyc_status | Small | ReplacingMergeTree | CDC | |
| Risk DB | positions | TBD | ReplacingMergeTree | CDC real-time |
| margins | TBD | ReplacingMergeTree | CDC real-time | |
| Config | market_config | Small | ReplacingMergeTree | Daily batch |
| fee_tiers | Small | ReplacingMergeTree | Daily batch |
Why ReplacingMergeTree
PostgreSQL rows get UPDATEd (e.g., order status changes, position updates). ClickHouse's ReplacingMergeTree deduplicates on primary key during background merges, handling CDC update events correctly.
Pre-Aggregated Materialized Views
ClickHouse materialized views are incremental triggers — they process each new row as it arrives (unlike PostgreSQL which requires periodic REFRESH):
| Materialized View | Source | What It Computes |
|---|---|---|
| volume_daily | trade_future, trade_spot | Volume per market per day |
| pnl_daily | trades + positions | P&L per user per market per day |
| liquidation_metrics | trade_future (order_type=1003) | Liquidation count, volume, frequency |
| candlestick_1m | trade_future, trade_spot | OHLCV per market per minute |
| candlestick_1h | chart_db.candlesticks (CDC) | OHLCV per market per hour |
| candlestick_1d | chart_db.candlesticks (CDC) | OHLCV per market per day |
| fee_revenue | trades | Fee revenue per market per tier per day |
| withdrawal_summary | withdrawals | Daily withdrawal volume per user |
Cross-Database Query Example
Something impossible today (requires joining across separate PostgreSQL databases):
-- "Show users with >$100K futures volume today who also withdrew >$10K"
SELECT
t.username,
SUM(t.quantity * t.price) as futures_notional,
w.total_withdrawn
FROM trade_future t
JOIN (
SELECT username, SUM(amount) as total_withdrawn
FROM withdrawals
WHERE created_at >= today()
GROUP BY username
HAVING SUM(amount) > 10000
) w ON t.username = w.username
WHERE t.transaction_time >= today()
GROUP BY t.username, w.total_withdrawn
HAVING futures_notional > 100000
ORDER BY futures_notional DESC
The AI agent generates this SQL automatically from plain English.
6. How the AI Agent Works
The Flow
1. User types question in LibreChat
"Which users had the most liquidations today?"
2. LLM receives:
- The question
- Business glossary (liquidation = order_type 1003, etc.)
- ClickHouse schema (via MCP list_tables)
3. LLM generates SQL:
SELECT username, COUNT(*) as liq_count, SUM(quantity * price) as liq_volume
FROM trade_future
WHERE order_type = 1003
AND transaction_time >= today()
AND username NOT LIKE 'bot_%'
AND username NOT LIKE 'market_maker_%'
AND username NOT LIKE 'liquidity_%'
GROUP BY username
ORDER BY liq_count DESC
LIMIT 20
4. MCP Server executes SQL (read-only) against ClickHouse
5. LLM formats result:
"Today's top users by liquidation count:
1. user_abc — 12 liquidations, $234K total
2. user_xyz — 8 liquidations, $156K total
..."
6. Langfuse logs: question, SQL, result, latency, tokens used
Key Point — It's All Structured Table Queries
The LLM is not doing anything magical with embeddings or vectors. It is:
- Reading the ClickHouse table schema (column names, types)
- Using the business glossary to understand domain terms
- Writing standard SQL
- Executing via MCP server
All our data is in regular ClickHouse tables. The "intelligence" comes from the LLM knowing how to translate English into SQL for our specific schema.
7. What Questions Can Be Asked — By Team
Trading Operations
| Question | Tables Used | Complexity |
|---|---|---|
| "What is BTC perpetual futures volume today?" | trade_future | Single table |
| "Show top 10 markets by volume this week" | trade_future, trade_spot | Single table each, union |
| "Compare BTC futures volume this week vs last week" | trade_future | Single table, time comparison |
| "Show 1-hour candlestick chart for ETHPFC last 24 hours" | candlestick_1h (view) | Pre-aggregated view |
| "Which markets had the highest spread in the last hour?" | orders | Single table |
| "Show order book depth for BTCPFC right now" | orders | Single table, latest snapshot |
| "Average fill time for limit orders by market today" | orders | Single table |
| "How does today's volume compare to 30-day average?" | volume_daily (view) | Pre-aggregated view |
Risk Management
| Question | Tables Used | Complexity |
|---|---|---|
| "Which users have the largest open positions?" | positions | Single table |
| "Show all liquidation events in the last hour" | trade_future | Single table (order_type=1003) |
| "What percentage of trades were liquidations by market today?" | trade_future | Single table, grouping |
| "List users with concentrated exposure (>80% in one market)" | positions | Single table, aggregation |
| "Show liquidation cascade events — markets where >5 liquidations happened within 1 minute" | trade_future | Single table, window function |
| "Which users traded within 1 minute of a liquidation event?" | trade_future (self-join) | Single table, self-join |
| "Show users with margin utilization above 90%" | positions, margins | Multi-table join |
| "Correlation between liquidation volume and price movement for BTCPFC this week" | trade_future, candlestick_1h | Multi-table join |
Compliance & Audit
| Question | Tables Used | Complexity |
|---|---|---|
| "Show all transactions for user X in the last 90 days" | trade_future, trade_spot, withdrawals, deposits | Multi-table union |
| "List accounts with withdrawal volume > $100K today" | withdrawals | Single table |
| "Find users who deposited and withdrew within 24 hours" | deposits, withdrawals | Multi-table join |
| "Show all KYC-pending users who traded this week" | users, kyc_status, trade_future | Multi-table join |
| "List accounts that received deposits from the same source address" | deposits | Single table, grouping |
| "Show transaction patterns for user X — frequency, volume, timing" | trade_future, trade_spot, withdrawals | Multi-table union + aggregation |
| "Which accounts had unusual withdrawal patterns this week?" | withdrawals | Single table, anomaly detection |
| "Full audit trail: every action by user X since account creation" | trade_future, trade_spot, orders, withdrawals, deposits | Multi-table union (5 tables) |
Finance & Revenue
| Question | Tables Used | Complexity |
|---|---|---|
| "What was total fee revenue yesterday?" | fee_revenue (view) | Pre-aggregated view |
| "Monthly fee revenue breakdown by market" | fee_revenue (view) | Pre-aggregated view |
| "Compare spot vs futures fee revenue this month" | fee_revenue (view) | Pre-aggregated view, grouping |
| "Daily P&L trend for the last 30 days" | pnl_daily (view) | Pre-aggregated view |
| "Which fee tier generates the most revenue?" | trade_future, trade_spot, fee_tiers | Multi-table join |
| "Revenue impact if we reduce maker fees by 10%" | trade_future, trade_spot, fee_tiers | Multi-table join + simulation |
| "Show monthly trading volume trend with MoM growth %" | volume_daily (view) | Pre-aggregated view, window function |
Product & Growth
| Question | Tables Used | Complexity |
|---|---|---|
| "How many unique users traded futures this week?" | trade_future | Single table, COUNT DISTINCT |
| "What is our DAU trend for the last 30 days?" | trade_future, trade_spot | Multi-table union |
| "Which markets have growing vs declining user counts?" | trade_future, trade_spot | Single table, time comparison |
| "New user retention — users who signed up last month, how many traded this month?" | users, trade_future | Multi-table join |
| "Top 10 users by total trading volume all-time" | trade_future, trade_spot | Multi-table union |
| "Average trades per user per day — trend over last 90 days" | trade_future, trade_spot | Single table, aggregation |
| "Geographic distribution of active traders this month" | users, trade_future | Multi-table join |
Question Complexity Summary
| Type | % of Questions | How LLM Handles |
|---|---|---|
| Single table | ~40% | Direct SELECT with WHERE/GROUP BY |
| Pre-aggregated view | ~20% | Query materialized view (fastest) |
| Multi-table JOIN | ~25% | LLM generates JOIN across tables |
| Multi-table UNION | ~10% | LLM unions results from multiple tables |
| Complex (window/self-join) | ~5% | LLM uses advanced SQL patterns |
8. How Multi-Table Queries Work (JOIN Problem & Solution)
The Challenge
Many real questions require data from multiple sources:
"Show me users who had futures liquidations today AND also made withdrawals"
This needs: trade_future (Trading DB) + withdrawals (Wallet DB) + users (User DB)
Problem: ClickHouse is not optimized for runtime JOINs. Unlike PostgreSQL, JOINs in ClickHouse are memory-heavy, single-threaded (hash build phase), and can be slow on large tables. This is a well-known ClickHouse limitation.
Solution: ClickHouse uses 3 strategies to avoid runtime JOINs. This is exactly how DWAINE (ClickHouse's own AI agent) handles it.
Strategy 1 — Pre-Joined Denormalized Marts (Primary Approach)
JOINs are computed at data load time, not at query time. Refreshable Materialized Views periodically run the JOINs in the background and write the results into flat, wide "mart" tables. The AI agent queries these pre-joined tables directly — no JOINs needed at query time.
How it works:
Staging Layer (normalized tables):
staging.trade_future — raw trades from CDC
staging.users — user profiles from CDC
staging.wallets — wallet transactions from CDC
| Refreshable Materialized View (runs every 1-5 minutes)
| Executes the JOINs in background
v
Marts Layer (denormalized, pre-joined):
mart_user_trading_activity
— username, kyc_status, registered_at, ← from users
futures_volume_today, spot_volume_today, ← from trades
withdrawal_total_today, deposit_total, ← from wallets
liq_count_today, liq_volume_today ← from trades (order_type=1003)
The AI agent only sees the Marts layer. When someone asks "Show users with >$50K volume who also withdrew >$10K", the LLM queries one flat table — no JOIN:
-- NO JOIN needed — all data is already in one table
SELECT username, futures_volume_today, withdrawal_total_today
FROM mart_user_trading_activity
WHERE futures_volume_today > 50000
AND withdrawal_total_today > 10000
ORDER BY futures_volume_today DESC
This is ClickHouse's official recommendation for AI agents:
"LLMs should only be exposed to the Marts layer — which is already pre-joined and denormalized. The agent has no need to write JOINs."
— How to set up ClickHouse for agentic analytics
Strategy 2 — Dictionaries (For Dimension Lookups)
Small, stable lookup tables (users, markets, fee_tiers) are loaded into in-memory Dictionaries. Instead of a JOIN, the LLM uses dictGet() — an O(1) key-value lookup that is 25x faster than a hash join.
-- INSTEAD OF THIS (slow JOIN):
SELECT t.username, t.quantity, u.kyc_status
FROM mart_trades_futures t
JOIN mart_users u ON t.username = u.username
WHERE t.transaction_time >= today()
-- THE LLM GENERATES THIS (fast dictionary lookup):
SELECT
username,
quantity,
dictGet('users_dict', 'kyc_status', username) AS kyc_status
FROM mart_trades_futures
WHERE transaction_time >= today()
Performance comparison (from ClickHouse benchmarks):
| Method | Time | Memory |
|---|---|---|
| dictGet() with flat layout | 0.044 sec | 84 MB |
| dictGet() with hashed layout | 0.113 sec | 103 MB |
| Parallel Hash JOIN | 0.690 sec | 4.8 GB |
| Standard Hash JOIN | 1.133 sec | 4.4 GB |
Which tables become Dictionaries:
| Dictionary | Source | Size | Lookup Key |
|---|---|---|---|
users_dict |
users table | Small | username |
markets_dict |
market_config | Small | market_name |
fee_tiers_dict |
fee_tiers | Small | tier_id |
kyc_dict |
kyc_status | Small | username |
The business glossary teaches the LLM to use dictGet() instead of JOINs for these dimension tables.
Strategy 3 — Runtime JOINs (Fallback for Ad-Hoc Queries)
For truly ad-hoc questions that don't fit a pre-built mart table, ClickHouse does support runtime JOINs with optimizations:
| Optimization | What It Does |
|---|---|
join_algorithm = 'auto' |
ClickHouse picks the best algorithm automatically |
| Automatic table reordering | Puts smaller table on the right side (since v24.12) |
| Global join reordering | Optimizes 3+ table JOIN chains (since v25.9) |
ANY JOIN |
Stops at first match — faster when you only need one row |
| Pre-filter with WHERE | Reduce table size before joining |
When this is acceptable:
- Small-to-medium result sets (both sides < few million rows after filtering)
- Ad-hoc questions that don't match any existing mart table
- One-time analysis queries where latency is less critical
How Each Question Type Is Handled
| Question Type | Strategy | Example |
|---|---|---|
| Single-table query | Direct query on mart table | "Top 10 markets by volume today" |
| Pre-aggregated metric | Query materialized view | "Daily volume trend last 30 days" |
| Cross-domain (trading + wallets) | Pre-joined mart table | "Users with high volume AND large withdrawals" |
| Dimension lookup (user info) | Dictionary (dictGet) | "Show KYC status for top traders" |
| Full audit trail (multiple activity types) | UNION ALL (no JOIN needed) | "All activity for user X in 30 days" |
| Rare ad-hoc correlation | Runtime JOIN (fallback) | "Correlation between margin % and liquidation frequency" |
Pre-Joined Mart Tables We Build
| Mart Table | Pre-Joined Sources | Refreshes | Use Case |
|---|---|---|---|
mart_user_trading_activity |
trades + users + wallets | Every 5 min | "Users who traded AND withdrew" |
mart_user_risk_profile |
positions + margins + trades (liquidations) | Every 1 min | "Users at risk of liquidation" |
mart_liquidation_detail |
trades + users + positions | Every 1 min | "Liquidation events with user context" |
mart_user_full_activity |
trades + spot + orders + wallets | Every 5 min | "Complete activity for user X" |
mart_market_daily_summary |
trades + orders + market_config | Every 1 min | "Market performance today" |
mart_compliance_view |
trades + wallets + users + kyc | Every 5 min | "Compliance team queries" |
Example — How Pre-Joining Works End-to-End
Question: "Show KYC-verified users who had liquidations this week with their withdrawal history"
Without pre-joining (SLOW — 3-table runtime JOIN):
-- BAD: Runtime JOIN across 3 large tables
SELECT u.username, u.kyc_status, liq.liq_count, wd.withdrawal_total
FROM staging.users u
JOIN staging.trade_future liq ON u.username = liq.username
JOIN staging.wallets wd ON u.username = wd.username
WHERE ...
-- Slow, memory-heavy, may timeout
With pre-joining (FAST — single table scan):
-- GOOD: Query pre-joined mart table, no JOIN needed
SELECT username, kyc_status, liq_count_week, withdrawal_total_week
FROM mart_user_trading_activity
WHERE kyc_status = 'verified'
AND liq_count_week > 0
ORDER BY liq_count_week DESC
The JOIN happened in the background when the Refreshable Materialized View populated mart_user_trading_activity. The AI agent never writes a JOIN.
Example — Full Audit Trail (UNION ALL, No JOIN)
For "show all activity for user X", no JOIN is needed — we use UNION ALL across tables, which ClickHouse handles efficiently since each branch scans one table independently:
SELECT * FROM (
SELECT 'futures_trade' AS activity, transaction_time AS event_time,
market_name, side, quantity, price, quantity * price AS value_usd
FROM mart_trades_futures
WHERE username = 'john_doe' AND transaction_time >= today() - 30
UNION ALL
SELECT 'spot_trade', transaction_time, market_name, side,
quantity, price, quantity * price
FROM mart_trades_spot
WHERE username = 'john_doe' AND transaction_time >= today() - 30
UNION ALL
SELECT 'withdrawal', created_at, currency, 'out',
amount, 0, amount
FROM mart_wallets
WHERE username = 'john_doe' AND transaction_type = 'withdrawal'
AND created_at >= today() - 30
UNION ALL
SELECT 'deposit', created_at, currency, 'in',
amount, 0, amount
FROM mart_wallets
WHERE username = 'john_doe' AND transaction_type = 'deposit'
AND created_at >= today() - 30
)
ORDER BY event_time DESC
UNION ALL is not a JOIN — each branch runs independently and results are concatenated. This is fast in ClickHouse.
Summary — JOIN Avoidance Strategy
Question comes in from user
|
v
Does a pre-joined mart table cover this?
YES → Query the mart table directly (no JOIN, fastest)
|
NO
v
Does it need a small dimension lookup (user info, market name)?
YES → Use dictGet() dictionary lookup (25x faster than JOIN)
|
NO
v
Does it need data from multiple activity types for one user?
YES → Use UNION ALL across tables (no JOIN, each branch independent)
|
NO
v
Truly ad-hoc cross-table correlation?
→ Use runtime JOIN with auto algorithm (fallback, acceptable for small result sets)
→ Log in Langfuse — if this pattern repeats, build a new mart table for it
Performance Comparison
| Approach | Query Time | Memory | When to Use |
|---|---|---|---|
| Pre-joined mart table | Milliseconds | Minimal | ~60% of questions (cross-domain) |
| Dictionary (dictGet) | Milliseconds | Minimal | ~15% of questions (dimension lookup) |
| UNION ALL | 1-2 seconds | Low | ~15% of questions (audit trails) |
| Runtime JOIN (fallback) | 2-10 seconds | High | ~10% of questions (rare ad-hoc) |
Reference
| Resource | URL |
|---|---|
| Minimize and optimize JOINs | https://clickhouse.com/docs/best-practices/minimize-optimize-joins |
| Denormalizing Data | https://clickhouse.com/docs/data-modeling/denormalization |
| Using Dictionaries to Accelerate Queries | https://clickhouse.com/blog/faster-queries-dictionaries-clickhouse |
| Direct Join (25x faster) | https://clickhouse.com/blog/clickhouse-fully-supports-joins-direct-join-part4 |
| Choosing the Right Join Algorithm | https://clickhouse.com/blog/clickhouse-fully-supports-joins-how-to-choose-the-right-algorithm-part5 |
| How to set up ClickHouse for agentic analytics | https://clickhouse.com/blog/how-to-set-up-clickhouse-for-agentic-analytics |
| Postgres to ClickHouse: Data Modeling Tips | https://clickhouse.com/blog/postgres-to-clickhouse-data-modeling-tips-v2 |
9. ClickHouse MCP Server
What It Is
The MCP (Model Context Protocol) server is an open standard that lets LLMs interact with databases safely. ClickHouse has an official MCP server:
- GitHub: https://github.com/ClickHouse/mcp-clickhouse
-
PyPI:
mcp-clickhouse(220K+ downloads) - License: Open source
What It Does (and Does NOT Do)
| Description | |
|---|---|
| Does | Exposes ClickHouse schema to the LLM (tables, columns, types) |
| Does | Executes read-only SQL queries generated by the LLM |
| Does | Returns results in LLM-readable format |
| Does | Enforce read-only access (no INSERT/UPDATE/DELETE by default) |
| Does NOT | Convert natural language to SQL (that's the LLM's job) |
| Does NOT | Store data or state |
Tools Exposed
| MCP Tool | What It Does |
|---|---|
list_databases |
Lists all databases in ClickHouse |
list_tables |
Lists tables with columns, types, pagination |
run_query |
Executes read-only SQL, returns results |
run_chdb_select_query |
Embedded queries against files/URLs |
Integration Flow
Qwen LLM ←— MCP Protocol (JSON-RPC) —→ ClickHouse MCP Server ←— Native Protocol —→ ClickHouse Tables
The MCP server is the controlled gateway — the LLM can discover schema and run queries, but cannot modify data or access restricted tables.
How the MCP Server Actually Works (Transport & Security)
Transport — Streamable HTTP (not stdio):
The MCP server runs as a persistent HTTP service. LibreChat sends HTTP POST requests to it when the LLM needs to call a tool (list_tables, run_query, etc.). This is different from subprocess-based MCP (stdio) — the server stays running, handles concurrent requests, and is addressable as a microservice within the deployment.
Authentication — Bearer Token:
LibreChat authenticates to the MCP server using a bearer token in the Authorization header. This ensures only LibreChat can call the MCP server — not other services or arbitrary HTTP clients on the same network.
SSRF Protection — Domain Whitelist:
LibreChat enforces an allowlist of domains it will make outbound HTTP requests to. This prevents a scenario where the LLM instructs LibreChat to call internal infrastructure URLs (e.g., http://internal-service/admin). Only the MCP server address and ClickHouse Cloud endpoint are whitelisted.
chDB — Query Files and URLs Without ETL:
The run_chdb_select_query tool uses chDB (embedded ClickHouse engine) to query local files, S3 URLs, or Parquet files directly — without loading them into ClickHouse tables first. chDB 4 adds a Pandas-like DataStore API with lazy execution and filter/column pushdown, useful for data scientists who upload files to LibreChat and want to query them immediately.
Remote MCP (ClickHouse Cloud):
For ClickHouse Cloud deployments, ClickHouse operates a hosted Remote MCP Server — agents connect directly without self-hosting the MCP service. For self-managed ClickHouse (our case), the MCP server runs as a container alongside ClickHouse.
Compatible Agent Frameworks
ClickHouse has tested their MCP server with 12 agent frameworks:
Agno, DSPy, LangChain, LlamaIndex, PydanticAI, Claude Agent SDK, OpenAI Agents SDK, CrewAI, Google ADK, Microsoft Agent Framework, mcp-agent, Upsonic
10. LLM Selection
Primary: Qwen 2.5 72B (Self-Hosted)
| Factor | Detail |
|---|---|
| License | Apache 2.0 (open source, no vendor lock-in) |
| SQL generation | Top-tier on text-to-SQL benchmarks |
| Multilingual | English, Chinese, Japanese |
| Size | 72B params — accurate enough, runs on single GPU node |
| Data privacy | All data stays on our infrastructure |
| Cost | Fixed GPU cost, not per-token |
Infrastructure
| Component | Specification |
|---|---|
| GPU instance | g5.12xlarge (4x A10G, 96GB VRAM) |
| VRAM needed | ~40GB for 72B INT4 quantized |
| Framework | vLLM (high-throughput inference) |
| Throughput | 30-50 tokens/second |
| Fallback | Qwen 2.5 32B on smaller GPU |
Alternatives
| Model | Size | Notes |
|---|---|---|
| Qwen 2.5 32B | 32B | Lighter, cheaper GPU, slightly less accurate |
| Llama 3.1 70B | 70B | Meta, strong general purpose |
| DeepSeek V3 | 671B MoE (~37B active params) | Strong on code/SQL |
| Claude Opus 4.7 | API | Best for complex multi-step analysis — use only when data is not restricted to internal infra |
Model selection rule: Use self-hosted Qwen for all queries involving the crypto exchange trading/user/wallet data (data stays on-prem, zero external API calls). Use Claude Opus 4.7 via API only for non-sensitive analytical work, development, or testing where data residency is not a constraint.
AGENTS.md — Per-Environment Agent Instructions
ClickHouse Assistant and agent frameworks support an AGENTS.md file — a plain text file placed in the agent's working context that injects domain-specific instructions into every session. This is complementary to the business glossary:
| Business Glossary | AGENTS.md | |
|---|---|---|
| Purpose | Maps domain terms to SQL | Sets agent behavior rules |
| Example | "liquidation" = order_type 1003 |
"Always exclude market makers from results" |
| Example | "volume" = SUM(quantity) |
"Default time window is today() unless specified" |
| Format | YAML key-value pairs | Plain text instructions |
| Effect | Teaches SQL mappings | Shapes how the agent reasons and responds |
Both are injected into the LLM's system prompt at query time — no retraining needed.
DWAINE Reference
ClickHouse's own DWAINE agent uses Claude via AWS Bedrock. We choose self-hosted Qwen for data privacy and cost control, but the architecture supports swapping LLM backends.
11. How the LLM Learns Our Tables (No Training Required)
Key Point — No Model Training Needed
The LLM does not need to be trained or fine-tuned to understand the crypto exchange tables. It learns everything it needs at query time through two mechanisms: automatic schema discovery and a business glossary.
Mechanism 1 — MCP Schema Discovery (Automatic)
Every time the LLM receives a question, it calls list_tables via the MCP server. ClickHouse returns the full schema automatically — table names, column names, data types:
MCP returns to LLM automatically:
Table: mart_trades_futures
- username String
- market_name String
- transaction_time DateTime
- side String
- quantity Float64
- price Float64
- order_type Int32 ← LLM sees this but doesn't know 1003 = liquidation
Table: mart_wallets
- username String
- transaction_type String ('deposit', 'withdrawal')
- amount Float64
- currency String
- created_at DateTime
Table: mart_user_trading_activity
- username String
- kyc_status String
- futures_volume_today Float64
- withdrawal_total_today Float64
- liq_count_today Int32
...
No configuration needed for this. When a new table is added to ClickHouse, the LLM sees it automatically on the next query.
Mechanism 2 — Business Glossary (You Configure This)
The schema alone is not enough. The LLM sees order_type Int32 but doesn't know what 1003 means. The business glossary is a configuration file (not model training) that gets injected into the LLM's prompt:
Business Glossary (business_glossary.yaml):
- "liquidation" = trade_future WHERE order_type = 1003
- "market maker" = username LIKE 'bot_%' OR 'market_maker_%' OR 'liquidity_%'
- "perpetual futures" = market_name LIKE '%PFC%'
- "volume" = SUM(quantity) from mart_trades_futures
- "notional" = SUM(quantity * price)
- "BTCPFC" = Bitcoin perpetual futures contract
- "whitelisted user" = user in internal_account_whitelist table
- "P&L" = pnl_daily materialized view
- "candlestick" = candlestick_1m/1h/1d materialized views
- "spot trade" = mart_trades_spot table
- "deposit" = mart_wallets WHERE transaction_type = 'deposit'
- "withdrawal" = mart_wallets WHERE transaction_type = 'withdrawal'
This is just a text file that gets added to the LLM's system prompt. You update it anytime — no retraining, no downtime.
Full Glossary Entries
| Term | Definition | SQL Mapping |
|---|---|---|
| Liquidation | Forced position closure |
order_type = 1003 in trade_future |
| Perpetual futures | Non-expiring futures | market_name LIKE '%PFC%' |
| Market maker | Internal trading bots | username LIKE 'bot_%' OR 'market_maker_%' OR 'liquidity_%' |
| Trading volume | Total quantity |
SUM(quantity) on trade_future/trade_spot |
| Notional volume | USD value | SUM(quantity * price) |
| Whitelisted user | Excluded from checks |
internal_account_whitelist table |
| Deposit | Fiat/crypto incoming |
mart_wallets WHERE transaction_type = 'deposit' |
| Withdrawal | Fiat/crypto outgoing |
mart_wallets WHERE transaction_type = 'withdrawal' |
| P&L | Realized profit/loss |
pnl_daily materialized view |
| Candlestick | OHLCV price bars |
candlestick_1m/1h/1d views |
| Spot trade | Immediate exchange |
mart_trades_spot table |
| Futures trade | Derivatives trade |
mart_trades_futures table |
| ADL | Auto-deleveraging |
order_type = 1004 in trade_future |
| Funding fee | Periodic futures fee |
mart_funding_payments table |
| Open interest | Total open positions |
SUM(quantity) on mart_positions |
What Actually Happens Per Query
User types: "Show top liquidated users today"
|
v
LLM receives in its prompt:
┌─────────────────────────────────────────────────┐
│ 1. System instructions │
│ "You are a SQL analyst for the crypto exchange. │
│ Generate ClickHouse SQL. Read-only only." │
│ │
│ 2. Business glossary (text file) │
│ "liquidation = order_type 1003" │
│ "market maker = bot_%, market_maker_%, liquidity_%" │
│ ... (all glossary entries) │
│ │
│ 3. ClickHouse schema (from MCP list_tables) │
│ mart_trades_futures: username, market_name, │
│ transaction_time, order_type, quantity, ... │
│ mart_wallets: username, amount, ... │
│ ... (all mart tables) │
│ │
│ 4. The user's question │
│ "Show top liquidated users today" │
└─────────────────────────────────────────────────┘
|
v
LLM generates SQL using all 3 context sources:
SELECT username, COUNT(*) as liq_count, SUM(quantity * price) as liq_volume
FROM mart_trades_futures
WHERE order_type = 1003 ← from glossary
AND transaction_time >= today() ← from schema (knows column type)
AND username NOT LIKE 'bot_%' ← from glossary (exclude market makers)
GROUP BY username
ORDER BY liq_count DESC LIMIT 20
Training vs Context Injection
| Training / Fine-tuning | Context Injection (What We Do) | |
|---|---|---|
| What it is | Modify the model's internal weights | Pass information in the prompt at query time |
| Time to update | Hours-days (retrain model) | Seconds (edit a text file) |
| Cost | Expensive (GPU hours) | Free |
| When schema changes | Must retrain | Automatic (MCP reads live schema) |
| When business terms change | Must retrain | Edit glossary file, instant effect |
| Risk | Can degrade model quality | None |
| What DWAINE uses | No training | Context injection (glossary + MCP) |
When Would You Fine-Tune? (Optional, Phase 5+)
Fine-tuning is not required but can be done later to push accuracy higher:
| Phase | Approach | Expected Coverage |
|---|---|---|
| Weeks 1-12 | MCP schema + business glossary only (no training) | Aim for DWAINE-class — agent handles ~70% of analytics use cases (the other 30% still go to a data engineer) |
| Post Week 12 (optional) | Fine-tune Qwen on successful query pairs from Langfuse logs | Higher coverage + lower per-query latency |
Fine-tuning would train the model on patterns like:
Input: "top liquidated users today"
Output: "SELECT username, COUNT(*) FROM mart_trades_futures WHERE order_type = 1003
AND transaction_time >= today() GROUP BY username ORDER BY COUNT(*) DESC LIMIT 10"
Langfuse collects thousands of these successful question→SQL pairs over time, which become the fine-tuning dataset. But this is an optimization, not a requirement.
How the Glossary Improves Over Time
Coverage and quality grow as the glossary fills in. The exact accuracy curve depends on workload — what we observed (and what ClickHouse reports for DWAINE) is that glossary growth correlates strongly with the share of questions the agent can answer end-to-end without analyst involvement:
Week 1: Glossary has 50 terms → narrow coverage; agent fields easy single-table questions
Week 4: Glossary has 100 terms → most common cross-domain questions handled
Week 8: Glossary has 200 terms → DWAINE-class — ~70% of analytics use cases covered
Week 12: Glossary has 300+ terms → covers long-tail domain language
Post-12: Fine-tune on Langfuse logs → tighter SQL, lower latency, higher first-pass success
The improvement cycle:
- User asks a question → LLM generates wrong SQL
- Langfuse logs the failure (question, wrong SQL, error)
- Team reviews and adds missing glossary entry
- Next time the same question is asked → correct SQL
ClickHouse's published takeaway for DWAINE is that business glossary quality is the #1 driver of agent usefulness — they reached the ~70% use-case-coverage milestone primarily through glossary expansion, not LLM fine-tuning.
12. Chat Interface — LibreChat
What It Is
LibreChat is an open-source chat interface acquired by ClickHouse (November 2025). It is the front-end of the Agentic Data Stack.
- Self-hosted — no data leaves the crypto exchange infrastructure
- Web-based — accessible from any browser
- Connects to self-hosted Qwen LLM backend
- MCP plugin connects LLM to ClickHouse
- Conversation history and bookmarking
- Supports tables and chart rendering in responses
Role-Based Access
| Role | Sees | Example Questions |
|---|---|---|
| Trading Ops | All trading data, metrics | "Top markets by volume this week" |
| Risk | Positions, liquidations, exposure | "Users with largest open positions" |
| Compliance | User transactions, audit trails | "All transactions for user X in Q1" |
| Finance | Revenue, fees, P&L | "Monthly fee revenue by market" |
| Product | User activity, market stats | "DAU trend for futures last 30 days" |
Supporting Services — Why Each Exists
LibreChat is not a standalone service — it depends on four supporting components:
MongoDB — Conversation History Store:
LibreChat is stateless by design. MongoDB stores everything that needs to persist: all conversation threads, message history, user accounts, bookmarks, and presets. Without MongoDB, every browser refresh loses the conversation. MongoDB is chosen because LibreChat's data model is document-oriented (conversations are nested JSON objects with variable structure), not relational.
Meilisearch — Conversation Search Index:
Meilisearch provides full-text search over conversation history. When a user types in the LibreChat search bar to find a past query ("find my liquidation analysis from last week"), Meilisearch returns results in milliseconds. It maintains a search index that mirrors the MongoDB conversations.
pgvector + RAG API — Document Query:
LibreChat supports document uploads (PDFs, CSVs, text files). The RAG API processes uploaded files: it chunks the text, generates embeddings using an embedding model, and stores them in pgvector (PostgreSQL with vector extension). When a user asks a question while a document is attached, the RAG API retrieves the most relevant chunks from pgvector and injects them into the LLM prompt — enabling "ask questions about this document" without fine-tuning.
Qwen Connection — OpenAI-Compatible API:
LibreChat connects to Qwen via the OpenAI-compatible API that vLLM exposes. vLLM serves Qwen 2.5 72B with an endpoint that speaks the same protocol as OpenAI's API (/v1/chat/completions). From LibreChat's perspective, Qwen is just another OpenAI-compatible provider — no custom integration needed. Switching to a different model (e.g., Llama, DeepSeek) only requires changing the endpoint URL.
Live Demo
ClickHouse operates a public demo at https://llm.clickhouse.com (AgentHouse) showing this exact flow on 37 public datasets. Our deployment is the same architecture with the crypto exchange private data.
13. Observability — Langfuse
What It Is
Langfuse (acquired by ClickHouse, January 2026) is an LLM observability platform. It tracks every interaction between users and the AI agent.
What It Tracks
| Metric | Why It Matters |
|---|---|
| Every question asked | Know what teams need |
| SQL generated | Debug wrong answers |
| Latency per query | Monitor performance |
| Token usage | Track compute cost |
| Accuracy (correct/incorrect) | Measure improvement |
| User identity | Compliance audit trail |
Why This Matters
Without Langfuse, the AI agent is a black box. With it:
- We know when answers are wrong and can fix the glossary
- We measure accuracy improvement over time (target: 70% → 85%)
- Compliance can audit every question and data access
- We track cost per team/query type
How Traces Actually Flow (LibreChat → Langfuse → ClickHouse)
Langfuse has a built-in integration in LibreChat — no code changes or custom instrumentation needed. LibreChat automatically sends every LLM interaction to Langfuse via three environment variables (LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY, LANGFUSE_BASE_URL). Once set, every prompt and response is traced automatically.
Async trace pipeline — why Redis is needed:
LibreChat does not write traces to ClickHouse directly. It writes trace events to a Redis queue. A separate Langfuse worker process reads from the Redis queue and writes the processed traces to ClickHouse. This decouples trace ingestion from query execution — a slow Langfuse write never adds latency to the user's chat response.
Why Langfuse uses ClickHouse as its own backend:
Langfuse stores trace events (prompts, responses, latency, token counts) in ClickHouse because traces are time-series append-only data — exactly the workload ClickHouse is optimized for. The Langfuse analytics dashboard (cost per team, accuracy trends, slow queries) runs SQL aggregations over these traces in milliseconds. Metadata (projects, users, API keys) is stored in PostgreSQL. Binary assets (uploaded files, large trace payloads) are stored in MinIO (S3-compatible object storage).
The Shared ClickHouse Instance
The same ClickHouse instance serves two roles simultaneously:
| Role | Database | What It Stores |
|---|---|---|
| Agent data warehouse |
marts, staging, raw
|
business data — trades, wallets, users, risk |
| Langfuse event store | langfuse |
LLM traces — prompts, SQL, latency, token counts |
This is intentional. It means you can write SQL that crosses both roles — for example: "which teams generated the most incorrect SQL this week, and what business tables were they querying?" — joining Langfuse trace data with business context in a single query.
14. Medallion Architecture — Data Organization
ClickHouse recommends a three-layer data organization for AI agents (from their setup guide):
+------------------+ +------------------+ +------------------+
| RAW DATABASE | --> | STAGING DATABASE | --> | MARTS DATABASE |
| | | | | |
| Untransformed | | Deduplicated | | Curated tables |
| CDC events from | | Normalized | | Aggregated views |
| all PostgreSQL | | Cleaned | | AI agent queries |
| databases | | | | THIS layer only |
+------------------+ +------------------+ +------------------+
(ingestion) (transformation) (consumption)
Why Three Layers
| Layer | Purpose | Who Accesses |
|---|---|---|
| Raw | Unmodified CDC events, full history | Data engineers only |
| Staging | Deduplicated, normalized, cleaned | Data engineers, advanced analysts |
| Marts | Curated, aggregated, business-ready | AI agent and all users |
The AI agent only accesses the Marts layer — curated, pre-validated tables with clear column names and business meaning. This reduces LLM errors and prevents access to raw/sensitive data.
Marts Tables (What the AI Agent Sees)
| Marts Table | Source | Description |
|---|---|---|
mart_trades_futures |
trade_future (staging) | Clean futures trades, no internal accounts |
mart_trades_spot |
trade_spot (staging) | Clean spot trades |
mart_orders |
orders (staging) | Order history with status |
mart_wallets |
wallet_transactions (staging) | Deposits, withdrawals, transfers |
mart_users |
users + kyc (staging) | User profiles (PII masked) |
mart_positions |
positions (staging) | Current open positions |
mart_volume_daily |
Materialized view | Daily volume by market |
mart_pnl_daily |
Materialized view | Daily P&L by user/market |
mart_liquidations |
Materialized view | Liquidation events and metrics |
mart_candlesticks |
Materialized view | OHLCV candles (1m/1h/1d) |
mart_fee_revenue |
Materialized view | Fee revenue by market/tier |
15. Security and Governance
Data Access Controls
| Control | Implementation |
|---|---|
| Authentication | SSO integration (existing corporate identity) |
| Authorization | Role-based — teams see only permitted mart tables |
| Query restrictions | MCP server enforces read-only (no mutations) |
| AI agent access | Marts layer only (no raw or staging data) |
| PII protection | User PII masked in marts tables |
| Audit trail | Every query logged in Langfuse with user identity |
| Network | ClickHouse in private VPC, no public access |
LLM Role Resource Guardrails
The LLM database user must have hard resource limits to prevent runaway agent queries from affecting the cluster. ClickHouse's own agentic analytics setup guide prescribes these exact values:
| Guardrail | Limit | Why |
|---|---|---|
| Query timeout | 30 seconds | Agent queries that scan too much will abort cleanly |
| Memory per query | 2 GB | Prevents a single agent query from exhausting node memory |
| Max rows scanned | 100 million | Forces the LLM to use indexes and partitions — catches unfiltered full scans |
| Max bytes scanned | 5 GB | Secondary byte-level cap alongside row limit |
| Max threads | 4 CPUs | Limits CPU contention with other workloads |
| Access type | SELECT-only on marts.*
|
No mutations, no access to raw or staging layers |
The LLM role gets SELECT access only on the marts schema — never on raw or staging. This is a hard access control, not just a glossary convention. If the LLM generates a query against a staging table, the database rejects it.
LLM Data Safety
- Self-hosted Qwen 2.5 — no data sent to external APIs
- Business glossary contains schema mappings, not actual data
- LLM sees query results only — no direct database access
- All interactions logged and auditable via Langfuse
Compliance
- Data residency: ClickHouse deployed in the same region as production OLTP databases
- Retention: Configurable TTL per table
- Right to erasure: CDC propagates DELETE events from PostgreSQL
- Access logs: Full audit trail via Langfuse
16. High Availability Architecture
ClickHouse Cluster (Production Region)
+-------------------+
| Load Balancer |
+---+----------+----+
| |
+--------v--+ +---v--------+
| CH Node 1 | | CH Node 2 |
| (Shard 1 | | (Shard 1 |
| Replica 1)| | Replica 2)|
+------------+ +------------+
ClickHouse Keeper (3 nodes for consensus)
Deployment Options
| Option | Description | Est. Monthly Cost |
|---|---|---|
| ClickHouse Cloud | Fully managed, auto-scaling | ~$2,000-5,000/mo |
| Self-hosted on EC2 | 2x m7g.2xlarge (Graviton3) | ~$1,500-3,000/mo |
Recommendation: Start with ClickHouse Cloud for faster deployment. Migrate to self-hosted if cost optimization is needed later.
17. Implementation Roadmap
Phase 1 — Data Foundation (Weeks 1-3)
| Task | Details |
|---|---|
| Deploy ClickHouse | ClickHouse Cloud or 2-node self-hosted cluster |
| Identify all source tables | Catalog tables from all PostgreSQL databases |
| Set up CDC pipelines | Debezium + Kafka for each source database |
| Design medallion schema | Raw → Staging → Marts table definitions |
| Initial data load | Backfill historical data from all PostgreSQL databases |
| Validation | Row counts and checksum verification |
Phase 2 — Marts + LLM (Weeks 4-6)
| Task | Details |
|---|---|
| Build marts tables | Curated, aggregated, AI-ready tables |
| Materialized views | Volume, P&L, liquidations, candlesticks, revenue |
| Deploy Qwen 2.5 72B | Self-hosted on GPU instance with vLLM |
| MCP server setup | Connect ClickHouse MCP to Qwen |
| Business glossary v1 | 50-100 domain-specific terms |
| Test SQL generation | Validate against known queries |
Phase 3 — Chat UI + Observability (Weeks 7-9)
| Task | Details |
|---|---|
| Deploy LibreChat | Self-hosted, SSO integration |
| Connect to Qwen backend | MCP plugin configuration |
| Langfuse setup | LLM tracing and accuracy monitoring |
| Role-based access | Team-specific table permissions |
| Beta launch | Trading Ops team as first users |
| Feedback loop | Review Langfuse failures, expand glossary |
Phase 4 — Production Hardening (Weeks 10-12)
| Task | Details |
|---|---|
| Glossary expansion | Target 200+ terms based on real usage |
| Accuracy tuning | Fix common failure patterns from Langfuse |
| Performance tuning | Query caching, materialized view optimization |
| Security audit | Access review, penetration testing |
| GA launch | Roll out to all teams |
Phase 5 — Advanced (Post Week 12)
- Scheduled reports (daily P&L email, weekly risk summary)
- Alert-driven queries (agent triggered by anomaly detection)
- Multi-step analysis (agent chains multiple queries)
- Fine-tune Qwen on the crypto exchange query patterns for higher accuracy
- Add more source databases as needed
18. Cost Analysis
Monthly Infrastructure Costs
| Component | Est. Monthly Cost |
|---|---|
| ClickHouse Cloud | $3,000-5,000 |
| Kafka/CDC (Amazon MSK) | $500 |
| GPU for Qwen 2.5 (g5.12xlarge) | $7,000 (1-yr reserved) / $11,700 (on-demand) |
| LibreChat (t3.medium) | $100 |
| Langfuse | Included in ClickHouse |
| Total | ~$11,000-14,000 (reserved) / ~$15,000-19,000 (on-demand) |
Cost Optimization Path
| Optimization | Savings |
|---|---|
| Self-hosted ClickHouse instead of Cloud | -$1,500-3,000/mo |
| Qwen 2.5 32B on smaller GPU | -$2,000/mo |
| Reserved instances (1-year) | -30% on GPU |
| Spot instances for dev/test | -60% |
ROI Justification
| Benefit | Value |
|---|---|
| Engineering time saved (ad-hoc queries) | 20-40 hours/month |
| Faster incident response | 50% reduction in MTTR |
| Production DB load reduction | Fewer analytical queries on live trading DB |
| Self-service analytics | Teams don't need SQL knowledge |
| Cross-database insights | Previously impossible queries now instant |
| Compliance efficiency | Instant audit queries vs manual extraction |
19. Risk Assessment
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| LLM generates incorrect SQL | Medium | Medium | Business glossary, Langfuse monitoring, human review |
| CDC lag (stale data) | Low | Medium | Monitor Kafka consumer lag, alert if >5 min |
| ClickHouse cluster failure | Low | High | 2-replica HA, automated failover, daily backups |
| GPU instance unavailable | Low | Medium | Reserved capacity, fallback to smaller model |
| Low user adoption | Medium | Medium | Start with power users, iterate on feedback |
| Schema drift (PG changes not in CH) | Low | Medium | Automated schema sync in CDC pipeline |
| Wrong answers not caught | Medium | High | Langfuse accuracy tracking, glossary reviews |
| PostgreSQL replication slot bloat | Medium | High | If Debezium falls behind, PG WAL accumulates indefinitely and can crash production DB — monitor slot lag, set max_slot_wal_keep_size, alert if lag > 10GB |
20. Success Metrics
90-Day Targets
| Metric | Target | Measurement |
|---|---|---|
| Query accuracy | >70% correct SQL | Langfuse evaluation |
| Active users | 20+ weekly | LibreChat usage logs |
| Query volume | 500+ queries/week | Langfuse trace count |
| Response time | <10 seconds | Langfuse latency p50 |
| Source databases consolidated | 4+ databases in ClickHouse | Table count |
| Business glossary | 200+ terms | Glossary config |
| Production DB offload | 50%+ analytical queries moved | PG query log |
6-Month Targets
| Metric | Target |
|---|---|
| Query accuracy | >85% |
| Active users | 50+ |
| Self-service resolution | 70% of questions answered without engineering |
| Automated reports | 10+ scheduled reports |
21. Reference Links
ClickHouse Agentic Data Stack
| Resource | URL |
|---|---|
| Agentic Data Stack GitHub repo | https://github.com/ClickHouse/agentic-data-stack |
| MCP Server GitHub repo | https://github.com/ClickHouse/mcp-clickhouse |
| AgentHouse live demo | https://llm.clickhouse.com |
ClickHouse Blog Posts
ClickHouse Documentation
| Doc | URL |
|---|---|
| MCP setup guide | https://clickhouse.com/docs/use-cases/AI/MCP |
| Langfuse integration | https://clickhouse.com/docs/cloud/features/ai-ml/langfuse |
Appendix A — Technology Stack Summary
| Component | Technology | Purpose |
|---|---|---|
| Analytical DB | ClickHouse | All the crypto exchange data consolidated, columnar storage, fast aggregations |
| CDC | Debezium + Kafka | Real-time sync from all PostgreSQL databases |
| LLM | Qwen 2.5 72B (self-hosted) | Natural language to SQL generation |
| MCP Server | ClickHouse MCP | Schema discovery + safe SQL execution |
| Chat UI | LibreChat (self-hosted) | Web-based chat interface for all teams |
| Observability | Langfuse | LLM tracing, accuracy tracking, cost monitoring |
| Source of Truth | PostgreSQL databases | Production OLTP (unchanged) |
Appendix B — DWAINE Reference (ClickHouse Internal Agent)
ClickHouse built DWAINE (Data Warehouse AI Natural Expert) for their own internal use:
| Metric | Value |
|---|---|
| Users | 250+ employees |
| Daily messages | 200+ |
| Daily conversations | 50-70 |
| Internal analytics use cases covered | ~70% |
| DWH team workload reduction | 50-70% |
| Stack | LibreChat + Claude (Bedrock) + ClickHouse MCP |
| Key learning | Business glossary is the #1 factor for accuracy |
| Source | ClickHouse — How we made our data warehouse AI-first |
Our platform follows the same architecture, with Qwen replacing Claude for data privacy.
Closing thoughts — the parts that actually matter
After shipping this platform, the things that determine success are not the parts you'd expect:
The business glossary is everything. ClickHouse's internal post-mortem on DWAINE concluded the same: glossary quality is the #1 driver of query accuracy. You don't fine-tune the LLM — you write a YAML file that maps domain terms to SQL fragments, and the LLM uses it via in-context learning. Budget more time for the glossary than for the LLM.
The Marts layer is non-negotiable. Exposing the LLM to staging or raw data is how you get hallucinated SQL. Marts are pre-joined, denormalized, semantically clean, and explicitly access-controlled. The LLM should only ever see Marts.
JOIN avoidance dominates query performance. The 3-strategy approach — pre-joined marts, dictionaries (
dictGet), UNION ALL — handles ~90% of cross-table questions without a single runtime JOIN. The other 10% gets ClickHouse's auto-algorithm fallback. This was the single biggest architectural decision.Langfuse observability is what makes the platform improvable over time. Without per-query traces, accuracy scores, and cost-per-question, you have no idea where to invest. With it, you can see exactly which question patterns fail and feed that back into the glossary.
The PostgreSQL replication slot bloat risk is the one most teams forget. If Debezium falls behind, PostgreSQL accumulates WAL indefinitely and can crash production. Set
max_slot_wal_keep_sizeand alert on slot lag. Don't learn this the hard way.Security guardrails matter more than people think. A 30-second query timeout, 2GB memory cap, 100M-row scan limit, and SELECT-only access on the Marts schema together mean the LLM cannot accidentally take down the cluster. ClickHouse's official agentic-analytics setup guide prescribes these exact values for a reason.
The full implementation — including the ClickHouse AI DBA MCP server (152 tools for cluster operations) referenced in this guide, knowledge base with 47 production-tested rules, and 16 recovery runbooks — lives in the companion repo on GitHub.
If you've shipped something similar (or are building it now), I'd love to compare notes. Drop a comment below or reach out on LinkedIn.
The companion piece — Replacing Elasticsearch with ClickHouse for OTEL Logs, Traces & Metrics: A 90% Cost-Reduction Migration — covers the observability layer of this same architecture in more depth.
Top comments (0)