DEV Community

RAKESH THERANI
RAKESH THERANI

Posted on

Four LLM Engines, One ClickHouse Cluster: An Agentic AI Architecture

We are building an agentic AI analytics platform for a crypto exchange where internal teams — Trading Ops, Risk, Compliance, Finance, Treasury, Product, Engineering — ask questions in plain English and get audited, citation-enforced answers.

It's built on five open-source components:

  • ClickHouse — data + vector + observability storage
  • Qwen 2.5 72B — self-hosted LLM via vLLM
  • Anthropic MCP — zero-trust tool layer
  • LibreChat — chat UI (acquired by ClickHouse, November 2025)
  • Langfuse — LLM observability (acquired by ClickHouse, January 2026)

The defining property is four execution engines on shared infrastructure, each tuned for a different question shape:

Engine Typical signals (semantic, not keyword) What it does Latency
Analytics Data retrieval, aggregations, single-domain "what is" patterns NL → SQL on pre-joined marts <2s
Numeric Research Match to one of 20 catalogued fitted-model patterns (regimes, probabilities, elasticities, etc.) Templated lookup against cached model output <1s
Optimization Decision-making with constraints + objective + trade-off ("what should we do") NL → structured math model → solver → reviewable plan 30s–5min
Deep Research Multi-source explanation requiring decomposition + reasoning ("why did this happen") Planner → workers → critic → synthesizer loop 60–90s

Plus four continuous self-learning loops, a four-layer trust defense, and a 24-week build timeline. The rest of this post walks through each piece with concrete examples.


Why This Exists

The problem in current operations

  • Analyst bottleneck. Risk, Compliance, Finance teams wait days for ad-hoc SQL. Analytics is a ticket queue, not a self-service capability.
  • Investigations are slow. "Why did revenue drop 12%?" takes 2 days of analyst time and ships a 30-slide deck — often wrong on a key dimension that emerges weeks later.
  • Optimization is spreadsheet-based. Insurance fund allocation, fee tier choices, wallet rebalancing — done in Excel because actual OR tools require OR specialists.
  • Compliance is reactive. SAR drafts written from scratch each time. Wash trading detection runs only when someone happens to look. Regulator inquiries take a week.
  • No audit trail for ad-hoc analysis. When numbers ship to the CFO or regulator, you can't reproduce them six months later.

The fix

  • Self-service NL for non-engineers. Data team stops being a ticket queue.
  • Deep Research engine that produces cited investigations in 60–90 seconds vs analyst-hours.
  • Optimization layer that takes business problems in English, formulates math models, runs solvers, returns reviewable plans for human approval.
  • Citation enforcement so every claim traces back to specific SQL queries with parameter bindings — auditable, reproducible.
  • Self-learning loops so the platform gets better weekly without engineering intervention.

End-to-End Flow

┌─────────────────────────────────────────────────────────────────┐
│  USER (1 of 7 teams)                                            │
│  Types: "Why did Q2 fee revenue drop 12% vs Q1?"                │
└──────────────────────┬──────────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  LIBRECHAT                                                      │
│  • SSO authentication; team role attached to session            │
│  • Session memory in MongoDB                                    │
│  • 128K context window for conversation history                 │
│  • Streams response back as it's generated                      │
└──────────────────────┬──────────────────────────────────────────┘
                       ▼
┌─────────────────────────────────────────────────────────────────┐
│  TRIAGE AGENT (Qwen 72B)                                        │
│  Semantic classification + structured output:                   │
│    {engine, confidence, case_id, extracted_params, reasoning}   │
│  Not keyword matching — full LLM intent inference               │
└──────────────────────┬──────────────────────────────────────────┘
                       ▼
┌──────────────────────┴──────────────────────────────────────────┐
│            ROUTED TO ONE OF FOUR EXECUTION ENGINES              │
└─────────────────────────────────────────────────────────────────┘
       │              │              │             │
       ▼              ▼              ▼             ▼
   ANALYTICS    NR LOOKUP    OPTIMIZATION   DEEP RESEARCH
       │            │              │             │
       ▼            ▼              ▼             ▼
  LLM writes   Templated     LLM emits      Planner →
  SQL grounded SELECT        structured     Workers →
  by schema +  from cache    JSON model     Critic →
  glossary     marts.numeric → template    Synthesizer
               _research_    generates      (5 rounds max)
               runs          CVXPY/CP-SAT
                             → solver
       │            │              │             │
       └────────────┴──────────────┴─────────────┘
                       │
                       ▼
       ┌─────────────────────────────────────────────┐
       │  MCP LAYER                                  │
       │  • ClickHouse MCP — read-only SELECT only   │
       │  • Solver MCP — solve_lp/mip/cp/convex      │
       │  • Numeric Research MCP — run_nr            │
       │  • Vector search MCP — similarity lookups   │
       │  • All tools allowlisted per agent          │
       │  • Bearer-token auth                        │
       └────────────────┬────────────────────────────┘
                        ▼
       ┌─────────────────────────────────────────────┐
       │  CLICKHOUSE                                 │
       │  • Raw → Staging → Marts (medallion)        │
       │  • HNSW vector indexes (25.8+ GA)           │
       │  • Refreshable MVs auto-update marts        │
       │  • Persistence tables:                      │
       │    – marts.research_runs (Deep Research)    │
       │    – marts.research_evidence (citations)    │
       │    – marts.optimization_runs (plans)        │
       │    – marts.numeric_research_runs (NR cache) │
       │    – marts.langfuse_flat (agent traces)     │
       └────────────────┬────────────────────────────┘
                        ▼
       ┌─────────────────────────────────────────────┐
       │  SYNTHESIZER (LLM)                          │
       │  • Renders prose answer from evidence       │
       │  • Citation regex enforcement:              │
       │    every claim must have [ev-N]             │
       │    → 3 retries; fallback to LOW confidence  │
       │  • Confidence label (HIGH/MEDIUM/LOW)       │
       │  • Cross-engine handoffs (DR → Opt for fix) │
       └────────────────┬────────────────────────────┘
                        ▼
       ┌─────────────────────────────────────────────┐
       │  LIBRECHAT renders answer to user           │
       │  Every step traced in Langfuse              │
       │  Trace ID returned for audit replay         │
       └─────────────────────────────────────────────┘
                        ▼
       ┌─────────────────────────────────────────────┐
       │  SELF-LEARNING LOOPS fire continuously      │
       │  • Glossary expansion (real-time + nightly) │
       │  • Mart recommendations (nightly)           │
       │  • LoRA fine-tuning (weekly)                │
       │  • Eval-set regression (weekly)             │
       └─────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Triage — How the Agent Chooses an Engine

This is the most important decision in the platform. Get it wrong and everything downstream is wrong.

It's semantic classification, not keyword matching

The triage agent is Qwen 72B doing intent inference — the LLM reads the full question and reasons about what the user is actually trying to do. The verb cues are hints in the system prompt, not the routing rule. A user asking "show me which markets to consolidate" routes to Optimization (not Analytics, despite the "show" verb) because the LLM infers that "which markets to consolidate" is a selection-with-objective decision. The classifier emits a structured JSON output with engine choice, confidence score, extracted parameters, and a one-sentence reasoning string.

What the triage prompt actually contains

You classify user questions into one of four engines.

ENGINES:

1. ANALYTICS  answers "what is happening" questions with a single SQL
   query against pre-joined marts. Best for: aggregations, counts,
   listings, filters over single domains.
   Sample questions:
     - "Top 10 markets by 24h volume"
     - "All withdrawals over $50K last week"
     - "Average fill time on BTC-PERP today"

2. NUMERIC_RESEARCH  looks up cached statistical models. Use when the
   question matches one of 20 catalogued NR cases (regimes, elasticity,
   probabilities, transition matrices, etc.). Lookup is sub-second;
   models are fitted weekly offline.
   Sample questions matched to NR cases:
     - "What market regime is BTC in?"           NR-01
     - "Probability IF covers a 15% shock?"      NR-04
     - "Forecast VIP tier distribution"          NR-08
     - "Maker fee elasticity for tier 2"         NR-07
     - "Almgren-Chriss impact on BTC-PERP"       NR-19
     [...20 cases total...]

3. OPTIMIZATION  formulates a math program and runs a solver.
   Best for: allocation, scheduling, sequencing, pricing optimization,
   trade-off problems where the user wants the *best* answer.
   Sample questions:
     - "Allocate $50M insurance fund across markets"
     - "Sequence today's liquidations to minimize impact"
     - "Choose maker fees to maximize revenue with 5% retention cap"

4. DEEP_RESEARCH  multi-step planner  workers  critic  synthesizer
   loop. Best for: "why" investigations, root-cause analyses,
   multi-source evidence gathering. Takes 60-90 seconds.
   Sample questions:
     - "Why did Q2 fee revenue drop 12%?"
     - "Investigate user X for wash trading"
     - "Liquidation cascade post-mortem"

RETURN STRUCTURED JSON:
{
  "engine": "ANALYTICS | NUMERIC_RESEARCH | OPTIMIZATION | DEEP_RESEARCH",
  "confidence": 0.0-1.0,
  "case_id": "<NR-XX if Numeric Research, else null>",
  "extracted_params": { ... },
  "fallback_engine": "<second-best engine>",
  "reasoning": "one-sentence justification"
}
Enter fullscreen mode Exit fullscreen mode

The LLM does semantic matching against the engine descriptions and sample questions, then emits a confidence score. Keywords are mnemonics — not the routing logic.

Real triage output for ambiguous cases

User: "Show me which 5 markets to consolidate this quarter"

{
  "engine": "OPTIMIZATION",
  "confidence": 0.84,
  "case_id": null,
  "extracted_params": { "n_markets": 5, "time_window": "quarter" },
  "fallback_engine": "ANALYTICS",
  "reasoning": "Despite 'show me' verb, 'which markets to consolidate' is a selection-with-objective decision — handled by optimization solver, not a SQL query."
}
Enter fullscreen mode Exit fullscreen mode

User: "Why is marts.mart_user_trading_activity slow today?"

{
  "engine": "ANALYTICS",
  "confidence": 0.91,
  "case_id": null,
  "extracted_params": { "mart": "mart_user_trading_activity" },
  "fallback_engine": "DEEP_RESEARCH",
  "reasoning": "Despite 'why' verb, this is a single-target query latency question answerable from system.query_log — engineering analytics, not investigation."
}
Enter fullscreen mode Exit fullscreen mode

When confidence is low

If confidence < 0.7, triage asks a one-sentence clarifier:

"I can interpret this two ways. Would you like a quick aggregation against current data (Analytics) or a deeper investigation tracing root causes (Deep Research)?"

User picks → routed. The question + chosen engine are logged → a self-learning loop may add a sample question to the triage prompt for better future disambiguation.


The Four Engines (In Detail)

ENGINE 1 — ANALYTICS (~80% of all questions)

The 80%-case engine. Most user questions are "show me X" or "count Y across Z" — those get answered here.

How the agent decides:

  1. Triage classifies as Analytics based on semantic intent (not just verbs)
  2. Schema injection — system prompt includes live list_tables() output + the top-5 marts retrieved by semantic similarity to the user question
  3. Glossary injection — relevant glossary terms loaded (liquidation = order_type IN (1003, 1004), whale = users WHERE 30d_volume > $10M, etc.)
  4. LLM writes SQL grounded by the schema + glossary
  5. Pre-execution validator (code, not LLM) checks SQL is SELECT-only, references existing tables, has LIMIT
  6. MCP executes against llm_role (read-only, mart-only access)
  7. Synthesizer renders as formatted table or summary

Why this is reliable: Pre-joined marts mean the LLM doesn't write JOINs — it picks a mart and filters. The #1 NL→SQL failure mode (bad JOINs) is eliminated by design.

Worked example 1.1 — Trading volume + liquidation roll-up

User (Risk Manager): "Top 10 markets by 24-hour volume with liquidation count and average liquidation size."

LLM generates:

SELECT v.market_name,
       v.volume_usd_24h,
       l.liq_count_24h,
       round(l.total_liq_volume_usd / nullIf(l.liq_count_24h, 0), 0) AS avg_liq_size_usd,
       round(100.0 * l.liq_count_24h / v.trade_count_24h, 2)         AS liq_rate_pct
FROM marts.mart_volume_daily v
LEFT JOIN marts.mart_liquidations_summary l ON v.market_name = l.market_name
WHERE v.trade_date = today()
ORDER BY v.volume_usd_24h DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Output: 10-row table sorted by volume desc. Latency: ~500ms. Cost: ~$0.005.

Note the agent retrieved liquidation = order_type IN (1003, 1004) from the glossary deterministically — never guessed.

Worked example 1.2 — Compliance cross-table screening

User (Compliance Officer): "All withdrawals over $50K in the last 7 days where the user has a KYC level below 2."

LLM generates:

SELECT b.username,
       b.coin_type,
       b.amount,
       b.amount_usdt,
       b.address,
       u.kyc_level,
       u.fregistertime,
       b.create_date_time
FROM staging.balance_audit b
JOIN staging.f_user u USING (username)
WHERE b.in_out_type = 2          -- withdrawal direction (from glossary)
  AND b.amount_usdt > 50000
  AND b.create_date_time >= today() - 7
  AND u.kyc_level < 2
ORDER BY b.amount_usdt DESC;
Enter fullscreen mode Exit fullscreen mode

Output: flagged-user list with KYC level. Latency: ~1s.

Worked example 1.3 — Trading bot performance breakdown

User (Product Manager): "How many trading bots are running on BTC-PERP by bot type, and what's the median ROI for each?"

LLM generates:

SELECT bot_type,
       count() AS running_bot_count,
       quantile(0.5)(roi) AS median_roi,
       round(quantile(0.5)(roi) * 100, 2) AS median_roi_pct,
       round(sum(total_investment_usdt), 0) AS total_capital_deployed
FROM marts.mart_trading_bot_performance
WHERE state = 'RUNNING' AND symbol = 'BTC-PERP'
GROUP BY bot_type
ORDER BY running_bot_count DESC;
Enter fullscreen mode Exit fullscreen mode

Output: grouped table with quantile aggregation. Latency: ~800ms.

What happens when Analytics misses: If LLM-generated SQL references a non-existent column, the pre-execution validator rejects it. The agent retries with corrected schema. Repeated failures log to Langfuse → a learning loop may draft a glossary entry to handle the term differently next time.


ENGINE 2 — NUMERIC RESEARCH LOOKUP (cached fitted models)

This engine serves cached output from 20 statistical models that run offline (weekly or monthly). The lookup itself is a deterministic templated SELECT — sub-second. The actual model fitting is heavy Python (scikit-learn, statsmodels, scipy, lifelines, PyMC) done in batch jobs.

How the agent decides:

  1. Triage classifies as Numeric Research → matches one of 20 NR cases by semantic similarity to the user question (not by keyword)
  2. Templated lookup runs against marts.numeric_research_runs — the SQL skeleton is fixed; only case_id and filter params are extracted by LLM
  3. Branch on freshness:
    • Fresh (within case-specific window — varies from 24h for IF coverage to 30 days for VIP transitions) → parse predictions_json, return cached
    • Stale → call run_nr(case_id=...) MCP tool to trigger synchronous refit (~5–30s depending on case)
    • Missing (new market, new case) → honest "no model for this market yet" + log for catalogue recommendation
  4. Synthesizer renders with confidence label + last-fit timestamp

Worked example 2.1 — Volatility regime (NR-01)

User (Trader): "What market regime is BTC-PERP in right now?"

Agent decision flow:

  • Triage matches NR-01 (Volatility Regime Classification) by semantic similarity to NR-01's catalogued sample question
  • Extracted params: {symbol: "BTC-PERP"}
  • Templated SELECT:
SELECT predictions_json, created_at, next_run_at
FROM marts.numeric_research_runs
WHERE nr_case_id = 'NR-01'
  AND JSONExtractString(inputs_json, 'symbol') = 'BTC-PERP'
ORDER BY created_at DESC LIMIT 1;
Enter fullscreen mode Exit fullscreen mode
  • Returned predictions_json:
{
  "current_regime": 2,
  "regime_label": "high-vol bullish skew",
  "days_in_current_regime": 5,
  "avg_persistence_days": 8.2,
  "transition_probs": {
    "to_regime_0_low_vol":          0.08,
    "to_regime_1_neutral":          0.42,
    "to_regime_2_high_vol_bullish": 0.28,
    "to_regime_3_high_vol_bearish": 0.22
  },
  "silhouette_score": 0.61
}
Enter fullscreen mode Exit fullscreen mode
  • Synthesizer renders: > "BTC-PERP is in regime 2 (high-vol bullish skew). Regimes persist ~8 days on average; this is day 5. Next-week transition probabilities: stay 28%, neutral 42%, bearish 22%, low-vol 8%. Last fit: 14h ago."

Latency: <1s. Cost: ~$0.001.

Worked example 2.2 — Insurance fund coverage (NR-04)

User (CRO): "What's our insurance fund coverage probability at a 15% shock?"

Agent decision flow:

  • Triage matches NR-04 (Insurance Fund Coverage Probability — Monte Carlo)
  • LLM extracts shock_pct = 0.15
  • Templated lookup returns Monte Carlo output (excerpt):
{
  "method": "monte_carlo_100k_trials",
  "current_if_usdt": 50000000,
  "results_by_shock": {
    "0.05": { "p_breach": 0.0024, "wilson_ci_95": [0.0021, 0.0027] },
    "0.10": { "p_breach": 0.0158, "wilson_ci_95": [0.0150, 0.0166] },
    "0.15": { "p_breach": 0.0792, "wilson_ci_95": [0.0775, 0.0809] },
    "0.20": { "p_breach": 0.1834, "wilson_ci_95": [0.1810, 0.1858] }
  },
  "top_tail_risk_contributors": [
    {"market": "BTC-PERP", "share_pct": 38},
    {"market": "ETH-PERP", "share_pct": 24},
    {"market": "SOL-PERP", "share_pct": 21}
  ]
}
Enter fullscreen mode Exit fullscreen mode
  • Synthesizer renders: > "Insurance fund coverage probability at 15% shock: 92.1% (CI: 91.9–92.3%). At 20% shock: 81.7%. Top tail-risk contributors: BTC-PERP (38%), ETH-PERP (24%), SOL-PERP (21%). Last refit: 4h ago."

Latency: <1s. Cost: ~$0.001.

Worked example 2.3 — VIP tier transitions (NR-08)

User (CFO): "Forecast next month's VIP tier distribution. Are we losing VIP-3 users?"

Agent decision flow:

  • Triage matches NR-08 (VIP Tier Transition Probabilities — Markov chain)
  • Templated lookup returns full 7×7 transition matrix
  • Orchestrator (code, not LLM) does matrix multiplication: current distribution × transition matrix = next-month forecast
  • Synthesizer renders: > "Next-month forecast: VIP-3 will drop from 480 → 403 (-16%). Demotion rate (T3→T2) has spiked from 5.2% historical to 9.0% this month. Recommend investigating VIP-3 satisfaction. Last refit: 3 weeks ago."

Latency: <1s. Cost: ~$0.002.

Worked example 2.4 — Almgren-Chriss impact (NR-19) — machine-to-machine

Optimization engine (not a direct user): the liquidation sequencing solver needs per-market impact coefficients before solving.

Agent decision flow (machine-to-machine):

  • Solver agent issues templated lookup; returned per-market (α, β) for impact_bps = α × (size / depth)^β:
{
  "BTC-PERP": {"alpha": 1.24, "beta": 0.51, "r_squared": 0.78, "n_obs": 18420},
  "ETH-PERP": {"alpha": 1.55, "beta": 0.49, "r_squared": 0.74, "n_obs": 12100},
  "SOL-PERP": {"alpha": 2.18, "beta": 0.62, "r_squared": 0.66, "n_obs":  5240}
}
Enter fullscreen mode Exit fullscreen mode
  • Solver uses these as priors in the CP-SAT scheduling model

Latency: <1s for the lookup. If stale, optimizer waits 5–15s for refit.

This is the NR feeding Optimization pattern. NR results aren't always user-facing; they often feed downstream engines.

How the 20 NR cases tile the domain

Category NR cases Used by
Market structure NR-01 (regimes), NR-15 (cap-hits), NR-16 (correlation) Risk, Trading
Liquidity / impact NR-03 (depth function), NR-19 (Almgren-Chriss) Trading, Optimization
Funding mechanics NR-02 (elasticity) Risk, Finance, Optimization
User behavior NR-06 (LTV survival), NR-08 (VIP transitions), NR-12 (retention curves) Product, Finance, Growth
Compliance / AML NR-05 (wash trade prevalence), NR-10 (withdrawal z-score), NR-11 (alert P/R) Compliance
Risk modeling NR-04 (IF coverage MC), NR-09 (HHI concentration) Risk
Revenue / pricing NR-07 (rebate elasticity), NR-14 (Gini concentration), NR-17 (affiliate payback) Finance, Growth, Optimization
Options NR-18 (Greeks aggregation) Risk, Trading
Internal product NR-20 (PT settlement KS-test) Product, Risk

Adding a new case is a configuration exercise: catalogue entry + fit script + cadence + sample question. The platform handles the rest.


ENGINE 3 — OPTIMIZATION (NL → math model → solver → reviewable plan)

The decision-making engine. Where Analytics answers "what is X?", Optimization answers "what should X be?" — formulating a math program from a business question, running it through a solver, returning a reviewable plan with assumptions, sensitivity, and alternatives.

How the agent decides:

  1. Triage classifies as Optimization based on intent — decision-making with constraints + objective, regardless of verb cues
  2. LLM emits structured JSON describing the math program (decision variables with bounds, objective with sense, constraints, parameters). The LLM does not write Python — it emits structured data.
  3. Deterministic Python template converts JSON → CVXPY / OR-Tools / Pyomo / HiGHS code
  4. Sandbox executes the model (docker + nsjail + cgroups; no network egress; 60s wall-clock; 2GB cap)
  5. Solver runs — backend chosen by model_type field:
    • LP → HiGHS
    • MIP → CBC, SCIP, optionally Gurobi or CPLEX
    • CP → OR-Tools CP-SAT
    • Convex (SOCP / SDP / NLP) → CVXPY with CLARABEL (default), ECOS / SCS fallback
  6. Plan persisted to marts.optimization_runs with full audit trail
  7. Synthesizer renders plan with binding constraints, sensitivity, alternative formulations

Critical: the platform produces a plan, not an action. Human approves before any operational effect. By design — never autonomous.

Worked example 3.1 — Insurance fund allocation (Convex, CVXPY)

User (CRO): "Allocate the $50M insurance fund across BTC/ETH/SOL to minimize worst-case 30-day loss with 5% per-market floor."

Structured JSON emitted by LLM:

{
  "model_type": "convex",
  "decision_variables": [
    {"name": "x_btc", "type": "continuous", "lb": 0},
    {"name": "x_eth", "type": "continuous", "lb": 0},
    {"name": "x_sol", "type": "continuous", "lb": 0}
  ],
  "parameters": {
    "cvar": [0.18, 0.31, 0.42],
    "budget":   50000000,
    "min_pct":  0.05,
    "max_pct":  0.40
  },
  "objective": {
    "sense": "minimize",
    "expression": "cvar_btc*x_btc + cvar_eth*x_eth + cvar_sol*x_sol"
  },
  "constraints": [
    "x_btc + x_eth + x_sol == budget",
    "x_i >= min_pct * budget",
    "x_i <= max_pct * budget"
  ]
}
Enter fullscreen mode Exit fullscreen mode

Template generates CVXPY code; CLARABEL solver runs in sandbox. Plan output:

Plan opt-20260512-001 — Insurance Fund Allocation
Status: optimal · Solver: CVXPY/CLARABEL · Solve time: 47ms

Decision:
  BTC market: $20.0M (40% — at ceiling)
  ETH market: $20.0M (40% — at ceiling)
  SOL market: $10.0M (20%)

Weighted CVaR contribution: $14.0M
  (= 0.18×$20M + 0.31×$20M + 0.42×$10M)

Binding constraints: budget, BTC ceiling, ETH ceiling

Sensitivity:
  Cap relaxed to 50%  → loss drops to $12.5M (-10.5%)
  SOL drawdown ±10%   → allocation unchanged (cap binds)

Alternatives considered:
  Equal-weight ($16.7M each): $15.2M (+8.4%)
  Min-VaR (not CVaR):         $14.9M (+6.1%)

Assumptions:
  - 90-day drawdown distributions (sourced from NR-04)
  - Per-market floor 5% / ceiling 40% (from glossary)

Want to: relax the ceiling? change the lookback? add markets?
Enter fullscreen mode Exit fullscreen mode

Latency: ~3s. Cost: ~$0.05. Decision human-approved before any IF movement.

Worked example 3.2 — Compliance case routing (CP-SAT, OR-Tools)

User (Head of Compliance): "Assign today's 25 open cases (12 SAR drafts, 8 KYC reviews, 5 transaction audits) to the 8 investigators on shift. Minimize SLA breach risk. SAR drafts require senior investigators."

Math type: Mixed-integer assignment problem. Decision variable x[i,j] ∈ {0,1} = "case i assigned to investigator j". Objective: minimize weighted SLA breach risk. Constraints: each case to exactly one investigator; per-investigator capacity in hours; senior-investigator filter for SAR drafts.

Plan output:

Plan opt-20260512-002 — Compliance Case Routing
Status: optimal · Solver: OR-Tools CP-SAT · Solve time: 1.8s

Assignments (25 cases → 8 investigators):
  alice  : 3 cases  (1 SAR, 2 KYC)            load 7.0 / 8.0 hrs
  bob    : 3 cases  (1 SAR, 1 audit, 1 KYC)   load 7.0 / 8.0 hrs
  carol  : 3 cases  (1 SAR, 2 KYC)            load 7.0 / 8.0 hrs
  dan    : 3 cases  (1 SAR, 1 audit, 1 KYC)   load 7.0 / 8.0 hrs
  eve    : 3 cases  (2 KYC, 1 audit)          load 5.0 / 8.0 hrs   [senior, slack]
  frank  : 3 cases  (1 SAR, 2 KYC)            load 7.0 / 8.0 hrs
  grace  : 3 cases  (1 SAR, 2 KYC)            load 7.0 / 8.0 hrs
  henry  : 4 cases  (1 SAR, 2 audit, 1 KYC)   load 8.0 / 8.0 hrs

Total expected SLA breach risk score: 0.42  (vs 1.18 for FIFO)

Binding constraints:
  - Senior investigators (4) — fully loaded with SAR drafts
  - henry hours_available — at capacity

Risks flagged:
  - 1 SAR is unassignable today (no senior capacity)
    → Recommend: escalate to overflow team OR delay 1 lower-priority audit

Want to: see what changes with +1 senior? change SLA hours?
Enter fullscreen mode Exit fullscreen mode

Latency: ~2s. Cost: ~$0.04.

Worked example 3.3 — Maker fee tier optimization (MIP with elasticity)

User (CFO): "Choose maker fee values for tiers 1–5 to maximize monthly fee revenue, subject to ≤5% volume retention loss. Use last 90 days as baseline."

Math type: MIP with discrete fee choices (0.5bp increments) and piecewise-linear elasticity from NR-07. Five integer decision variables. Constraint: total volume ≥ 95% of baseline.

Plan output:

Plan opt-20260512-003 — Maker Fee Tier Optimization
Status: optimal · Solver: HiGHS MIP · Solve time: 230ms

Recommended fees:
  Tier 1 (≤$10M/mo):     +2.0 bps  (current +2.5)
  Tier 2 ($10–50M):      +1.0 bps  (current +1.5)
  Tier 3 ($50–250M):      0.0 bps  (current +0.5)
  Tier 4 ($250M–1B):     -1.5 bps  (current -1.0)
  Tier 5 (>$1B):         -3.5 bps  (current -3.0)

Projected 30-day revenue:  $9.4M   (current $8.5M, +10.6%)
Projected 30-day volume:  $40.4B   (current $42.3B, -4.5%)

Volume loss 4.5% — within 5% glossary cap.

Binding constraints: max_volume_drop_pct = 0.05

Sensitivity:
  Elasticity 1.5× current → revenue gain drops to +6.2%
  Elasticity 0.5× current → revenue gain rises to +14.8%
  → Plan moderately sensitive to elasticity — recommend A/B test before rollout

Assumptions:
  - Linear elasticity 1.2% volume / bp (from glossary NR-07)
  - 90-day baseline volumes
  - Tier boundaries unchanged

Want to: vary elasticity? cap at 3%? add a 6th tier?
Enter fullscreen mode Exit fullscreen mode

Latency: ~1s. Cost: ~$0.04.

What makes Optimization safe:

  1. LLM never writes Python — emits structured JSON; fixed template generates solver code. Successful prompt injection can't escape template structure.
  2. Sandbox containment — code runs with no network, 60s timeout, 2GB cap, ephemeral container.
  3. Reviewable plan not action — every plan ships with assumptions, binding constraints, sensitivity, alternatives. Human approves before any operational effect.

ENGINE 4 — DEEP RESEARCH (multi-step planner / workers / critic / synthesizer)

The highest-stakes engine. Used when a question requires reasoning across multiple data sources, where there's no fixed catalogue match, and where the answer needs a structured investigation report with citations.

How the agent decides:

  1. Triage classifies as Deep Research based on intent (multi-source explanation requiring decomposition + reasoning) — not just on "why" keyword
  2. Planner (LLM) decomposes the question into 7–15 sub-questions, each tagged with a tool hint (SQL / vector search / past plans / web fetch / solver)
  3. Worker pool (LLM-driven, code-coordinated) executes sub-questions in parallel via asyncio.Semaphore(6). Each worker:
    • Generates SQL or vector query
    • Executes via MCP
    • Returns evidence record tagged [ev-N]
  4. Critic (LLM) reads evidence ledger. Decides "sufficient" or "investigate further" (spawn 1–8 more sub-questions). Bounded to 5 rounds max.
  5. Synthesizer (LLM) produces structured report. Every claim must carry [ev-N]. Regex check + 3-retry fallback enforces this.
  6. Persistence — report → marts.research_runs; per-evidence record → marts.research_evidence. Both indexed by embedding for future similar-question retrieval (auto-memory).

Worked example 4.1 — Q2 fee revenue variance attribution

User (CFO): "Why did Q2 fee revenue drop 12% vs Q1?"

Planner output (7 initial sub-questions):

sq Tool Question
sq-1 SQL Volume × rate decomposition Q1 vs Q2
sq-2 SQL Drop by product (spot / perp / options)
sq-3 SQL Drop by region
sq-4 SQL Drop by VIP tier
sq-5 Past plans Was there a fee schedule change?
sq-6 Vector search Market regime shift Q2 vs Q1
sq-7 Web fetch External regulatory events

Workers execute in parallel (~30s):

sq Finding Confidence
sq-1 Vol Q1 $130B, Q2 $118B (−9.2%); rate 1.05 → 1.03 bps (−1.9%) HIGH
sq-2 Perpetuals −15%; spot flat; options +8% HIGH
sq-3 APAC −18%; EU −2%; Americas +3% HIGH
sq-4 VIP 4-5 tiers dropped most (−22%); retail flat HIGH
sq-5 Maker rebate widened 0.5 bp on Apr 14 (decision opt-20260414-007) HIGH
sq-6 Q2 had 35 low-vol days vs Q1's 12 (cosine 0.91 to 2024-Q3) MEDIUM
sq-7 No major regulatory events in Q2 LOW

Critic round 1: "VIP-tier × APAC intersection is biggest single factor (−22% in tier 4-5, −18% APAC). Need user-level breakdown to find concentration."

Spawn 2 more sub-questions:

| sq-8 | SQL | Top 50 APAC VIPs by Q1 volume — 8 reduced volume >50%; 2 went to zero |
| sq-9 | SQL | Those 8 share 3 affiliate codes (aff_217, aff_223, aff_341) |

Critic round 2: "Sufficient. Synthesize."

Synthesizer output (citation-enforced):

Investigation rsrch-20260512-001 — Q2 fee revenue variance
Verdict: HIGH CONFIDENCE — VIP migration + secondary regime effect

Decomposition (−12.0%):
  Volume effect: −9.2%   [ev-1]
  Rate effect:   −1.9%   [ev-5]  (Apr 14 maker rebate widening)
  Mix effect:    −0.9%

Headline findings:
  1. 8 APAC VIPs reduced volume >50%; 2 went to zero          [ev-8]
  2. These 8 share 3 affiliate codes — likely competitor migration [ev-9]
  3. Q2 had 35 low-vol days vs Q1's 12 — secondary effect      [ev-6]
  4. Apr 14 rebate widening explains −1.9% rate drop           [ev-5]

Recommendations:
  HIGH:   VIP relations contact 8 APAC accounts directly
  HIGH:   Growth audit aff_217 / aff_223 / aff_341
  MEDIUM: Stress-test fee model under continued low-vol regime
          (handoff to optimization layer)

Confidence: HIGH on factual findings; MEDIUM on competitor-migration hypothesis
Investigation cost: $0.07 · 67s · 9 sub-queries · trace lf-finance-q2-001
Enter fullscreen mode Exit fullscreen mode

Latency: 67s. Cost: $0.07. The same investigation took FP&A 2 days last quarter and got the affiliate angle wrong (discovered 3 weeks later).

Worked example 4.2 — Wash trading investigation

User (Compliance Officer): "Investigate user whale_42 for wash trading suspicion. Be thorough — this may become a SAR filing."

Planner output (9 sub-questions):

sq Tool Question
sq-1 Vector Trade pattern self-similarity vs historical wash-trader templates
sq-2 SQL Round-trip frequency (buy/sell within N seconds)
sq-3 SQL Counterparty concentration
sq-4 SQL Device fingerprint overlap with other accounts
sq-5 SQL Withdrawal patterns vs deposit patterns (net P&L)
sq-6 SQL Affiliate / referral chain
sq-7 Vector RAG KYC document similarity to other accounts
sq-8 Past investigations Historical confirmed wash traders — pattern match
sq-9 SQL IP / geo-rotation patterns

Workers execute (~40s):

sq Finding Confidence
sq-1 Trade pattern cosine 0.94 to confirmed wash-trader template (Jan 2025) HIGH
sq-2 78% of trades are round-trips < 5 minutes HIGH
sq-3 64% of volume against single counterparty acct_X HIGH
sq-4 4 other accounts share device fingerprint hash HIGH
sq-5 Withdrawals match deposits within 0.1% — net-zero P&L HIGH
sq-6 All 5 accounts referred by same affiliate aff_991 HIGH
sq-7 KYC doc for whale_42 has 0.87 similarity to KYC of acct_X (different photo, same template) MEDIUM
sq-8 2 prior cases referred by aff_991 confirmed wash trading MEDIUM
sq-9 All 5 accounts log in from rotating IPs in same /24 subnet HIGH

Critic round 1: "Strong evidence across 8 of 9 sub-questions. Recommend immediate compliance escalation."

Synthesizer output:

Investigation rsrch-20260512-002 — whale_42 wash-trading suspicion
Verdict: HIGH CONFIDENCE — wash trading + Sybil network

Hypotheses tested:
  H1  Coordinated wash-trade pattern    CONFIRMED  [ev-1, ev-2]
  H2  Sybil network (5 linked accounts) CONFIRMED  [ev-4, ev-6, ev-9]
  H3  Same-counterparty self-trading    CONFIRMED  [ev-3]
  H4  Affiliate-chain coordination      CONFIRMED  [ev-6, ev-8]
  H5  Inadvertent / unrelated activity  FALSIFIED  [ev-5 net-zero P&L]

Headline findings:
  1. Trade pattern matches confirmed wash-trader template (cosine 0.94) [ev-1]
  2. 5 accounts share device fingerprint, IP subnet, affiliate code   [ev-4, ev-6, ev-9]
  3. 78% round-trip trades < 5 min, 64% against single counterparty   [ev-2, ev-3]
  4. P&L net-zero — economic activity has no purpose other than volume [ev-5]
  5. Affiliate aff_991 has 2 prior confirmed wash-trading cases       [ev-8]

Recommendations:
  IMMEDIATE: Freeze 5 accounts; suspend aff_991
  IMMEDIATE: File SAR (handoff to SAR draft template)
  MEDIUM:    Review all aff_991-referred accounts (~40 users)

Investigation cost: $0.21 · 71s · 9 sub-queries · trace lf-compliance-whale42-001
Filing-ready evidence pack persisted to marts.research_evidence
Enter fullscreen mode Exit fullscreen mode

Latency: 71s. Cost: $0.21.

The same investigation took an analyst 4–8 hours, produced a free-text Slack message with no evidence trail, and required a second analyst to reconstruct reasoning before SAR filing.

What makes Deep Research trustworthy:

  1. Citation enforcement — every claim must carry [ev-N]. Synthesizer output regex-checked; failures trigger up to 3 retries before falling back to LOW confidence with gaps flagged.
  2. Critic gate — synthesis cannot proceed until critic returns "sufficient". Multiple rounds catch premature conclusions.
  3. Persistence + replay — every investigation lands in marts.research_runs with full evidence ledger. Six months later you can replay the exact reasoning steps via the trace ID.

Self-Learning — Four Continuous Feedback Loops

The platform improves weekly without engineering intervention. Each loop operates on a different cadence and improves a different surface.

Loop 1 — Glossary Expansion (1-day cycle)

Purpose: Unfamiliar domain terms become deterministic SQL fragments.

User asks question with unfamiliar term
            │
            ▼
Resolution hierarchy attempts to resolve:
   1. Live MCP list_tables — no exact match
   2. Glossary lookup — no entry for the term
   3. LLM disambiguation fallback — proposes interpretation
            │
            ▼
Agent returns answer with explicit disclaimer:
"I interpreted X as Y — confirm or refine?"
            │
            ├─ 👍 confirmed     → propose entry to Slack queue
            └─ 👎 corrected     → user-supplied definition queued
            │
            ▼
Data team approves entry in Slack (1-click)
            │
            ▼
Glossary updated — next user gets deterministic answer
in <1s without disambiguation
Enter fullscreen mode Exit fullscreen mode

Cadence: real-time inline + nightly batch
Effect: glossary grows from ~50 terms at week 1 to ~400+ by month 6

Worked example: Day 1: User asks "show me VIP user churn this quarter". Glossary has no "churn" entry. Inline LLM proposes last_trade_at < now() - INTERVAL 30 DAY AND prior_30d_volume > 0. User clicks 👍. Slack posts: "Propose glossary entry — churn: [SQL]. Approve?" Data team approves day 2 morning. Day 3: Another user asks "how many churned users last week". Glossary lookup returns deterministic SQL. Answer in 800ms. No LLM disambiguation needed. Same question would have taken ~3s on day 1.

Loop 2 — Mart Recommendation (nightly batch)

Purpose: Frequent slow query patterns become permanent pre-joined marts.

NIGHTLY JOB
Mines marts.langfuse_flat for high-frequency staging queries:
  • > 50 times per week
  • > 5s average latency
  • similar query fingerprint via normalizeQuery()
            │
            ▼
Auto-drafts CREATE MATERIALIZED VIEW recommendation
with predicted speedup (e.g., 8s → <500ms)
            │
            ▼
Posts to #data-platform Slack
            │
            ▼
1-click approve → MV deployed
            │
            ▼
Next query hits the mart instead of computing the JOIN from scratch
Enter fullscreen mode Exit fullscreen mode

Cadence: nightly
Effect: mart count grows from ~15 at week 1 → ~35 by month 6

Worked example: 50 users this week asked variations of "users approaching liquidation" — all answered via a 3-table JOIN taking 8 seconds. The recommendation job detects the pattern; auto-drafts marts.mart_near_liquidation MV definition; predicts speedup to <500ms; posts to Slack. Approved; deployed overnight. Within 24h, every related query hits the mart at sub-second latency.

Loop 3 — LoRA Fine-Tuning (weekly)

Purpose: Per-team specialist accuracy improves from real corrections.

User clicks 👎 with corrected SQL — captured in Langfuse
            │
            ▼
WEEKLY JOB
Batches all corrections since last run:
  ~50–200 corrections per team per week
            │
            ▼
QLoRA training (PEFT + bitsandbytes, NF4 4-bit on Qwen 72B)
  • LoraConfig r=16
  • per-team adapter (e.g., risk_lora_v12)
  • trains overnight on 24GB GPU
            │
            ▼
A/B test harness:
  • Run eval set against current + candidate adapter
  • Route 10% of traffic to candidate for 48h
  • Promote if accuracy +2pp AND latency regression <10%
            │
            ▼
Per-team multi-LoRA serving via vLLM (--enable-lora --max-loras 8)
Each team's specialist gets their own adapter at no GPU cost
Enter fullscreen mode Exit fullscreen mode

Cadence: weekly train, weekly A/B promote
Effect: specialist accuracy improves +2–5pp per quarter for first 6 months, then asymptotes

Worked example: Compliance team's specialist, trained on 6 months of corrections, has learned that SAR draft questions specifically need device-fingerprint joins (the agent kept missing this — corrected ~30 times). After v12 adapter promotion, accuracy on SAR-related questions jumps from 78% → 91%.

Loop 4 — Eval-Set Regression Daemon (weekly automated)

Purpose: Catch drift / regression in CI before production users notice.

SUNDAY 06:00 UTC — k8s CronJob fires
            │
            ▼
Runs against current production pipeline:
  • 30 known-answer deep-research investigations
  • 50 known-answer SQL questions
  • per-team specialist eval sets
            │
            ▼
Compare every result vs ground truth:
  • Did headline findings match?
  • Was confidence calibration correct?
  • Latency within SLA?
  • Cost within budget?
            │
            ▼
Persisted to marts.eval_set_runs
            │
            ▼
If HIGH-confidence accuracy drops >2pp from last week:
  → BLOCK next prompt or model promotion
  → Post diff to Slack with regression details
Enter fullscreen mode Exit fullscreen mode

Cadence: weekly automated — never skip
Effect: prompt-change accidents caught within 24h instead of in production

Worked example: Engineering pushes a prompt change to the planner template Friday afternoon. Saturday a user reports vague "the bot seems off today." Sunday 06:30 UTC eval daemon detects HIGH-confidence accuracy dropped 3.4pp on the 30-investigation eval set. Slack alert. Promotion blocked. Engineering investigates Monday, finds the edit accidentally removed citation-enforcement instructions. Reverts. Eval recovers next Sunday.

Plus six supporting intelligences

# Intelligence What it learns Cadence
2 Semantic Table Router Top-K relevant tables per question class Per-query (<100ms)
3 Vector Query Cache Repeat question → reuse exact past answer Real-time
7 Adaptive Mart Refresh Tune MV refresh cadence based on query frequency Daily
8 Schema Change Detection New tables / columns → auto-draft glossary Nightly
9 Multi-Turn Chain Learning Pre-compute likely follow-up queries Real-time
10 Slow Query Detector Recommend skipping indexes for frequent patterns Nightly

The compound effect — 6-month trajectory

Metric Week 1 Month 3 Month 6
Glossary size ~50 ~200 ~400+
% questions answered without analyst escalation ~50% ~75% ~90%
HIGH-confidence accuracy ~70% ~85% ~92%
Median latency ~3s ~1.5s ~1s
Mart count ~15 ~25 ~35
Recurring failure rate high dropping <2%
Eval-set regressions caught in CI per quarter 0 2–3 4–6

Dynamic vs Deterministic — Where Each Lives

Different layers of the platform use different mixes of LLM intelligence and deterministic code. The rule: spend LLM where ambiguity exists; spend code where determinism is required for safety, speed, or auditability.

Layer Dynamic (LLM-driven) Deterministic (code)
Input understanding ✅ Classify engine + extract params
Schema discovery ✅ Live MCP list_tables
Glossary lookup ✅ Templated YAML lookup
Schema disambiguation fallback ✅ LLM proposes interpretation
Analytics SQL generation ✅ LLM writes grounded SQL ✅ Pre-execution validator + read-only role
NR lookup SQL ✅ Templated SELECT against marts.numeric_research_runs
NR cache-vs-refit decision ✅ Orchestrator code on freshness threshold
Optimization JSON model ✅ LLM emits structured model ✅ Python template converts JSON → solver code
Deep Research planning ✅ LLM decomposes into sub-questions
Deep Research workers ✅ LLM picks tool + generates SQL per sub-question ✅ Concurrency via asyncio.Semaphore
Deep Research critic ✅ LLM judges sufficiency ✅ Max 5 rounds (budget enforced)
Citation enforcement ✅ Regex check + 3-retry fallback
Synthesizer rendering ✅ LLM writes prose ✅ Confidence label rules
Persistence ✅ Parameterized inserts
Budget tracking ✅ Per-call token/cost/wall-clock counters
Self-improvement loops ✅ LLM proposes glossary/MV/etc ✅ Human approval workflow + A/B promotion

Trust Mechanisms — Four-Layer Defense

Every output passes through these defenses.

Layer What How enforced
1. Grounding LLM can't reference what it can't see Live schema via MCP list_tables; pre-joined marts eliminate JOINs; glossary terms deterministic; cross-encoder reranking
2. Verification Every claim is checkable Citation enforcement ([ev-N] regex); critic agent (up to 5 rounds); hypothesis verdicts (CONFIRMED/FALSIFIED/UNTESTED); HIGH/MEDIUM/LOW confidence labels calibrated weekly
3. Containment If wrong, blast radius is zero Database-layer read-only (readonly = 1); mart-only access; optimization output is reviewable plan never an action; sandboxed code execution with no network egress; per-team RBAC at DB role layer
4. Auditability Every answer is replayable Langfuse traces every prompt / tool call / sub-query; source provenance per claim; persisted research / optimization / NR runs; 7-year audit retention for regulated jurisdictions

By the Numbers

Catalogued use cases ~145 (75 team-based + ~70 product-domain)
Compound cross-schema queries 13 (single-SQL, 4–7 table joins)
Numeric research fitted models 20 (5 with full Python implementations)
Worked optimization examples 5 (3 with full code)
Worked deep-research investigations 3 (full traces)
MCP frameworks supported 12
Solvers supported 9 (HiGHS, CBC, SCIP, OR-Tools CP-SAT, CVXPY/CLARABEL/ECOS/SCS, Pyomo, optional Gurobi/CPLEX)
Teams served 7
Self-improvement loops 4 primary + 6 supporting
Deployment shapes 2 (crypto exchange + hedge fund)
Build timeline 24 weeks
Team size to build 5–7 engineers

The Tech Stack

Layer Component Notes
Data + vector + observability storage ClickHouse 25.8+ Native HNSW vector indexes (GA); same DB stores trades + embeddings + Langfuse traces
Chat UI LibreChat Acquired by ClickHouse, November 2025; SSO; multi-LLM; 128K context
LLM Qwen 2.5 72B via vLLM Self-hosted; ~$0.01/query amortized; multi-LoRA serving
Triage Qwen 2.5 72B (or 32B in cost-sensitive mode)
Tool layer Anthropic MCP 12 framework compatibility
Observability Langfuse Acquired by ClickHouse, January 2026; traces in ClickHouse
Reranker BGE-reranker-v2-m3 Cross-encoder, ~600M params, FP16
Solvers OR-Tools / CVXPY / Pyomo / HiGHS / CBC / SCIP Open-source; optional Gurobi/CPLEX
Code sandbox Docker + nsjail + cgroups (v1) Upgrade path to gVisor / Firecracker
Fine-tuning PEFT + bitsandbytes (QLoRA) NF4 4-bit, 24GB GPU sufficient
Pipeline Debezium + Kafka CDC from PostgreSQL sources

The Architectural Thesis (One Paragraph)

The platform combines four execution engines (Analytics, Numeric Research lookup, Optimization, Deep Research) on shared infrastructure (ClickHouse + Qwen + MCP + LibreChat + Langfuse) with citation-enforced trust mechanisms (regex-validated [ev-N] per claim, critic agent gating, HIGH/MEDIUM/LOW confidence calibration) and four continuous self-improvement loops (glossary expansion, mart recommendation, LoRA fine-tuning, eval-set regression). Triage is semantic LLM classification, not keyword matching. Engine choice is intent-driven, not verb-matched. The result is a platform that handles open-ended business questions in 1–90 seconds, never takes autonomous action, replays every answer end-to-end via Langfuse, and gets better weekly without engineering intervention.

That's the platform.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.