SimpleWBS

Posted on Jan 28

Build Your AI Analyst and Start Having Conversations with Your Structured Data

#ai #businessintelligence #database #analytics

There’s a peculiar bottleneck in most organisations: the people who have questions about data aren’t always the people who know how to query it. A marketing manager wants to understand customer behaviour patterns. A product lead needs to analyse feature adoption. A CEO wants a quick breakdown of quarterly trends. And all of them end up in a queue, waiting for someone who speaks SQL.

What if you could simply ask your database a question in plain English?

‘What are the tip patterns in New York taxis?’

And instead of a blank stare, you get a comprehensive analysis: average tip amounts, the most common payment methods, seasonal variations, and the curious insight that New Yorkers are most generous with their tips early in the morning.

This isn’t science fiction. You can build this today, in about thirty minutes, using open-source tools. Let me show you how.

Why ClickHouse Is the Ideal Backend for AI Analytics

Before we dive into the setup, it’s worth understanding why ClickHouse is particularly well-suited for AI-powered analytics.

When an AI analyst explores your data, it doesn’t write one perfect query. It writes many queries. It might scan available tables, sample data to understand distributions, run aggregations to find patterns, then drill down into specific segments. A single conversational question can trigger five, ten, or even twenty queries behind the scenes.

With a traditional row-based database, this exploratory pattern is painfully slow. Each query might take seconds or minutes. The conversation becomes stilted — you ask a question, wait, get a partial answer, wait again. The AI’s ‘thinking’ is bottlenecked by the database.

ClickHouse changes this entirely. As a columnar OLAP database, it’s engineered for exactly this use case: fast analytical queries over large datasets. Queries that would take thirty seconds in PostgreSQL complete in milliseconds. Aggregations across billions of rows return in under a second.

This speed transforms the AI analyst experience. The AI can afford to be curious. It can run exploratory queries, check its assumptions, and iterate on its analysis — all whilst maintaining conversational flow. When you ask about tip patterns, the AI can query by time of day, by payment method, by location, by trip distance, and synthesise everything into a coherent answer before you’ve finished reading the previous response.

There’s another advantage: ClickHouse Cloud handles the infrastructure. You don’t need to provision servers, manage scaling, or worry about query optimisation. Point your data at ClickHouse — whether from S3, Kafka, PostgreSQL, or dozens of other sources — and the analytical layer is ready.

For serious analytical workloads, the database isn’t a commodity choice. ClickHouse’s speed is what makes conversational data analysis feel natural rather than laborious.

The Medallion Architecture: Structuring Data for AI

If you’re bringing your own data into ClickHouse, consider organising it using the medallion architecture. This three-layer approach — bronze, silver, gold — makes your data progressively more useful for analysis.

The Bronze layer holds raw data exactly as it arrives. No transformations, no cleaning. This is your source of truth, useful for debugging and reprocessing.

The Silver layer contains cleaned and normalised data. Duplicates removed, data types corrected, timestamps standardised. This is where you’d point most analytical queries.

The Gold layer holds aggregated, business-ready tables. Pre-computed metrics, dimensional models, KPIs. These are optimised for specific analytical questions.

When you configure your AI analyst, you’ll point it primarily at the gold layer for fast, reliable answers, with permission to explore silver for deeper investigation. This structure helps the AI deliver consistent results whilst maintaining flexibility for ad-hoc exploration.

The Tools We’ll Use

We’re going to wire together three components:

LibreChat is a popular open-source library for building ChatGPT-style interfaces. It gives us a polished chat experience without having to build one from scratch. Crucially, it includes an Agent Builder that lets us create specialised AI assistants with pre-loaded knowledge and connected tools.

ClickHouse MCP Server is the bridge between the AI and your database. MCP stands for Model Context Protocol — it’s a standardised way for AI models to interact with external tools and data sources. The ClickHouse MCP server specifically allows AI models to discover databases, explore table structures, and execute queries.

Docker ties everything together, letting us run the entire stack with a single command.

The beauty of this architecture is its modularity. You can swap out LibreChat for another interface. You can point the MCP server at your own ClickHouse instance instead of the demo playground. You can even switch between AI models — Claude, GPT-4, or others — depending on your preference and budget.

Choosing Your LLM: Premium vs Budget

One of the most important decisions is which language model to use. The good news: you have options spanning a wide range of capabilities and costs.

Premium tier: Claude Opus, Claude Sonnet, and GPT-5/GPT-4o represent the state of the art. These models excel at complex reasoning, handle ambiguous questions gracefully, and generate sophisticated SQL. For mission-critical analytics or complex datasets, they’re worth the cost. Claude Sonnet in particular hits a sweet spot — excellent analytical reasoning at a lower price point than Opus.

Budget tier: Here’s where things get interesting. Models like Qwen 2.5 (72B and 32B variants), Llama 3.3, and DeepSeek deliver surprisingly strong performance at a fraction of the cost. For straightforward analytical queries — ‘show me sales by region’, ‘what’s the trend over time’, ‘compare these two segments’ — these models perform nearly as well as their premium counterparts.

OpenRouter is the secret weapon here. Rather than managing API keys for multiple providers, OpenRouter gives you a single API endpoint that routes to dozens of models. You can experiment freely: try Qwen for a week, switch to Claude for complex projects, fall back to Llama for high-volume batch analysis. The flexibility is remarkable.

A practical strategy: start with a premium model whilst you’re building and testing your agent. Once it’s working reliably, experiment with budget models on the same queries. You might find that Qwen handles 80% of your use cases perfectly well — and that 80% just got significantly cheaper.

Setting Up Your Environment

Before we begin, make sure you have Docker running on your machine. That’s the only prerequisite.

First, we’ll clone the ClickHouse examples repository. This repository contains examples for many different use cases, but we’re particularly interested in the MCP integrations with various AI libraries.

git clone https://github.com/ClickHouse/examples.git
cd examples/AI/MCP/librechat

Next, clone the LibreChat project itself into this directory:

git clone https://github.com/danny-avila/LibreChat.git libra

You’ll now have a folder called ‘libra’ containing the LibreChat codebase.

Configuring Docker Compose

The ClickHouse examples repository includes a Docker Compose override file that does most of the heavy lifting. Let’s examine what it contains.

The override file defines two key services. The first is the API service, which mounts a custom librechat.yaml configuration file. This file tells LibreChat about our MCP server. The second service is the ClickHouse MCP server itself.

For the MCP server, the configuration looks something like this:

clickhouse-mcp-server:
  image: clickhouse/mcp-server
  container_name: clickhouse-mcp-server
  ports:
    - "8001:8000"
  environment:
    - CLICKHOUSE_HOST=sql-clickhouse.clickhouse.com
    - CLICKHOUSE_PORT=8443
    - CLICKHOUSE_USER=demo
    - CLICKHOUSE_PASSWORD=demo
    - CLICKHOUSE_SECURE=true

By default, this points to the ClickHouse SQL playground — a public demo environment with several interesting datasets. But here’s the important bit: if you want to point this at your own ClickHouse server, this is the place to do it. Simply replace the environment variables with your own connection details.

Copy the Docker Compose override file into the LibreChat folder:

cp docker-compose.override.yml libra/
cd libra

Setting Up Your API Key

LibreChat needs an API key to communicate with your chosen language model. Create an environment file from the provided example:

cp .env.example .env

Open the .env file in your preferred text editor. Search for the API key configuration for your chosen provider.

For Anthropic (Claude):

ANTHROPIC_API_KEY=sk-ant-your-key-here

For OpenAI (GPT models):

OPENAI_API_KEY=sk-your-key-here

For OpenRouter (access to Qwen, Llama, and dozens of others):

OPENROUTER_API_KEY=sk-or-your-key-here

You can configure multiple providers and switch between them in the interface. Save and close the file.

Configuring LibreChat to Use the MCP Server

This is where the magic happens. We need to tell LibreChat about our ClickHouse MCP server.

Create the LibreChat configuration file:

cp librechat.example.yaml librechat.yaml

Open librechat.yaml and scroll to the bottom. Add the MCP server configuration:

mcpServers:
  clickhouse-playground:
    url: http://clickhouse-mcp-server:8000

The name ‘clickhouse-playground’ is arbitrary — you can call it whatever makes sense for your setup. The URL points to the Docker container we defined earlier. Because both services run within Docker’s network, they can communicate using container names rather than localhost.

Launching the Stack

Everything is configured. Time to launch:

docker compose up

You’ll see a cascade of log messages as Docker pulls images and starts containers. Once the logs settle, open your browser and navigate to:

http://localhost:3080

LibreChat will prompt you to create a user account. Fill in your details, complete the registration, and log in.

Important: Don’t start chatting directly with the base interface. The raw chat works, but without an agent, you’re essentially onboarding a new analyst from scratch with every conversation. The AI has to discover tables, figure out what columns mean, and guess at business logic each time. Instead, head straight to the Agent Builder to create a proper AI analyst.

Building Your Analyst Agent (The Critical Step)

This is the most important part of the entire setup. Everything else is infrastructure — this is where you create actual intelligence.

Click on the sidebar in LibreChat and select Agents, then Create Agent.

You’ll see several fields to configure. The most important one is Instructions — this is where you embed domain knowledge directly into the agent’s behaviour. Think of it as writing an onboarding document for a new analyst, except this analyst has perfect recall and will follow your guidelines precisely.

The quality of your instructions directly determines the quality of your AI analyst. A generic instruction like ‘you are a helpful data analyst’ produces generic results. A detailed instruction set that explains your data, your business rules, and your analytical priorities produces an agent that behaves like a knowledgeable colleague.

Using AI to Write Your Agent Instructions

Here’s a powerful technique: use another AI to help you write the instructions for your analyst agent.

Take the schema of your gold layer tables — the datamarts and analytical tables your AI analyst will query — and give them to Claude, GPT-4, or whatever AI you prefer. Ask it to write comprehensive instructions for an agent that will use this data.

Here’s a prompt you can adapt:

I'm building an AI analyst agent that will query a ClickHouse database using
natural language. Below is the schema for my main analytical tables.

Please write detailed instructions for this agent that include:
1. A description of what each table contains and when to use it
2. What each column means in business terms (not just technical definitions)
3. Important business rules and caveats (data quality issues, special cases,
   things that would trip up a naive analyst)
4. What analytical priorities to focus on when given open-ended questions
5. Query guidelines specific to ClickHouse (aggregation functions, time
   handling, performance considerations)
6. How to present findings (lead with insights, mention limitations)

Here's my schema:

[Paste your CREATE TABLE statements or schema documentation here]

The AI will generate a first draft of instructions. Review it, refine it, add your own domain knowledge, and iterate. You might go back and forth several times — ‘add more detail about how we calculate churn’, ‘explain the difference between gross and net revenue’, ‘mention that Q1 2023 data is incomplete due to the migration’.

This approach works because you’re using AI where it excels (synthesising information, structuring documentation) to create instructions that make another AI excel at analysis. It’s AI helping AI help you.

Example: A New York Taxi Analyst

Here’s what a comprehensive instruction set looks like for the ClickHouse playground’s taxi dataset:

You are a senior data analyst specialising in New York City taxi operations.
You have access to the ClickHouse SQL playground via the MCP server.

PRIMARY DATASET

Your main focus is the trips table in the nyc_taxi database. This contains
millions of taxi trip records from New York City.

KEY COLUMNS AND THEIR MEANING

- pickup_datetime: When the passenger got in (use this for time-based analysis)
- dropoff_datetime: When the trip ended
- passenger_count: Number of passengers (1–6 typically; 0 means data error)
- trip_distance: Distance in miles (sanity check: over 100 miles is likely bad data)
- fare_amount: Base fare in USD (does not include tips or extras)
- tip_amount: Tip in USD (only recorded for card payments, not cash)
- total_amount: Everything combined – fare, tip, tolls, surcharges
- payment_type: 1 = Credit card, 2 = Cash, 3 = No charge, 4 = Dispute
- pickup_location_id: Zone where trip started (joins to taxi_zone_lookup)
- dropoff_location_id: Zone where trip ended

BUSINESS RULES AND CAVEATS

- Tips for cash payments are not recorded. When analysing tipping behaviour,
  always filter to payment_type = 1 (credit card) and mention this limitation.
- Airport trips have different fare structures. Location IDs 132 (JFK) and
  138 (LaGuardia) indicate airport pickups or dropoffs.
- Data quality varies by year. Pre-2015 records have more missing values.
- Rush hour is 7–9am and 5–7pm on weekdays.
- Trips with fare_amount <= 0 or trip_distance <= 0 should be excluded
  from most analyses as they indicate cancelled trips or data errors.

ANALYSIS PRIORITIES

When asked open-ended questions, focus on:
1. Revenue patterns (fare trends, tip percentages, seasonal variation)
2. Operational efficiency (trip duration vs distance, busy periods)
3. Customer behaviour (tipping habits, popular routes, payment preferences)

QUERY GUIDELINES

- Always include a LIMIT clause (start with 1000, increase if needed)
- For time-based analysis, use toStartOfHour(), toStartOfDay(), or
  toStartOfMonth() to aggregate appropriately
- When comparing periods, calculate percentage change, not just absolute difference
- Round monetary values to 2 decimal places in final output
- Use formatReadableQuantity() for large numbers to improve readability

OUTPUT STYLE

Present findings as a business analyst would: lead with the insight,
then provide supporting data. Mention any data quality issues that
might affect interpretation. If you're uncertain about something,
say so rather than guessing.

Notice what this instruction set accomplishes:

It tells the agent which database and table to focus on — no more wandering through irrelevant datasets.
It explains what each column actually means in business terms, not just technical definitions.
It flags data quality gotchas that would trip up a naive analyst.
It establishes analytical priorities so open-ended questions get sensible answers.
It sets guardrails for query efficiency and output formatting.
It defines a communication style that prioritises insights over raw data dumps.

Connecting the Agent to ClickHouse

After writing your instructions, scroll down to the Actions section in the Agent Builder. This is where you connect the agent to your ClickHouse MCP server.

Select your clickhouse-playground (or whatever you named it) from the available MCP services. This gives the agent the ability to list databases, describe tables, and execute queries.

Choose a model — Claude Sonnet works well for analytical reasoning, or use a budget model via OpenRouter if you’re optimising for cost.

Save your agent.

Using Your AI Analyst

Now when you start a conversation with this agent, it already knows your data. You’re not chatting with a generic AI that needs to discover everything from scratch. You’re working with a specialist who understands your domain.

Ask ‘What affects tip amounts?’ and it immediately queries the right table, filters for credit card payments (because it knows cash tips aren’t recorded), and structures its analysis around the factors that actually matter.

Ask ‘Show me the busiest times’ and it knows to aggregate by hour, distinguish between weekdays and weekends, and present the results in a way that’s actionable.

Ask ‘Are there any data quality issues I should know about?’ and it can point to the caveats you built into its instructions, reinforcing that institutional knowledge with every conversation.

This is the difference between a chat interface and a proper agent.

Adapting the Instructions for Your Data

The taxi example above is a template. For your own data, you’ll want to adapt each section:

Primary Dataset: Which database and tables should the agent focus on? If you have dozens of tables, explicitly list the three or four that matter most for analytical questions. Point it at your gold layer first.

Key Columns: Don’t just list column names. Explain what they mean in your business context. What does ‘status = 3’ actually represent? What are the valid ranges? Which columns are frequently null? What’s the relationship between customer_id and account_id?

Business Rules: Every dataset has quirks. Maybe your revenue numbers need to be multiplied by an exchange rate. Maybe certain customer segments should be excluded from churn analysis. Maybe dates before a system migration are unreliable. Maybe ‘active users’ has a specific definition that’s different from what someone might assume. Write these down.

Analysis Priorities: What questions does your business actually care about? Revenue, growth, efficiency, customer satisfaction, operational metrics? Tell the agent what to optimise for when given ambiguous requests.

The better your instructions, the more your agent behaves like a knowledgeable colleague rather than a generic chatbot with database access.

Pointing at Your Own Data

The demo is compelling, but the real value comes when you connect this to your own ClickHouse instance.

Modify the environment variables in your Docker Compose override file. Replace the playground credentials with your own:

environment:
  - CLICKHOUSE_HOST=your-clickhouse-server.com
  - CLICKHOUSE_PORT=8443
  - CLICKHOUSE_USER=your_username
  - CLICKHOUSE_PASSWORD=your_password
  - CLICKHOUSE_SECURE=true

Restart the containers, update your agent instructions to reflect your actual schema, and suddenly your AI analyst can explore your production data. Sales figures. User behaviour. Operational metrics. Whatever lives in your database becomes queryable through natural conversation.

A few considerations for production use:

Security matters. The MCP server will execute whatever queries the AI generates. Consider creating a read-only database user with access limited to specific tables. You probably don’t want the AI accidentally running expensive queries or accessing sensitive data.

Performance deserves attention. Large language models sometimes generate inefficient queries. ClickHouse is remarkably fast, but it’s still worth monitoring query execution times and setting appropriate timeouts.

Iterate on your instructions. Your first version won’t be perfect. Pay attention to where the agent gets confused or produces unhelpful results. Each failure is a signal to add more detail to your instructions.

Beyond Querying

What we’ve built is a foundation. A conversational interface to structured data. But the possibilities extend further.

You could add scheduled reports — ‘Every Monday morning, summarise last week’s sales performance and email it to the team.’ You could create alerts — ‘Tell me if any metric drops more than 20% compared to the previous week.’ You could build dashboards that update themselves based on natural language instructions.

The MCP protocol is designed for extensibility. The ClickHouse examples repository contains integrations with other AI libraries beyond LibreChat. The same principles apply whether you’re building a Slack bot, a custom application, or an internal tool.

The Bigger Picture

We’re at an inflection point in how humans interact with data. For decades, there’s been a translation layer between questions and answers — someone who knows SQL, someone who can write the query, someone who can interpret the results. That layer is dissolving.

This doesn’t mean data analysts become obsolete. Quite the opposite. It means they can focus on the genuinely difficult problems: data quality, model design, causal inference, strategic interpretation. The mechanical work of translating ‘what were our sales last quarter?’ into SELECT statements becomes automated.

Building your own AI analyst isn’t just a technical exercise. It’s a glimpse of how data work will function in the near future. And now you have the tools to build it yourself.

The complete setup instructions are available in the ClickHouse documentation under ‘Using ClickHouse MCP Server with LibreChat’.

DEV Community