DEV Community: ClickHouse

Join me if you can: ClickHouse vs. Databricks & Snowflake - Part 1

ClickHouse — Wed, 02 Jul 2025 15:00:00 +0000

TL;DR

We took a public benchmark that tests join-heavy SQL queries on Databricks and Snowflake and ran the exact same queries on ClickHouse Cloud.

ClickHouse was faster and cheaper at every scale, from 721 million to 7.2 billion rows.

”ClickHouse can’t do joins.” Let’s test that

Let’s be crystal clear upfront: this is not our benchmark.

Someone else designed a coffee-shop-themed benchmark to compare both cost and performance when running join-heavy queries on Databricks and Snowflake, across different compute sizes. The benchmark author shared the full dataset and query suite publicly.

Out of curiosity, we took that same benchmark, loaded the data into ClickHouse Cloud, used similar instance sizes, and ran the original 17 queries. Most queries involve joins, and we didn’t rewrite them (queries 6, 10 and 15 required minor syntax changes to work in ClickHouse SQL dialect).

We did no tuning at all, not for the queries, and not on the ClickHouse side (no changes to table schemas, indexes, settings, etc).

Next time someone says “ClickHouse can’t do joins,” just send them this blog.

We’ve spent the past 6 months making join performance radically better in ClickHouse, and this post is your first look at how far we’ve come. (Spoiler: it’s really fast. And really cheap. And we’re just getting started.)

We’ll walk through how we ran the benchmark, how you can run it too, and then dive into the full results across three dataset sizes: 721 million, 1.4 billion, and 7.2 billion rows.

Finally, we’ll wrap up with a simple takeaway: ClickHouse can do joins, and it can do it fast.

How to reproduce it

You’ll find everything in this GitHub repo, including all 17 queries, scripts, and instructions. We’ve also published the full datasets in a public S3 bucket, so you can skip the generation step and jump straight to testing.

The whole thing is automated: spin up a ClickHouse Cloud service, set your credentials via environment variables, and run one command with your cluster specs and price per compute unit.

Click a button. Grab a coffee. And your results are ready.

We run each query 5 times and report the fastest run, to reflect warm-cache performance fairly. See full results.

One dataset, many runs: how we benchmarked at scale

To speed up the benchmarking process, we took advantage of ClickHouse Cloud Warehouses, a feature that lets you spin up multiple compute services over a single shared dataset.

We ingested the data once, then spun up additional services in different sizes, varying the number of nodes, CPU cores, and RAM, to benchmark different hardware and cost configurations.

Because all services in a Warehouse share the same data, we could run the same benchmark across all configurations at once, without reloading anything.

This also let us test ClickHouse Cloud’s Parallel Replicas feature, where multiple compute nodes process a single query in parallel for even faster results.

Benchmark structure

The original benchmark was posted in two parts on LinkedIn (part 1 and part 2), using synthetic data that simulates orders at a national coffee chain. It tested three data scales for the main fact table:

Scale factor	Total rows in fact table (Sales)
500m	721m
1b	1.4b
5b	7.2b

All three data scales use the same schema with three tables:

Sales: the main fact table (orders)
Products: product dimension
Locations: store/location dimensions

The benchmark consists of 17 SQL queries, most involving joins between the fact and one or both dimension tables. All queries were run sequentially.

Part 1 of the original benchmark covered the smaller 721 million row scale, and part 2 added results for the 1.4 billion and 7.2 billion row scales.

This post mirrors the structure and layout of the original benchmark posts: same queries, same chart style, same order, just re-run on ClickHouse Cloud. For each scale, we report:

Total cost (USD)
Total runtime (seconds)
Cost per query (excluding Q10 & Q16)
Seconds per query (excluding Q10 & Q16)
Cost per query (Q10 & Q16 only)
Seconds per query (Q10 & Q16 only)

Queries 10 and 16 are significantly slower than the other queries and would compress the scale of the other queries on the charts. That’s why the original posts listed them separately.

The original benchmark included both “Clustered” and “Non-clustered” variants. (Here, “Clustered” means the data was physically sorted and co-located to improve query performance, especially on large tables.) For consistency, we report only the Clustered results here.

Methodology & setup:

In the original benchmark results appear to be hot runs. We run each query 5 times and report the fastest run, to reflect warm-cache performance fairly.

All results shown here are based on the ClickHouse Cloud Scale tier, using services deployed on AWS in the us-east-2 region. The full results also include Enterprise tier costs, which still compare favorably.

All services were running ClickHouse 25.4.1, with parallel replicas enabled by default.

You’ll find the full chart sets below, presented without further commentary. All context is in the original LinkedIn posts, and we’ll wrap up with a clear takeaway.

In the charts below, ClickHouse Cloud results follow the label format CH 2n_30c_120g, where:

2n = number of compute nodes
30c = CPU cores per node
120g = RAM per node (in GB)

All service configurations use 30 cores and 120 GB RAM per node, with 4 service sizes tested: 2, 4, 8, and 16 compute nodes.

The label meanings for Databricks (e.g. DBX_S) and Snowflake (e.g. SF_S_Gen2) are unchanged and documented in the original posts.

Results: 500m scale (721m rows)

This scale uses a seed of 500m orders to generate a total of 721m rows.

Total cost

Each bar also shows the total runtime for all 17 queries, shown in parentheses.

Total runtime

Each bar also shows the total cost for running all 17 queries, shown in parentheses.

Cost per query (excluding Q10 & Q16)

Runtime per query (excluding Q10 & Q16)

Cost per query (Q10 & Q16 only)

Runtime per query (Q10 & Q16 only)

Results: 1b scale (1.4b rows)

This scale uses a seed of 1b orders to generate a total of 1.4b rows.

Total cost

Each bar also shows the total runtime for all 17 queries, shown in parentheses.

Total runtime

Each bar also shows the total cost for running all 17 queries, shown in parentheses.

Cost per query (excluding Q10 & Q16)

Runtime per query (excluding Q10 & Q16)

Cost per query (Q10 & Q16 only)

Runtime per query (Q10 & Q16 only)

Results: 5b scale (7.2b rows)

This scale uses a seed of 5b orders to generate a total of 7.2b rows.

Total cost

Each bar also shows the total runtime for all 17 queries, shown in parentheses.

Total runtime

Each bar also shows the total cost for running all 17 queries, shown in parentheses.

Cost per query (excluding Q10 & Q16)

Runtime per query (excluding Q10 & Q16)

Cost per query (Q10 & Q16 only)

Runtime per query (Q10 & Q16 only)

What we learned and what’s next

ClickHouse is fast with joins. Really fast, across all scales.

The 17 queries in this benchmark focus on practical join workloads: 2–3 tables, no tuning, no rewrites. We ran them as-is to see how ClickHouse stacks up.

At 500m scale, most queries complete in under 1 second, with ClickHouse consistently 3–5× faster than the alternatives. And cheaper.
At 1b scale, ClickHouse joins, aggregates, and sorts 1.7 billion rows in just half a second, while other systems need 5 to 13 seconds, and still cost more.
At 5b scale, even the heaviest queries finish in seconds, not minutes, with ClickHouse staying the fastest and cheapest option overall.

We didn’t do anything special to get these results, no config tweaks, no ClickHouse-specific tricks. Just a clean run of the original benchmark.

In part 2, we’ll show you how to make it truly fast, the ClickHouse way, with a few powerful tricks up our sleeve.

Behind the scenes, we’ve spent the last 6 months making joins in ClickHouse much faster and more scalable, from improved planning and memory efficiency to better execution strategies.

And we’re not stopping here.

Next, we’re turning up the difficulty: full TPC-H, up to 8-way joins.

Want to see how ClickHouse handles the most demanding joins? Stay tuned for our TPC-H results.

Integrating with ClickHouse MCP

ClickHouse — Tue, 01 Jul 2025 11:03:08 +0000

MCP is a protocol for connecting third-party services - databases, APIs, tools, etc. - to LLMs. Creating an MCP server defines how a client can interact with your service. An MCP client (like Claude Desktop, ChatGPT, Cursor, Windsurf, and more) connects to the server, and allows an LLM to interact with your service. MCP is quickly becoming the de-facto protocol, and we published the ClickHouse MCP server earlier in the year: mcp-clickhouse.

Natural language interfaces are becoming popular across pretty much all domains, including the spaces where we find ClickHouse users. Software engineers, data engineers, analytics engineers, you name it. We're all starting to adopt natural language and agentic interfaces for parts of the job. It's making it easier than ever to work with data, whether you're comfortable with SQL or not. What we're seeing is that LLMs are helping to round out and expand people's skills - software engineers can do more with data, data engineers can do more with software, etc. There's never been a time when a wider audience could work with data.

Universally across these users, domains, and interfaces is the expectation of speed and interactivity in the user experience. Users aren't firing off a query on Friday afternoon, grabbing a delicious Bánh mì on the way home, and picking up a report on Monday morning. They're having a collaborative, interactive conversation with an LLM, where responses are delivered in seconds, and there is a real back-and-forth. If we add third-party services into the mix, we can't disrupt the user experience. If a user wants to query their database this way, it needs to handle this kind of responsiveness.

That's what makes ClickHouse the ideal database for agentic AI data workflows. ClickHouse is built to be the world's fastest analytical database, where no bits, bytes, or milliseconds are wasted. Even before the LLM and agentic era, ClickHouse aimed to support interactive analytics at scale. We didn't set out to be the best database for agentic AI - sometimes, happy accidents just happen.

Future use cases

Popularity aside, it's still early days, and the tools, workflows, and use cases are evolving rapidly. We see a lot of people forgoing the traditional SQL interface and BI tooling, instead using chat interfaces like Claude Desktop or ChatGPT to talk to their data, skipping SQL entirely, and generating insights and visualizations. We also see developers without a traditional data background building user-facing applications that expose data to end users, relying on LLMs not just to generate front-ends, but to structure data and optimise queries for very high concurrency.

With ClickHouse also becoming the best choice for observability 2.0, we're seeing SREs and DevOps teams using LLMs to query their traces, metrics, and logs, blending full-text search and analytics without obscure query syntax.

And we're imagining what might come next: perhaps we'll see LLMs able to use existing observability data to inform their thinking, perhaps making recommendations for architecture, performance enhancements, or bug fixes based on the data they can access without requiring users to prompt with specific errors or traces.

Soon, ClickHouse Cloud will offer a remote MCP server as a default interface. That means any MCP client could connect directly to your cloud instance without additional local setup.

Want early access? Sign up for the AI features waitlist at clickhouse.ai.

ClickHouse MCP Agent Examples

To make it dead simple to get started, we’ve put together some practical examples showing how to integrate various libraries with the ClickHouse MCP server.

You can do this today with the open-source mcp-clickhouse server. For more on how this fits into the bigger picture, check out this AgentHouse demo and our thoughts on agent-facing analytics.

You can find all five in the ClickHouse/examples repo. They are all configured to run against the ClickHouse SQL Playground, which is configured via the following config:


env = {
    "CLICKHOUSE_HOST": "sql-clickhouse.clickhouse.com",
    "CLICKHOUSE_PORT": "8443",
    "CLICKHOUSE_USER": "demo",
    "CLICKHOUSE_PASSWORD": "",
    "CLICKHOUSE_SECURE": "true"
}

We also use Anthropic models and have provided our API key via the ANTHROPIC_API_KEY environment variable.

1. Agno

Let’s start with Agno (previously PhiData), a lightweight, high-performance library for building Agents.


async with MCPTools(command="uv run --with mcp-clickhouse --python 3.13 mcp-clickhouse", env=env, timeout_seconds=60) as mcp_tools:
    agent = Agent(
        model=Claude(id="claude-3-5-sonnet-20240620"),
        markdown=True, 
        tools = [mcp_tools]
    )
    await agent.aprint_response("What's the most starred project in 2025?", stream=True)

This one has a straightforward API. We initialize MCPTools with the command to launch our local MCP Server, and all the tools become available via the mcp_tools variable. We can then pass the tools into our agent before calling it on the last line.

View the full Agno example.

2. DSPy

DSPy is a framework from Stanford for programming language models.


server_parameters = StdioServerParameters(
    command="uv",
    args=[
        'run',
        '--with', 'mcp-clickhouse',
        '--python', '3.13',
        'mcp-clickhouse'
    ],
    env=env
)

dspy.configure(lm=dspy.LM("anthropic/claude-sonnet-4-20250514"))

class DataAnalyst(dspy.Signature):
    """You are a data analyst. You'll be asked questions and you need to try to answer them using the tools you have access to. """

    user_request: str = dspy.InputField()
    process_result: str = dspy.OutputField(
        desc=(
            "Answer to the query"
        )
    )

async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        tools = await session.list_tools()

        dspy_tools = []
        for tool in tools.tools:
            dspy_tools.append(dspy.Tool.from_mcp_tool(session, tool))

        print("Tools", dspy_tools)

        react = dspy.ReAct(DataAnalyst, tools=dspy_tools)
        result = await react.acall(user_request="What's the most popular Amazon product category")
        print(result)

This one is more complicated. We similarly initialize our MCP server, but rather than having a single command as a string, we need to split up the command and the arguments.

DSPy also requires us to specify a Signature class for each interaction, where we define input and output fields. We then provide that class when initializing our agent, which is done using the React class.

ReAct stands for "reasoning and acting," which asks the LLM to decide whether to call a tool or wrap up the process. If a tool is required, the LLM takes responsibility for deciding which tool to call and providing the appropriate arguments.

You’ll notice that we must iterate over our MCP tools and convert them to DSPy ones.

View the full DSPy example.

3. LangChain

LangChain is a framework for building LLM-powered applications.


server_params = StdioServerParameters(
    command="uv", 
    args=[
        "run", 
        "--with", "mcp-clickhouse",
        "--python", "3.13", 
        "mcp-clickhouse"
    ],
    env=env
)
         
async with stdio_client(server_params) as (read, write):
    async with ClientSession(read, write) as session:
        await session.initialize()
        tools = await load_mcp_tools(session)
        agent = create_react_agent("anthropic:claude-sonnet-4-0", tools)
        
        handler = UltraCleanStreamHandler()        
        async for chunk in agent.astream_events(
            {"messages": [{"role": "user", "content": "Who's committed the most code to ClickHouse?"}]}, 
            version="v1"
        ):
            handler.handle_chunk(chunk)
            
        print("\n")

LangChain follows a similar approach to DSPy when initializing the MCP Server. Like DSPy, we need to invoke a ReAct function to create the agent, passing in our MCP tools. We (well, Claude!) wrote a custom bit of code (UltaCleanStreamHandler) to render the output in a more user-friendly way.

View the full LangChain example.

4. LlamaIndex

LlamaIndex is a data framework for your LLM applications.


mcp_client = BasicMCPClient(
    "uv", 
    args=[
        "run", 
        "--with", "mcp-clickhouse",
        "--python", "3.13", 
        "mcp-clickhouse"
    ],
    env=env
)

mcp_tool_spec = McpToolSpec(
    client=mcp_client,
)

tools = await mcp_tool_spec.to_tool_list_async()

agent_worker = FunctionCallingAgentWorker.from_tools(
    tools=tools, 
    llm=llm, verbose=True, max_function_calls=10
)
agent = AgentRunner(agent_worker)

response = agent.query("What's the most popular repository?")

LlamaIndex follows the familiar approach of initializing the MCP server. We then initialize an agent with our tools and LLM. We found the default max_function_calls value of 5 was too low and wasn’t enough to answer any questions, so we increased it to 10.

View the full LlamaIndex example.

5. PydanticAI

PydanticAI is a Python agent framework designed to make it less painful to build production-grade applications with Generative AI.


server = MCPServerStdio(  
    'uv',
    args=[
        'run',
        '--with', 'mcp-clickhouse',
        '--python', '3.13',
        'mcp-clickhouse'
    ],
    env=env
)
agent = Agent('anthropic:claude-sonnet-4-0', mcp_servers=[server])

async with agent.run_mcp_servers():
    result = await agent.run("Who's done the most PRs for ClickHouse?")
    print(result.output)

Pydantic has the simplest API. Again, we initialize our MCP server and pass it into the agent. It then runs the server as an asynchronous context manager and we can ask the agent questions inside that block.

View the full PydanticAI example.

Try It Out

We’re just getting started with MCP and ClickHouse, and we’d love to hear about what you’re building and your experience using mcp-clickhouse.

Try out the examples, build something cool, and let us know what you think. If you run into issues or have ideas, open a GitHub issue or chat with us in Slack.

Scaling our Observability platform beyond 100 Petabytes

ClickHouse — Thu, 26 Jun 2025 17:05:06 +0000

TLDR

Observability at scale: Our internal system grew from 19 PiB to 100 PB of uncompressed logs and from ~40 trillion to 500 trillion rows.

Efficiency breakthrough: We absorbed a 20× surge in event volume using under 10% of the CPU previously needed.

OTel pitfalls: The required parsing and marshalling of events in OpenTelemetry proved a bottleneck and didn’t scale - our custom pipeline addressed this.

Introducing HyperDX: ClickHouse-native observability UI for seamless exploration, correlation, and root-cause analysis with Lucene-like syntax.

Introduction

About a year ago, we shared the story of LogHouse- our internal logging platform built to monitor ClickHouse Cloud. At the time, it managed what felt like a massive 19 PiB of data. More than just solving our observability challenges, LogHouse also saved us millions by replacing an increasingly unsustainable Datadog bill. The response to that post was overwhelming. It was clear our experience resonated with others facing similar struggles with traditional observability vendors and underscored just how critical effective data management is at scale.

A year later, LogHouse has grown beyond anything we anticipated and is now storing over 100 petabytes of uncompressed data across nearly 500 trillion rows. That kind of scale forced a series of architectural changes, new tools, and hard-earned lessons that we felt were worth sharing - not least that OpenTelemetry (OTel) isn’t always the panacea of Observability (though we still love it), and that sometimes custom pipelines are essential.

In our case, this shift enabled us to handle a 20x increase in event volume using less than 10% of the CPU for our most critical data source - a transformation with massive implications for cost and efficiency.

Other parts of our stack have also changed, not least due to the ClickHouse acquisition of HyperDX. Not only did this give us a first-party ClickHouse-native UI, but it also led to the creation of ClickStack - an opinionated, end-to-end observability stack built around ClickHouse. With HyperDX, we’ve started transitioning away from our Grafana-based custom UI, moving toward a more integrated experience for exploration, correlation, and root cause analysis.

As more teams adopt ClickHouse for observability and realize just how much they can store and query affordably, we hope these insights prove as useful as our first post. If you’re curious about this journey, when and where OTel is appropriate, and how we scaled a log pipeline to 100PB…read on.

Beyond general purpose: evolving observability at scale

Over the past year, our approach to observability has undergone a significant transformation. We've continued to leverage OpenTelemetry to gather general-purpose logs, but as our systems have scaled, we began to reach its limits. While OTel remains a valuable part of our toolkit, it couldn't fully deliver the performance and precision we needed for our most demanding workloads. This prompted us to develop purpose-built tools tailored to our critical systems and rethink where generic solutions truly fit. Along the way, we've broadened the range of data we collect and revamped how we present insights to engineers.

A new frontier of scale

When we last wrote about LogHouse, we were proud to handle 19 PiB of uncompressed data across 37 trillion rows. Today, those numbers feel like a distant memory. LogHouse now stores over 100 petabytes of uncompressed data, representing nearly 500 trillion rows.
Here's a quick look at the breakdown:

System	Uncompressed Size	Stored rows
SysEx	93.6 PB	431 Trillion
OTel	14.5 PB	16.7 Trillion

These numbers also tell a story. In our original post, 100% of our telemetry flowed through OpenTelemetry, with every log line collected via the same general-purpose pipeline. But as the scale and complexity of our data grew, so did the need for specialization.
While our total volume has grown more than 5x, the breakdown reveals a deliberate shift in strategy: today, the vast majority of our data comes from “SysEx”, a new purpose-built exporter we developed to handle high-throughput, high-fidelity system logs from ClickHouse itself. This shift marks a turning point in how we think about observability pipelines - and brings us to our first key topic.

We hope the following helps comprehend the scale at which LogHouse operates.

OpenTelemetry's efficiency challenges at extreme scale

Initially, we used OpenTelemetry (OTel) for all log collection. It was a great starting point and an established industry standard which allowed us to quickly establish a baseline where every pod in our Kubernetes environment shipped logs to ClickHouse. However, as we scaled, we identified two key reasons to build a specialized tool for shipping our core ClickHouse server telemetry.

First, while OTel capably captured the ClickHouse text log via stdout, this represents only a narrow slice of the telemetry ClickHouse exposes. Any ClickHouse expert knows that the real gold lies in its system tables - a rich, structured collection of logs, metrics, and operational insights that go far beyond what’s printed to standard output. These tables capture everything from query execution details to disk I/O and background task states, and unlike ephemeral logs, they can be retained indefinitely within a cluster. For both real-time debugging and historical analysis, this data is invaluable. We wanted all of it in LogHouse.

Second, the inefficiency of the OTel pipeline for this specific task became obvious as we scaled.

The data journey involved:

A customer's ClickHouse instance writes logs as JSON to stdout.
The kubelet persists these logs in /var/log/…
An OTel collector collects these logs from the disk, parsing and marshalling the JSON into an in-memory representation.
The collector transforms these into the OTel log format - again an in-memory representation.
Finally, they are inserted back into another ClickHouse instance (LogHouse) over the native format (requiring another transformation within the ClickHouse Go client).

Note: The architecture described here is simplified. In reality, our OTel pipeline is more involved. Logs were first collected at the edge in JSON, converted into the OTel format, and sent over OTLP to a set of gateway instances. These gateways (also OTel collectors) performed additional processing before finally converting the data into ClickHouse’s native format for ingestion. Each step introduced overhead, latency, and further complexity.

At our scale, this pipeline introduced two critical problems: inefficiency and data loss. First, we were burning substantial compute on repeated data transformations. Native ClickHouse types were being flattened into JSON, mapped into the OTel log format, and then re-ingested - only to be reinterpreted by ClickHouse on the other end. This not only wasted CPU cycles but also degraded the fidelity of the data.
Even more importantly, we were hitting hard resource limits on the collectors themselves. Deployed as agents on each Kubernetes node, they were subject to strict CPU and memory constraints via standard Kubernetes limits. As traffic spiked, many collectors ran so hot they began dropping log lines outright - unable to keep up with the volume emitted by ClickHouse. We were losing data at the edge before it ever had a chance to reach LogHouse.
We found ourselves at a crossroads: either dramatically scale up the resource footprint of our OTel agents (and gateways) or rethink the entire ingestion model. We chose the latter.

Note: To put the cost in perspective - handling 20 million rows per second through the OpenTelemetry pipeline without dropping events would require an estimated 8,000 CPU cores across agents and collectors. That’s an enormous footprint dedicated solely to log collection, making it clear that the general-purpose approach was unsustainable at our scale.

Building SysEx: A specialized system for ClickHouse-to-ClickHouse transfers

Our solution was to develop the System Tables Exporter, or SysEx. This is a specialized tool designed to transfer data from one ClickHouse instance to another as efficiently as possible. We wanted to go directly from the system tables in a customer's pod to the tables in LogHouse, preserving native ClickHouse types and eliminating all intermediate conversions. This has the fantastic side benefit that any query our engineers use to troubleshoot a live instance can be trivially adapted to query historical data across our entire fleet in LogHouse, as the table schemas are identical, with the addition of some enrichment columns (such as the Pod Name, ClickHouse version, etc).

Firstly we should emphasize that SysEx performs a literal byte-for-byte copy of data from the source to the destination. This preserves full fidelity, eliminates unnecessary CPU overhead, and avoids the pitfalls of repeated marshalling.

The architecture is simple and powerful. We run a pool of SysEx scrapers connecting to our customer's ClickHouse instances. A hash ring assigns each customer pod to a specific scraper replica to distribute the load. These scrapers then run SELECT queries against the source pod's system tables and stream the data directly into LogHouse, without any deserialization. The scrapers simply coordinate and forward bytes between the source and destination.
Scraping system tables requires careful handling to ensure no data is missed due to buffer flushes. Fortunately, nearly all system table data is inherently time-series in nature. SysEx leverages this by querying within a sliding time window, deliberately trailing real time by a small buffer - typically five minutes. This delay allows for any internal buffers to flush, ensuring that when a scraper queries a node, all relevant rows for that time window are present and complete. This strategy has proven reliable and meets our internal SLAs for timely and complete event delivery to LogHouse.

SysEx is written in Go, like most of our infrastructure components for ClickHouse Cloud. Naturally, this raises a question for anyone familiar with the Go ClickHouse client: how do we avoid the built-in marshalling and unmarshalling of data when reading from and writing to ClickHouse? By default, the client converts data into Go-native types, which would defeat the purpose of a byte-for-byte copy. To solve this, we contributed improvements to the Go client that allow us to bypass internal marshalling entirely, enabling SysEx to stream data in its native format directly from the source cluster to LogHouse - without decoding, re-encoding, or allocating intermediary data structures.

This approach is broadly equivalent to a simple bash command:

curl -s -u 'default:<password>' "https://sql-clickhouse.clickhouse.com:8443/?query=SELECT+*+FROM+system.query_log+FORMAT+Native" | curl -s -X POST --data-binary @- 'http://localhost:8123/?query=INSERT+INTO+query_log+FORMAT+Native'

An actual go implementation for the curious can be found here.

Most importantly, SysEx doesn’t require the heavy buffering that OTel does, thanks to its pull-based model. Because scrapers query data at a steady, controlled rate, we don’t risk dropping logs when LogHouse is temporarily unavailable or when the source experiences a spike in telemetry. Instead, SysEx naturally handles backfill by scraping historical windows, ensuring reliable delivery without overloading the system or requiring complex retry buffers.

Dynamic schema generation

One of the key challenges with the SysEx approach is that it assumes the source and target schemas match. But in reality, as any ClickHouse user knows, system table schemas change frequently. Engineers continuously add new metrics and columns to support emerging features and accelerate issue diagnosis, which means the schema is a moving target.

To handle this, we generate schemas dynamically. When SysEx encounters a system table, it inspects and hashes its schema to determine if a matching table already exists in LogHouse. If it does, the data is inserted there. If not, a new schema version is created for this system table e.g. text_log_6.

At query time, we use ClickHouse’s Merge table engine to unify all schema iterations into a single logical view. This allows us to query across multiple versions of a system table seamlessly. The engine automatically resolves schema differences by selecting only the columns that are compatible across tables, or by restricting the query to tables that contain the requested columns. This gives us forward compatibility as schemas evolve, without sacrificing query simplicity or requiring manual schema management.

State snapshotting

As we continued to scale and refine our observability capabilities, one of our primary focuses was capturing in-memory system tables, such as system.processes. Unlike the time-series data we’ve been capturing, these tables provide a snapshot of the server’s state at a specific point in time. To handle this, we implemented a periodic snapshot process, capturing these in-memory tables and storing them in LogHouse.

This approach not only allows us to capture the state of the cluster at any given moment, but also provides time-travel through critical details like table schemas and cluster settings. With this additional data, we are able to enhance our diagnostic capabilities by performing cluster-wide or ClickHouse Cloud-wide analyses. This we can join against service settings or query characteristics like used_functions to pinpoint anomalies, making it easier to identify the root causes of issues as they arise. By correlating queries with particular schemas, we further improved our ability to proactively identify and resolve performance or reliability problems for our customers.

Fleet-wide queries

One of the many powerful capabilities we've unlocked with SysEx is the ability to take the same Advanced Dashboard queries that customers use to monitor their individual ClickHouse instances and run them across our entire fleet of customer instances simultaneously.

For release analysis, we can now execute proven diagnostic queries before and after deployments to immediately identify behavioral changes across our entire fleet. This has been rolled into our comprehensive release analysis process. Queries that analyze query performance patterns, resource utilization trends, and error rates complete in real time, allowing us to quickly spot regressions or validate improvements at fleet scale.

Secondly, our support dashboards can now embed the same deep diagnostic queries that customers rely on, but with enriched context from our centralized telemetry. When investigating customer issues, support engineers can run familiar Advanced Dashboard queries while simultaneously correlating with network logs, Kubernetes events, data and control plane events - all within the same interface.

20x more data, 90% less CPU: The numbers behind our rewrite

The efficiency gains from this SysEx are staggering. Consider these stats from LogHouse:

OTel Collectors: Use over 800 CPU cores to ship 2 million logs per second.
LogHouse Scrapers (SysEx): Use just 70 CPU cores to ship 37 million logs per second.

This specialized approach has allowed us to handle a 20x increase in event volume with less than 10 percent of the CPU footprint for our most important data source. Most importantly, it means we no longer drop events at the edge. To achieve this same level of reliability with our previous OTel-based pipeline, we would have needed over 8,000 CPU cores. SysEx delivers it with a fraction of the resources, maintaining full fidelity and consistent delivery.

When OpenTelemetry is the the right choice

If you’ve read this far, you might be wondering: when is OpenTelemetry still the right choice, and is it still useful?
We firmly believe that it is. While our architecture has evolved to meet challenges at extreme scale, such as parsing and processing over 20 million log lines per second, OpenTelemetry remains a critical part of our stack. It offers a standardized, vendor-neutral format and provides an excellent onboarding experience for new users - and is hence the default choice for ClickStack. Unlike SysEx, which is tightly integrated with ClickHouse internals, OpenTelemetry decouples producers from consumers, which is a major architectural advantage, especially for users who want flexibility across observability platforms.

It is also well suited for scenarios where SysEx cannot operate. SysEx is pull-based and relies on querying live system tables, which means the service must be healthy and responsive. If a service is crash-looping or down, SysEx is unable to scrape data because the necessary system tables are unavailable. OpenTelemetry, by contrast, operates in a passive fashion. It captures logs emitted to stdout and stderr, even when the service is in a failed state. This allows us to collect logs during incidents and perform root cause analysis even if the service never became fully healthy.
For this reason, we continue to run OpenTelemetry across all ClickHouse services. The key difference is in what we collect. Previously, we ingested everything, including trace-level logs. Now, we collect only info-level and above. This significantly reduces the data volume and allows our OTel collectors and gateways to operate with far fewer resources. The result is a smaller, more focused pipeline that still accounts for the 2 million log lines per second referenced earlier.

HyperDX for better experience

Collecting all this data is just the beginning. Making it usable and accessible is what really matters. In the first iteration of LogHouse, we built a highly customized observability experience on top of Grafana. It served us well, but as our internal data sources grew and diversified, particularly with the introduction of SysEx and wide-column telemetry, it became clear we needed something more deeply integrated with ClickHouse.

This challenge was not unique to us. Many teams building observability solutions on ClickHouse have encountered the same issue. Getting data into ClickHouse was straightforward, but building a UI that fully unlocked its value required significant engineering effort. For smaller teams or companies without dedicated frontend resources, ClickHouse-powered observability was often out of reach.

HyperDX changed that. It provided a first-party, ClickHouse-native UI that supports log and trace exploration, correlation, and analysis at scale. Its workflows are designed with ClickHouse in mind, optimizing queries and minimizing latency. When we evaluated HyperDX prior to the acquisition, it was already clear that it addressed many of the pain points we and others had experienced. The ability to query using Lucene syntax dramatically simplifies data exploration and is often sufficient. Importantly, it still allows us to query in SQL - something which we still find essential for more complex event analysis - see “SQL for more complex analysis”.

A key reason HyperDX was such a compelling fit was the schema-agnostic approach introduced in v2.0. It doesn't require log tables to conform to a single, rigid structure. This flexibility is critical for a system like LogHouse, which ingests data from numerous sources:

It seamlessly handles the standardized, yet evolving, data format from our OpenTelemetry pipeline.
More importantly, it works out-of-the-box with the highly specialized, wide-column tables produced by SysEx and our other custom exporters. It does this with no prior knowledge of the SysEx schemas, or complex grok pattern specializations. It simply inspects the schema behind-the-scenes and adapts to work with them.

This means our engineering teams can add new data sources with unique, optimal schemas to LogHouse without ever needing to worry about breaking or reconfiguring the user interface. By combining HyperDX's powerful UI and session replay capabilities with LogHouse's massive data repository, we have created a unified and adaptable observability experience for our engineers.

It is worth emphasizing that Grafana still has its place in our observability stack. Our internal Grafana-based application has some distinct advantages, particularly in how it handles routing and query scoping. Users are required to specify the namespace (effectively a customer service) they intend to query. Behind the scenes, the application knows exactly where data for each service resides and can route queries directly to the appropriate ClickHouse instance within LogHouse. This minimizes unnecessary query execution across unrelated services and helps keep resource usage efficient.

This is especially important in our environment, where we operate LogHouse databases across many regions. As our previous blog post described, efficiently querying across these distributed systems is critical for performance and reliability. We’re currently exploring how we might push this routing logic to ClickHouse itself, allowing HyperDX to benefit from the same optimization..so stay tuned.

In addition to its routing capabilities, Grafana remains the home for many of our long-standing dashboards and alerts, particularly those built on Prometheus metrics. These remain valuable, and migrating them is not currently a priority. For example, kube_state_metrics has almost become a de facto standard for cluster health monitoring. These high-level metrics are well suited for alerting, even if they are not ideal for deep investigation. For now, they continue to serve their purpose effectively.

For now, the two tools serve complementary purposes and coexist effectively within our observability stack.

Embracing high cardinality observability

Store everything, aggregate nothing

The development of SysEx has brought more than just technical gains. It has driven a cultural shift in how we think about observability. By unlocking access to system tables that were previously unavailable, where only standard output logs had been captured, we have embraced a model centered on wide events and high cardinality data.

Some refer to this as Observability 2.0. We simply call it LogHouse combined with ClickStack.

This approach replaces the traditional three-pillar model with something more powerful: a centralized warehouse that can store high-cardinality telemetry from many sources. Each row contains rich context - query identifiers, pod names, version metadata, network details - without needing to pre-aggregate or discard dimensions to fit within the limits of a metric store.

As engineers, we have adapted to this new model, leaving behind outdated concerns about cardinality explosions. Instead of summarizing at ingest time, we store everything as is and push aggregation to query time. This approach allows for in-depth inspection and flexible exploration without sacrificing fidelity.

One pattern we have found particularly impactful is logging wide events that include timeseries attributes in place of traditional metrics. For example, here is a log line from SysEx that tracks data pushed from a source ClickHouse instance to the LogHouse cluster:

{
  "level": "info",
  "ts": 1728437333.011701,
  "caller": "scrape/scrape.go:334",
  "msg": "pushed",
  "podName": "c-plum-qa-31-server-zkmrfei-0",
  "podIP": "10.247.29.9",
  "spokenName": "plum-qa-31",
  "azTopoName": "us-east1-c",
  "srcTable": "part_log",
  "windowSize": 120,
  "insertDuration": 0.00525254,
  "insertLag": 300.011693981,
  "startGTE": 1728436913,
  "stopLT": 1728437033
}

At this point, you may be asking: how is this different from a traditional metrics store like Prometheus?

The key difference is that we store every single data point. We do not pre-aggregate fields like insertDuration; instead, we capture and retain each value and store it together.

In contrast, a system like Prometheus typically stores either a gauge per series or, more commonly, pre-aggregates values into histograms to support efficient querying. This design introduces significant limitations. For example, storing time series for all label combinations in Prometheus would lead to a cardinality explosion. In our environment, with tens of thousands of unique pod names, each label combination would require its own timeseries just to preserve query-time flexibility. Pre-aggregating with histograms helps control resource usage but comes at the cost of fidelity. It makes certain questions impossible to answer, such as:

"Which exact insert is represented by this spike in insertDuration - down to the specific instance, table, and time window?"

With our approach, we avoid these trade-offs entirely. We log each event as a wide row that captures all relevant dimensions and metrics in full. This shifts aggregation and summarization to query time while preserving the ability to drill down into individual events when necessary.

This model isn’t entirely new. Systems like Elasticsearch have long encouraged the ingestion of wide events and flexible document structures. The difference is that ClickHouse makes this approach operationally viable at scale. Its columnar design allows us to store high-cardinality, high-volume event data efficiently - without the runaway storage costs or query latency that traditionally limited these kinds of approaches to storing events.

Leveraging data science tools for observability analysis

The power of this approach is in how we can use that single event to draw many different conclusions by visualising its various characteristics, and we can always jump back to the raw logs from any given point on a chart.

First, we can focus on a particular service and see its inserts line by line in series. This is the raw view upon the data.

We can visualize the insert lag for all tables for this individual instance trivially…

We may go a layer up and visualise the insert lag for all servers in a region, which have lag > desired.

And, because Observability is Just another Data Problem, we get to borrow all of the tooling in the data science space for our observability data, so we can visualise our logs in any tool of our choice for which ClickHouse either integrates directly or via a client library. For example, Plotly in a Jupyter notebook;

import plotly.express as px
import pandas as pd
import clickhouse_connect

client = clickhouse_connect.get_client(
…
)
query = """
SELECT
    toInt64(toFloat64(LA['ts'])) AS ts,
    toInt64(LA['startGTE']) AS start,
    toInt64(LA['stopLT']) AS stop
FROM otel.generic_logs_0
WHERE (PodName LIKE 'loghouse-scraper-%'
    AND Timestamp >= '2025-06-10 16:14:00'
    AND Timestamp <= '2025-06-10 18:35:00'
    AND EventDate = '2025-06-10'
    AND Body = 'pushed'
    AND LA['srcTable'] = 'text_log'
    AND LA['podName'] = 'c-plum-qa-31-server-zzvuyka-0'
)
ORDER BY EventTime DESC
"""

df = client.query_df(query)

# Convert the 'start' and 'stop' columns from Unix timestamps to datetime objects
df['start'] = pd.to_datetime(df['start'], unit='s')
df['stop'] = pd.to_datetime(df['stop'], unit='s')

fig = px.timeline(df, x_start="start", x_end="stop", y="ts")

fig.update_traces(width=40)
fig.update_layout(bargap=0.1)

fig.show()

The plot shows scrape time versus wall time, allowing us to inspect each event for duplication. With Plotly, I could size the width of the rectangles as the exact start/end times. The annotations highlight a window where duplicate scrapes occurred, confirming the presence of overlapping data in that range.

This plot illustrates the varying insert duration for some tables collected by the LogHouse Scraper.

While I tend to prefer Plotly, we recognize that others may favor more modern visualization libraries. Thanks to ClickHouse's broad integration support, our SREs can choose the best tools for their workflows. Whether it’s Hex, Bokeh, Evidence, or any other platform that supports SQL-driven analysis, they are free to work with the approach that suits them best.

Here, we saw five views of the same event - demonstrating the flexibility we have to choose how we render at query time, using different charting tools, always with the ability to drill down into the raw line-by-line events.

When log search isn't enough: complex queries with SQL

HyperDX offers a robust event search interface utilizing Lucene syntax, ideal for quick lookups and filtering. However, to answer more complex observability questions, a more expressive query language is needed. With ClickHouse as the engine behind LogHouse, we can always drop into full SQL

SQL allows us to express joins, time-based operations, and transformations that would be difficult or impossible to perform in typical log query tools. One example is identifying pod termination times by correlating Kubernetes event streams. The query below uses ASOF JOIN to align Killing and Created events for the same container, calculating the time between termination and restart:

WITH
    KE AS
    (
        SELECT *
        FROM loghouse.kube_events
        WHERE (FirstTimestamp >= '2025-03-10 01:00:00') AND (FirstTimestamp <= '2025-03-11 01:00:00') AND (Reason IN ['Killing']) AND (FieldPath LIKE 'spec.containers{c-%-server}')
    ),
    CE AS
    (
        SELECT *
        FROM loghouse.kube_events
        WHERE (FirstTimestamp >= '2025-03-10 01:00:00') AND (FirstTimestamp <= '2025-03-11 01:00:00') AND (Reason IN ['Created']) AND (FieldPath LIKE 'spec.containers{c-%-server}')
    )
SELECT
    Name,
    KE.FirstTimestamp AS killTime,
    CE.FirstTimestamp AS createTime,
    createTime - killTime AS delta,
    formatReadableTimeDelta(createTime - killTime) AS readableDelta
FROM KE
ASOF LEFT JOIN CE ON (CE.Name = KE.Name) AND (CE.FirstTimestamp >= KE.FirstTimestamp)
HAVING createTime > '1970-01-01 00:00:00'
ORDER BY delta DESC
LIMIT 5

┌─Name─────────────────────────────┬────────────killTime─┬──────────createTime─┬─delta─┬─readableDelta─────────────────────┐
│ c-emerald-tu-48-server-p0jw87g-0 │ 2025-03-10 19:01:39 │ 2025-03-10 20:15:59 │  4460 │ 1 hour, 14 minutes and 20 seconds │
│ c-azure-wb-13-server-648r93g-0   │ 2025-03-10 11:30:23 │ 2025-03-10 12:28:50 │  3507 │ 58 minutes and 27 seconds         │
│ c-azure-wb-13-server-3mjrr1g-0   │ 2025-03-10 11:30:23 │ 2025-03-10 12:28:47 │  3504 │ 58 minutes and 24 seconds         │
│ c-azure-wb-13-server-v31soea-0   │ 2025-03-10 11:30:23 │ 2025-03-10 12:28:46 │  3503 │ 58 minutes and 23 seconds         │
└──────────────────────────────────┴─────────────────────┴─────────────────────┴───────┴───────────────────────────────────┘

4 rows in set. Elapsed: 0.099 sec. Processed 17.78 million rows, 581.49 MB (180.05 million rows/s., 5.89 GB/s.)
Peak memory usage: 272.88 MiB.

Sure, we could write a component to track this as a metric, but the power of ClickHouse is that we don’t need to do so. It’s sufficient to store a warehouse of wide events and derive the metric we need at query time from them. So, when a colleague asks, ‘what’s the p95 replacement time for Pods after termination is requested’, we can just find a relevant set of events instead of responding, 'let me ship a new metric ', and getting back to them with an answer after the next release goes out.

Expanding the data universe

Sold on the immense value of having deep, structured telemetry in a high-performance analytics engine, we've been busy adding more data sinks to LogHouse, mainly at the request of our engineering and support team, who love using LogHouse and want all critical data to live in the warehouse. This year, we've embraced a cultural shift towards high-cardinality, wide-event-based observability as shown above.

Some of our new data sources, which adhere to this wide event philosophy, include:

kubenetmon: Our open-source tool for monitoring Kubernetes networking, giving us deep insights into cluster traffic. kubenetmon uses Linux's conntrack system to capture L3/L4 connection data with byte/packet counts. This provides three key capabilities: forensics (time-series connection records with per-minute bandwidth), attribution (mapping connections to specific workloads and pods), and metering (cost tracking for expensive data transfer like cross-region egress). The system processes millions of connection observations per minute, helping us identify costly cross-regional downloads, track cross-AZ traffic patterns, and correlate network usage with actual costs. You can find the project at https://github.com/ClickHouse/kubenetmon.
Kubernetes Event Exporter: We forked the popular exporter and added a native ClickHouse sink, allowing us to analyze Kubernetes API events at scale. You can find our fork here. This is hugely useful for understanding why things changed in K8s over time. We’re not stopping there, however! We’re already working on a plan to ingest not just the events, but the entire k8s object model into LogHouse, with snapshots at every change. This would allow us to model the full state of all clusters at any moment in time over the past six months, and step through all of the changes. Instead of just knowing "Pod X was terminated at 15:47," we’ll see the full cluster state before and after, understand dependencies, resource constraints, and the cascading effects of changes.
Control Plane Data: We collect all operational data from our Control Plane department, who had not yet onboarded into LogHouse.
Real User Monitoring (RUM): In a project that is still a work in progress, we collect frontend performance metrics from our users' browsers, which are pushed via a public gateway into our OTel pipeline.
Istio Access Log: We ingest HTTP-level traffic data from our Istio service mesh, capturing request/response patterns, latencies, and routing decisions. Combined with ClickHouse's system.query_log and kubenetmon's network flows, this creates a powerful tri-dimensional correlation capability. When network usage spikes occur, our support team can trace the complete story: which specific SQL queries were executing, what HTTP requests triggered them, and the exact packet flow patterns. This cross-layer visibility transforms debugging from guesswork into precise root cause analysis - if we see unusual egress traffic, we can immediately identify whether it's from expensive cross-region queries, backup operations, or unexpected replication, making troubleshooting incredibly efficient for the support team.

What’s next and the road ahead

It’s been an incredible year of growth for LogHouse. By moving beyond a one-size-fits-all approach and embracing specialized, highly efficient tooling, we’ve scaled our observability platform to remarkable new heights while significantly enhancing our cost performance. Integrating HyperDX is a key part of that evolution, providing a flexible and powerful user experience on top of our petabyte-scale data warehouse. We're excited to see what the next year brings as we continue to build on this strong foundation.

Toward zero-impact scraping

While SysEx is designed to be efficient and resource-conscious, customers occasionally notice our scrape queries in their logs and metrics. These queries are tightly constrained with strict memory limits, but when they error (as they sometimes do) it can create concern. Although the actual resource impact is minimal, we recognize that even lightweight queries can create noise or confusion in sensitive environments.

To address this, we’re exploring what we call zero-impact scraping - the next evolution of SysEx. The goal is to eliminate all in-cluster query execution by entirely decoupling scraping from the live system. One promising direction involves leveraging plain rewritable disks on S3, where ClickHouse already writes its service logs. In this model, a pool of SysEx workers would mount these disk-based log tables directly, bypassing the need to query the running ClickHouse instance. This design would deliver all the benefits of our current system - native format, high fidelity, minimal transformation - while removing even the perception of operational impact.

OpenTelemetry remains a critical component of our platform, particularly for early-stage data capture before service tables are available. This is especially useful during crash loops, where structured logs may be unavailable. However, if our zero-impact scraping approach proves successful, it could reduce our reliance on OTel even further by providing a high-fidelity, low-disruption path for log ingestion throughout the lifecycle of a cluster.

This effort is still in progress, and we’ll share more once we’ve validated the approach in production.

Migrating to JSON

The JSON type has been available in ClickHouse for some time and recently reached GA in version 25.3. It offers a flexible and efficient way to store semi-structured data, dynamically creating columns with appropriate types as new fields appear. It even supports fields with multiple types and gracefully handles schema explosion.

Despite these advantages, we’re still evaluating how well JSON fits common observability access patterns at scale. For example, querying a string across an entire JSON blob can effectively involve scanning thousands of columns. There are workarounds - such as also storing a raw string version of the JSON alongside the structured data - but we’re still developing best practices in this area.

Culturally, we have also come to recognize the practical limits of the Map type, which has served us well. Most of our log and resource attributes are small and stable enough that the Map continues to be the right fit. We have found that single-level JSON logs are often all you need, and for exceptions, tools like HyperDX automatically translate map access into JSONExtract functions. While we plan to adopt JSON more broadly, this is still a work in progress. Expect us to share more in a future update.

Conclusion

Over the past year, LogHouse has evolved from an ambitious logging system into a foundational observability platform powering everything from performance analysis to real-time debugging across ClickHouse Cloud. What began as a cost-saving measure has become a catalyst for both cultural and technical transformation, shifting us toward high-fidelity, wide-event telemetry at massive scale. By combining specialized tools like SysEx with general-purpose frameworks like OpenTelemetry, and layering on flexible interfaces like HyperDX, we have built a system that not only keeps up with our growth but also unlocks entirely new workflows. The journey is far from over, but the lessons from scaling to 100PB and 500 trillion rows continue to shape how we think about observability as a core data problem we are solving at warehouse scale.