<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aman Puri</title>
    <description>The latest articles on DEV Community by Aman Puri (@dataengineeringguide).</description>
    <link>https://dev.to/dataengineeringguide</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3740291%2F0f5e33c4-0b9c-4799-84dc-fc2ac636b55c.png</url>
      <title>DEV Community: Aman Puri</title>
      <link>https://dev.to/dataengineeringguide</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dataengineeringguide"/>
    <language>en</language>
    <item>
      <title>How to build real-time customer-facing analytics on Postgres (without slowing down OLTP)</title>
      <dc:creator>Aman Puri</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:09:20 +0000</pubDate>
      <link>https://dev.to/dataengineeringguide/real-time-analytics-postgres-4o5e</link>
      <guid>https://dev.to/dataengineeringguide/real-time-analytics-postgres-4o5e</guid>
      <description>&lt;p&gt;SaaS applications rely on Postgres as the source of truth for user, product, and transactional data. But exposing that data back to users through real-time, in-product dashboards creates a problem: analytical queries frequently grind your transactional database to a halt.&lt;/p&gt;

&lt;p&gt;Customer-facing dashboards aren't like internal BI reports. They live inside your product, where every active user means another concurrent query. Your latency budget here is tight. Web performance research shows that &lt;a href="https://www.nngroup.com/articles/response-times-3-important-limits" rel="noopener noreferrer"&gt;users perceive responses under 100 milliseconds as instantaneous&lt;/a&gt;. Slower than that starts eroding the experience; anything in seconds feels broken.&lt;/p&gt;

&lt;p&gt;This guide walks through a two-stage approach for scaling Postgres-backed analytics. We'll cover how to push native Postgres to its limits using materialized views and advanced indexing, then define the exact triggers that tell you when to move on. From there, we'll detail how to implement a real-time OLAP architecture that eliminates the operational headache of DIY data pipelines. The goal: sub-100 ms dashboards without compromising your primary database.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Customer-facing analytics is a different workload from BI:&lt;/strong&gt; it needs &lt;strong&gt;sub-100 ms p95&lt;/strong&gt; on analytical queries, high concurrency, freshness within the SLA tier you commit to customers, strict tenant isolation, and predictable cost.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Stage 1 (Postgres-only):&lt;/strong&gt; use &lt;strong&gt;pre-aggregated tables/materialized views&lt;/strong&gt;, composite indexes starting with &lt;code&gt;tenant_id&lt;/code&gt;, careful RLS, and BRIN for time-series.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Graduate when 2+ breaking points hit:&lt;/strong&gt; buffer cache churn, million-row scans per request, high-cardinality index bloat, tighter freshness SLAs, or ad-hoc filtering needs.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Stage 2 (Postgres + CDC + real-time OLAP):&lt;/strong&gt; stream changes via CDC into a &lt;strong&gt;columnar OLAP database (e.g., ClickHouse)&lt;/strong&gt; and serve dashboards from OLAP, not replicas or warehouses.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Avoid DIY CDC outages:&lt;/strong&gt; replication slots/WAL retention can fill disks; prefer &lt;strong&gt;managed CDC connectors&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What customer-facing analytics requires (eight non-negotiables)
&lt;/h2&gt;

&lt;p&gt;Most database performance escalations happen because teams treat all data retrieval as one big "analytics" workload. When engineers complain that Postgres can't handle their analytics anymore, they're usually lumping together four completely different workloads.&lt;/p&gt;

&lt;p&gt;OLTP handles high-throughput, row-level reads and writes powering your core application. Internal BI runs heavy aggregations for executives and analysts, where queries can safely take minutes. Ad-hoc data science is unpredictable, exploratory work running in notebooks. Customer-facing analytics is the in-product dashboards and reporting exposed directly to your end users.&lt;/p&gt;

&lt;p&gt;Customer-facing analytics is by far the strictest of these four. It demands sub-100 ms latency like a hot OLTP query path combined with complex aggregations like BI. It requires strict tenant isolation to prevent data leaks. It must support concurrency that scales linearly with your active user base.&lt;/p&gt;

&lt;p&gt;Because of this profile, any solution needs to meet eight non-negotiable requirements. Use these as your benchmark for deciding when to optimize Postgres and when to move to a dedicated OLAP database.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Sub-100 ms p95 query latency
&lt;/h3&gt;

&lt;p&gt;For an API response that feels instant, your data store needs to return results in under 100 milliseconds at p95 on analytical queries. If your database takes 500 milliseconds to aggregate data, network overhead and front-end rendering will push total latency well past what feels sluggish.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Linear concurrency scaling
&lt;/h3&gt;

&lt;p&gt;In an internal BI tool, fifty concurrent queries is heavy load. In a customer-facing dashboard, fifty concurrent queries is a quiet Tuesday morning. Your data layer needs to handle hundreds or thousands of simultaneous requests without exhausting connection pools or queueing queries behind each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Tunable freshness per dashboard
&lt;/h3&gt;

&lt;p&gt;Different features need different freshness guarantees. A monthly billing rollup tolerates a 30-minute delay. A live marketplace inventory dashboard or security audit log needs seconds-level freshness. Your architecture needs to support streaming ingestion for those strict SLAs.&lt;/p&gt;

&lt;p&gt;Pick the freshness tier intentionally per dashboard, not as a single global setting. Each tier maps to a different architecture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Freshness tier&lt;/th&gt;
&lt;th&gt;Recommended architecture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;| 5 to 30 minutes | Scheduled rollup jobs against Postgres summary tables; standard cron or workers |&lt;br&gt;
| 1 to 5 minutes | Incremental rollup workers; hot summary tables refreshed per write batch |&lt;br&gt;
| 1 to 10 seconds | CDC stream from Postgres into a real-time OLAP database like ClickHouse |&lt;/p&gt;

&lt;p&gt;Most B2B SaaS customer-facing dashboards live in the 1 to 10 second tier. Tighter than that is rare and usually only matters for fraud signals or trading-style tooling.&lt;/p&gt;

&lt;p&gt;The looser tiers are not a free pass. As data volume grows, the rollup jobs themselves become expensive: minute-to-hour-long aggregations that consume primary database resources and contend with OLTP. The granularity of rollups you can afford shrinks as the underlying data grows. &lt;a href="https://clickhouse.com/blog/how-polymarket-scales-data-with-postgres-and-clickhouse" rel="noopener noreferrer"&gt;Polymarket hit this&lt;/a&gt; even though their customer-facing dashboards did not need second-level freshness.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Strict tenant isolation
&lt;/h3&gt;

&lt;p&gt;B2B SaaS applications are inherently multi-tenant. Every analytical query must be strictly bounded by a tenant identifier. Whether you implement this through API-layer authorization or database-level Row Level Security, cross-tenant data leakage is catastrophic. Your architecture must prevent it by design.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Pre-aggregations and materialized views
&lt;/h3&gt;

&lt;p&gt;You can't compute heavy metrics from raw event data on every page load. Your system needs to pre-aggregate data into summary tables or materialized views, absorbing compute cost at ingestion time rather than query time.&lt;/p&gt;

&lt;h3&gt;
  
  
  6. High-cardinality column support
&lt;/h3&gt;

&lt;p&gt;SaaS analytics frequently require filtering and grouping by high-cardinality dimensions: user IDs, API keys, session tokens, unique SKU numbers. Your database needs to index and query these columns efficiently without suffering from massive index bloat or degraded scan performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  7. Time-series-friendly storage
&lt;/h3&gt;

&lt;p&gt;Dashboard queries are almost exclusively time-windowed: "Show me API requests over the last 7 days grouped by hour." Your storage engine needs to be optimized for time-series access patterns, physically ordering data by time so the query engine can skip irrelevant date ranges.&lt;/p&gt;

&lt;h3&gt;
  
  
  8. Predictable cost scaling
&lt;/h3&gt;

&lt;p&gt;As your user base grows, the cost of serving analytical queries needs to scale predictably. Architectures that couple storage and compute too tightly, or price based on scanned data volume per query, will produce explosive cloud bills as product adoption increases.&lt;/p&gt;

&lt;p&gt;A common failure mode is running raw analytical queries directly against your OLTP tables. Complex aggregations over the live transactional schema scan large volumes of data and evict hot operational data from Postgres shared buffers. As cache hit ratio drops, your transactional application slows down, lock contention spikes, and dashboards become unusable.&lt;/p&gt;

&lt;p&gt;The pattern that does work for early-stage Postgres analytics is pre-aggregation: absorb the compute cost at ingestion time into summary tables, and serve dashboards from those rather than the OLTP schema. That is exactly what Stage 1 is.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: run customer-facing analytics on Postgres only
&lt;/h2&gt;

&lt;p&gt;Most B2B SaaS startups should start their analytics journey entirely within Postgres. The operational benefits of a small surface area (a single database, backup strategy, and security boundary) are worth more than the performance ceiling at this stage.&lt;/p&gt;

&lt;p&gt;The sweet spot for Stage 1: an early-stage product, moderate data volumes, and dashboards that can tolerate one to five minute data freshness (matching the rollup-tier of the freshness SLA matrix above).&lt;/p&gt;

&lt;h3&gt;
  
  
  How Stage 1 works
&lt;/h3&gt;

&lt;p&gt;Your application writes directly to normalized OLTP tables. Rather than querying these tables for dashboards, you implement scheduled or incremental background jobs to build pre-aggregated analytics tables. A cron job might aggregate raw user events into an &lt;code&gt;org_daily_metrics&lt;/code&gt; table. Your customer-facing API then reads strictly from these pre-aggregated tables, applying tenant filters for data isolation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Postgres tactics that extend Stage 1
&lt;/h3&gt;

&lt;p&gt;To maximize Postgres for this workload, you need several specific database tactics.&lt;/p&gt;

&lt;p&gt;Composite indexes starting with the tenant identifier, like &lt;code&gt;(org_id, timestamp)&lt;/code&gt;, are mandatory. They ensure queries only scan data belonging to the requesting customer.&lt;/p&gt;

&lt;p&gt;If you implement Row Level Security (RLS) for tenant isolation, design your policies carefully. Naive RLS policies that join to permissions tables add tens to hundreds of milliseconds per query. Optimized RLS strategies use pre-calculated session variables to evaluate permissions, reducing overhead to under a millisecond.&lt;/p&gt;

&lt;p&gt;Materialized views offer a native way to pre-compute complex aggregations. But refreshing a standard materialized view acquires an exclusive lock, blocking all reads. To avoid this, use &lt;a href="https://www.postgresql.org/docs/current/sql-refreshmaterializedview.html" rel="noopener noreferrer"&gt;&lt;code&gt;REFRESH MATERIALIZED VIEW CONCURRENTLY&lt;/code&gt;&lt;/a&gt;. This allows concurrent reads, but requires the view to have at least one unique index.&lt;/p&gt;

&lt;h3&gt;
  
  
  When Stage 1 breaks
&lt;/h3&gt;

&lt;p&gt;For large time-series event tables, traditional B-tree indexes become massive and inefficient. &lt;a href="https://www.cybertec-postgresql.com/en/btree-vs-brin-2-options-for-indexing-in-postgresql-data-warehouses/" rel="noopener noreferrer"&gt;Block Range Indexes (BRIN)&lt;/a&gt; on timestamp columns can reduce index sizes by orders of magnitude. A BRIN index stores minimum and maximum values for physical block ranges, letting the query planner skip irrelevant blocks entirely during time-windowed queries.&lt;/p&gt;

&lt;p&gt;Even with these optimizations, Stage 1 has a ceiling. Postgres is row-oriented, built for transactional integrity rather than analytical scans.&lt;/p&gt;

&lt;p&gt;Define explicit triggers that tell you when to graduate from Stage 1 before your application falls over.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical breaking points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Buffer cache churn:&lt;/strong&gt; Dashboard queries scan so much data they flush hot OLTP data from memory, causing production writes to slow down.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Row scan volume:&lt;/strong&gt; Your aggregations require scanning millions of rows per request, making it impossible to stay within 100 ms latency even with heavy indexing.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;High cardinality failure:&lt;/strong&gt; Analytics require filtering on highly cardinal dimensions like API keys or individual SKUs, leading to massive index bloat and degraded B-tree performance.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Business breaking points:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Freshness demands:&lt;/strong&gt; Customers start demanding sub-30-second data freshness at high volumes, making incremental rollups and concurrent materialized view refreshes too slow.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Ad-hoc filtering:&lt;/strong&gt; Users need to dynamically filter dashboards by custom attributes, breaking the utility of pre-calculated summary tables.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rule is simple: when any two of these breaking points become true, you've outgrown native Postgres. At that point, evaluate offloading in Stage 2.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2: offload analytics with CDC to a real-time OLAP database
&lt;/h2&gt;

&lt;p&gt;When Postgres can no longer serve both high-throughput transactions and complex analytical reads, you need to physically separate the workloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  How CDC-based offloading works
&lt;/h3&gt;

&lt;p&gt;Your application continues writing to Postgres for OLTP operations. Change Data Capture (CDC) streams committed row-level changes into a dedicated, real-time OLAP database. Your customer-facing dashboards then read from the OLAP database, enforcing &lt;code&gt;tenant_id&lt;/code&gt; filters at the API layer.&lt;/p&gt;

&lt;p&gt;This separation lets Postgres dedicate its memory and CPU entirely to serving the application, while the OLAP database handles analytical aggregations. The choice of OLAP database matters: not every analytical database is built for in-product latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why read replicas and cloud warehouses are the wrong fit for in-product dashboards
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Requirement&lt;/th&gt;
&lt;th&gt;Postgres read replica&lt;/th&gt;
&lt;th&gt;Cloud data warehouse (Snowflake/BigQuery)&lt;/th&gt;
&lt;th&gt;Real-time OLAP (ClickHouse)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sub-100 ms p95 latency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited (row-store scans cap throughput)&lt;/td&gt;
&lt;td&gt;Designed for batch BI (seconds to minutes)&lt;/td&gt;
&lt;td&gt;Yes (columnar store, sparse indexes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Horizontal concurrency&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited (connection pool ceilings)&lt;/td&gt;
&lt;td&gt;Costly under high concurrency (queueing, scaling spend)&lt;/td&gt;
&lt;td&gt;Yes (scales linearly with nodes)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Freshness SLA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Seconds (physical replication)&lt;/td&gt;
&lt;td&gt;Minutes to hours (batch-oriented)&lt;/td&gt;
&lt;td&gt;Seconds (CDC streaming)Most B2B SaaS customer-facing dashboards live in the 1 to 10 second tie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High-cardinality support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Limited (index bloat on row-store)&lt;/td&gt;
&lt;td&gt;Yes (brute-force scan)&lt;/td&gt;
&lt;td&gt;Yes (designed for high-cardinality columns)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Predictable cost scaling&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Unpredictable for analytics (Postgres is built for OLTP; analytical scans do not scale linearly)&lt;/td&gt;
&lt;td&gt;Unpredictable for customer-facing concurrency&lt;/td&gt;
&lt;td&gt;Predictable (engine built for analytics; scales linearly with workload)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A fourth option, headless analytics SaaS, sits on top of one of these stores rather than next to them. It is covered separately below.&lt;/p&gt;

&lt;p&gt;Postgres read replicas share the row-oriented limitations of the primary. They don't compress data effectively for analytics, and they still require heavy B-tree indexes that bloat memory. They make sense for offloading read-only traffic from the primary, not for high-concurrency analytical reads.&lt;/p&gt;

&lt;p&gt;Cloud data warehouses like Snowflake and BigQuery are traditional, batch-oriented engines built for internal BI and complex long-running queries. They are excellent at that workload.&lt;/p&gt;

&lt;p&gt;They were not designed for sub-100 ms in-product latency. Routing user-facing dashboards through a cloud data warehouse routinely produces queries taking several seconds, blowing past the 100 ms budget.&lt;/p&gt;

&lt;p&gt;Concurrency is also expensive. Cloud data warehouses are priced for batch jobs rather than customer-facing traffic: consumption-based compute, minimum-billed durations on resume, and horizontal scaling under concurrency &lt;a href="https://clickhouse.com/blog/how-cloud-data-warehouses-bill-you" rel="noopener noreferrer"&gt;combine to make bills hard to predict&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;A &lt;a href="https://clickhouse.com/" rel="noopener noreferrer"&gt;real-time OLAP database like ClickHouse&lt;/a&gt; or Apache Pinot is built for this profile. These systems are columnar, reading only the specific columns requested. They use sparse indexes to skip vast amounts of data without B-tree memory overhead, and they &lt;a href="https://clickhouse.com/resources/engineering/database-compression" rel="noopener noreferrer"&gt;compress data aggressively&lt;/a&gt; by grouping similar values together in column-oriented storage, which both reduces storage cost and lets queries scan less data per request.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where headless analytics SaaS fits
&lt;/h3&gt;

&lt;p&gt;Headless analytics platforms like Cube, Hex embedded, Preset embedded, and GoodData sit on top of an OLAP database, not in place of one. They provide a semantic layer, an API surface, query caching, and an embed-friendly UI toolkit. That accelerates the front-end build, but it does not change the latency, concurrency, or freshness profile of the underlying store. Pair a headless analytics layer with a real-time OLAP database when you want a faster ship path on the UI side; do not treat it as a substitute for the OLAP database itself.&lt;/p&gt;

&lt;h3&gt;
  
  
  The operational risks of DIY CDC (replication slots and WAL growth)
&lt;/h3&gt;

&lt;p&gt;Moving to this architecture traditionally introduces a severe ETL Tax. Operating logical replication pipelines with tools like Debezium and Kafka turns a small engineering group into a distributed systems maintenance team.&lt;/p&gt;

&lt;p&gt;The primary risk centers on Postgres replication slots.&lt;/p&gt;

&lt;p&gt;By default, Postgres uses &lt;a href="https://postgresqlco.nf/doc/en/param/max_slot_wal_keep_size" rel="noopener noreferrer"&gt;&lt;code&gt;max_slot_wal_keep_size = -1&lt;/code&gt;&lt;/a&gt;, which lets replication slots retain unlimited Write-Ahead Log (WAL) files if a consumer falls behind or goes offline.&lt;/p&gt;

&lt;p&gt;If your self-hosted CDC consumer crashes silently on Friday evening, Postgres will dutifully retain every transaction log generated over the weekend. By Monday morning, hundreds of gigabytes of WAL files will have filled your primary disk, triggering emergency read-only mode and taking your entire SaaS application offline.&lt;/p&gt;

&lt;p&gt;Add schema drift and replication slot lag management, and the operational cost of DIY CDC quickly eclipses the value of the analytics dashboard.&lt;/p&gt;

&lt;p&gt;CDC operations are only one line item in the ETL Tax: pipeline engineering hours, warehouse compute, data staleness, schema drift incidents, vendor sprawl across DMS or Kinesis or Fivetran plus the warehouse plus monitoring. Hidden costs routinely run 2x to 4x the license fees. The right answer is not a better DIY pipeline; it is removing the pipeline from your ops surface entirely.&lt;/p&gt;

&lt;h3&gt;
  
  
  How managed CDC connectors reduce risk
&lt;/h3&gt;

&lt;p&gt;The solution is a managed, integrated CDC connector. &lt;a href="https://clickhouse.com/docs/cloud/managed-postgres" rel="noopener noreferrer"&gt;ClickHouse Managed Postgres&lt;/a&gt; is vanilla Postgres with open-source ClickHouse integration baked in. &lt;a href="https://clickhouse.com/cloud/clickpipes" rel="noopener noreferrer"&gt;ClickPipes&lt;/a&gt;, powered by PeerDB , is a single managed service that streams committed Postgres changes directly to ClickHouse, replacing what teams would otherwise build with Debezium, Kafka, and Kafka Connect.&lt;/p&gt;

&lt;p&gt;This turns a complex, high-risk data engineering problem into a simple configuration step. The platform handles WAL management, replication slot monitoring, and schema synchronization automatically, connecting OLTP source to OLAP destination without intermediate infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  A unified query layer with pg_clickhouse
&lt;/h3&gt;

&lt;p&gt;CDC moves the data. The second half of the integration story is how you query it. The &lt;a href="https://github.com/ClickHouse/pg_clickhouse" rel="noopener noreferrer"&gt;pg_clickhouse&lt;/a&gt; extension is an Apache 2.0 Postgres foreign data wrapper that exposes ClickHouse tables as foreign tables inside Postgres, with full pushdown of filters, &lt;code&gt;GROUP BY&lt;/code&gt;, &lt;code&gt;ORDER BY&lt;/code&gt;, &lt;code&gt;HAVING&lt;/code&gt;, and SEMI JOINs to ClickHouse's columnar engine.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pg_clickhouse&lt;/code&gt; benefits every query path, customer-facing dashboards included. The API-layer pattern stays exactly as before: bounded endpoints, tenant_id enforced server-side, queries assembled by your service. What changes is the engine underneath. Queries against ClickHouse foreign tables get pushed down to the columnar engine automatically, so existing API code, ORMs, internal dashboards, and cron jobs all get columnar speedups without any rewrites or new clients.&lt;/p&gt;

&lt;h3&gt;
  
  
  How to model CDC data in ClickHouse
&lt;/h3&gt;

&lt;p&gt;Once data lands in ClickHouse, model it for the access pattern rather than copying the normalized Postgres schema unchanged.&lt;/p&gt;

&lt;p&gt;Denormalization into wide event tables is a strong default for the hottest customer-facing dashboards because it gives you the lowest possible query latency on a known access pattern.&lt;/p&gt;

&lt;p&gt;JOINs are also a first-class capability in ClickHouse, with all standard SQL join types (INNER, LEFT, RIGHT, FULL, CROSS, plus SEMI, ANTI, and ASOF), &lt;a href="https://clickhouse.com/docs/guides/joining-tables" rel="noopener noreferrer"&gt;multiple join algorithms&lt;/a&gt; (hash, parallel hash, grace hash, full sorting merge, partial merge, direct), automatic optimization with statistics, and continued performance gains in 2025 and 2026. Use JOINs where they keep your schema simpler and your data easier to manage; reach for denormalization where you need the absolute floor on query latency.&lt;/p&gt;

&lt;p&gt;For high-volume CDC and upsert flows, ClickHouse provides the &lt;a href="https://clickhouse.com/docs/engines/table-engines/mergetree-family/replacingmergetree" rel="noopener noreferrer"&gt;&lt;code&gt;ReplacingMergeTree&lt;/code&gt;&lt;/a&gt; table engine, which identifies rows with the same primary key and keeps the latest version. Background merges deduplicate asynchronously, and the &lt;code&gt;FINAL&lt;/code&gt; keyword in &lt;code&gt;SELECT&lt;/code&gt; guarantees deduplicated results at query time. &lt;code&gt;FINAL&lt;/code&gt; has been significantly optimized for production use, so reach for it whenever a dashboard query needs immediate consistency.&lt;/p&gt;

&lt;p&gt;Outside CDC, ClickHouse also supports &lt;a href="https://clickhouse.com/docs/updating-data/overview" rel="noopener noreferrer"&gt;standard SQL &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt;&lt;/a&gt; for individual or scattered row changes, and &lt;code&gt;ALTER TABLE&lt;/code&gt; operations or partition drops for bulk data management. Pick the mechanism that matches the shape of the change rather than treating any one engine as the only option.&lt;/p&gt;

&lt;p&gt;Where the source data is naturally immutable (event logs, request logs, telemetry), favor an append-only model. It keeps the schema simple and lets the merge process do less work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Customer-facing analytics patterns (three real SaaS examples)
&lt;/h2&gt;

&lt;p&gt;Here's how this maturity journey maps to real-world SaaS environments. Three distinct business models, each handling the transition from Stage 1 Postgres constraints to Stage 2 real-time OLAP.&lt;/p&gt;

&lt;p&gt;This path is well-trodden. Polymarket &lt;a href="https://clickhouse.com/blog/how-polymarket-scales-data-with-postgres-and-clickhouse" rel="noopener noreferrer"&gt;scaled customer-facing prediction-market data by pairing Postgres with ClickHouse&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 1: product usage dashboards in SaaS
&lt;/h3&gt;

&lt;p&gt;A project management SaaS provides workspace admins with a dashboard tracking daily active sessions and feature adoption. In Stage 1, the team relies on an &lt;code&gt;org_daily_metrics&lt;/code&gt; rollup table in Postgres, refreshed every minute via a background worker.&lt;/p&gt;

&lt;p&gt;As the user base crosses millions of daily events, the rollup window starts overrunning and the materialized-view refresh begins locking the primary database during peak write hours. Customers also start asking for sub-30-second freshness on the live-session counter. They graduate to Stage 2, streaming raw application events via CDC directly into a wide event table in ClickHouse.&lt;/p&gt;

&lt;p&gt;Dashboards now query ClickHouse directly. Workspace admins can filter adoption metrics dynamically by user role and date range, with queries returning inside the dashboard's latency budget.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 2: marketplace seller dashboards
&lt;/h3&gt;

&lt;p&gt;An e-commerce marketplace provides merchants with a live view of gross merchandise value (GMV) and pending orders. In Stage 1, the team rolls up order metrics into per-merchant summary tables in Postgres, refreshed every couple of minutes by a background worker. That serves the long tail of small merchants comfortably, but power sellers with thousands of daily transactions cannot tolerate minute-scale staleness during a launch or a flash sale.&lt;/p&gt;

&lt;p&gt;They graduate to Stage 2 by implementing CDC to stream the order event log into ClickHouse. Using &lt;a href="https://clickhouse.com/docs/engines/table-engines/mergetree-family/aggregatingmergetree" rel="noopener noreferrer"&gt;&lt;code&gt;AggregatingMergeTree&lt;/code&gt;&lt;/a&gt;, they maintain real-time, incrementally updated materialized views of merchant GMV.&lt;/p&gt;

&lt;p&gt;The seller dashboard refreshes on a 30-second interval. By the time it does, the latest orders are already in ClickHouse, and queries return inside the dashboard's latency budget even during peak ordering windows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pattern 3: developer-facing usage analytics
&lt;/h3&gt;

&lt;p&gt;An API-as-a-Service platform needs to expose detailed usage analytics so developers can monitor latency, error rates, and cost forecasts. In Stage 1, the platform runs daily rollups grouped by &lt;code&gt;api_key_id&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;As developers demand real-time visibility into active incidents, daily rollups become obsolete. The platform starts streaming raw API request logs directly into ClickHouse, where months of request data sit efficiently behind sparse indexes and columnar compression.&lt;/p&gt;

&lt;p&gt;The developer dashboard now lets users drill into specific error codes over custom time windows, with queries returning inside the dashboard's latency budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-patterns that break customer-facing analytics
&lt;/h2&gt;

&lt;p&gt;Scaling customer-facing analytics means avoiding several destructive architectural traps that engineering teams frequently fall into.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern: running OLAP queries on production Postgres
&lt;/h3&gt;

&lt;p&gt;This is the most common failure mode. Allowing internal BI tools or heavy customer-facing dashboards to execute large table scans on your primary Postgres database flushes your &lt;code&gt;shared_buffers&lt;/code&gt;. This evicts hot transactional data from memory, forcing your application to perform slow disk reads for routine operations. Your OLTP path chokes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern: hand-rolled CDC pipelines without a data platform team
&lt;/h3&gt;

&lt;p&gt;Treating CDC infrastructure as a side project is a recipe for production outages. Unmonitored Kafka clusters, schema drift breaking downstream schemas, replication slot lag causing WAL disk exhaustion. These will consume your engineering team's cycles. If you can't staff a dedicated data platform team, rely on managed CDC connectors.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern: letting the front end send raw SQL
&lt;/h3&gt;

&lt;p&gt;Allowing the client application to generate and send raw SQL strings to your API is a security and maintenance hazard. Raw SQL bypasses semantic layers, makes query optimization impossible, and creates severe SQL injection risks.&lt;/p&gt;

&lt;p&gt;The right pattern is the opposite. Build stable, bounded endpoints (like &lt;code&gt;/api/analytics/overview&lt;/code&gt;) that securely resolve &lt;code&gt;tenant_id&lt;/code&gt; from the authentication token, enforce strict cardinality limits and date range boundaries, and accept only semantic parameters from the client. The API assembles optimized database queries server-side so malicious or buggy client code cannot issue unbounded analytical queries against your database.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern: treating materialized views as the long-term answer at scale
&lt;/h3&gt;

&lt;p&gt;Postgres materialized views are useful for Stage 1 freshness on the order of tens of seconds. They are not a permanent customer-facing analytics architecture. As event volume grows, refresh windows lengthen, &lt;code&gt;REFRESH MATERIALIZED VIEW CONCURRENTLY&lt;/code&gt; increasingly competes with primary writes for I/O, and the view's freshness SLA falls behind what customers see. Materialized views buy you headroom inside Stage 1; they do not let you skip the graduation to Stage 2.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern: row-level security as the only tenant isolation mechanism
&lt;/h3&gt;

&lt;p&gt;RLS is a useful safety net, but treating it as the sole isolation boundary couples tenant-scoped query performance to RLS policy execution. Naive RLS policies join to permissions tables and add tens to hundreds of milliseconds per query. Strong tenant isolation enforces &lt;code&gt;tenant_id&lt;/code&gt; at the API layer, partitions or indexes data on &lt;code&gt;tenant_id&lt;/code&gt;, and uses RLS as a last-line defense rather than the primary control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Anti-pattern: measuring success on internal team load times
&lt;/h3&gt;

&lt;p&gt;Internal engineers running dashboards from a fast laptop on a fast network are a poor proxy for the customer experience. A SaaS customer's dashboard might be opened on a coffee-shop connection, halfway around the world, against a tenant with millions of events. Define your latency SLA on customer-side p95 and p99 over a representative population, not on internal team perception. Architectures that look fine internally regularly fail in production for the customers who matter most.&lt;/p&gt;

&lt;h2&gt;
  
  
  Build order: how to ship customer-facing analytics in six steps
&lt;/h2&gt;

&lt;p&gt;The architecture only works if you build it in the right order. Skipping a step routinely costs a quarter of engineering time recovering from a production fire.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;  &lt;strong&gt;Define dashboard metrics explicitly.&lt;/strong&gt; Resist "let users query anything" as a v1 goal. Pick the 8 to 15 metrics customers actually need, document the dimensions and date ranges they will filter on, and treat that document as the contract your API and storage will enforce.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ship Stage 1 with pre-aggregated summary tables in Postgres and indexed lookups.&lt;/strong&gt; Use composite indexes that lead with &lt;code&gt;tenant_id&lt;/code&gt;. Add &lt;code&gt;REFRESH MATERIALIZED VIEW CONCURRENTLY&lt;/code&gt; only after verifying the unique-index requirement. Use BRIN indexes on time columns for large append-only event tables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Add an API-layer cache keyed on &lt;code&gt;(tenant_id, metric, date_range)&lt;/code&gt; with a 5 to 30 second TTL.&lt;/strong&gt; This single layer absorbs the bulk of dashboard concurrency before any database tuning becomes necessary, and it is straightforward to add or remove without touching the storage architecture.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Graduate to Stage 2 when two or more Stage 1 triggers fire.&lt;/strong&gt; Stream Postgres changes into a real-time OLAP database via CDC. With &lt;a href="https://clickhouse.com/" rel="noopener noreferrer"&gt;ClickHouse Managed Postgres&lt;/a&gt;, you don't migrate from Postgres; you activate ClickHouse alongside it from the same console. Your application keeps querying its Postgres connection string. pg_clickhouse transparently routes analytical queries to ClickHouse foreign tables, so the same API endpoints, ORMs, and dashboards get columnar speedups without rewrites or new clients.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Expose freshness in the UI and alert on freshness lag p95 the way you alert on query latency.&lt;/strong&gt; A "Last updated 8 seconds ago" indicator next to each chart is enough to make the SLA visible to customers.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This order keeps the Stage 1 surface area small, defers infrastructure complexity until the workload actually requires it, and turns the Stage 2 graduation into a known, planned move instead of an emergency.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: when to stay on Postgres or move to real-time OLAP
&lt;/h2&gt;

&lt;p&gt;Real-time customer-facing analytics has a strict baseline. An architecture that cannot satisfy all five of these is not yet a customer-facing analytics architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Analytical queries return in under 100 ms at p95.&lt;/li&gt;
&lt;li&gt;  API responses complete in under 300 ms at p95.&lt;/li&gt;
&lt;li&gt;  Data is fresh within the SLA tier you committed to your customers.&lt;/li&gt;
&lt;li&gt;  Tenant data is securely isolated.&lt;/li&gt;
&lt;li&gt;  The workload imposes zero load on your primary OLTP database.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're hitting Postgres scaling limits now, moving to real-time OLAP doesn't require a re-platforming project. ClickHouse Managed Postgres is vanilla Postgres with open-source ClickHouse integration: the OLTP database, the columnar OLAP database, managed CDC via ClickPipes, and a unified query layer via pg_clickhouse, behind one console. You don't migrate from Postgres. You activate ClickHouse next to it, and the Stage 1 to Stage 2 graduation becomes a configuration change rather than a quarter of pipeline engineering.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Can Postgres handle customer-facing analytics dashboards?
&lt;/h3&gt;

&lt;p&gt;Yes at small scale. It stops scaling once concurrency and scan volume grow, because analytical queries evict OLTP cache and increase contention. Use Postgres-only with pre-aggregations early, then offload analytics when latency or stability degrades.&lt;/p&gt;

&lt;h3&gt;
  
  
  What latency should customer-facing analytics target?
&lt;/h3&gt;

&lt;p&gt;Aim for &lt;strong&gt;sub-100 ms p95 query latency&lt;/strong&gt; on analytical queries at the data store so the full API response can stay under a few hundred milliseconds and feel instantaneous in-product.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I move analytics off Postgres?
&lt;/h3&gt;

&lt;p&gt;When &lt;strong&gt;two or more&lt;/strong&gt; breaking points appear, for example buffer cache churn, million-row scans per request, high-cardinality index bloat, stricter freshness SLAs, or users demand ad-hoc filtering that rollups can't support.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is a Postgres read replica enough for analytics?
&lt;/h3&gt;

&lt;p&gt;No. Replicas are still row-stores with the same scan/index limitations, and heavy analytics remain slow and expensive while adding operational complexity.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why not use Snowflake or BigQuery for in-product dashboards?
&lt;/h3&gt;

&lt;p&gt;They are traditional, batch-oriented engines optimized for internal BI. Customer-facing dashboards need sub-100 ms latency, which is outside the design point of those engines, and high concurrency on consumption-priced compute can produce unpredictable bills.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is the recommended architecture for real-time analytics on Postgres data?
&lt;/h3&gt;

&lt;p&gt;Keep Postgres for OLTP, use &lt;strong&gt;CDC&lt;/strong&gt; to stream changes into a &lt;strong&gt;real-time OLAP&lt;/strong&gt; database, and serve dashboards from the OLAP system with strict &lt;code&gt;tenant_id&lt;/code&gt; filtering at the API layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the biggest operational risk with DIY CDC (Debezium/Kafka)?
&lt;/h3&gt;

&lt;p&gt;Replication slots and WAL retention. If a consumer falls behind, WAL can grow until disk fills and Postgres goes read-only. Managed CDC reduces this risk with monitoring and backpressure handling.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I keep tenant data isolated in analytics?
&lt;/h3&gt;

&lt;p&gt;Enforce &lt;code&gt;tenant_id&lt;/code&gt; scoping in the API layer (and/or RLS) and design indexes/partitions around &lt;code&gt;tenant_id&lt;/code&gt; so queries only scan a tenant's data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I use materialized views in Postgres for dashboards?
&lt;/h3&gt;

&lt;p&gt;Yes for Stage 1, but refresh behavior matters. &lt;code&gt;REFRESH MATERIALIZED VIEW CONCURRENTLY&lt;/code&gt; avoids blocking reads but requires a unique index and won't meet tight freshness at scale.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do I need to denormalize data when moving from Postgres to OLAP?
&lt;/h3&gt;

&lt;p&gt;Denormalization into wide event tables is a strong default for the hottest customer-facing dashboards because it gives you the lowest possible query latency. JOINs are also fully supported in ClickHouse, with all standard SQL join types, multiple algorithms, and automatic optimization, so you can keep a more normalized schema where it makes the data easier to manage. Pick per-dashboard rather than applying a blanket rule.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do updates and deletes work in ClickHouse with CDC?
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;ReplacingMergeTree&lt;/code&gt; is the standard pattern for high-volume CDC: updates land as new rows and background merges deduplicate asynchronously. Use the &lt;code&gt;FINAL&lt;/code&gt; keyword in &lt;code&gt;SELECT&lt;/code&gt; to get deduplicated results at query time; &lt;code&gt;FINAL&lt;/code&gt; has been significantly optimized for production. Outside CDC, &lt;a href="https://clickhouse.com/docs/updating-data/overview" rel="noopener noreferrer"&gt;standard SQL &lt;code&gt;UPDATE&lt;/code&gt; and &lt;code&gt;DELETE&lt;/code&gt;&lt;/a&gt; are also available for scattered row changes, and &lt;code&gt;ALTER TABLE&lt;/code&gt; operations cover bulk data management.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>analytics</category>
      <category>database</category>
    </item>
    <item>
      <title>Best real-time analytics database for star schema and fast joins (2026 guide)</title>
      <dc:creator>Aman Puri</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:08:50 +0000</pubDate>
      <link>https://dev.to/dataengineeringguide/real-time-analytics-star-schema-joins-3ife</link>
      <guid>https://dev.to/dataengineeringguide/real-time-analytics-star-schema-joins-3ife</guid>
      <description>&lt;p&gt;Modern OLAP databases like ClickHouse now handle star schemas and complex cross-table joins at massive scale. That gives data engineering teams a real choice: flatten data into one big table when the workload calls for it, or keep a normalized fact and dimension model when flexibility, storage efficiency, or update patterns matter more. Both are valid, and the right answer depends on your use case.&lt;/p&gt;

&lt;p&gt;This guide gives you a practical framework for evaluating modern join capabilities, performance limits, and operational tradeoffs.&lt;/p&gt;

&lt;p&gt;The old rule that real-time analytics requires aggressive denormalization? It's obsolete in 2026. Vectorized query execution, memory-efficient join algorithms, and automatic query optimization have eliminated this constraint. You don't have to sacrifice schema flexibility for query speed anymore.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Modern real-time OLAP databases can run star schemas with fast cross-table joins in 2026. You don't always need "one big table."&lt;/li&gt;
&lt;li&gt;  ClickHouse is the best fit when you need sub-second joins + high concurrency on normalized fact/dimension models.&lt;/li&gt;
&lt;li&gt;  Evaluate engines on (1) memory-efficient join algorithms, (2) automatic join optimization + statistics-based reordering, (3) runtime filters/predicate pushdown, (4) concurrency isolation, (5) real-time updates/upserts for dimensions, (6) ingestion vs query isolation.&lt;/li&gt;
&lt;li&gt;  ClickHouse handles both single-table aggregations and multi-join analytical queries. Apache Druid/Pinot excel at single-table, high-QPS aggregations but are more limited when multi-table joins are required.&lt;/li&gt;
&lt;li&gt;  Denormalize only for edge cases like single-digit ms latency, very large dimensions, or expensive distributed shuffle joins you can't avoid.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why denormalization is no longer required for real-time analytics
&lt;/h2&gt;

&lt;p&gt;For over a decade, denormalized wide tables were the default path to sub-second dashboard performance, especially when managing &lt;a href="https://dataanalyticsguide.substack.com/p/tail-latency-p99-real-time-analytics" rel="noopener noreferrer"&gt;tail latency (p99)&lt;/a&gt;. Older analytical databases struggled with multiple joins under high concurrency, so flattening data into a single table was often the most reliable way to hit strict latency SLAs.&lt;/p&gt;

&lt;p&gt;The 'one big table' approach is still a strong fit for many workloads, particularly when ingestion patterns are append-only and dimensions rarely change. At petabyte scale, though, it comes with tradeoffs around storage, update flexibility, and schema evolution that are worth weighing against a normalized alternative.&lt;/p&gt;

&lt;p&gt;Denormalization comes with three tradeoffs to weigh against its performance benefits.&lt;/p&gt;

&lt;p&gt;First, repeating dimension values like user demographics or product categories across fact rows &lt;a href="https://www.devoteam.com/expert-view/improve-bigquery-performance-normalisation-vs-denormalisation/" rel="noopener noreferrer"&gt;increases storage footprint, which can meaningfully affect cloud bills at billion-row scale&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Second, dimension updates become more expensive. Changing a single attribute may require rewriting historical partitions, which can introduce variance into streaming pipelines.&lt;/p&gt;

&lt;p&gt;Third, ad-hoc multi-dimensional analysis is harder to support without maintaining additional pre-joined tables for new query patterns.&lt;/p&gt;

&lt;p&gt;None of these are dealbreakers, and many teams accept them in exchange for the predictability of single-table scans. The right choice depends on your workload, update patterns, and operational preferences.&lt;/p&gt;

&lt;p&gt;Modern real-time analytics databases have expanded what's possible on the join side. Today's engines implement advanced join reordering, bloom filter pushdowns, and parallel execution strategies, dynamically pruning data before scanning.&lt;/p&gt;

&lt;p&gt;The result is that normalized fact and dimension tables, star schemas, and snowflake schemas are now viable options for sub-second customer-facing dashboards, alongside denormalized wide tables. You have a real choice between logical data modeling and pre-joined wide tables, and the best fit depends on your workload.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to evaluate real-time analytics databases for star schema joins
&lt;/h2&gt;

&lt;p&gt;If you're evaluating real-time OLAP databases broadly, our &lt;a href="https://clickhouse.com/resources/engineering/how-to-choose-a-database-for-real-time-analytics-in-2026" rel="noopener noreferrer"&gt;complete evaluation framework&lt;/a&gt; covers all five criteria in depth: latency and concurrency, ingestion throughput, SQL flexibility, cost efficiency, and operational complexity.&lt;/p&gt;

&lt;p&gt;For star schema workloads specifically, three requirements matter most: join execution, dimension mutability, and query-ingest isolation. Here's how different architecture classes perform.&lt;/p&gt;

&lt;h3&gt;
  
  
  Requirement 1: Which join algorithms prevent out-of-memory failures?
&lt;/h3&gt;

&lt;p&gt;To avoid catastrophic query failures, a database needs modern algorithms like grace hash joins (which safely spill to disk when memory runs out) and parallel hash joins. Relying solely on basic broadcast joins will cause out-of-memory errors on large dimensions. It's just a matter of time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modern real-time OLAP (ClickHouse):&lt;/strong&gt; Built for relational modeling at speed. These systems offer vectorized execution and &lt;a href="https://clickhouse.com/docs/guides/joining-tables#choosing-a-join-algorithm" rel="noopener noreferrer"&gt;multiple join algorithms&lt;/a&gt; including hash joins, sort-merge joins, grace hash joins, and more, with the query planner selecting the best strategy based on table sizes and available memory. This makes them well-suited for complex star schema queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Append-only event stores (Apache Pinot, Apache Druid):&lt;/strong&gt; Fast for single-table aggregations, but their architectures are limited. They primarily support lookup and broadcast joins, which restricts them to simple dimension table enrichments. Complex multi-table analytical queries are either unsupported or operationally constrained.&lt;/p&gt;

&lt;h3&gt;
  
  
  Requirement 2: Does it support updates and upserts for slowly changing dimensions?
&lt;/h3&gt;

&lt;p&gt;Managing slowly changing dimensions within a star schema requires lightweight updates and upserts. You can't afford to rewrite large historical data partitions for every change.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Modern real-time OLAP (ClickHouse):&lt;/strong&gt; These databases natively support row-level updates and deletes. In ClickHouse, both operations use the same patch parts mechanic: a small mask or patch file applied immediately for instant consistency, then materialized during background merges. For managing slowly changing dimensions, the ReplacingMergeTree engine handles deduplication by version during merges, keeping only the latest row for each key. Ideal for managing mutating entities in a normalized schema.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Append-only event stores (Apache Pinot, Apache Druid):&lt;/strong&gt; Neither engine supports SQL standard UPDATE or DELETE statements. Their segment-based, append-only architecture makes real-time mutability operationally complex and inefficient. Simple updates typically require re-ingesting and rebuilding entire data segments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Requirement 3: Can it isolate ingestion from join-heavy queries?
&lt;/h3&gt;

&lt;p&gt;A real-time system needs to stream thousands of events per second into large fact tables while simultaneously serving sub-second slice-and-dice queries. Without resource contention.&lt;/p&gt;

&lt;p&gt;Different systems take different paths to isolation. Apache Druid and Pinot bake it into the architecture with dedicated broker, router, and server roles, and Pinot adds workload-based resource isolation with named CPU and memory budgets. This works, but operators have to provision and manage multiple specialized components from day one.&lt;/p&gt;

&lt;p&gt;ClickHouse handles isolation through resource management and workload scheduling, letting teams specialize compute when it makes sense rather than mandating it upfront. ClickHouse Cloud goes further by separating storage and compute via SharedMergeTree, then layering specialized compute services on top: dedicated read-write services for ingestion, independently scaling read-only services for query traffic, and isolated compute pools that share the same underlying data. The result is near-perfect isolation for mixed workloads without managing discrete node tiers or duplicating data across clusters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Summary: ClickHouse vs. Druid/Pinot for star schema joins
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Core requirement&lt;/th&gt;
&lt;th&gt;ClickHouse&lt;/th&gt;
&lt;th&gt;Apache Druid / Pinot&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory-efficient join algorithms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent (Grace Hash, Parallel Hash, Shuffle)&lt;/td&gt;
&lt;td&gt;Limited (Primarily lookup/broadcast joins)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Real-time data mutability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent (Row-level updates and deletes via patch parts)&lt;/td&gt;
&lt;td&gt;Limited (Segment-based, append-only model)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ingestion vs. query isolation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Excellent (Resource management, workload scheduling, plus dedicated read-only and read-write compute services in Cloud)&lt;/td&gt;
&lt;td&gt;Good (Dedicated broker/router nodes; Pinot supports workload-based resource isolation)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a full comparison across all evaluation dimensions, including concurrency, cost efficiency, and operational complexity, see our &lt;a href="https://clickhouse.com/resources/engineering/how-to-choose-a-database-for-real-time-analytics-in-2026" rel="noopener noreferrer"&gt;complete real-time OLAP evaluation guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How ClickHouse executes fast star schema joins
&lt;/h2&gt;

&lt;p&gt;Before 2024, industry discussions often categorized ClickHouse as a single-table engine that struggled with complex schemas. That was historically grounded, but recent releases (&lt;a href="https://clickhouse.com/blog/clickhouse-release-25-09" rel="noopener noreferrer"&gt;25.9&lt;/a&gt; and beyond) have fundamentally changed ClickHouse's join execution capabilities. &lt;a href="https://clickhouse.com/blog/join-me-if-you-can-clickhouse-vs-databricks-snowflake-join-performance" rel="noopener noreferrer"&gt;Benchmarks running 17 join-heavy queries&lt;/a&gt; across fact and dimension tables showed ClickHouse Cloud was faster and cheaper than Snowflake and Databricks at every scale, from 721 million to 7.2 billion rows, with zero tuning.&lt;/p&gt;

&lt;p&gt;The numbers back this up. &lt;a href="https://clickhouse.com/blog/clickhouse-release-25-09#join-reordering" rel="noopener noreferrer"&gt;Automatic global join reordering&lt;/a&gt;, introduced in version 25.9, uses a greedy optimization algorithm with column statistics to determine optimal join order across multi-table queries. On a six-table TPC-H query (scale factor 100), enabling join reordering with column statistics improved execution from &lt;a href="https://clickhouse.com/blog/clickhouse-release-25-09#benchmarks-tpc-h-results" rel="noopener noreferrer"&gt;3,903 seconds to 2.7 seconds, a 1,450x speedup with 25x less memory&lt;/a&gt;. As of &lt;a href="https://clickhouse.com/blog/clickhouse-release-25-10#automatically-build-column-statistics-for-mergetree-tables" rel="noopener noreferrer"&gt;version 25.10&lt;/a&gt;, column statistics can be created automatically, making this optimization available without manual setup.&lt;/p&gt;

&lt;p&gt;These performance gains come from a series of optimizations across recent releases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://clickhouse.com/blog/clickhouse-release-25-09#join-reordering" rel="noopener noreferrer"&gt;Global join reordering&lt;/a&gt; (25.9) evaluates table cardinality estimates to determine optimal build and probe sides for multi-table queries. &lt;a href="https://clickhouse.com/blog/clickhouse-release-25-10#bloom-filters-in-joins" rel="noopener noreferrer"&gt;Runtime bloom filters&lt;/a&gt; (25.10) generate a filter on the build side and apply it as a pre-filter on the probe side, skipping irrelevant rows before the join executes, delivering a &lt;a href="https://clickhouse.com/blog/clickhouse-release-25-10#running-the-query-with-and-without-runtime-filtering" rel="noopener noreferrer"&gt;2.1x speedup and 7x reduction in memory consumption&lt;/a&gt; on TPC-H benchmarks. &lt;a href="https://clickhouse.com/blog/clickhouse-release-25-10#lazy-columns-replication-in-joins" rel="noopener noreferrer"&gt;Lazy columns replication&lt;/a&gt; (25.10) avoids unnecessary copying of large string values during joins, achieving over 20x faster execution on self-join workloads.&lt;/p&gt;

&lt;p&gt;ClickHouse supports &lt;a href="https://clickhouse.com/docs/guides/joining-tables#choosing-a-join-algorithm" rel="noopener noreferrer"&gt;multiple join algorithms&lt;/a&gt; including parallel hash join, grace hash join, full sorting merge join, and more, with the query planner selecting the best strategy based on table sizes and available memory.&lt;/p&gt;

&lt;p&gt;To see how these optimizations perform on real join workloads, explore the &lt;a href="https://github.com/ClickHouse/coffeeshop-benchmark" rel="noopener noreferrer"&gt;coffeeshop benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Beyond standard relational joins, ClickHouse has eliminated the performance tax on semi-structured data. The &lt;a href="https://clickhouse.com/docs/sql-reference/data-types/newjson" rel="noopener noreferrer"&gt;native JSON type&lt;/a&gt; shreds JSON into dynamic sub-columns transparently with no upfront schema definition required. This matters because real-world data varies: some workloads have semi-structured data with a known set of keys (e.g., application event payloads), while others involve completely dynamic JSON with thousands of unpredictable keys (e.g., user-generated metadata or IoT telemetry). ClickHouse handles both cases efficiently using &lt;a href="https://clickhouse.com/blog/json-data-type-gets-even-better" rel="noopener noreferrer"&gt;advanced serialization techniques&lt;/a&gt; that store JSON paths as compressed columnar subcolumns.&lt;/p&gt;

&lt;p&gt;The result: no parsing bottleneck at query time, even when joining tables that contain complex JSON documents.&lt;/p&gt;

&lt;p&gt;The infrastructure architecture ensures these complex joins run without impacting ingestion. &lt;a href="https://clickhouse.com/cloud" rel="noopener noreferrer"&gt;ClickHouse Cloud&lt;/a&gt; separates storage and compute using the &lt;a href="https://clickhouse.com/docs/cloud/reference/shared-merge-tree" rel="noopener noreferrer"&gt;SharedMergeTree engine&lt;/a&gt;, so distributed queries scale on stateless compute nodes while background merges run independently. For ops teams, this means user-facing queries stay responsive during heavy ingestion. For developers, the same query patterns work identically whether running locally or on a production cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to normalize vs. denormalize for real-time analytics workloads
&lt;/h2&gt;

&lt;p&gt;Different analytical workloads need different data modeling strategies. Here's when to use a normalized star schema versus falling back to traditional denormalization.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recipe: Star schema for customer-facing dashboards
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Decision logic:&lt;/strong&gt; This workload demands exploratory, ad-hoc multi-dimensional slice-and-dice capabilities from many concurrent external users. A star schema works well here because it gives analysts and customers flexibility to ask any question without being constrained by a rigid, pre-joined wide table. Storing dimensions separately simplifies upstream streaming ingestion pipelines and optimizes storage costs by preventing data duplication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recipe:&lt;/strong&gt; Keep dimension tables small and ensure they replicate across all compute nodes for fast broadcast joins. Use materialized views to pre-aggregate the heaviest fact table patterns, but perform final analytical joins against normalized dimension tables at query time for flexibility. Index your dimension tables properly to maximize runtime filter performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recipe: Star schema for e-commerce analytics with slowly changing dimensions
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Decision logic:&lt;/strong&gt; Modern retail analytics requires joining real-time, high-volume clickstream logs with frequently updating dimension tables: fluctuating product inventory, dynamic pricing, evolving user profiles. These are classic Slowly Changing Dimensions. A star schema handles constant updates efficiently without re-processing billions of historical event rows, which would immediately cripple pipeline performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recipe:&lt;/strong&gt; Use the &lt;a href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/replacingmergetree/" rel="noopener noreferrer"&gt;ReplacingMergeTree engine&lt;/a&gt; for dimension tables that change frequently (users, products, pricing). ReplacingMergeTree handles deduplication by version or timestamp during background merges, keeping only the latest row for each key. For individual row corrections (e.g., fixing a single product price), use Lightweight Updates, which apply instantly via patch parts with minimal overhead. At query time, join these dimensions against append-only fact tables using ClickHouse's &lt;a href="https://clickhouse.com/docs/guides/joining-tables#choosing-a-join-algorithm" rel="noopener noreferrer"&gt;optimized join algorithms&lt;/a&gt;. Your dashboard reflects the latest data without pipeline lag.&lt;/p&gt;

&lt;h3&gt;
  
  
  Recipe: Hybrid schema for observability and telemetry (minimal joins)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Decision logic:&lt;/strong&gt; Observability workloads prioritize raw ingestion speed over complex relational mapping. The primary goal is ingesting millions of log lines or trace spans per second while enabling simple, fast filtering on a single table. Cross-table joins are rare and typically isolated to low-cardinality metadata enrichment, like mapping an IP address to a geographic location. The sheer volume and velocity of machine-generated data makes query-time joins for every single query impractical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recipe:&lt;/strong&gt; Rely primarily on a denormalized wide-table schema or use the native JSON type to ingest raw, unstructured logs without strict schemas. For necessary metadata enrichment, skip traditional SQL joins. Use ClickHouse's optimized dictionary functions for point lookups instead. They're faster than hash joins and consume a fraction of the memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  When a single wide table is still the right choice
&lt;/h3&gt;

&lt;p&gt;Modern engines have made star schemas viable for real-time workloads, but denormalization remains the right call for many use cases. Flatten data into a single table when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Ultra-low latency requirements:&lt;/strong&gt; When sub-second rendering is too slow and your application demands single-digit millisecond response times, typical of real-time bidding or algorithmic ad tech.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Static, unchanging event streams:&lt;/strong&gt; If your event data is written once and never logically updated, the primary downside of denormalization (update anomalies) becomes irrelevant.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Large dimension tables:&lt;/strong&gt; When your dimension tables grow so large they rival the fact table in size, forcing a distributed network shuffle join that becomes prohibitively expensive.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In these edge cases, pre-joining data during ETL guarantees the database only performs rapid sequential scans, trading increased storage costs for maximum query throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Operational considerations: Cost, concurrency, and updates at scale
&lt;/h2&gt;

&lt;p&gt;Beyond initial query benchmarks, the true test of a real-time analytics database lies in day-two operations. Evaluating these systems requires a rigorous look at compute costs, concurrency thresholds, and ecosystem integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost considerations: storage, compression, and scan efficiency
&lt;/h3&gt;

&lt;p&gt;The cost argument for star schemas is often understated. In a denormalized wide table, you repeat dimension values like product names, user demographics, and region strings across every fact row. At billion-row scale, that redundancy adds meaningful storage overhead, especially in cloud environments where storage costs scale linearly.&lt;/p&gt;

&lt;p&gt;A normalized star schema eliminates this duplication by design: each dimension value is stored once. ClickHouse amplifies the savings further because columnar storage groups values of the same type together, allowing compression algorithms like LZ4 and ZSTD to exploit patterns that are invisible in row-oriented layouts. The result: &lt;a href="https://clickhouse.com/resources/engineering/database-compression" rel="noopener noreferrer"&gt;10x to 20x compression&lt;/a&gt; is typical on the fact table alone, and dimension tables compress even more aggressively because they're smaller and more repetitive.&lt;/p&gt;

&lt;p&gt;Compute costs follow from this too, though the relationship depends on how your engine bills you. Some &lt;a href="https://clickhouse.com/blog/how-cloud-data-warehouses-bill-you" rel="noopener noreferrer"&gt;cloud warehouses charge directly per byte scanned&lt;/a&gt;, so a star schema join that reads a 2 GB fact table plus a 50 MB dimension is cheaper than scanning a 20 GB denormalized table with repeated dimension strings. Other engines bill on compute time, where scanning less data still helps because smaller scans generally mean faster queries and less compute consumed.&lt;/p&gt;

&lt;p&gt;But raw scan efficiency matters just as much as data volume. ClickHouse's vectorized execution and columnar compression let it scan large datasets at a fraction of the cost of traditional cloud warehouses. In practice, ClickHouse can often scan 20 GB faster and cheaper than a slower engine scans 2 GB. The real savings come from combining a normalized schema with an engine that scans efficiently in the first place.&lt;/p&gt;

&lt;h3&gt;
  
  
  Handling dimension updates at scale
&lt;/h3&gt;

&lt;p&gt;In a star schema, fact tables are append-only: events, transactions, clicks. They don't change. But dimension tables do: a product price gets corrected, a user upgrades their plan, a store changes its region mapping. These are classic slowly changing dimensions, and they're the operational pain point of star schemas.&lt;/p&gt;

&lt;p&gt;ClickHouse handles this split cleanly. For fact tables, use standard MergeTree with append-only inserts and no mutation overhead. For dimension tables, you have two complementary tools. ReplacingMergeTree handles deduplication by version during background merges, keeping only the latest row for each key, making it ideal for dimensions that update in bulk (e.g., a nightly product catalog refresh). For individual row corrections (e.g., fixing a single product price mid-day), Lightweight Updates apply instantly via patch parts: ClickHouse writes a compact patch containing only the changed values, applied immediately for instant consistency, then materialized during background merges. No table locks, no partition rewrites.&lt;/p&gt;

&lt;p&gt;This means your fact table stays fast and untouched while your dimension tables stay current, which is exactly the operational model a star schema requires.&lt;/p&gt;

&lt;h3&gt;
  
  
  Managing star schema transformations in production
&lt;/h3&gt;

&lt;p&gt;The operational overhead of a star schema, including defining dimension tables, orchestrating fact-dimension relationships, handling incremental loads, is only manageable with proper tooling.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://clickhouse.com/docs/integrations/dbt" rel="noopener noreferrer"&gt;dbt&lt;/a&gt; has become the standard here. The ClickHouse dbt adapter supports incremental materializations and full-refresh strategies, letting you model fact and dimension tables with different update patterns: append-only inserts for facts, full refresh or merge for dimensions. You define your dimension and fact models as SQL, version them in git, and dbt handles the orchestration.&lt;/p&gt;

&lt;p&gt;For a full list of ClickHouse's ecosystem integrations, including visualization, ingestion, orchestration, and language clients, see our &lt;a href="https://clickhouse.com/resources/engineering/how-to-choose-a-database-for-real-time-analytics-in-2026" rel="noopener noreferrer"&gt;complete evaluation guide&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Choosing a real-time analytics database for star schema joins
&lt;/h2&gt;

&lt;p&gt;The real-time data engineering landscape has shifted. Logical relational modeling and real-time performance are no longer mutually exclusive, which means star schemas, snowflake schemas, and denormalized wide tables are all viable strategies in 2026. The right choice depends on your workload, update patterns, latency targets, and operational preferences.&lt;/p&gt;

&lt;p&gt;Successfully evaluating a database architecture in 2026 means looking past outdated benchmarks. Data architects need to test actual memory-efficient join algorithms, verify optimizer intelligence, and stress-test concurrency limits under intense ingestion.&lt;/p&gt;

&lt;p&gt;If you're building customer-facing analytics applications, or if you're struggling to migrate off a slow, expensive cloud data warehouse, ClickHouse Cloud provides the query execution speed and schema flexibility required to power sub-second multi-dimensional analysis without the traditional operational burden. Many teams start by adding ClickHouse as a real-time serving layer alongside their existing data warehouse, and some later choose to consolidate entirely.&lt;/p&gt;

&lt;p&gt;Don't take these performance claims on faith. &lt;a href="https://clickhouse.com/cloud" rel="noopener noreferrer"&gt;Spin up a free ClickHouse Cloud trial&lt;/a&gt;, load as much of your own data as possible, and run your actual queries at realistic scale. That's the only way to know if it fits your workload. For evidence of how ClickHouse performs on standardized join workloads, explore the fully open and reproducible &lt;a href="https://github.com/ClickHouse/coffeeshop-benchmark" rel="noopener noreferrer"&gt;coffeeshop benchmark&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Which real-time analytics database is best for star schema and cross-table joins?
&lt;/h3&gt;

&lt;p&gt;In 2026, ClickHouse is a strong choice for star schemas because it supports modern join algorithms, automatic join optimization, and high concurrency. Druid/Pinot are typically better for single-table real-time analytics with limited joins.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does ClickHouse support star schema (fact + dimension tables) in production?
&lt;/h3&gt;

&lt;p&gt;Yes. ClickHouse supports multi-table joins and is commonly used with normalized fact/dimension models when configured with appropriate join strategies and resource limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  How many joins can ClickHouse (or an OLAP database) handle in one query?
&lt;/h3&gt;

&lt;p&gt;There's no fixed limit. It depends on memory, join strategy, and data sizes. Multiple joins to small dimensions are handled efficiently, and joins between larger tables can be optimized through join algorithm selection, runtime filtering, and resource management settings.&lt;/p&gt;

&lt;h3&gt;
  
  
  What join pattern is most likely to be slow or expensive?
&lt;/h3&gt;

&lt;p&gt;The most expensive pattern is usually a distributed shuffle join between two large tables. It forces heavy network data movement and large intermediate states.&lt;/p&gt;

&lt;h3&gt;
  
  
  How can I make star-schema joins faster in real-time analytics?
&lt;/h3&gt;

&lt;p&gt;Keep dimensions small and well-partitioned, prefer broadcast/replicated dimensions when possible, and avoid large shuffle joins. Use pre-aggregation (e.g., materialized views) for the heaviest fact-table patterns.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>analytics</category>
      <category>database</category>
    </item>
    <item>
      <title>OPTIMIZE TABLE ... FINAL in ClickHouse: when to use it, when to avoid it, and how merges work</title>
      <dc:creator>Aman Puri</dc:creator>
      <pubDate>Fri, 26 Jun 2026 19:08:22 +0000</pubDate>
      <link>https://dev.to/dataengineeringguide/clickhouse-optimize-table-final-48kc</link>
      <guid>https://dev.to/dataengineeringguide/clickhouse-optimize-table-final-48kc</guid>
      <description>&lt;p&gt;If you manage a ClickHouse cluster in production, you may have hit duplicate rows or the "too many parts" error. A search result, an older blog post, or a forum thread probably pointed you to &lt;a href="https://clickhouse.com/docs/optimize/avoidoptimizefinal" rel="noopener noreferrer"&gt;OPTIMIZE TABLE ... FINAL&lt;/a&gt;. You ran it, the immediate symptom vanished, and you scheduled it in a cron job.&lt;/p&gt;

&lt;p&gt;Weeks later, ClickHouse feels slower after OPTIMIZE, and your cluster's performance has degraded.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  Don't blindly run &lt;code&gt;OPTIMIZE TABLE ... FINAL&lt;/code&gt; on a schedule. It forces a physical rewrite of active parts in the targeted partition or partitions, consuming significant I/O and CPU.&lt;/li&gt;
&lt;li&gt;  &lt;code&gt;OPTIMIZE ... FINAL&lt;/code&gt; bypasses the background merge scheduler's normal size guardrails, including &lt;a href="https://clickhouse.com/docs/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool" rel="noopener noreferrer"&gt;&lt;code&gt;max_bytes_to_merge_at_max_space_in_pool&lt;/code&gt;&lt;/a&gt;, which is typically around 150 GB.&lt;/li&gt;
&lt;li&gt;  On self-managed &lt;code&gt;ReplicatedMergeTree&lt;/code&gt;, &lt;code&gt;OPTIMIZE&lt;/code&gt; creates a replicated merge task. Depending on &lt;code&gt;alter_sync&lt;/code&gt;, ClickHouse waits for execution on the current replica or all replicas. Other replicas may merge locally or fetch the merged part, depending on settings and replica state, which can add merge, network, and replication-queue pressure.&lt;/li&gt;
&lt;li&gt;  For current-state queries on &lt;code&gt;ReplacingMergeTree&lt;/code&gt;, use &lt;code&gt;SELECT ... FINAL&lt;/code&gt; because background merges provide eventual deduplication, not immediate query correctness. Enable &lt;code&gt;do_not_merge_across_partitions_select_final=1&lt;/code&gt; only when the partition key guarantees all versions of the same logical row stay in the same partition.&lt;/li&gt;
&lt;li&gt;  To prevent "too many parts," fix ingestion: &lt;a href="https://clickhouse.com/docs/optimize/bulk-inserts" rel="noopener noreferrer"&gt;batch inserts&lt;/a&gt; into larger blocks, target roughly one insert per second, or use &lt;a href="https://clickhouse.com/docs/optimize/asynchronous-inserts" rel="noopener noreferrer"&gt;asynchronous inserts&lt;/a&gt; when client-side batching is not practical.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why scheduled OPTIMIZE TABLE ... FINAL backfires
&lt;/h2&gt;

&lt;p&gt;You might reach for &lt;code&gt;OPTIMIZE TABLE ... FINAL&lt;/code&gt; in two common situations: when a &lt;a href="https://clickhouse.com/docs/engines/table-engines/mergetree-family/replacingmergetree" rel="noopener noreferrer"&gt;ReplacingMergeTree&lt;/a&gt; query returns duplicate versions before background merges finish, or when inserts fail with a &lt;code&gt;TOO_MANY_PARTS&lt;/code&gt; error because one partition has too many active parts.&lt;/p&gt;

&lt;p&gt;Both symptoms create pressure for immediate relief, so forcing a merge can feel like the obvious next step.&lt;/p&gt;

&lt;p&gt;The operation takes a while, but when it finishes, your part count drops and duplicates disappear. So you automate it to run nightly or hourly.&lt;/p&gt;

&lt;p&gt;At first, the cluster seems healthier: part count drops, and duplicate versions in &lt;code&gt;ReplacingMergeTree&lt;/code&gt; tables may disappear. But &lt;code&gt;OPTIMIZE ... FINAL&lt;/code&gt; can create parts larger than the background merge scheduler would normally select.&lt;/p&gt;

&lt;p&gt;New inserts still create new parts. If those new parts contain newer versions of rows already inside the oversized part, ClickHouse may not automatically merge them together. The result is that duplicates can accumulate again, while &lt;code&gt;SELECT ... FINAL&lt;/code&gt; has to resolve more of that state at query time.&lt;/p&gt;

&lt;p&gt;Very large parts can also make merge-dependent maintenance more expensive later. Lightweight deletes are reflected in query results when the delete completes, but ClickHouse physically removes the affected data later during background merges. If the affected rows live inside very large parts, that cleanup requires larger merge work. Heavier &lt;code&gt;ALTER TABLE ... UPDATE&lt;/code&gt; and &lt;code&gt;ALTER TABLE ... DELETE&lt;/code&gt; mutations can also compete with or wait behind merge activity, so they are best reserved for deliberate, large, often partition-aligned changes. During these rewrites, ClickHouse must keep old parts until the new parts are committed, so disk I/O and temporary disk usage can rise, and query latency can degrade.&lt;/p&gt;

&lt;p&gt;The instinct to manually "vacuum" or compact a database is understandable. Many storage engines benefit from periodic operator-driven compaction.&lt;/p&gt;

&lt;p&gt;In ClickHouse's MergeTree family, forcing large, unscheduled table rewrites usually works against the background merge scheduler. The scheduler's heuristics and safety limits are tuned to keep merges running as a resource-aware background process. Bypassing them shifts that work onto the same cluster serving live queries and ingestion.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is a data part in ClickHouse (MergeTree)
&lt;/h2&gt;

&lt;p&gt;In ClickHouse, an INSERT into a MergeTree-family table writes data into a new immutable unit on storage called a data part.&lt;/p&gt;

&lt;p&gt;A part is a collection of physical assets rather than a single file. In common wide-format parts, this includes compressed column files, mark files that map index granules to byte offsets, a sparse primary index, checksums, and metadata. Compact parts and newer storage formats can organize these files differently, but the core idea is the same: a part is a self-contained, immutable unit of sorted columnar data. You can read more about the structure in the &lt;a href="https://clickhouse.com/docs/parts" rel="noopener noreferrer"&gt;table parts documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Parts are immutable. Once written, the engine does not update them in place. It creates new parts, merges existing parts, and later removes parts that are no longer active.&lt;/p&gt;

&lt;p&gt;Part names encode their history. The format is &lt;code&gt;partition_id_minimum_block_number_maximum_block_number_level_data_version&lt;/code&gt;, where &lt;code&gt;data_version&lt;/code&gt; is optional and is incremented when a mutation writes a new part. For example, a part named &lt;code&gt;all_1_4_1_6&lt;/code&gt; belongs to the all partition, covers the block-number range 1 through 4, has merge level 1, and carries data version 6. The &lt;code&gt;level&lt;/code&gt; field tracks merge history only: level 0 means the part has not been produced by a merge, while a higher level indicates one or more merge generations in its lineage. Mutations are tracked separately in &lt;code&gt;data_version&lt;/code&gt;, so a part can remain level 0 after being rewritten by a mutation.&lt;/p&gt;

&lt;p&gt;To see what your active parts look like, query &lt;a href="https://clickhouse.com/docs/operations/system-tables/parts" rel="noopener noreferrer"&gt;&lt;code&gt;system.parts&lt;/code&gt;&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;formatReadableSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes_on_disk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="k"&gt;size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'your_table_name'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;bytes_on_disk&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Examining levels and sizes tells you whether ingestion is creating many fragmented, level-0 parts or whether background merges are consolidating them into larger, higher-level parts.&lt;/p&gt;

&lt;p&gt;Healthy ClickHouse part management means allowing the engine to maintain a balanced distribution of parts continuously, not forcing every partition into a single huge part.&lt;/p&gt;

&lt;h2&gt;
  
  
  How ClickHouse background merges work (and why partitions matter)
&lt;/h2&gt;

&lt;p&gt;ClickHouse does not wait for manual intervention to clean up data. The engine continuously evaluates active parts in the background, selects candidates, and schedules merges to reduce fragmentation.&lt;/p&gt;

&lt;p&gt;This process is scoped per partition. ClickHouse does not merge parts from two different partitions together.&lt;/p&gt;

&lt;p&gt;The background merge scheduler balances reducing part count against write amplification. To protect the cluster from resource starvation, it operates under safety limits such as &lt;a href="https://clickhouse.com/docs/operations/settings/merge-tree-settings#max_bytes_to_merge_at_max_space_in_pool" rel="noopener noreferrer"&gt;&lt;code&gt;max_bytes_to_merge_at_max_space_in_pool&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;This setting caps the total source-part size that an automatic background merge will select when enough resources are available. It roughly corresponds to the largest part size automatic merges normally create. The common default is around 150 GB.&lt;/p&gt;

&lt;p&gt;Merging data requires reading source parts, merging rows according to the table engine's rules, writing compressed columnar output, and committing a replacement part. On large parts, ClickHouse can use vertical merge algorithms that process columns in a more memory-efficient way, but the operation still consumes CPU, memory, and I/O.&lt;/p&gt;

&lt;p&gt;Merging multiple very large parts can create significant write amplification. That work competes with concurrent queries and ingestion for disk or object-storage bandwidth and can evict useful data from caches. The size limits keep normal background merging incremental and resource-aware.&lt;/p&gt;

&lt;h2&gt;
  
  
  OPTIMIZE TABLE in ClickHouse: what each variant does
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OPTIMIZE TABLE.&lt;/strong&gt; Running the base command without modifiers asks ClickHouse to initiate an unscheduled merge. It may perform no merge if the table does not have useful candidates. By default, ClickHouse does not notify the client when no merge is performed unless &lt;a href="https://clickhouse.com/docs/operations/settings/settings#optimize_throw_if_noop" rel="noopener noreferrer"&gt;&lt;code&gt;optimize_throw_if_noop&lt;/code&gt;&lt;/a&gt; is enabled. On replicated tables, wait behavior depends on &lt;a href="https://clickhouse.com/docs/operations/settings/settings#alter_sync" rel="noopener noreferrer"&gt;&lt;code&gt;alter_sync&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OPTIMIZE TABLE ... PARTITION&lt;/strong&gt;. This scopes the optimize request to a specific partition. It is most appropriate for bounded maintenance on a closed partition, such as a historical month of data that will not receive more writes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OPTIMIZE TABLE ... FINAL.&lt;/strong&gt; This forces optimization even if the data is already in one part. In practice, it forces ClickHouse to merge active parts in each targeted partition down to a single part, bypassing the normal automatic-merge size guardrails. It should be treated as an infrequent, intentional administrative operation, not routine maintenance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why OPTIMIZE TABLE ... FINAL is expensive
&lt;/h3&gt;

&lt;p&gt;ClickHouse must read the active source parts, apply the relevant MergeTree-family merge logic, write the merged output, and commit the new part. For large datasets, this is CPU- and I/O-intensive.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;OPTIMIZE TABLE ... FINAL&lt;/code&gt; also bypasses the &lt;code&gt;max_bytes_to_merge_at_max_space_in_pool&lt;/code&gt; safeguard. It can attempt merges that automatic background merging would not choose, limited primarily by available resources such as disk space.&lt;/p&gt;

&lt;p&gt;Once very large parts exist, the background scheduler may not be able to merge future incoming parts with them. It will continue to merge smaller new parts where it can, but it may no longer be able to consolidate the whole partition automatically. For &lt;code&gt;ReplacingMergeTree&lt;/code&gt;, this can mean duplicate versions accumulate again as new data arrives.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why OPTIMIZE can add replication pressure
&lt;/h3&gt;

&lt;p&gt;For self-managed &lt;code&gt;ReplicatedMergeTree&lt;/code&gt;, &lt;code&gt;OPTIMIZE&lt;/code&gt; creates a replicated merge task. Depending on &lt;code&gt;alter_sync&lt;/code&gt;, ClickHouse waits for execution on the current replica or all replicas.&lt;/p&gt;

&lt;p&gt;Other replicas may merge locally or fetch the merged part, depending on settings and replica state. Either way, a forced large merge can add substantial CPU, disk, network, and replication-queue work. During that window, replicas can have different part structures, so queries may use different execution plans depending on which replica serves the read.&lt;/p&gt;

&lt;p&gt;A cron-scheduled &lt;code&gt;OPTIMIZE ... FINAL&lt;/code&gt; makes this recurring. The cluster repeatedly absorbs large administrative merge work, and inserts can see higher latency or failures under pressure.&lt;/p&gt;

&lt;p&gt;ClickHouse Cloud uses SharedMergeTree with shared object storage and shared metadata, so it does not use the same explicit local-part replication model as self-managed &lt;code&gt;ReplicatedMergeTree&lt;/code&gt;. The same principle still applies: forced large merges consume resources and should not be scheduled as routine maintenance.&lt;/p&gt;

&lt;h2&gt;
  
  
  OPTIMIZE TABLE ... FINAL vs. SELECT ... FINAL: what FINAL means
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;OPTIMIZE TABLE ... FINAL&lt;/code&gt; is a storage operation. It rewrites physical data parts on disk or object storage.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT ... FINAL&lt;/code&gt; is a query-time modifier. It applies the table engine's merge logic when reading data and does not alter the underlying parts.&lt;/p&gt;

&lt;p&gt;Older guidance often recommended avoiding &lt;code&gt;SELECT ... FINAL&lt;/code&gt; because earlier implementations were much more expensive. ClickHouse has since optimized &lt;code&gt;FINAL&lt;/code&gt; significantly. The &lt;a href="https://clickhouse.com/blog/clickhouse-release-23-12" rel="noopener noreferrer"&gt;ClickHouse 23.12 release notes&lt;/a&gt; describe improvements that split data into intersecting and non-intersecting ranges so ClickHouse can process ranges that do not need finalization as normal reads. The &lt;a href="https://clickhouse.com/blog/clickhouse-release-24-01" rel="noopener noreferrer"&gt;24.1 release notes&lt;/a&gt; describe additional vertical processing optimizations for &lt;code&gt;FINAL&lt;/code&gt; with &lt;code&gt;ReplacingMergeTree&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;SELECT ... FINAL&lt;/code&gt; still has overhead, especially when a query must read many candidate rows that are not filtered by the sorting key. But it is the right tool when users need correct current-state results from unmerged &lt;code&gt;ReplacingMergeTree&lt;/code&gt; data. It gives query-time correctness without forcing a physical rewrite of the storage layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to deduplicate rows and handle ReplacingMergeTree safely
&lt;/h2&gt;

&lt;p&gt;Use &lt;code&gt;ReplacingMergeTree&lt;/code&gt; for current-state deduplication when data arrives as immutable inserts. During background merges, ClickHouse identifies duplicates by the table's &lt;code&gt;ORDER BY&lt;/code&gt; columns. If a version column is configured, it keeps the row with the highest version. Without a version column, replacement is based on merge order, so a version column is the safer pattern for update-style workloads.&lt;/p&gt;

&lt;p&gt;Background merging is asynchronous, so you cannot rely on it for query-time correctness. Queries against unmerged data can still see multiple versions of the same logical row unless you use &lt;code&gt;FINAL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For better &lt;code&gt;FINAL&lt;/code&gt; performance on partitioned tables, use &lt;a href="https://clickhouse.com/docs/operations/settings/settings#do_not_merge_across_partitions_select_final" rel="noopener noreferrer"&gt;&lt;code&gt;do_not_merge_across_partitions_select_final&lt;/code&gt;&lt;/a&gt; when your schema guarantees that all versions of a logical row land in the same partition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;event_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;event_timestamp&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;events_table&lt;/span&gt; &lt;span class="k"&gt;FINAL&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;event_date&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;today&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;SETTINGS&lt;/span&gt; &lt;span class="n"&gt;do_not_merge_across_partitions_select_final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This setting tells ClickHouse to process partitions independently during &lt;code&gt;SELECT ... FINAL&lt;/code&gt;, which can improve parallelism and reduce unnecessary cross-partition work. It is only correct when the partition key cooperates. If versions of the same row can land in different partitions, using this setting can produce incorrect current-state results.&lt;/p&gt;

&lt;p&gt;If a dataset is large enough that &lt;code&gt;SELECT ... FINAL&lt;/code&gt; is still too expensive for a hot query path, do not solve that by scheduling &lt;code&gt;OPTIMIZE ... FINAL&lt;/code&gt;. Instead, shift the latest-state computation to ingestion time.&lt;/p&gt;

&lt;p&gt;One common pattern is a materialized view into an &lt;code&gt;AggregatingMergeTree&lt;/code&gt; table using &lt;code&gt;argMaxState&lt;/code&gt; at ingest time and &lt;code&gt;argMaxMerge&lt;/code&gt; at query time. That keeps a precomputed latest-state representation for fast reads, while the source &lt;code&gt;ReplacingMergeTree&lt;/code&gt; can retain the raw immutable event stream and let background merges reclaim storage over time.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to fix the "too many parts" error
&lt;/h2&gt;

&lt;p&gt;The "too many parts" error appears when the number of active parts in a partition exceeds the &lt;a href="https://clickhouse.com/docs/operations/settings/merge-tree-settings#parts_to_throw_insert" rel="noopener noreferrer"&gt;&lt;code&gt;parts_to_throw_insert&lt;/code&gt;&lt;/a&gt; limit. This protects the cluster when inserts create parts faster than background merges can consolidate them. Check &lt;code&gt;system.merge_tree_settings&lt;/code&gt; in your environment for the exact default in your ClickHouse version.&lt;/p&gt;

&lt;p&gt;This is an ingestion problem, not a merge engine failure.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;OPTIMIZE TABLE ... FINAL&lt;/code&gt; is not a fix. It rewrites a snapshot of the parts you already have, but it does not change the rate at which new parts arrive. The exception can return as soon as ingestion resumes, while your storage is left with very large parts that can make future operations harder.&lt;/p&gt;

&lt;h3&gt;
  
  
  Find the rapidly growing partition
&lt;/h3&gt;

&lt;p&gt;Run this diagnostic query to find where part growth is concentrated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt;
    &lt;span class="k"&gt;partition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;active_parts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;level&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;highest_level&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;formatReadableSize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bytes_on_disk&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;total_size&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="k"&gt;system&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;parts&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="k"&gt;table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'your_table_name'&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;active&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;GROUP&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="k"&gt;partition&lt;/span&gt;
&lt;span class="k"&gt;HAVING&lt;/span&gt; &lt;span class="n"&gt;active_parts&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;active_parts&lt;/span&gt; &lt;span class="k"&gt;DESC&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;150&lt;/code&gt; threshold is a monitoring heuristic, not a ClickHouse limit. It gives an early warning before inserts are delayed or rejected.&lt;/p&gt;

&lt;h3&gt;
  
  
  Fix excessive part creation with batch inserts
&lt;/h3&gt;

&lt;p&gt;The most effective client-side fix is batching inserts. ClickHouse recommends fairly large batches: at least 1,000 rows per insert, and ideally 10,000 to 100,000 rows, while keeping insert query frequency around once per second when using synchronous inserts.&lt;/p&gt;

&lt;p&gt;Sending hundreds or thousands of tiny inserts per second creates too many parts. Every flush creates at least one new part per affected partition, and the merge scheduler then has to consolidate them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use asynchronous inserts to batch on the server
&lt;/h3&gt;

&lt;p&gt;If rewriting your upstream application to buffer data client-side is not practical, use server-side batching with &lt;a href="https://clickhouse.com/docs/optimize/asynchronous-inserts" rel="noopener noreferrer"&gt;asynchronous inserts&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Enable &lt;code&gt;async_insert = 1&lt;/code&gt; and use &lt;code&gt;wait_for_async_insert = 1&lt;/code&gt;, which is the recommended production mode. ClickHouse buffers incoming inserts in memory and flushes them to storage when configurable thresholds are met, such as buffer size, elapsed time, or number of queued queries.&lt;/p&gt;

&lt;p&gt;With &lt;code&gt;wait_for_async_insert = 1&lt;/code&gt;, the client receives acknowledgment only after data is flushed to storage. That preserves durability expectations and lets your application receive insertion errors instead of hiding them in server logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Avoid over-partitioning
&lt;/h3&gt;

&lt;p&gt;Check your schema for over-partitioning. A common mistake is using a granular daily &lt;code&gt;PARTITION BY&lt;/code&gt; key for low-volume data or for workloads that write across many dates at once.&lt;/p&gt;

&lt;p&gt;If each insert touches many partitions, ClickHouse creates at least one part per partition per flush. The problem multiplies quickly.&lt;/p&gt;

&lt;p&gt;Switching from daily &lt;code&gt;toYYYYMMDD(date)&lt;/code&gt; partitions to monthly &lt;code&gt;or yearly&lt;/code&gt; partitions can reduce part pressure when the workload does not require daily partition management. Monitor merge health with &lt;a href="https://clickhouse.com/docs/operations/system-tables/merges" rel="noopener noreferrer"&gt;&lt;code&gt;system.merges&lt;/code&gt;&lt;/a&gt; for in-flight merges and &lt;a href="https://clickhouse.com/docs/operations/system-tables/part_log" rel="noopener noreferrer"&gt;&lt;code&gt;system.part_log&lt;/code&gt;&lt;/a&gt; for historical merge activity.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to use OPTIMIZE TABLE, SELECT ... FINAL, or do nothing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Trust background merges as your default.&lt;/strong&gt; This is the right approach for most production workloads. The heuristics controlling merge selection and the limits protecting I/O bandwidth exist so operators do not have to intervene manually.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use SELECT ... FINAL for reading deduplicated data.&lt;/strong&gt; When you need a correct current-state view of a &lt;code&gt;ReplacingMergeTree&lt;/code&gt;, apply &lt;code&gt;FINAL&lt;/code&gt; at query time because background merges provide eventual deduplication, not immediate query correctness. Use partition-aware settings only when your partitioning scheme keeps every version of the same logical row in the same partition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use OPTIMIZE TABLE without FINAL as a merge request, not a guarantee.&lt;/strong&gt; After a heavy backfill or maintenance event, a plain &lt;code&gt;OPTIMIZE TABLE&lt;/code&gt; can ask ClickHouse to initiate an unscheduled merge. It may no-op if there are no useful merge candidates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use OPTIMIZE TABLE ... PARTITION for targeted maintenance.&lt;/strong&gt; When an older partition is closed and will not receive more writes, optimizing that specific partition can be reasonable. It is more bounded than a table-wide forced rewrite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use OPTIMIZE TABLE ... FINAL only for bounded, one-time administrative operations.&lt;/strong&gt; Reserve forced full rewrites for specific, scoped situations, such as collapsing all versions in a small, closed ReplacingMergeTree partition that will &lt;strong&gt;never receive further writes&lt;/strong&gt;. Do not put it in a cron job or automated pipeline. In most cases, OPTIMIZE TABLE ... FINAL is probably the wrong lever to reach for. If in doubt, you should reach out to ClickHouse Support for guidance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;ClickHouse is designed to manage its own storage. Background merges, immutable parts, and merge-size limits exist so operators do not have to run periodic forced rewrites.&lt;/p&gt;

&lt;p&gt;Use &lt;code&gt;SELECT ... FINAL&lt;/code&gt; for query-time correctness, fix ingestion when part counts climb, and reserve &lt;code&gt;OPTIMIZE TABLE ... FINAL&lt;/code&gt; for bounded administrative operations. Trust the part lifecycle, and avoid putting forced rewrites in a cron job.&lt;/p&gt;

&lt;p&gt;Whether you run ClickHouse yourself or use ClickHouse Cloud, the same principle applies. To evaluate ClickHouse Cloud, &lt;a href="https://clickhouse.cloud/signUp?glxid=eae36bed-8ea6-4152-860c-7c9392f47817&amp;amp;pagePath=%2Fresources%2Fengineering%2Fclickhouse-optimize-table-final&amp;amp;origPath=%2Fresources%2Fengineering%2Fclickhouse-optimize-table-final&amp;amp;utm_ga=GA1.1.1167642100.1782485334" rel="noopener noreferrer"&gt;spin up a free trial&lt;/a&gt;, load as much of your own data as practical, run an evaluation at realistic scale, and compare against your existing system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Frequently asked questions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Should I run OPTIMIZE TABLE ... FINAL regularly in ClickHouse?
&lt;/h3&gt;

&lt;p&gt;No. &lt;code&gt;OPTIMIZE ... FINAL&lt;/code&gt; forces a heavy rewrite of active parts in the targeted partition or partitions, bypasses automatic merge guardrails, and can add sustained CPU and I/O pressure.&lt;/p&gt;

&lt;h3&gt;
  
  
  What does OPTIMIZE TABLE ... FINAL actually do?
&lt;/h3&gt;

&lt;p&gt;It forces ClickHouse to optimize even when data is already in one part. For MergeTree-family tables, that commonly means merging active parts in each targeted partition down to a single part.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is OPTIMIZE TABLE replicated across replicas?
&lt;/h3&gt;

&lt;p&gt;For self-managed &lt;code&gt;ReplicatedMergeTree&lt;/code&gt;, &lt;code&gt;OPTIMIZE&lt;/code&gt; creates a replicated merge task. Depending on &lt;code&gt;alter_sync&lt;/code&gt;, ClickHouse waits for execution on the current replica or all replicas. Other replicas may merge locally or fetch the merged part, depending on settings and replica state. In ClickHouse Cloud, SharedMergeTree uses shared storage and shared metadata, so the self-managed local-part replication model does not apply in the same way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why can OPTIMIZE ... FINAL make performance worse over time?
&lt;/h3&gt;

&lt;p&gt;It can create very large parts and substantial write amplification. Those large parts may be outside the size range automatic merges normally select, so future inserts can create new parts that are harder to consolidate with the largest existing parts.&lt;/p&gt;

&lt;h3&gt;
  
  
  What should I use instead of OPTIMIZE ... FINAL to remove duplicates in ReplacingMergeTree?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;SELECT ... FINAL&lt;/code&gt; when you need correct query-time deduplication. Background merges will clean up storage over time, but they are asynchronous and should not be relied on for immediate query correctness.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I speed up SELECT ... FINAL on partitioned tables?
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;do_not_merge_across_partitions_select_final = 1&lt;/code&gt; when all versions of the same logical row are guaranteed to stay within one partition. This lets ClickHouse process partitions independently during &lt;code&gt;FINAL&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  What causes the "too many parts" error?
&lt;/h3&gt;

&lt;p&gt;Inserts are creating parts faster than background merges can consolidate them, usually because of too many small inserts or inserts that touch too many partitions.&lt;/p&gt;

&lt;h3&gt;
  
  
  What batch size should I target to avoid excessive part creation?
&lt;/h3&gt;

&lt;p&gt;Use batches of at least 1,000 rows, ideally 10,000 to 100,000 rows, and keep synchronous insert frequency around once per second where possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Can ClickHouse batch small inserts automatically?
&lt;/h3&gt;

&lt;p&gt;Yes. Use asynchronous inserts with &lt;code&gt;async_insert = 1&lt;/code&gt; and &lt;code&gt;wait_for_async_insert = 1&lt;/code&gt; so ClickHouse buffers small inserts and flushes them as larger parts based on configurable thresholds.&lt;/p&gt;

&lt;h3&gt;
  
  
  When is OPTIMIZE TABLE ... PARTITION appropriate?
&lt;/h3&gt;

&lt;p&gt;It can be appropriate for bounded maintenance on a closed partition that will not receive more writes. It is safer than forcing a table-wide rewrite, but it should still be used deliberately.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>database</category>
      <category>performance</category>
    </item>
    <item>
      <title>ClickHouse JOINs Aren't Slow Anymore (You're Reading 2020's Docs)</title>
      <dc:creator>Aman Puri</dc:creator>
      <pubDate>Tue, 12 May 2026 18:41:27 +0000</pubDate>
      <link>https://dev.to/dataengineeringguide/clickhouse-joins-arent-slow-anymore-youre-reading-2020s-docs-15fe</link>
      <guid>https://dev.to/dataengineeringguide/clickhouse-joins-arent-slow-anymore-youre-reading-2020s-docs-15fe</guid>
      <description>&lt;p&gt;ClickHouse JOIN performance was rebuilt between 2022 and 2026. The "avoid JOINs in ClickHouse" advice from 2020 is still circulating:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"ClickHouse can't do JOINs."&lt;/p&gt;

&lt;p&gt;"Denormalize everything."&lt;/p&gt;

&lt;p&gt;"Only hash join is supported."&lt;/p&gt;

&lt;p&gt;"JOINs OOM on anything bigger than RAM."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;All four were accurate in 2020. None of them are accurate today.&lt;/p&gt;

&lt;p&gt;In 2020, ClickHouse had one join algorithm, no disk spilling, no cost-based optimizer, and join order followed query syntax. If the right table did not fit in memory, the query crashed.&lt;/p&gt;

&lt;p&gt;Between 2022 and early 2026, the join subsystem was rebuilt. Six algorithms ship by default. The optimizer reorders joins automatically using column statistics. Runtime bloom filters prune fact tables at the storage scan. Grace hash spills to disk instead of OOMing. None of it is behind feature flags.&lt;/p&gt;

&lt;p&gt;Here is what shipped.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Many JOIN Algorithms Does ClickHouse Support?
&lt;/h2&gt;

&lt;p&gt;By the end of 2022 ClickHouse had five join algorithms. Today it has six, and the engine selects automatically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hash join.&lt;/strong&gt; The classic.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel hash join.&lt;/strong&gt; Default since v24.12 (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/70788" rel="noopener noreferrer"&gt;PR #70788&lt;/a&gt;). Near-linear scaling across CPU cores. Zero configuration.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grace hash join.&lt;/strong&gt; Disk-spilling for joins that exceed memory (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/38191" rel="noopener noreferrer"&gt;PR #38191&lt;/a&gt;). GA in v24.3. Closes the most upvoted join issue in ClickHouse history, open since June 2020.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full sorting merge.&lt;/strong&gt; Memory-bounded sort-merge (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/35796" rel="noopener noreferrer"&gt;PR #35796&lt;/a&gt;). Faster than hash on pre-sorted data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct join.&lt;/strong&gt; O(1) memory lookups against dictionaries and EmbeddedRocksDB. No hash table construction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Paste join.&lt;/strong&gt; Positional joins.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;"ClickHouse only has hash join" has been false since November 2022.&lt;/p&gt;

&lt;h2&gt;
  
  
  ClickHouse JOIN OOM Crashes: Grace Hash and Disk Spilling
&lt;/h2&gt;

&lt;p&gt;Grace hash join partitions both inputs into buckets via a secondary hash, processes one bucket pair at a time, and spills inactive buckets to disk. Full sorting merge bounds memory by the sort buffer rather than right-side cardinality.&lt;/p&gt;

&lt;p&gt;Before &lt;a href="https://github.com/ClickHouse/ClickHouse/pull/38191" rel="noopener noreferrer"&gt;PR #38191&lt;/a&gt;, a join where the right table exceeded memory crashed with OOM. After it, the join completes. It just takes longer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does ClickHouse Have a Query Optimizer for JOINs?
&lt;/h2&gt;

&lt;p&gt;"ClickHouse has no query optimizer" was true. It has not been since 2024.&lt;/p&gt;

&lt;h3&gt;
  
  
  Equivalence-Set Predicate Pushdown (180× Speedup)
&lt;/h3&gt;

&lt;p&gt;Equivalence-set predicate pushdown (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/61216" rel="noopener noreferrer"&gt;PR #61216&lt;/a&gt;) is the highest-impact join optimization of the four-year arc. When tables join on &lt;code&gt;t1.id = t2.id&lt;/code&gt;, a filter &lt;code&gt;WHERE t1.id = 5&lt;/code&gt; is logically equivalent to &lt;code&gt;t2.id = 5&lt;/code&gt;. The optimizer recognizes this and pushes the filter to both sides before the join executes.&lt;/p&gt;

&lt;p&gt;Benchmarks: &lt;strong&gt;up to 180× speedup&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Extended in 2026 (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/96596" rel="noopener noreferrer"&gt;PR #96596&lt;/a&gt;) to propagate across chains of INNER JOINs using a Disjoint Set Union to track transitive equalities. Filter on &lt;code&gt;t1.id&lt;/code&gt;, get it pushed to &lt;code&gt;t2&lt;/code&gt;, &lt;code&gt;t3&lt;/code&gt;, and beyond.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic OUTER to INNER Conversion
&lt;/h3&gt;

&lt;p&gt;Automatic OUTER to INNER conversion (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/62907" rel="noopener noreferrer"&gt;PR #62907&lt;/a&gt;) recognizes when post-join filters make outer semantics redundant. A &lt;code&gt;LEFT JOIN ... WHERE right_col IS NOT NULL&lt;/code&gt; becomes an INNER JOIN. One benchmarked query went from 32s to 0.006s.&lt;/p&gt;

&lt;p&gt;Converting to INNER first unlocks predicate pushdown and reordering that are structurally impossible for OUTER JOINs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Right-Side and OR-Condition Pushdown
&lt;/h3&gt;

&lt;p&gt;Two further pushdown extensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Right-side predicate pushdown (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/50532" rel="noopener noreferrer"&gt;PR #50532&lt;/a&gt;) with 27× improvement on applicable workloads.&lt;/li&gt;
&lt;li&gt;OR-condition pushdown (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/84735" rel="noopener noreferrer"&gt;PR #84735&lt;/a&gt;). Previously only AND conditions could be pushed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No query hints required.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cost-Based JOIN Reordering in ClickHouse (v25.9+)
&lt;/h2&gt;

&lt;p&gt;Automatic join order optimization landed in v25.9.&lt;/p&gt;

&lt;h3&gt;
  
  
  Greedy JOIN Reordering with Column Statistics
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/86822" rel="noopener noreferrer"&gt;PR #86822&lt;/a&gt; introduced global join reordering using a greedy algorithm with column statistics. For queries joining three or more tables, the optimizer evaluates estimated cardinalities and picks the order that minimizes intermediate result sizes.&lt;/p&gt;

&lt;p&gt;TPC-H SF100: &lt;strong&gt;1,450× speedup, 25× memory reduction&lt;/strong&gt; versus syntax-order execution.&lt;/p&gt;

&lt;h3&gt;
  
  
  DPsize Dynamic Programming JOIN Reordering
&lt;/h3&gt;

&lt;p&gt;v25.12 added a DPsize dynamic programming algorithm (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/91002" rel="noopener noreferrer"&gt;PR #91002&lt;/a&gt;) for exhaustive search on complex queries, with greedy fallback above the complexity threshold. PostgreSQL and Oracle planners work the same way.&lt;/p&gt;

&lt;h3&gt;
  
  
  Automatic Statistics Collection
&lt;/h3&gt;

&lt;p&gt;Statistics collection went automatic in v25.10 (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/89332" rel="noopener noreferrer"&gt;PR #89332&lt;/a&gt;). Runtime hash table sizes from previous executions feed back into future plans (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/93912" rel="noopener noreferrer"&gt;PR #93912&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;Hand-tuned join order is obsolete.&lt;/p&gt;

&lt;h2&gt;
  
  
  Runtime Bloom Filters for ClickHouse Star Schema JOINs
&lt;/h2&gt;

&lt;p&gt;Star and snowflake schemas were historically where ClickHouse joins hurt. The fact table is large. The dimensions are small. The fact table got scanned in full, then most rows were discarded.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/89314" rel="noopener noreferrer"&gt;PR #89314&lt;/a&gt;, default since v25.10, builds bloom filters from build-side join keys during hash table construction and pushes them to the probe-side scan. Non-matching rows are discarded at the storage layer before reaching the join.&lt;/p&gt;

&lt;p&gt;v25.10 release blog: &lt;strong&gt;2.1× overall query speedup, 7× memory reduction&lt;/strong&gt; on star-schema workloads. Filters can be pushed into PREWHERE (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/95838" rel="noopener noreferrer"&gt;PR #95838&lt;/a&gt;) for further efficiency. An adaptive mechanism disables them at runtime when they stop filtering enough rows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Does ClickHouse Support Correlated Subqueries?
&lt;/h2&gt;

&lt;p&gt;True until April 2025. &lt;a href="https://github.com/ClickHouse/ClickHouse/pull/76078" rel="noopener noreferrer"&gt;PR #76078&lt;/a&gt; shipped correlated subquery decorrelation. EXISTS, scalar, and projection-list subqueries auto-convert into joins.&lt;/p&gt;

&lt;p&gt;Beta with default enablement landed in August 2025 (&lt;a href="https://github.com/ClickHouse/ClickHouse/pull/85107" rel="noopener noreferrer"&gt;PR #85107&lt;/a&gt;), closing &lt;a href="https://github.com/ClickHouse/ClickHouse/issues/6697" rel="noopener noreferrer"&gt;issue #6697&lt;/a&gt;, open since 2019.&lt;/p&gt;

&lt;p&gt;Migration from PostgreSQL or Snowflake no longer requires manual rewriting of correlated subqueries.&lt;/p&gt;

&lt;h2&gt;
  
  
  ClickHouse JOIN Limitations and Trade-offs in 2026
&lt;/h2&gt;

&lt;p&gt;Four caveats remain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flat tables are still faster for sub-10ms p99 dashboards.&lt;/strong&gt; Denormalization wins on extreme latency targets.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The optimizer depends on statistics.&lt;/strong&gt; Stale or missing stats degrade plan quality. Monitor &lt;code&gt;system.statistics&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Correlated subqueries are beta.&lt;/strong&gt; Common patterns work. Edge cases need explicit join rewrites.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grace hash trades speed for completion.&lt;/strong&gt; Constant spilling means the data model needs adjustment, not the join algorithm.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  When Should You Use JOINs in ClickHouse?
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Star/snowflake analytics&lt;/td&gt;
&lt;td&gt;Yes. Runtime bloom filters target this exact pattern.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-table reporting&lt;/td&gt;
&lt;td&gt;Yes. The optimizer reorders.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Joins larger than RAM&lt;/td&gt;
&lt;td&gt;Yes. Grace hash and full sorting merge guarantee completion.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Real-time dimension enrichment&lt;/td&gt;
&lt;td&gt;Yes. Direct join is O(1) memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ASOF time-series joins&lt;/td&gt;
&lt;td&gt;Yes. Full sorting merge variant is 2× faster, 2× less memory.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sub-10ms p99 on complex joins&lt;/td&gt;
&lt;td&gt;Conditional. Denormalize if the storage budget allows.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10+ table joins with no stats&lt;/td&gt;
&lt;td&gt;Enable &lt;code&gt;allow_statistics_optimize&lt;/code&gt; (default in v25.10+).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Responding to "avoid JOINs in ClickHouse" in 2026
&lt;/h2&gt;

&lt;p&gt;Check the version.&lt;/p&gt;

&lt;p&gt;Anyone benchmarking against ClickHouse 23.x, or quoting blog posts written before parallel hash join became default, is not evaluating ClickHouse 2026. They are evaluating a system that no longer exists.&lt;/p&gt;

&lt;p&gt;The commit history is public. 50+ pull requests. Six algorithms. Cost-based optimization. Runtime filtering. Automatic algorithm selection, build-side selection, join reordering, predicate pushdown. None behind flags. All default.&lt;/p&gt;

&lt;p&gt;Test it on the workload. That is the only benchmark that matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  ClickHouse JOIN FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Are ClickHouse JOINs production-ready in 2026?
&lt;/h3&gt;

&lt;p&gt;Yes. Six algorithms ship by default, the optimizer reorders joins using column statistics, runtime bloom filters prune fact tables at the storage scan, and grace hash spills to disk instead of OOMing. None of it requires configuration.&lt;/p&gt;

&lt;h3&gt;
  
  
  How many JOIN algorithms does ClickHouse support?
&lt;/h3&gt;

&lt;p&gt;Six: hash, parallel hash (default), grace hash (disk-spilling), full sorting merge, direct (O(1) key-value), and paste. The engine selects automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Does ClickHouse still crash with OOM on large JOINs?
&lt;/h3&gt;

&lt;p&gt;No. Grace hash join (GA in v24.3) partitions inputs and spills to disk. Full sorting merge bounds memory by the sort buffer. Both algorithms guarantee completion regardless of right-side size.&lt;/p&gt;

&lt;h3&gt;
  
  
  Should I denormalize tables in ClickHouse instead of using JOINs?
&lt;/h3&gt;

&lt;p&gt;Conditional. For sub-10ms p99 dashboard queries, flat tables remain faster. For analytical workloads where query readability and storage efficiency matter, normalized star and snowflake schemas with JOINs perform well after the 2024 to 2026 optimizer work.&lt;/p&gt;

</description>
      <category>clickhouse</category>
      <category>database</category>
      <category>performance</category>
      <category>sql</category>
    </item>
  </channel>
</rss>
