starrocks and apache doris are the two engines most often quietly powering the "how is this dashboard sub-second?" moment in 2026 analytics stacks — and the two names a senior data engineer is now expected to place on a whiteboard between Snowflake and ClickHouse without flinching. They are the survivors of a quiet renaissance: MPP, the architecture the cloud warehouse era was supposed to have buried, came back vectorised, decoupled, and engineered for sub-second analytics on top of lakehouse storage.
This guide is the field map for new generation mpp — the starrocks vs doris lineage, the FE / BE query architecture, vectorized execution and SIMD batch processing, the primary key model versus aggregate and duplicate key models, materialized indexes and async materialized views, colocate joins, and the migration patterns from ClickHouse, Snowflake, BigQuery, Druid, and Pinot. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the SQL aggregation practice library →, rehearse on JOIN problems →, and stack the analytical muscles with window functions drills →.
On this page
- Why MPP came back in 2024
- The new-generation MPP landscape
- StarRocks vs Doris — same DNA, different paths
- Query architecture — vectorized + materialized indexes
- Migration & adoption patterns
- Cheat sheet — StarRocks / Doris recipes
- Frequently asked questions
- Practice on PipeCode
1. Why MPP came back in 2024
The cloud warehouse buried the MPP appliance — then vectorised execution and lakehouse storage dragged it back, leaner
The one-sentence invariant: massively parallel processing never died as an architecture; the on-prem MPP appliance died, and a new generation of vectorised, decoupled-storage MPP engines — StarRocks, Apache Doris, SelectDB — picked up its sub-second BI workloads where Snowflake and BigQuery were too slow or too expensive. Once you internalise that shift, the entire "why is everyone benchmarking against StarRocks?" conversation becomes a history lesson, not a fad.
The classic MPP story in three bullets.
- Greenplum, Vertica, Netezza, early Redshift were shared-nothing, columnar (mostly), and row-at-a-time in their execution loop. They were fast for their era but expensive to scale, painful to elastic-scale, and tightly coupled storage and compute on the same nodes.
- Cloud data warehouses — Snowflake, BigQuery, Redshift RA3, Databricks SQL — solved the elasticity and storage-coupling problem. Compute spun up on demand, storage lived on S3 / GCS, and the analyst experience went from "ticket the DBA" to "open a worksheet."
- The unsolved gap. Cloud warehouses optimise for throughput and elasticity. They do not optimise for the specific workload of "100 BI users hammering a dashboard with sub-second P95 latency expectations." That is where MPP came back.
The new MPP wave in three bullets.
- StarRocks — born in 2020 as a fork of Apache Doris (then called DorisDB), now backed commercially by CelerData. Doubles down on lakehouse acceleration via External Catalog (Iceberg, Hudi, Delta) and aggressive materialized-view rewrites.
- Apache Doris — the original. Donated to the Apache Software Foundation in 2018; governance is community-driven. Doris 2.x ships the same vectorised engine and broadly the same data models, with a different roadmap emphasis (real-time ingestion, MySQL protocol fidelity).
- SelectDB — the commercial entity behind Apache Doris, the way CelerData is behind StarRocks. Same engine, managed service.
Three things that are genuinely different now.
- Vectorised execution. Both engines process batches of columns, not row-by-row tuples. The execution loop runs on Apache-Arrow-shaped chunks and applies SIMD-friendly operators (compare-batch, hash-batch, filter-batch). The CPU stays warm; branch prediction wins; cache lines are used to their full width. That single architectural choice is most of the speed difference vs row-at-a-time MPP.
- Decoupled storage (optional). Modern StarRocks and Doris can run as either tightly-coupled BE storage (the classic MPP shape, for sub-second BI) or against external catalogs over Iceberg / Hudi / Delta in object storage (the lakehouse acceleration shape). One engine, two storage personas.
- Lakehouse-aware optimisation. External Catalog reads Iceberg / Hudi / Delta manifests, prunes by partition, projects only required columns, and rewrites queries through cached materialized views — all before fan-out to the BEs. ClickHouse on S3 can read the data; StarRocks/Doris turn that read into a planned, indexed, cached operation.
Where these engines fit vs the rest.
- vs ClickHouse. ClickHouse is brilliant at single-table aggregations over high-cardinality streams. StarRocks / Doris are stronger at multi-table JOIN, MV rewrite, and lakehouse catalogs.
- vs Druid / Pinot. Druid and Pinot specialise in pre-aggregated, low-latency time-series serving. StarRocks / Doris carry full SQL maturity (window functions, complex JOINs) the moment the workload outgrows pure rollup.
- vs Snowflake / BigQuery. Warehouses dominate elastic ad-hoc and large-scale ETL. New-gen MPP wins the "1k QPS dashboard" and "$/query on lakehouse" axes.
- vs Redshift / Databricks SQL. Both are catching up on vectorised execution; the new-gen MPP engines simply got there earlier and are still iterating faster.
The 2026 adoption signal. Tencent, Trip.com, JD.com, Airbnb, Pinterest, and Shopee have all publicly committed StarRocks deployments. Apache Doris ships in production at WeChat, Meituan, and Xiaomi. The talent market reflects it: "StarRocks experience" is a 2026 data-engineering JD line item, not a research-paper curiosity.
Worked example — the dashboard latency math that brings MPP back
Detailed explanation. A typical BI dashboard hits a fact table joined to three dimensions with a date filter, a group-by region, and a top-N sort. On Postgres that query takes 5 seconds. On Snowflake on a small warehouse it takes 2 seconds. On StarRocks with a sync materialized index it takes 200 ms. The maths is not magic — it is vectorisation × MV rewrite × shuffle elimination.
Question. Estimate the wall-clock latency of the same BI query on three engines (Postgres, Snowflake, StarRocks) given a fact table of 100M rows, three dim joins, and a daily roll-up. Show why StarRocks lands sub-second and why the others do not.
Input.
| Engine | Storage | Execution | Index / MV | Concurrency cost |
|---|---|---|---|---|
| Postgres | row + B-tree | row-at-a-time | partial idx | high lock cost |
| Snowflake small WH | columnar micro-partitions | vectorised, but spin-up + I/O | result cache only | warm WH needed |
| StarRocks | columnar BE + cache | vectorised + SIMD | sync materialized index | colocate group |
Code.
-- The BI query (same on all three engines, modulo dialect)
SELECT
d_date,
r.region,
SUM(f.amount) AS revenue,
COUNT(DISTINCT f.customer_id) AS uniq_customers
FROM fact_orders f
JOIN dim_date d ON f.date_id = d.date_id
JOIN dim_region r ON f.region_id = r.region_id
JOIN dim_product p ON f.product_id = p.product_id
WHERE d.d_date BETWEEN DATE '2026-06-01' AND DATE '2026-06-14'
GROUP BY d_date, r.region
ORDER BY revenue DESC
LIMIT 50;
Step-by-step explanation.
- On Postgres, the engine streams 100M fact rows row-by-row, follows pointers into three dim tables, hashes by group key, sorts the result. CPU spends most of its time on pointer chasing and TupleSlot bookkeeping. Wall-clock ~5 s.
- On Snowflake, the engine scans the relevant micro-partitions (pruned by
d_date), processes columns in batches, joins via hash, and emits the result. The query benefits from columnar I/O but pays a small warehouse warm-up tax and a partition-list parse. Wall-clock ~2 s on a small WH. - On StarRocks, the FE detects a sync materialized index matching the query shape — a pre-aggregated rollup by
(d_date, region_id)with the SUM and COUNT DISTINCT (HyperLogLog) materialized. The query is rewritten to read the MV directly. The BEs run a vectorised scan with SIMD comparisons, the colocate group eliminates the shuffle for thef.region_id = r.region_idjoin, and the result lands in ~200 ms. - The order-of-magnitude jump is not from "faster hardware." It is from MV rewrite (skip the 100M-row scan), SIMD batch processing (more rows per CPU cycle), and shuffle elimination (no cross-node data movement for the join).
Output.
| Engine | Latency P95 | Cost driver |
|---|---|---|
| Postgres (default) | ~5 s | row-at-a-time scan + pointer chasing |
| Snowflake small WH | ~2 s | columnar batch + warehouse warm-up |
| StarRocks (sync MV + colocate) | ~200 ms | MV rewrite + SIMD + no shuffle |
Rule of thumb. When a dashboard's P95 target is sub-second and the underlying fact table is 100M+ rows, the latency budget cannot afford a full scan. Either pre-aggregate (Druid / Pinot), MV-rewrite (StarRocks / Doris), or pay the warehouse warm-up tax (Snowflake / BigQuery). New-gen MPP is the option that keeps full SQL surface area while still landing sub-second.
Worked example — sub-second BI on a cloud warehouse, but at what cost?
Detailed explanation. Snowflake and BigQuery can hit sub-second on a dashboard, but typically only with a large always-on warehouse, a Search Optimisation Service, or aggressive result caching. The dollars per concurrent user climb fast. StarRocks / Doris hit sub-second on the same workload with a smaller compute footprint because they specialise.
Question. A finance team has 200 concurrent BI users on a single dashboard that must respond within 1 second P95. Compare the cost shape of "Snowflake LARGE warehouse always-on" vs "StarRocks 4-node cluster."
Input.
| Configuration | Compute units | Always-on? | Concurrency model |
|---|---|---|---|
| Snowflake LARGE WH | 8 nodes | yes (or 24/7 schedule) | multi-cluster scaling on demand |
| StarRocks 4-node BE | 4 BEs + 3 FEs | yes | shared cluster, MV-cached |
Code (StarRocks DDL sketch).
-- Sync materialized index — auto-maintained on the base table
CREATE MATERIALIZED VIEW mv_revenue_daily_region
DISTRIBUTED BY HASH(region_id) BUCKETS 32
AS
SELECT
d_date,
region_id,
SUM(amount) AS revenue,
HLL_UNION(hll_hash(customer_id)) AS uniq_hll
FROM fact_orders
GROUP BY d_date, region_id;
-- The BI query stays unchanged — FE rewrites it onto the MV
SELECT d_date, region_id, SUM(amount) AS revenue
FROM fact_orders
WHERE d_date >= DATE '2026-06-01'
GROUP BY d_date, region_id;
Step-by-step explanation.
- The Snowflake LARGE WH bill scales with on-time × cluster-count. At 200 concurrent users hitting the same dashboard the result cache absorbs most queries; the warehouse still has to be warm.
- The StarRocks cluster runs continuously at fixed cost (cluster size + storage). The MV is auto-maintained on every base-table commit, so the rewrite is always available without manual refresh schedules.
- The crossover point is workload-dependent. Below ~50 concurrent dashboard QPS, Snowflake on autosuspend often wins because there's no idle cost. Above ~200 concurrent QPS the StarRocks cluster typically wins on $/query because vectorised + MV-rewrite is denser per node.
- Both engines have escape hatches — Snowflake Search Optimisation + dedicated Snowpark; StarRocks External Catalog for lakehouse drag-and-drop. The decision rarely ends "always X" — it ends "X for the dashboard, Y for the ad-hoc analyst."
Output.
| Workload shape | Likely winner |
|---|---|
| Bursty ad-hoc, 5 QPS | Snowflake (idle cost) |
| 24/7 dashboard, 200 concurrent users | StarRocks (per-query cost) |
| Lakehouse query acceleration over Iceberg | StarRocks (External Catalog + MV) |
| Snowflake-native ETL + occasional dashboard | Snowflake (one less system) |
Rule of thumb. Pick by workload shape, not by benchmark spec sheet. A single 200-user dashboard is exactly the workload StarRocks / Doris are designed to specialise on; a 5-QPS ad-hoc analyst pool is not.
Worked example — the lakehouse acceleration pattern
Detailed explanation. Most large analytics teams already store fact data in Iceberg or Hudi tables in object storage. The "lakehouse acceleration" pattern adds StarRocks (or Doris) as a query layer on top of those tables — using External Catalog, async MVs, and FE planning to make queries land sub-second without copying the data into BE storage.
Question. Given an Iceberg fact table in S3 (50TB, partitioned by d_date), describe the StarRocks External Catalog pattern that gives a BI tool sub-second access.
Input.
| Layer | What lives here |
|---|---|
| S3 / Iceberg | fact_orders 50TB, partition by d_date |
| Iceberg catalog | Glue / REST catalog with manifests |
| StarRocks FE | parses query, plans, rewrites onto async MV |
| StarRocks BE | reads Iceberg manifests, scans only needed files |
Code.
-- 1) Mount the Iceberg catalog as an External Catalog
CREATE EXTERNAL CATALOG iceberg_glue
PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "glue",
"aws.glue.region" = "us-east-1"
);
-- 2) Create an async MV that materialises the dashboard rollup
CREATE MATERIALIZED VIEW mv_daily_region_revenue
REFRESH ASYNC EVERY (INTERVAL 5 MINUTE)
AS
SELECT
d_date,
region_id,
SUM(amount) AS revenue
FROM iceberg_glue.warehouse.fact_orders
GROUP BY d_date, region_id;
-- 3) The BI query is unchanged — FE rewrites onto the MV
SELECT d_date, region_id, SUM(amount)
FROM iceberg_glue.warehouse.fact_orders
WHERE d_date >= CURRENT_DATE - INTERVAL '14' DAY
GROUP BY d_date, region_id;
Step-by-step explanation.
- The External Catalog declaration teaches the FE how to talk to Iceberg. No data is copied into StarRocks storage.
- The async MV materialises the dashboard rollup into StarRocks-managed storage, refreshed every five minutes. Queries that match the MV's shape land in tens of milliseconds; queries that don't fall through to a direct Iceberg read.
- Direct Iceberg reads still benefit from partition pruning (the FE prunes against
d_date), column projection (BE reads onlyregion_idandamount), and SIMD-vectorised scans. - The "5-minute freshness" trade-off is explicit — the team picks the refresh interval that matches their freshness SLA. For sub-minute freshness, a continuous materialized view (Doris feature) or a sync-MV on a smaller hot table is the upgrade path.
Output.
| Workload | Latency |
|---|---|
| Dashboard query matching the MV shape | tens of ms |
| Ad-hoc query missing the MV | direct Iceberg scan, seconds |
| Real-time ingestion into the base table | continues against Iceberg writer; MV refreshes async |
Rule of thumb. When the data already lives in Iceberg / Hudi / Delta, accelerate it with StarRocks External Catalog before considering a copy. The lakehouse acceleration pattern keeps a single source of truth and adds a sub-second serving layer without duplicating the bytes.
Data engineering interview question on MPP renaissance
A senior interviewer often opens with: "Why did the team move the BI dashboard off Snowflake onto StarRocks? Walk me through the latency target, the cost shape, and the architectural lever that flipped the decision." It tests whether the candidate can articulate the specific gap new-gen MPP fills and the trade-offs they pay.
Solution Using a structured latency / cost decision framework
1. Define the latency target (P95).
- Dashboard P95 ≤ 500 ms, 200 concurrent users.
2. Profile the current engine.
- Snowflake LARGE WH always-on: P95 ~1.4 s; cost ~$X/month.
3. Identify the dominant cost driver.
- Always-on warehouse compute, multi-cluster scaling on dashboard load.
4. Pick the lever.
- Sync materialized index + colocate group on the dashboard fact + dim.
- Move the dashboard workload (only) onto StarRocks.
5. Migrate one workload, keep the rest.
- Snowflake remains the ETL + ad-hoc home; StarRocks owns the dashboard.
6. Measure: P95, $/query, concurrent QPS, freshness SLA.
- Compare against the original baseline; expand only if numbers hold.
Step-by-step trace.
| Step | Decision | Why |
|---|---|---|
| 1 | Target P95 ≤ 500 ms | dashboard UX requirement |
| 2 | Current P95 = 1.4 s | profile from query history |
| 3 | Cost = WH compute | autosuspend hurts dashboard, always-on hurts wallet |
| 4 | StarRocks + MV + colocate | MV rewrite skips scan; colocate skips shuffle |
| 5 | Move only the dashboard | keep ETL on Snowflake; minimise blast radius |
| 6 | Measure for 30 days | confirm P95, cost, freshness before scaling |
The structured framework matters more than the engine name. A candidate who says "we picked StarRocks because StarRocks is fast" gets dinged; a candidate who walks the latency math, cost math, and migration scope wins the round.
Output:
| Metric | Before (Snowflake LARGE) | After (StarRocks 4-node + MV) |
|---|---|---|
| P95 dashboard latency | 1.4 s | 180 ms |
| Dashboard $/month | $X | $0.45 X |
| Concurrent users supported | 200 | 200 (with headroom) |
| Freshness SLA | minute | minute (sync MV) |
Why this works — concept by concept:
- Workload specialisation — new-gen MPP wins on the specific shape "many concurrent users, repeating dashboard, sub-second target." It does not displace the warehouse; it carves out the workload the warehouse is least efficient at.
- Materialized index rewrite — the FE intercepts the SQL and routes it onto a pre-aggregated index. The 100M-row scan becomes a 10k-row lookup.
- SIMD vectorisation — the BE processes batches of columns per CPU cycle instead of rows per tick. Cache lines run full, branch mispredictions drop, throughput climbs by an order of magnitude.
- Colocate join shuffles eliminated — when fact and dim share the same bucket key and bucket count, the join happens locally on each BE without exchanging data across the cluster. The shuffle cost — the dominant cost in many warehouse plans — goes to zero.
- Migration blast radius — moving only the dashboard onto StarRocks keeps the rest of the platform on Snowflake. The cost story is honest, the migration is reversible, and the team learns the engine on a single contained workload first.
- Cost — O(rows_in_MV) per query instead of O(rows_in_fact); cluster cost is fixed, not per-query — wins the more you query.
SQL
Topic — aggregation
SQL aggregation problems (analytics SQL)
2. The new-generation MPP landscape
Four families, two axes — latency target and storage model — and new-gen MPP fills the corner the others miss
The mental model in one line: classic MPP (Greenplum / Vertica / early Redshift) is row-by-row vectorised with coupled storage; cloud warehouses (Snowflake / BigQuery) are vectorised with decoupled storage but multi-second latency; real-time OLAP (ClickHouse / Druid / Pinot) is sub-second on streams but weaker on SQL; new-gen MPP (StarRocks / Doris) is the missing corner — sub-second, decoupled-aware, full SQL. Once you can place every engine on those two axes, the architecture interview surface shrinks dramatically.
The four families in one table.
| Family | Examples | Storage | Execution | Latency target | Sweet spot |
|---|---|---|---|---|---|
| Classic MPP | Greenplum, Vertica, Netezza, early Redshift | coupled | row-at-a-time | seconds | on-prem enterprise data warehouse |
| Cloud warehouse | Snowflake, BigQuery, Databricks SQL, Redshift RA3 | decoupled | vectorised | multi-second | elastic SQL on shared storage |
| Real-time OLAP | ClickHouse, Druid, Pinot | coupled or decoupled | vectorised + index | sub-second | streaming pre-aggregates, single-table |
| New-gen MPP | StarRocks, Apache Doris, SelectDB | coupled or decoupled (External Catalog) | vectorised + SIMD + MV rewrite | sub-second | multi-user BI, lakehouse acceleration |
The two axes — latency × storage model.
- Axis 1 — query latency. Sub-second (200 ms) vs multi-second (2-10 s) vs seconds-to-minutes (10 s+). The dashboard / ad-hoc / batch split.
- Axis 2 — storage model. Coupled (engine owns disks) vs decoupled (engine reads object storage) vs hybrid (both modes supported). The "do I own my storage or rent it" split.
Three historical evolution arrows.
- Classic MPP → Cloud warehouse. Storage decoupled from compute; elasticity unlocked; on-prem dies.
- Cloud warehouse → New-gen MPP. Latency optimised; vectorisation + MV rewrite layered on top of cheap object storage; the cost line of "100 dashboards on Snowflake" finds a competitor.
- Real-time OLAP → New-gen MPP. SQL surface matured (window functions, JOIN, MV); the team that started on Druid because nothing else was sub-second now has a full-SQL alternative.
Who's where.
- ClickHouse Cloud — Yandex-born; brilliant single-table aggregation; weakest on JOINs and on cross-table MV; lake-format catalog support catching up.
- Druid — Apache; segments + bitmap indexes; preaggregated time series; OK SQL surface; ingestion-pipeline-heavy.
- Pinot — Apache (LinkedIn-born); star-tree index; real-time + offline segments; great for funnel + filter; SQL maturing.
- StarRocks — vectorised MPP, External Catalog, sync MV + async MV, colocate join. Iceberg-first as of 2026.
- Doris — same DNA; Apache governance; MySQL protocol fidelity; aggressive primary-key model for upserts.
- Snowflake / BigQuery — the dominant cloud warehouses; result-cache + Search Optimisation Service for sub-second on hot data.
Where StarRocks + Doris fit.
- Between cloud warehouse and real-time OLAP. They carry the warehouse SQL surface (window functions, complex JOINs, ANSI compliance) and the OLAP latency target.
- As a query acceleration layer. Most teams keep Snowflake or BigQuery as the system of record and add StarRocks for the few workloads where latency matters.
- As a lakehouse query head. Iceberg / Hudi / Delta data does not need to be copied into StarRocks BE storage; External Catalog reads it in place, with MV-cached rollups on top.
Common interview probes on the landscape.
- "When would you pick ClickHouse over StarRocks?" — single-table high-cardinality streaming aggregations, where you do not need JOINs and you want the absolute fastest per-query throughput.
- "When would you pick StarRocks over Druid?" — when the workload includes JOINs, when SQL maturity matters for analyst self-serve, when the source data lives in Iceberg / Hudi.
- "When would you pick Snowflake over StarRocks?" — when the analyst pool is small and bursty, when freshness is hourly not minute, when the team does not want another system to operate.
- "What does decoupled storage buy you on a new-gen MPP?" — elasticity for the BE tier, separate storage scaling, and the ability to mount the same data through different engines simultaneously.
Worked example — placing your workload on the landscape
Detailed explanation. Pick the engine by placing the workload (not the team) on the two-axis landscape. A team can run multiple engines for multiple workloads; the failure mode is forcing every workload onto one engine because "we chose it last quarter."
Question. Given five common analytics workloads, place each on the latency × storage axes and recommend the best-fit family.
Input.
| Workload | Latency target | Storage | Best-fit family |
|---|---|---|---|
| Finance dashboard (200 users) | sub-second P95 | already on Iceberg | new-gen MPP |
| Streaming clickstream funnel | sub-second on stream | Kafka + segment store | real-time OLAP (Druid / Pinot) |
| Ad-hoc analyst notebook | seconds | warehouse-native | cloud warehouse |
| Quarterly financial close ETL | minutes | warehouse-native | cloud warehouse |
| Single-table 10B-row event count | sub-second on one table | own object store | real-time OLAP (ClickHouse) |
Code (decision sketch).
if latency_target <= 500 ms
and concurrency >= 100
and surface == "multi-table JOIN + window":
pick = "StarRocks / Doris"
elif latency_target <= 500 ms
and shape == "single-table streaming aggregate":
pick = "ClickHouse / Druid / Pinot"
elif latency_target <= 10 s
and shape == "elastic ad-hoc SQL":
pick = "Snowflake / BigQuery"
else:
pick = "warehouse + batch (Spark / Snowpark)"
Step-by-step explanation.
- The finance dashboard is sub-second × multi-table × already on Iceberg → new-gen MPP (StarRocks External Catalog).
- The streaming funnel is sub-second × streaming × single-table-heavy → real-time OLAP (Druid).
- The ad-hoc notebook is seconds × bursty × warehouse-native → cloud warehouse (Snowflake).
- The quarterly ETL is minutes × scheduled × warehouse-native → cloud warehouse.
- The 10B-row event count is sub-second × single-table → ClickHouse.
Output.
| Workload | Pick | Why |
|---|---|---|
| Finance dashboard | StarRocks | multi-table JOIN + Iceberg + sub-second |
| Streaming funnel | Druid | streaming + pre-aggregate |
| Ad-hoc notebook | Snowflake | elastic SQL + analyst tooling |
| Quarterly ETL | Snowflake / BigQuery | batch SQL + warehouse-native |
| 10B-row count | ClickHouse | single-table aggregate at scale |
Rule of thumb. Place the workload, not the team. A modern stack is plural — one warehouse, one MPP query head, one real-time OLAP — and the data engineer's job is routing each workload to the right engine. The interview red flag is the candidate who insists every workload belongs on whichever engine they happened to learn first.
Worked example — the latency budget by tier
Detailed explanation. Sub-second is not one number. It is a budget split across network, parse, plan, scan, compute, and serialisation. Each engine spends the budget differently, and an interviewer who probes "where does the 200 ms go?" rewards candidates who can break it down.
Question. Decompose a 200 ms StarRocks query into its budget components. What is the dominant cost, and where is the lever?
Input.
| Component | Typical budget (StarRocks MV-hit) |
|---|---|
| Client → FE network | 5 ms |
| FE parse + plan + MV rewrite | 10 ms |
| FE → BE dispatch | 5 ms |
| BE scan + SIMD execute | 100 ms |
| BE → FE merge | 30 ms |
| Result serialisation | 40 ms |
| Total | ~190 ms |
Code (instrumentation idiom).
-- Force a profile capture for the next query
SET enable_profile = true;
SELECT d_date, SUM(amount) FROM fact_orders WHERE d_date = '2026-06-14';
-- Read the profile (StarRocks)
SHOW PROFILELIST;
SHOW PROFILE FOR <id>;
Step-by-step explanation.
- The BE scan + SIMD execute is usually the dominant cost. SIMD batches push throughput up, but the budget still scales with rows visited.
- MV rewrite is the biggest lever — it changes the BE budget from "scan 100M rows" to "scan 10k MV rows." That alone is the difference between 5 s and 200 ms.
- Colocate join eliminates the FE → BE shuffle pass for the join, shaving a second pass through the data.
- Result serialisation and the merge phase are constant-ish; they bound how low the total can go once the BE work is small.
Output.
| Lever | Latency saving |
|---|---|
| MV rewrite | 90%+ on hit |
| Colocate join | ~30% on shuffle-heavy queries |
| SIMD batch | ~5-10x vs row-at-a-time |
| Cache (result + page) | repeat queries near-zero |
Rule of thumb. Before benchmarking a new-gen MPP engine, instrument the profile of your representative query. The "200 ms median" headline number is rarely the right comparison; the per-phase breakdown is.
Data engineering interview question on positioning new-gen MPP
A senior interviewer might ask: "You have to introduce StarRocks to a team currently on Snowflake + ClickHouse. How do you carve the workload split without firing either incumbent?" It tests whether the candidate can keep multiple engines coexisting.
Solution Using a workload-routing carve-out
1. Inventory the workloads.
- ETL → keep on Snowflake.
- Ad-hoc analyst → keep on Snowflake.
- Single-table click counters → keep on ClickHouse.
- Multi-table BI dashboards (sub-second) → move to StarRocks.
- Lakehouse rollups over Iceberg → move to StarRocks (External Catalog).
2. Define the migration unit.
- One dashboard at a time, behind a feature flag.
3. Hold the freshness SLA.
- Async MV refresh every 5 min ≈ existing Snowflake hourly refresh.
4. Measure P95 + $/month per workload before / after.
5. Phase the cut-over.
- 30-day overlap; rollback path on every step.
Step-by-step trace.
| Workload | Engine before | Engine after | Why |
|---|---|---|---|
| Nightly ETL | Snowflake | Snowflake | batch SQL + warehouse-native; no benefit moving |
| Ad-hoc analyst notebook | Snowflake | Snowflake | bursty + small concurrency |
| Clickstream count (single table) | ClickHouse | ClickHouse | streaming aggregate; already sub-second |
| Finance BI dashboard | Snowflake | StarRocks | sub-second target + 200 concurrent users |
| Iceberg rollup for marketing | Snowflake | StarRocks (External Catalog) | lakehouse acceleration without copy |
The carve-out keeps incumbents on what they are best at and adds StarRocks only on the workloads where new-gen MPP wins clearly. No engine gets fired; one engine gets focused.
Output:
| Phase | Workloads moved | Outcome |
|---|---|---|
| Phase 1 | Finance dashboard | P95 1.4 s → 180 ms |
| Phase 2 | Iceberg rollup | minute-scale → sub-second |
| Phase 3 | optional expansion | drive next decision by data |
Why this works — concept by concept:
- Carve-out, not replacement — moving one workload at a time keeps the blast radius small. Each migration ships with metrics + rollback. The platform becomes plural, not fragile.
- Workload routing as architecture — modern analytics stacks treat the engine layer as a routing decision. The data engineer's craft is matching workload shape to engine sweet spot, not enforcing a monoculture.
- Freshness SLA as constraint — the async-MV refresh interval has to clear the freshness bar the business already accepts. Picking the right MV cadence is half the migration design.
- P95 + $/month as the comparison — single-number latency averages hide the dashboard pain; P95 is what users feel. Cost is what finance sees. Both must improve to justify a new engine.
- Phase the cut-over — overlap windows + feature flags + rollback paths are the difference between a successful migration and an after-hours rollback ticket. New-gen MPP runs alongside Snowflake until the team is sure.
- Cost — per workload; the StarRocks cluster pays back only on workloads where it wins clearly. Run the math per workload, not per platform.
SQL
Topic — joins
JOIN problems (analytics SQL)
3. StarRocks vs Doris — same DNA, different paths
One engine, forked in 2020 — convergent execution, divergent governance and lakehouse story
The mental model in one line: StarRocks and Apache Doris share a common ancestor (DorisDB ~2020) and a common execution engine shape (vectorised, MPP, FE/BE topology), but diverged on governance (CelerData-led commercial vs Apache Software Foundation) and on lakehouse-catalog emphasis (StarRocks pushed Iceberg / Hudi / Delta hardest). Once you can recite that fork in two sentences, every "StarRocks vs Doris" interview probe falls into one of two buckets — technical (mostly converged) or organisational (still very different).
The lineage in one paragraph. Apache Doris began at Baidu in 2008 as the Palo project, was open-sourced as Apache Doris, and donated to the Apache Software Foundation in 2018. In 2020 a fork created DorisDB (later renamed StarRocks), backed commercially by CelerData. Doris 2.x (the ASF line) and StarRocks 3.x (the CelerData line) both carry the original vectorised engine but have evolved with different priorities: Doris emphasises real-time ingestion and MySQL-protocol fidelity; StarRocks emphasises lakehouse catalogs and materialized-view rewrite.
What stayed the same.
- Vectorised execution engine. Both process batches of columns; both use SIMD-friendly operators.
- MPP FE / BE topology. Both have a frontend coordinator tier (parser, planner, dispatcher, catalog) and a backend worker tier (storage, scan, execute).
- Columnar storage. Both store rows column-by-column with per-column compression.
- Data models. Both support Aggregate Key, Duplicate Key, and Primary Key models.
- MySQL protocol. Both speak the MySQL wire protocol — drop-in driver compatibility on the client side.
What diverged.
- Governance. Doris is Apache Software Foundation; PMC + committers + release cadence on the ASF charter. StarRocks is CelerData-led; community is open-source-friendly but the roadmap is set by the commercial entity.
- Catalog integrations. StarRocks added Iceberg / Hudi / Delta / Hive Metastore / JDBC catalogs ahead of Doris's equivalent push. Doris caught up on most of these in 2.x but StarRocks remains the Iceberg-first answer in 2026.
- Materialized view rewrite. Both ship sync MVs and async MVs. StarRocks pushed harder on view-matching for arbitrary queries (recognises a query that almost matches a known MV and rewrites it).
- Real-time ingestion. Doris emphasises Routine Load + Stream Load for sub-minute freshness; the primary-key model handles upserts efficiently.
- Commercial support. CelerData (StarRocks) and SelectDB (Doris) each sell managed cloud + enterprise support. The choice often comes down to "which vendor relationship makes sense."
Versions to know.
- Apache Doris 2.0 / 2.1 — vectorised engine, primary key model, lakehouse catalogs, sub-minute freshness.
- StarRocks 3.x — External Catalog (Iceberg, Hudi, Delta), async + sync MV rewrite, shared-data architecture (storage-compute separation in a single cluster).
Why interviews now ask "StarRocks vs Doris."
- Both engines show up in production; a candidate who only knows one half is half-equipped.
- The technical core is converged, so the interviewer can test depth on either engine and probe the organisational lens (governance, vendor) to surface architectural maturity.
- The lakehouse catalog story differentiates StarRocks today; that is a senior probe.
Worked example — translating a DDL between StarRocks and Doris
Detailed explanation. A team has a StarRocks Primary Key table and needs the Doris equivalent because a sister team standardised on Doris. The DDL is almost identical — both engines inherit the model from the shared ancestor — but a few keywords differ.
Question. Translate the following StarRocks Primary Key DDL into Doris. Highlight every keyword that differs.
Input.
| Field | Type |
|---|---|
| user_id | BIGINT |
| VARCHAR(128) | |
| signup_ts | DATETIME |
| status | VARCHAR(16) |
Code.
-- StarRocks 3.x Primary Key
CREATE TABLE users_sr (
user_id BIGINT NOT NULL,
email VARCHAR(128) NOT NULL,
signup_ts DATETIME NOT NULL,
status VARCHAR(16)
)
PRIMARY KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES (
"replication_num" = "3",
"enable_persistent_index" = "true"
);
-- Apache Doris 2.x — UNIQUE KEY is the equivalent contract;
-- a Doris UNIQUE table in MERGE_ON_WRITE mode behaves like StarRocks PK.
CREATE TABLE users_doris (
user_id BIGINT NOT NULL,
email VARCHAR(128) NOT NULL,
signup_ts DATETIME NOT NULL,
status VARCHAR(16)
)
UNIQUE KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES (
"replication_num" = "3",
"enable_unique_key_merge_on_write" = "true"
);
Step-by-step explanation.
-
PRIMARY KEY(user_id)in StarRocks ↔UNIQUE KEY(user_id)in Doris. Same semantics: per-key upsert + delete; the latest version wins. - The
enable_persistent_indexproperty in StarRocks is the equivalent ofenable_unique_key_merge_on_writein Doris; both control how the engine maintains the in-memory + on-disk key index for fast upserts. -
DISTRIBUTED BY HASH(...) BUCKETS Nis identical between engines. The bucket key choice determines colocate join opportunities and shuffle behaviour. -
replication_numis the same property name on both — the number of BE replicas for the tablet.
Output.
| Concept | StarRocks | Doris |
|---|---|---|
| Upsert table | PRIMARY KEY(...) |
UNIQUE KEY(...) |
| Fast-upsert toggle | enable_persistent_index |
enable_unique_key_merge_on_write |
| Bucket layout | DISTRIBUTED BY HASH BUCKETS |
DISTRIBUTED BY HASH BUCKETS |
| Replicas | replication_num |
replication_num |
Rule of thumb. When porting between StarRocks and Doris, expect 90% of the DDL to copy-paste cleanly. The 10% that differs is the upsert-table keyword (PRIMARY KEY vs UNIQUE KEY) and the index-mode toggle name. Memorise that pair and most migrations become mechanical.
Worked example — choosing the engine by governance + lakehouse needs
Detailed explanation. Most "StarRocks vs Doris" decisions are not settled by raw benchmarks — they are settled by governance preference and lakehouse-catalog needs. A team that wants Apache Software Foundation oversight picks Doris; a team that wants Iceberg-first plus commercial support picks StarRocks.
Question. A platform team is choosing between StarRocks and Doris for their new analytics engine. Build a decision matrix that captures the non-benchmark axes.
Input.
| Axis | Question |
|---|---|
| Governance | ASF preferred? |
| Catalog needs | Iceberg / Hudi / Delta first day? |
| Commercial support | Managed service or self-hosted? |
| Real-time ingest | Sub-minute freshness on upserts? |
| Existing skill set | MySQL-protocol familiarity? |
Code (decision sketch in pseudo-SQL).
WITH decision AS (
SELECT
CASE
WHEN governance_pref = 'ASF'
OR ingest_freshness = 'sub_minute'
THEN 'Doris'
WHEN catalog_first IN ('iceberg', 'hudi', 'delta')
OR vendor_relationship = 'CelerData'
THEN 'StarRocks'
ELSE 'either — pick by team familiarity'
END AS recommend
)
SELECT recommend FROM decision;
Step-by-step explanation.
- If governance preference is ASF-first, the answer is Doris regardless of other axes. Some organisations require ASF-licensed projects for compliance.
- If the data lives in Iceberg / Hudi / Delta from day one and the team needs External Catalog now, StarRocks is the lower-friction answer in 2026.
- If real-time ingest with sub-minute upserts is the primary requirement, Doris's Routine Load + primary-key model is the slightly more polished path.
- If neither axis is decisive, pick by team familiarity. Both engines run the same workloads in production at scale.
Output.
| Profile | Pick |
|---|---|
| Compliance team prefers ASF | Doris |
| Iceberg lakehouse + CelerData support | StarRocks |
| MySQL-shop with sub-minute upserts | Doris |
| Multi-source ingest (Hudi + Iceberg + Hive) | StarRocks |
| Greenfield no constraints | either — pick by team |
Rule of thumb. Decide on the governance + lakehouse axes first. Once those constraints are settled, the remaining technical choice is small and largely reversible (DDL ports in a sprint).
Worked example — when the answer is "both, behind a façade"
Detailed explanation. A few large platforms run both StarRocks and Doris in production, behind a SQL-routing façade. The shape works because both engines speak the MySQL protocol and share most of the data-model surface — the façade can route by workload signature without leaking engine choice to analysts.
Question. Sketch a façade design that routes BI dashboard queries to StarRocks (for MV rewrite + Iceberg catalog) and real-time upsert queries to Doris (for sub-minute freshness).
Input.
| Workload signature | Engine |
|---|---|
| SELECT with sub-second SLA + Iceberg catalog | StarRocks |
| INSERT / UPDATE / DELETE with sub-minute SLA | Doris |
| Internal CTAS for nightly rollups | either |
Code (routing pseudo-code).
def route(sql: str) -> str:
sig = parse_signature(sql)
if sig.is_select and sig.touches_iceberg:
return "starrocks"
if sig.is_dml and sig.freshness_sla_seconds <= 60:
return "doris"
if sig.is_ctas and sig.is_internal:
return least_loaded_engine()
return "starrocks" # default
Step-by-step explanation.
- The façade parses each SQL into a workload signature (read vs write, lakehouse vs internal, freshness SLA).
- Reads against Iceberg / Hudi / Delta route to StarRocks for External Catalog + MV rewrite.
- Writes with sub-minute freshness route to Doris for Primary Key (UNIQUE KEY) upsert speed.
- Internal CTAS jobs can go to either; load-balance.
- The default route is StarRocks for the "I do not know" case — most platforms tune this to whichever engine they introduced first.
Output.
| Signature | Destination |
|---|---|
| Dashboard read over Iceberg | StarRocks |
| Upsert into users table | Doris |
| Nightly rollup CTAS | round-robin |
| Untagged query | default StarRocks |
Rule of thumb. "Both, behind a façade" is overkill for most teams. Pick one engine and move on. The two-engine pattern is justified only when the workload split is large, the team is large, and the operational overhead of two engines is genuinely smaller than the engineering cost of forcing one to do both jobs.
Data engineering interview question on the StarRocks / Doris fork
A senior interviewer often opens with: "Why did StarRocks fork from Doris in 2020, and how does that history affect how you'd pick between them today?" It tests whether the candidate can pair the technical lineage with the organisational consequences.
Solution Using a paired technical + organisational answer
Technical lineage:
- 2008: Baidu Palo project.
- ~2017: open-sourced as Apache Doris (Incubator).
- 2018: Doris donated to the Apache Software Foundation.
- 2020: DorisDB fork (commercial), later renamed StarRocks; CelerData formed.
- 2026: Apache Doris 2.x and StarRocks 3.x are the production lines.
Organisational consequence:
- Apache Doris == ASF governance; community-driven roadmap.
- StarRocks == CelerData-led; commercial backing + accelerated catalog roadmap.
Picking today:
- Compliance prefers ASF -> Doris.
- Lakehouse + commercial support -> StarRocks.
- Otherwise -> pick by team familiarity; both run the same workloads.
Step-by-step trace.
| Year | Event | Why it matters |
|---|---|---|
| 2008 | Baidu Palo | shared ancestor of both engines |
| 2018 | Doris → ASF | governance becomes ASF |
| 2020 | StarRocks fork | commercial backing diverges |
| 2026 | both production-grade | technical surface converged again |
The trace lets a candidate hand-wave less and show the lineage as a sequence of decisions, not a vibe.
Output:
| Profile | Recommendation |
|---|---|
| Strict ASF preference | Apache Doris |
| Iceberg-first lakehouse | StarRocks |
| Existing MySQL-shop, real-time upserts | Apache Doris |
| Vendor support + managed service | StarRocks (CelerData) or Doris (SelectDB) |
Why this works — concept by concept:
- Shared ancestor — both engines inherit the vectorised MPP core from the 2008-2020 Palo / Doris evolution. Technical depth on one transfers cleanly to the other.
- Governance fork — the 2020 split is not a technical schism. It is an organisational one. Knowing the year and the reason is senior signalling.
- Convergent capabilities — both engines ship vectorised execution, MPP FE/BE, columnar storage, and three data models. The 80% the candidate cares about is the same.
- Divergent emphasis — StarRocks pushed lakehouse catalogs first; Doris pushed real-time upserts and ASF governance. Both have caught up partially on each other, but the order of emphasis still shows in the default UX.
- Pick by axis, not benchmark — the right decision is governance + catalog needs first, benchmarks last. Benchmarks shift quarterly; governance and catalog priorities do not.
- Cost — same operational footprint per engine; the cost difference is the vendor relationship and the support SLA, not raw compute.
SQL
Topic — window functions
Window function problems (analytics SQL)
4. Query architecture — vectorized + materialized indexes
FE plans + rewrites; BE scans + SIMDs; materialized index rewrites the query; colocate join eliminates the shuffle — that is the whole architecture
The mental model in one line: the FE is the brain (parser, planner, MV rewriter, dispatcher) and the BE is the muscle (columnar storage, vectorised SIMD execution, colocate-aware joins). A query becomes sub-second when the FE rewrites it onto a materialized index and the BE runs the scan with no shuffle. Once you can recite that loop, you can defend any architectural probe on StarRocks or Doris.
The FE / BE topology.
- FE (Frontend). Stateless query coordinator. Parses SQL, builds the logical plan, applies MV rewrite, builds the distributed physical plan, dispatches fragments to BEs, merges results, returns to the client. Runs the metadata catalog (table schemas, MV definitions, partitions). Typically 3 or 5 nodes, odd count for quorum.
- BE (Backend). Stateful worker. Owns tablets (the unit of distribution), runs the vectorised execution engine, applies SIMD-friendly operators, services external catalog reads, returns intermediate results. Typically scales horizontally; each tablet replicates ×3 by default.
- CN (Compute Node). Optional stateless compute layer for shared-data deployments — separates compute scaling from storage scaling.
Vectorised execution in five bullets.
-
Columnar batches. Operators consume
Chunkobjects — Apache-Arrow-shaped buffers of columns. - SIMD operators. Compare, hash, filter, project, aggregate — every operator is implemented to use CPU vector intrinsics when available.
- Branch-light hot loop. Each operator's tight loop runs over fixed-width column buffers with predictable access patterns. Branch prediction wins; cache hit rate climbs.
- Pipeline execution. Operators connect via lockless ring buffers; the engine runs multiple stages in parallel within a single BE.
- Late materialisation. The engine reads only the columns it needs, when it needs them; columns absent from filters or projections are never decompressed.
The three data models.
- Duplicate Key model. Like a classic append-only log. Every row is kept; sorted by a key for scan locality. Best for raw events and append-only fact tables.
- Aggregate Key model. Pre-aggregates on write. The key columns identify the bucket; non-key columns carry aggregate functions (SUM, MAX, MIN, REPLACE, BITMAP_UNION, HLL_UNION). Same key rows collapse on flush. Best for rollup tables.
- Primary Key model (StarRocks) / Unique Key model (Doris). Per-key upsert; the latest write wins. The engine maintains a key index for fast point lookups and updates. Best for dimension tables and any table that needs real-time upserts.
Materialized index (sync MV).
- Definition. A sync MV is a pre-computed projection / aggregation on the same base table, maintained synchronously on every write. The FE rewrites matching queries to read the MV instead of the base table.
-
Use case. A dashboard rolls up
fact_ordersto(d_date, region_id). Define a sync MV with that grouping; the FE auto-rewrites the dashboard query. - Cost. Extra storage proportional to the MV row count; extra write amplification on the base table (proportional to MV size).
Async materialized views.
- Definition. A pre-computed result of an arbitrary query, refreshed on a schedule (every N minutes) or on a base-table change trigger. Lives in its own storage; the FE rewrites matching queries.
- Use case. A lakehouse rollup over Iceberg — refresh every 5 minutes; queries hit the cached MV at sub-second speed.
- Cost. Compute on each refresh; storage for the MV result.
Colocate join + bucket join.
- Colocate group. Two or more tables in the same group share the same bucket key and bucket count, and their tablets are placed on the same BEs. Joins on the bucket key happen locally — no cross-BE data movement.
- Bucket join. When two tables share the same bucket layout but are not in a colocate group, the engine can still run a bucket-level join with reduced shuffle.
- Shuffle elimination is the lever. Many dashboard queries become sub-second only when the shuffle phase is eliminated.
Common interview probes on architecture.
- "Walk through what happens when a SELECT lands on the FE." — parse, plan, MV-rewrite check, physical plan, dispatch, BE vectorised execute, merge.
- "When should a table use Primary Key vs Aggregate Key?" — Primary Key for upsert-heavy dimension tables; Aggregate Key for write-then-rollup fact tables.
- "What is the difference between a sync MV and an async MV?" — sync MV maintained synchronously on every write (small enough to keep hot); async MV refreshed on schedule (good for lakehouse rollups).
- "How does colocate join eliminate the shuffle?" — both tables share the same bucket layout and placement, so the join runs per-BE without cross-network data movement.
Worked example — defining a sync materialized index
Detailed explanation. The classic BI workload is "show daily revenue per region for the last 14 days." Without an MV, every dashboard refresh scans the full fact table. A sync materialized index pre-aggregates by (d_date, region_id) and the FE auto-rewrites the query.
Question. Given the fact_orders table, define a sync MV that accelerates the daily revenue-per-region rollup and show what query the FE rewrites onto it.
Input.
| Column | Type |
|---|---|
| order_id | BIGINT |
| d_date | DATE |
| region_id | INT |
| customer_id | BIGINT |
| amount | DECIMAL(18,2) |
Code.
-- Base table — Duplicate Key, sorted by date + region for scan locality
CREATE TABLE fact_orders (
order_id BIGINT NOT NULL,
d_date DATE NOT NULL,
region_id INT NOT NULL,
customer_id BIGINT NOT NULL,
amount DECIMAL(18,2)
)
DUPLICATE KEY(d_date, region_id, order_id)
PARTITION BY RANGE(d_date) ()
DISTRIBUTED BY HASH(region_id) BUCKETS 32;
-- Sync materialized index — maintained on every write
CREATE MATERIALIZED VIEW mv_revenue_daily_region AS
SELECT
d_date,
region_id,
SUM(amount) AS revenue
FROM fact_orders
GROUP BY d_date, region_id;
-- The BI query — FE rewrites this to read the MV
SELECT
d_date,
region_id,
SUM(amount) AS revenue
FROM fact_orders
WHERE d_date BETWEEN DATE '2026-06-01' AND DATE '2026-06-14'
GROUP BY d_date, region_id
ORDER BY revenue DESC
LIMIT 20;
Step-by-step explanation.
- The base table is Duplicate Key — every row preserved, sorted by
(d_date, region_id, order_id)for scan locality on the most common filters. - The sync MV pre-aggregates revenue by
(d_date, region_id). The engine maintains it on every base-table write; for a busy table this adds a small write amplification but pays back massively on read. - The FE detects that the BI query's GROUP BY matches the MV's GROUP BY (modulo a WHERE on
d_date) and rewrites the plan to read the MV instead offact_orders. - The rewritten plan scans
mv_revenue_daily_region— a much smaller table — applies thed_datefilter, sorts, limits. Wall-clock drops by an order of magnitude.
Output.
| Query phase | Plan |
|---|---|
| Without MV | scan 100M rows of fact_orders, GROUP BY, sort, limit |
| With MV rewrite | scan 14×N rows of mv_revenue_daily_region, filter, sort, limit |
| Latency before | ~3 s |
| Latency after | ~120 ms |
Rule of thumb. Every dashboard with a fixed rollup shape deserves a sync MV. The write amplification is bounded; the read latency win is order-of-magnitude. Audit your top-10 dashboard queries and define MVs for the recurring GROUP BY shapes.
Worked example — colocate group eliminates the join shuffle
Detailed explanation. When fact_orders and dim_region share a colocate group on region_id, the engine places their tablets on the same BEs and the join runs locally — no cross-network shuffle. For a 100M-row fact joined to a 1k-row dim, the shuffle elimination can be the difference between 800 ms and 150 ms.
Question. Define a colocate group between fact_orders and dim_region on region_id, and explain why the join no longer shuffles.
Input.
| Table | Distribution | Bucket count |
|---|---|---|
| fact_orders | HASH(region_id) | 32 |
| dim_region | HASH(region_id) | 32 |
Code.
CREATE TABLE dim_region (
region_id INT NOT NULL,
name VARCHAR(64),
continent VARCHAR(32)
)
PRIMARY KEY(region_id)
DISTRIBUTED BY HASH(region_id) BUCKETS 32
PROPERTIES (
"colocate_with" = "g_region"
);
CREATE TABLE fact_orders (
order_id BIGINT NOT NULL,
d_date DATE NOT NULL,
region_id INT NOT NULL,
customer_id BIGINT NOT NULL,
amount DECIMAL(18,2)
)
DUPLICATE KEY(d_date, region_id, order_id)
PARTITION BY RANGE(d_date) ()
DISTRIBUTED BY HASH(region_id) BUCKETS 32
PROPERTIES (
"colocate_with" = "g_region"
);
-- Joins on region_id now run colocated — no shuffle
SELECT
r.continent,
SUM(f.amount) AS revenue
FROM fact_orders f
JOIN dim_region r ON f.region_id = r.region_id
WHERE f.d_date >= DATE '2026-06-01'
GROUP BY r.continent;
Step-by-step explanation.
- Both tables share
colocate_with = "g_region"and bucket layoutHASH(region_id) BUCKETS 32. The cluster places matching tablets on the same BEs. - For each BE, the engine has both the
fact_orderstablet for some hash range ofregion_idand the correspondingdim_regiontablet. The join runs locally on the BE. - Without the colocate group, the cluster would shuffle either
fact_orders(huge) ordim_region(small but still a network round-trip) before joining. With colocate, neither moves. - The cost: every table in the group must keep the same bucket count and key. Adding or rebalancing buckets requires coordination across the group.
Output.
| Plan | Shuffle | Latency on 100M-row join |
|---|---|---|
| no colocate | shuffle smaller side | ~800 ms |
| colocate group | none | ~150 ms |
Rule of thumb. For every fact-to-dim join that runs on the dashboard hot path, put both tables in a colocate group on the join key. The DDL discipline (matching bucket layouts) is the price; the shuffle elimination is the payout.
Worked example — Primary Key model for fast upserts
Detailed explanation. A users dimension is upserted from a CDC stream. A Duplicate Key model would accumulate every version; an Aggregate Key model could pick MAX of each column but is awkward for strings. The Primary Key model (StarRocks) / Unique Key model (Doris) gives "latest version wins per key" with millisecond upserts.
Question. Build a Primary Key table for a CDC-fed users dimension and show the upsert path.
Input.
| user_id | name | status | op | ts | |
|---|---|---|---|---|---|
| 1 | a@x.com | Alice | active | INSERT | T0 |
| 2 | b@x.com | Bob | active | INSERT | T0 |
| 1 | a@x.com | Alice | inactive | UPDATE | T1 |
Code.
CREATE TABLE users_dim (
user_id BIGINT NOT NULL,
email VARCHAR(128) NOT NULL,
name VARCHAR(64) NOT NULL,
status VARCHAR(16) NOT NULL,
updated_ts DATETIME
)
PRIMARY KEY(user_id)
DISTRIBUTED BY HASH(user_id) BUCKETS 16
PROPERTIES (
"replication_num" = "3",
"enable_persistent_index" = "true"
);
-- Routine Load from Kafka — Primary Key applies upserts in place
CREATE ROUTINE LOAD users_cdc_load ON users_dim
COLUMNS(user_id, email, name, status, updated_ts)
PROPERTIES (
"format" = "json",
"jsonpaths" = "[\"$.user_id\",\"$.email\",\"$.name\",\"$.status\",\"$.updated_ts\"]"
)
FROM KAFKA (
"kafka_broker_list" = "broker:9092",
"kafka_topic" = "users_cdc"
);
Step-by-step explanation.
- The Primary Key model maintains a per-tablet key index. On every write, the engine looks up the key, marks the previous version as deleted, and writes the new version.
- Reads see only the latest version. The index makes point lookups (
WHERE user_id = ?) fast and upserts O(1)-ish per key. - The Routine Load streams JSON from Kafka; for each record the engine applies the upsert. No staging table, no MERGE statement, no compaction job to schedule.
- The Aggregate Key model would handle this with REPLACE on every non-key column, but loses the index. The Primary Key model is the right shape for "I will read individual rows by key" workloads.
Output.
| Read query | Result |
|---|---|
| SELECT * FROM users_dim WHERE user_id = 1 | (user_id=1, ..., status='inactive', updated_ts=T1) |
| COUNT(*) FROM users_dim | 2 |
Rule of thumb. Use Primary Key (or Doris Unique Key) for any table that is upserted from a CDC stream or that needs fast point lookups. Use Duplicate Key for raw event logs. Use Aggregate Key when the rollup is the only read shape you care about.
Worked example — async MV for lakehouse rollup
Detailed explanation. Fact data lives in Iceberg. A nightly dashboard wants the last 30 days of daily revenue per region. An async materialized view refreshed every 5 minutes pre-computes the rollup against the Iceberg table; the dashboard reads from the MV at sub-second speed.
Question. Define an async MV over an Iceberg fact table and show the query that the FE rewrites onto it.
Input.
| Layer | Object |
|---|---|
| Iceberg catalog | iceberg_glue.warehouse.fact_orders |
| External Catalog mount | iceberg_glue |
| Internal MV storage | mv_iceberg_daily_revenue (managed by StarRocks) |
Code.
CREATE MATERIALIZED VIEW mv_iceberg_daily_revenue
REFRESH ASYNC EVERY (INTERVAL 5 MINUTE)
PROPERTIES (
"replication_num" = "3"
)
AS
SELECT
d_date,
region_id,
SUM(amount) AS revenue
FROM iceberg_glue.warehouse.fact_orders
WHERE d_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY d_date, region_id;
-- Dashboard query — FE rewrites onto the MV
SELECT
d_date,
region_id,
SUM(amount) AS revenue
FROM iceberg_glue.warehouse.fact_orders
WHERE d_date BETWEEN CURRENT_DATE - INTERVAL '14' DAY AND CURRENT_DATE
GROUP BY d_date, region_id;
Step-by-step explanation.
- The async MV is defined against the External Catalog's Iceberg table. StarRocks stores the materialised result inside its own BE storage, replicated 3 times.
- Every 5 minutes the engine refreshes the MV by re-running the underlying query and replacing the contents. The 5-minute freshness is explicit; staleness is bounded.
- When the dashboard query lands, the FE checks the MV catalogue, recognises the matching shape (with a tighter date filter), and rewrites the plan onto the MV.
- The dashboard latency drops from "Iceberg manifest read + 30-day scan" (~3 s) to "MV scan with 14-day filter" (~80 ms).
Output.
| Without MV | With async MV |
|---|---|
| ~3 s P95 | ~80 ms P95 |
Rule of thumb. For every recurring dashboard query over a lakehouse table, define an async MV with the refresh cadence that clears the freshness SLA. The MV is cheap to define, transparent to the analyst, and the rewrite is automatic.
Data engineering interview question on architecture
A senior interviewer might frame: "Walk me through every architectural lever StarRocks uses to make a dashboard query land at 200 ms instead of 5 seconds." The answer should cover FE rewrite, vectorisation, MV, colocate, and the data model choice.
Solution Using the full FE → BE pipeline trace
1. FE receives SQL.
2. FE parses + optimises + checks MV catalogue.
- If a sync MV matches -> rewrite to read MV.
- Else if an async MV matches -> rewrite to read async MV.
3. FE builds physical plan with colocate detection.
- Same bucket key + count + colocate group => local join.
4. FE dispatches fragments to BEs.
5. Each BE runs vectorised SIMD execution.
- Reads only the projected columns.
- Applies filters in vectorised batches.
- Hash join locally (colocate) or via shuffle.
6. BEs return intermediate results to the FE.
7. FE merges + serialises + returns.
Step-by-step trace.
| Phase | Lever | Latency saving |
|---|---|---|
| FE rewrite | sync MV | 90%+ on hit |
| FE rewrite | async MV | 90%+ on hit, with freshness tax |
| Physical plan | colocate | ~30% on shuffle-heavy joins |
| BE execute | SIMD vectorised | 5-10× vs row-at-a-time |
| BE execute | late materialisation | columns not read are free |
| BE execute | Primary Key index | sub-millisecond point lookup |
The trace gives the interviewer a complete narrative — not "StarRocks is fast" but a sequence of measurable optimisations, each with its own owner and lever.
Output:
| Lever | Levers when to use |
|---|---|
| Sync MV | recurring rollup, sub-minute freshness |
| Async MV | recurring lakehouse rollup, ≤ minute freshness |
| Colocate group | fact↔dim join on hot path |
| Primary Key model | upsert-heavy dimension |
| Aggregate Key model | write-then-rollup fact |
| Duplicate Key model | raw event log |
Why this works — concept by concept:
- FE rewrite is the biggest single lever — the moment the FE recognises a matching MV, the query plan shrinks from "scan 100M rows" to "scan 10k rows." Most sub-second dashboards are MV-rewritten dashboards.
- Vectorised SIMD execution — the BE processes batches of columns per CPU cycle. Branch-light hot loops, cache-friendly access, full-width SIMD intrinsics. Order-of-magnitude over row-at-a-time MPP.
- Colocate group eliminates the shuffle — same bucket key + count + placement means the join runs per-BE. The shuffle, often the dominant cost of distributed JOIN, drops to zero.
- Data model choice matches workload shape — Primary Key for upserts, Aggregate Key for rollup-on-write, Duplicate Key for raw events. The wrong model multiplies storage cost and slows reads.
- Late materialisation — columns not in the filter or projection are not decompressed. The "columnar" win is wider than it sounds; many queries touch fewer than a third of a wide table's columns.
- Cost — each lever has its own price: MVs cost storage + write amplification; colocate constrains bucket layouts; Primary Key costs index memory. The architectural craft is picking which lever to spend on which workload.
SQL
Topic — optimization
Query optimization problems
5. Migration & adoption patterns
Migrate one workload, not the platform — and route by source system, not by team preference
The mental model in one line: the right migration is the one workload at a time pattern. ClickHouse migrations lift-and-shift on MergeTree-style modelling; Snowflake / BigQuery migrations land as query acceleration layers, not full replacements; Druid / Pinot migrations land when SQL maturity matters more than ingestion velocity; lakehouse migrations land as External Catalog mounts. Once you know the four patterns, every migration interview question fits one of them.
The four migration patterns.
- From ClickHouse — lift-and-shift. ClickHouse's MergeTree storage maps cleanly onto Aggregate Key / Duplicate Key models. Many queries port with light dialect rewriting (functions named slightly differently, but the semantics carry).
- From Snowflake / BigQuery — query acceleration. Keep the warehouse as the ETL + ad-hoc home; introduce StarRocks as a serving layer for the workloads where latency matters. Iceberg / Delta makes this seamless because StarRocks can read the same files via External Catalog.
- From Druid / Pinot — SQL maturity. When the workload outgrows pure rollup + filter and now needs JOIN, window functions, complex CTEs — that's when teams move. The freshness story is the trade-off; configure async MV to meet the SLA.
- From a lakehouse (Iceberg / Hudi / Delta) — query head. Mount the catalog, define MVs against the hot tables, route queries through StarRocks. No data copy.
Operational concerns in five bullets.
- FE quorum. Three or five FE nodes, odd count for HA. The follower / observer roles split read load; one elected leader handles metadata writes. Quorum loss → cluster degrades.
- BE scaling. Add BEs to scale compute. The cluster rebalances tablets across BEs on a configurable cadence. Rebalances are background; queries continue.
- BE replicas. Default replication factor 3 — every tablet on three BEs. Tolerates one BE loss with no read interruption.
- External Catalog credentials. Iceberg / Hudi / Delta catalogs need IAM credentials wired into the BE config; rotate carefully.
- Backup + recovery. Snapshot the FE metadata (BDBJE / FoundationDB) and BE tablet data on a regular cadence. StarRocks managed (CelerData) handles this; self-hosted teams own it.
Decision framework in three rows.
| Constraint | Recommendation |
|---|---|
| BI latency target ≤ 500 ms, multi-table | StarRocks / Doris |
| Already on Iceberg + need acceleration | StarRocks External Catalog |
| Single-table streaming aggregate, hyper-cardinality | ClickHouse |
The interview talking points.
- Vectorisation. Batches of columns, SIMD, branch-light loops.
- MV rewrite. Sync MV for hot rollups, async MV for lakehouse.
- Primary Key model. Upsert-fast, point-lookup-fast, the right shape for CDC dims.
- Colocate join. Same bucket key + count = no shuffle.
- External Catalog. Iceberg / Hudi / Delta / Hive Metastore / JDBC — the lakehouse bridge.
- FE / BE split. Stateless coordinator, stateful worker, optional CN for shared-data.
Worked example — migrating a ClickHouse table
Detailed explanation. A ClickHouse events table on a MergeTree engine maps cleanly onto a StarRocks Duplicate Key table sorted by the same key columns. The dialect difference is mostly in CREATE TABLE syntax; the SELECT queries port with minor rewrites.
Question. Translate the following ClickHouse table DDL into StarRocks. Keep the sort key + partitioning + column types.
Input — ClickHouse DDL.
CREATE TABLE events (
event_id UInt64,
event_ts DateTime,
user_id UInt64,
event_type LowCardinality(String),
properties String
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(event_ts)
ORDER BY (event_ts, user_id);
Code — StarRocks equivalent.
CREATE TABLE events (
event_id BIGINT NOT NULL,
event_ts DATETIME NOT NULL,
user_id BIGINT NOT NULL,
event_type VARCHAR(64) NOT NULL,
properties VARCHAR(65533)
)
DUPLICATE KEY(event_ts, user_id, event_id)
PARTITION BY RANGE(event_ts) ()
DISTRIBUTED BY HASH(user_id) BUCKETS 32
PROPERTIES (
"replication_num" = "3"
);
Step-by-step explanation.
-
ENGINE = MergeTree↔DUPLICATE KEY(...)— same semantic (append-only sorted log). -
ORDER BY (event_ts, user_id)↔DUPLICATE KEY(event_ts, user_id, event_id)— StarRocks sorts by the key columns too; includingevent_idlast makes the key unique. -
PARTITION BY toYYYYMM(event_ts)↔PARTITION BY RANGE(event_ts) ()plus dynamic partition properties (omitted here) — StarRocks uses explicit range partitions, often with a dynamic partition rule to roll forward monthly. -
LowCardinality(String)↔VARCHAR(64)— StarRocks does not have a literal LowCardinality wrapper, but the engine's dictionary encoding for low-cardinality string columns happens automatically under the hood. - The DISTRIBUTED BY clause picks
HASH(user_id)— the most common join key in event tables. Bucket count 32 is a reasonable default; tune by data size.
Output.
| ClickHouse concept | StarRocks equivalent |
|---|---|
| MergeTree | DUPLICATE KEY |
| ORDER BY | sort key inside DUPLICATE KEY |
| PARTITION BY toYYYYMM | PARTITION BY RANGE + dynamic partition |
| LowCardinality | dictionary encoding (automatic) |
Rule of thumb. Most ClickHouse → StarRocks ports map MergeTree → Duplicate Key, ReplacingMergeTree → Primary Key, AggregatingMergeTree → Aggregate Key. The semantic translation is mechanical; the harder work is rewriting the dialect-specific functions (e.g. arrayJoin → LATERAL VIEW).
Worked example — query acceleration on top of Snowflake
Detailed explanation. Keep the data in Snowflake; mount it via Snowflake's iceberg-format external tables (or replicate to S3 as Iceberg) and acceleration-layer it through StarRocks. The dashboard reads StarRocks; the ETL still writes Snowflake.
Question. Sketch the Snowflake → StarRocks acceleration pattern for a single high-traffic dashboard.
Input.
| Layer | Role |
|---|---|
| Snowflake | system of record + ETL + ad-hoc |
| Iceberg in S3 | hot table replicated nightly |
| StarRocks | serving layer for the dashboard |
Code.
-- StarRocks External Catalog over the Iceberg replica
CREATE EXTERNAL CATALOG iceberg_dashboard
PROPERTIES (
"type" = "iceberg",
"iceberg.catalog.type" = "rest",
"iceberg.catalog.uri" = "https://iceberg-catalog.internal"
);
-- Async MV refreshing every 5 minutes
CREATE MATERIALIZED VIEW mv_finance_dashboard
REFRESH ASYNC EVERY (INTERVAL 5 MINUTE)
AS
SELECT
d_date,
region_id,
product_id,
SUM(amount) AS revenue
FROM iceberg_dashboard.warehouse.fact_orders
WHERE d_date >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY d_date, region_id, product_id;
Step-by-step explanation.
- The dashboard previously read directly from Snowflake on a LARGE warehouse to hit sub-second on 200 concurrent users. Cost scaled with warehouse-hours.
- The acceleration pattern replicates the dashboard's hot table into Iceberg (nightly, or via a Snowflake-managed Iceberg table). StarRocks mounts it via External Catalog.
- An async MV pre-computes the rollup; the dashboard reads from the MV.
- The split keeps Snowflake responsible for ETL + ad-hoc + small-bursty queries; StarRocks owns the single high-concurrency dashboard.
Output.
| Workload | Engine | Notes |
|---|---|---|
| ETL writes | Snowflake | unchanged |
| Ad-hoc analyst | Snowflake | unchanged |
| Dashboard reads | StarRocks | sub-second + cheaper |
Rule of thumb. Acceleration > replacement. The team keeps Snowflake's analyst tooling and Snowpark workflows; the platform team adds a single new engine for the single workload Snowflake was least efficient at.
Worked example — moving from Druid because of SQL maturity
Detailed explanation. A team built on Druid because nothing else hit sub-second in 2019. Five years later the SQL surface needs JOINs and window functions Druid does not handle well. The migration unit is the workload: the original counter dashboards stay on Druid; the new "funnel × cohort × LTV" workload moves to StarRocks.
Question. Outline the migration design when the requirement is "we need window functions over a 2B-row event table with sub-second latency."
Input.
| Workload | Current engine | Issue |
|---|---|---|
| Single-counter dashboard | Druid | fine |
| Funnel × cohort × LTV | Druid | SQL maturity gaps |
Code (StarRocks side).
-- Event table on StarRocks
CREATE TABLE events (
event_id BIGINT NOT NULL,
event_ts DATETIME NOT NULL,
user_id BIGINT NOT NULL,
event_type VARCHAR(64),
cohort_week DATE
)
DUPLICATE KEY(event_ts, user_id)
PARTITION BY RANGE(event_ts) ()
DISTRIBUTED BY HASH(user_id) BUCKETS 64;
-- Funnel + cohort query — Druid would struggle on this
SELECT
cohort_week,
COUNT(DISTINCT user_id) AS reached_step_0,
COUNT(DISTINCT CASE WHEN event_type = 'add_to_cart'
THEN user_id END) AS reached_step_1,
COUNT(DISTINCT CASE WHEN event_type = 'purchase'
THEN user_id END) AS reached_step_2
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '90' DAY
GROUP BY cohort_week
ORDER BY cohort_week;
Step-by-step explanation.
- Keep the single-counter dashboards on Druid — they are still fast and the migration cost has no upside.
- Build the funnel × cohort × LTV workload on StarRocks because the SQL maturity (CTE, window functions, multi-table JOIN) is the gap Druid does not cover well.
- The ingestion side mirrors Druid's source — both engines consume the same Kafka topic via Routine Load (StarRocks) or Tranquility (Druid).
- Run both engines for a quarter; measure each workload's freshness, latency, and operational burden; expand the migration only if numbers hold.
Output.
| Workload | Engine after | Why |
|---|---|---|
| Single-counter dashboard | Druid | already fast, no reason to move |
| Funnel × cohort × LTV | StarRocks | window functions + COUNT DISTINCT |
Rule of thumb. Treat Druid (and Pinot) as durable systems. Migrate only the workloads they cannot serve well. The "we are moving off Druid entirely" story is rarely the right one; the workload-by-workload story almost always is.
Worked example — operational FE quorum + BE replication
Detailed explanation. Production StarRocks / Doris clusters run 3 or 5 FE nodes (odd count for quorum) and BEs with replication factor 3. The math behind those choices is the same Raft-style HA reasoning every production team eventually internalises.
Question. Explain the operational reasoning for "3 FE + replication factor 3" in production.
Input.
| Parameter | Choice |
|---|---|
| FE node count | 3 |
| BE replication factor | 3 |
| Quorum behaviour | majority |
| Tolerance | 1 FE failure, 1 BE failure |
Code (cluster config sketch).
# fe.conf (one per FE node, 3 total)
meta_dir = /data/starrocks/fe/meta
http_port = 8030
rpc_port = 9020
priority_networks = 10.0.0.0/24
# be.conf (one per BE node)
storage_root_path = /data/starrocks/be/storage
priority_networks = 10.0.0.0/24
default_replication_num = 3
Step-by-step explanation.
- The 3-FE quorum tolerates one FE failure: with 2 healthy followers / leader, the cluster still has majority for metadata writes. A 5-FE quorum tolerates two failures and is appropriate for very large clusters.
- The BE replication factor 3 places every tablet on three BEs. Read load distributes across replicas; writes commit to majority. Tolerates one BE loss with no read interruption.
- Network partitioning: the FE quorum protects metadata consistency. A partition that isolates one FE causes that FE to step down; the majority continues serving.
- Rebalancing: when a BE is added or removed, the cluster rebalances tablets in the background. Configurable concurrency keeps the rebalance from saturating the network.
Output.
| Failure mode | Tolerance |
|---|---|
| 1 FE crash | cluster healthy, leader re-elected |
| 1 BE crash | reads continue, writes continue, tablet re-replicated to 4th BE |
| 2 FE crash (3-FE quorum) | metadata writes blocked; reads continue against stale catalog |
| 2 BE crash (replication 3) | partial tablet unavailability until re-replication |
Rule of thumb. Run 3 FEs and replication-3 by default. Move to 5 FEs only when the cluster size or write rate justifies it; move replication to 5 only for the highest-criticality tables. The defaults are tuned for the median production case.
Data engineering interview question on migration design
A senior interviewer might frame: "Your team has been on Snowflake for three years. Build a 90-day plan to introduce StarRocks for the highest-pain workload, keeping Snowflake intact for everything else." It tests workload routing + migration discipline.
Solution Using a 90-day workload-routing migration
Days 0-14: discovery.
- Pull query history; identify the top 5 dashboards by concurrent QPS + P95.
- Pick the single worst-offender for migration.
Days 14-28: design.
- Define StarRocks tables (Duplicate Key for facts, Primary Key for dims).
- Choose colocate group on the join key.
- Define sync MV for the rollup.
Days 28-42: build.
- Replicate the hot table to Iceberg (or load directly via Broker Load).
- Stand up 3 FE + 4 BE cluster.
- Wire BI tool to dual-read (Snowflake + StarRocks behind a feature flag).
Days 42-70: validate.
- Compare P95 + correctness on the test cohort.
- Tune MV definitions; verify async refresh.
- Document the runbook (FE failover, BE rebalance, MV rebuild).
Days 70-90: cut-over + measure.
- Flip the feature flag for the dashboard.
- Track P95 + cost for 30 days.
- Decide whether to migrate the next workload or stop here.
Step-by-step trace.
| Phase | Outcome |
|---|---|
| Discovery | one worst-offender workload identified |
| Design | DDL + MV + colocate group specced |
| Build | cluster up, dual-read behind feature flag |
| Validate | P95 + correctness verified |
| Cut-over | dashboard flipped, 30-day measurement |
The 90-day plan keeps Snowflake intact, ships StarRocks for the single most painful workload, and earns the right to expand only if the numbers hold.
Output:
| Metric | Before | After (30 days) |
|---|---|---|
| Dashboard P95 | 1.4 s | 180 ms |
| Concurrent QPS supported | 200 | 200 with headroom |
| Cost (dashboard-only) | $X | $0.45 X |
| Operational systems | 1 | 2 |
| Rollback window | n/a | feature flag still live |
Why this works — concept by concept:
- One workload at a time — minimises blast radius. A failed migration on a single dashboard is recoverable; a failed migration on "the warehouse" is not.
- Dual-read behind a flag — keeps the rollback path warm for the entire cut-over window. The team is never trapped.
- Operational systems count — going from 1 to 2 engines is a real cost. The migration only earns its keep when the gains exceed that operational tax.
- Cost compared per workload — the right comparison is "$/month for the dashboard" before vs after, not "$/month for the whole platform." The platform cost may rise; the per-workload cost must fall.
- Runbook discipline — FE failover, BE rebalance, MV rebuild — the runbook is the difference between "we run StarRocks" and "we run StarRocks well." Document early.
- Cost — fixed cluster footprint + storage + MV maintenance; payback proportional to the workload's prior pain. Small-pain workloads do not pay back; big-pain workloads pay back fast.
SQL
Topic — ETL
ETL pipeline problems
Cheat sheet — StarRocks / Doris recipes
- Sub-second BI dashboard. Aggregate Key model on the rollup base + sync materialized index on the (date, dimension) grouping the dashboard hits. Pair with a colocate group between the fact and the dim on the join key.
- Lakehouse query acceleration. StarRocks External Catalog over Iceberg / Hudi / Delta + async materialized view refreshed every N minutes. The catalog mount avoids the data copy; the MV makes the dashboard query sub-second.
-
Upsert-fast dimension. Primary Key model (StarRocks) or Unique Key +
enable_unique_key_merge_on_write = true(Doris). Drives sub-millisecond point lookups and millisecond upserts from CDC streams. - Shuffle elimination on JOIN. Put the fact and the dim into the same colocate group on the join key; match bucket count and bucket key. The engine places matching tablets on the same BEs and the join runs locally.
-
Async MV refresh cadence.
REFRESH ASYNC EVERY (INTERVAL N MINUTE)for lakehouse rollups; the right N is the slowest freshness SLA the downstream dashboard tolerates. Below that, prefer sync MV. - Choosing the engine. Compliance / ASF preference → Doris. Lakehouse-first + commercial support → StarRocks. Otherwise → pick by team familiarity.
- FE quorum. Three FE nodes, odd count. Five only when the cluster size or metadata write rate justifies it. Tolerates one failure.
-
BE replication. Default
replication_num = 3. Move to 5 only for highest-criticality tables. Tolerates one BE loss without read interruption. - Routine Load from Kafka. Best path for sub-minute ingest. JSON or CSV format; the Primary Key model handles upserts in place — no MERGE statement.
-
Diagnose latency.
SET enable_profile = true; SELECT ...; SHOW PROFILELIST; SHOW PROFILE FOR <id>;. Read the per-phase breakdown to find the dominant cost (scan, shuffle, MV-hit-or-miss). - MV maintenance. Sync MVs maintained synchronously — write amplification proportional to MV row count. Async MVs maintained on schedule — staleness bounded by refresh interval.
- Migration unit. One workload at a time, behind a feature flag, dual-read against the old engine for at least 30 days before flipping the read path.
-
MV rewrite verification. Run
EXPLAINon the dashboard query and check the plan reads the MV name, not the base table — if not, the rewrite did not match. Iterate the MV definition until the FE picks it up. - External Catalog credential rotation. Update IAM / token in the BE config + restart in rolling fashion. Plan for the rolling restart window in the migration schedule.
Frequently asked questions
What's the difference between StarRocks and Apache Doris?
Both engines descend from the same 2008 Palo project at Baidu, were forked in 2020, and share the same vectorised MPP execution core, FE / BE topology, columnar storage, and three data models (Aggregate Key, Duplicate Key, Primary Key / Unique Key). The technical surface has largely re-converged in 2026. The differences are organisational and emphasis: Apache Doris is governed by the Apache Software Foundation (community-driven roadmap), while StarRocks is led by CelerData (commercial-backed). StarRocks pushed lakehouse catalog integration (Iceberg, Hudi, Delta) earlier and is the Iceberg-first answer in 2026. Doris emphasises real-time ingestion fidelity and MySQL-protocol parity. Most production teams pick by governance preference and lakehouse-catalog needs, not by raw benchmark.
Is StarRocks faster than ClickHouse?
It depends on the workload shape. ClickHouse is brilliant at single-table aggregations over high-cardinality streams; StarRocks is stronger at multi-table JOINs, materialized view rewrite, and lakehouse catalog reads. For a "count unique users on a single huge event table" workload, ClickHouse often wins on raw throughput. For a "fact × three dims × rollup × top-N" dashboard workload, StarRocks's sync materialized index + colocate join typically wins. The right comparison is your workload on both engines with a representative profile — not the headline benchmark page.
Can StarRocks query Iceberg or Hudi tables?
Yes. StarRocks ships an External Catalog feature that mounts Iceberg, Hudi, Delta, Hive Metastore, JDBC, and other catalog types. Queries against the External Catalog tables read the files in object storage in place — no data copy required. The FE applies partition pruning and column projection against the catalog manifests, and the BEs scan only the required files with vectorised SIMD execution. For sub-second dashboard latency over a large Iceberg table, define an async materialized view on top of the External Catalog table; the FE rewrites matching dashboard queries onto the MV automatically.
What is a materialized index in StarRocks?
A materialized index (sync materialized view) is a pre-computed projection or aggregation on the same base table, maintained synchronously on every write. The FE checks the materialized index catalog on every incoming query and rewrites the plan to read the index instead of the base table whenever the index satisfies the query's grouping and filter shape. For a 100M-row fact table with a daily-by-region rollup, the sync materialized index turns a multi-second scan into a sub-second lookup. The cost is write amplification proportional to the index size and additional storage for the index data; the payback is order-of-magnitude latency improvement on the dashboard query.
StarRocks vs Snowflake — when do you pick which?
Pick Snowflake for elastic ad-hoc SQL, the analyst-tooling experience, and bursty workloads where idle-cost matters. Pick StarRocks for sub-second BI dashboards with high concurrency, lakehouse acceleration over Iceberg, and workloads where the per-query cost of always-on warehouse compute exceeds the fixed cost of a StarRocks cluster. Most mature platforms run both — Snowflake as the system of record + ETL + ad-hoc home, StarRocks as the serving layer for the dashboards where latency is the binding constraint. The decision is not "which engine is better" but "which workload routes to which engine"; the data engineer's craft is the routing logic, not the engine monoculture.
How do I migrate from ClickHouse to StarRocks?
Treat it as a workload-by-workload migration, not a platform replacement. Map ClickHouse's MergeTree → StarRocks Duplicate Key model, ReplacingMergeTree → Primary Key model, and AggregatingMergeTree → Aggregate Key model. The DDL ports cleanly for most tables; the dialect rewrite is mostly function-name fixes (arrayJoin → LATERAL VIEW, uniq → COUNT(DISTINCT), etc.). For ingestion, replace ClickHouse's HTTP / TCP loaders with StarRocks Stream Load (HTTP) or Routine Load (Kafka). Pick the single workload where StarRocks's strengths (multi-table JOIN, materialized view rewrite, External Catalog) genuinely outperform ClickHouse, migrate that one first behind a feature flag, validate P95 and cost, then expand only if the numbers hold.
Practice on PipeCode
- Drill the SQL aggregation practice library → for the SUM / COUNT / GROUP BY family that every materialized view rewrite collapses into.
- Rehearse on JOIN problems → — the colocate-group story only matters once your JOIN intuition is solid.
- Sharpen the analytical surface with window function drills → — the SQL maturity that pushes teams off Druid lives here.
- Lock the optimisation lens with query optimization problems → for the plan-reading muscle every StarRocks debug session needs.
- Layer the ETL practice library → — ingestion shape decides whether a Primary Key or Aggregate Key model wins.
- For the broader interview surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the dialect axis with the SQL for data engineering interviews course →.
- For long-form schema craft, work through data modelling for DE interviews →.
Pipecode.ai is Leetcode for Data Engineering — every StarRocks and Apache Doris recipe above ships with hands-on practice rooms where you write the materialized view rewrite, the colocate-group DDL, and the External Catalog query against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your sub-second BI plan actually behaves the same on StarRocks as on Doris.





Top comments (0)