A semantic layer is the part of the modern data stack that decides what "active user" means — once — so every dashboard, notebook, embedded chart, and LLM agent that asks the question receives the same answer. Skip it and every BI tool ships its own definition; ship it and the warehouse becomes the canonical source of metric truth instead of the source of metric disagreement.
This guide compares the three engines analytics engineers actually shortlist in 2026: cube.dev as the standalone open-source headless-BI engine, the dbt semantic layer powered by MetricFlow, and lookml as the original semantic model inside Looker. We walk through where each one sits between the warehouse and the consumer surface, how the data models map onto each other, how to define the same Weekly Active Users metric three ways, and how the metrics layer routes queries from Tableau, Power BI, Hex, Mode, embedded apps, and LLM agents.
When you want hands-on reps the moment you finish reading, drill the aggregation practice library → for the measure / metric foundations, rehearse on joins problems → for the entity-resolution patterns that semantic layers automate, and stack the group-by drills → for the granularity reasoning every measure definition leans on.
On this page
- What a semantic layer actually solves
- The role of the semantic layer in a modern stack
- The three platforms compared
- Defining a metric in each platform
- Consumer fan-out — who queries the semantic layer
- Cheat sheet — semantic layer recipes
- Frequently asked questions
- Practice on PipeCode
1. What a semantic layer actually solves
The "every dashboard redefines active user" problem — and why a governed metric layer is the antidote
The mental model in one line: a semantic layer is a single place where business metrics, dimensions, and joins are defined as objects — the warehouse stays the storage, the BI tool stays the surface, and the metric definition lives in between as code that every consumer reads from. Without it, the same metric is re-invented in every dashboard, every SQL snippet, and every notebook — and the numbers diverge.
The four symptoms of a missing semantic layer.
- Five definitions of "active user." Marketing counts a session-open; product counts a feature-event; finance counts a paid event; data science counts a 30-day retention bucket; the CEO sees a different number on every dashboard.
-
Joins re-written by every analyst. The
orderstocustomerstoregionsjoin chain lives in a Tableau workbook, a Looker explore, a Hex notebook, and a one-off SQL snippet — four copies, four chances to drift. -
Filters re-implemented per surface. "Exclude internal users" lives as a
WHERE email NOT LIKE '%@acme.internal'in some places and aWHERE is_internal = FALSEin others. Numbers drift the day a new internal domain appears. - No diffable governance. When the metric definition lives in a BI workbook's hidden calculated field, you cannot review it in a pull request — and you cannot tell which definition matches the "official" one.
One metric definition, many consumers — the headless-BI premise.
- A semantic layer publishes metric definitions as code. Cube ships them as YAML / JS
cubes; dbt SL ships them as YAMLsemantic_models+metrics; LookML ships them asview/explorefiles. - The same definition resolves into a query when a consumer asks for it. The consumer never writes the join, the GROUP BY granularity, or the filter — the layer does.
- Every consumer — Tableau, Power BI, Hex, Mode, embedded apps, and now LLM agents — sees the same number because they all read from the same definition.
- This is the headless BI premise: a layer that is a query API but is not a visualisation tool. The viz layer becomes interchangeable.
Where a semantic layer sits.
- Below it: the warehouse (Snowflake, BigQuery, Databricks, Redshift, Postgres) holds the marts produced by dbt or any ELT tool.
- The semantic layer itself: Cube / dbt SL / LookML translate metric requests into warehouse SQL and (often) cache the results.
- Above it: the BI / notebook / embedded / LLM surfaces fan out, each one calling a SQL, REST, or GraphQL endpoint that the layer exposes.
Dimensions, measures, metrics, and joins as first-class objects.
-
Dimensions are the columns you group by —
region,signup_month,device_type. -
Measures are the aggregations on a single table —
count(distinct user_id),sum(amount). Measures are the LEGO bricks. -
Metrics are the named, business-level expressions built from measures —
WAU = count_distinct_usersfiltered to the last 7 days. Metrics are what the dashboard asks for. -
Joins are declared once and re-used by every metric. The consumer never has to know that
orders.region_idjoins toregions.id.
Why this didn't take off until dbt + Cube made it cheap.
- The pre-2020 attempts (Looker, Power BI Datasets, Microstrategy schemas) were bundled into a single BI tool — adopting them locked you into that vendor's viz layer.
- Cube.dev (2019) decoupled the layer from the viz tool by exposing REST / GraphQL / SQL APIs — every BI tool, notebook, and embedded app could now consume the same definitions.
- The dbt Semantic Layer (powered by MetricFlow, 2023) put metric definitions next to dbt models — the same git repo, the same review process, the same CI.
- LLM agents (2024–2026) finally made the governance argument concrete: a model that hallucinates a join or a filter ships a confident wrong answer; one that calls the semantic layer ships a verifiable one.
Worked example — the "five definitions of active user" failure mode
Detailed explanation. A consumer-app company has five dashboards labelled "active users." Each was built independently by a different team using a different SQL definition. The CEO opens all five in the Monday meeting and sees five different numbers. This is the canonical symptom that pushes a team to adopt a semantic layer.
Question. Given a shared events table, write the five drifted definitions of "weekly active users" and show how a single semantic-layer definition collapses them into one number. Trace why each ad-hoc version returns a different count.
Input — events.
| event_id | user_id | event_name | event_ts | platform |
|---|---|---|---|---|
| 1 | 100 | login | 2026-06-08 10:00 | web |
| 2 | 100 | search | 2026-06-08 10:01 | web |
| 3 | 200 | login | 2026-06-09 12:00 | mobile |
| 4 | 300 | login | 2026-06-09 13:00 | web |
| 5 | 300 | purchase | 2026-06-09 13:05 | web |
| 6 | 400 | login | 2026-05-30 09:00 | web |
| 7 | 100 | login | 2026-06-09 09:00 | mobile |
Code.
-- Definition 1: marketing — anyone with any event in last 7 days
SELECT COUNT(DISTINCT user_id) AS wau_marketing
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '7 days';
-- Definition 2: product — logged in at least once
SELECT COUNT(DISTINCT user_id) AS wau_product
FROM events
WHERE event_name = 'login'
AND event_ts >= CURRENT_DATE - INTERVAL '7 days';
-- Definition 3: finance — purchased in last 7 days
SELECT COUNT(DISTINCT user_id) AS wau_finance
FROM events
WHERE event_name = 'purchase'
AND event_ts >= CURRENT_DATE - INTERVAL '7 days';
-- Definition 4: data science — at least 2 distinct event days
SELECT COUNT(DISTINCT user_id) AS wau_ds
FROM (
SELECT user_id, DATE(event_ts) AS d
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '7 days'
GROUP BY user_id, DATE(event_ts)
) t
GROUP BY user_id
HAVING COUNT(*) >= 2;
-- Definition 5: mobile-only WAU
SELECT COUNT(DISTINCT user_id) AS wau_mobile
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '7 days'
AND platform = 'mobile';
Step-by-step explanation.
- Each query asks "weekly active users" but folds in a different qualifying event. None is wrong — they answer different business questions — but they are all labelled identically on five dashboards.
- Marketing counts every distinct user with any event = users 100, 200, 300 → 3.
- Product counts users with a
login= users 100, 200, 300 → 3 (coincidentally the same here, drifts on other weeks). - Finance counts users with a
purchase= user 300 → 1. - Data-science counts users active on two or more days = user 100 (active 2026-06-08 and 2026-06-09) → 1.
- Mobile-only WAU restricts to
platform = 'mobile'→ users 100 and 200 → 2. - A semantic layer collapses this by publishing one named metric per business question:
weekly_active_users(default — any event),weekly_logged_in_users,weekly_purchasers,weekly_engaged_users(2+ days),weekly_active_mobile. Each dashboard asks for the named metric, not for an ad-hoc SQL string.
Output.
| Metric label | Count |
|---|---|
| wau_marketing (any event) | 3 |
| wau_product (login) | 3 |
| wau_finance (purchase) | 1 |
| wau_ds (2+ days active) | 1 |
| wau_mobile (mobile-only) | 2 |
Rule of thumb. If five dashboards labelled the same metric show five different numbers, the fix is not "agree on the definition once." It is "move the definition into a semantic layer so the next dashboard reads from the same file." Without the layer the next disagreement is one PR away.
Worked example — joins re-written by every analyst
Detailed explanation. A multi-table join chain (orders → customers → regions → tax tables) lives as boilerplate at the top of every Looker workbook, every Hex notebook, and every ad-hoc SQL snippet. The chain is re-typed every time. A semantic layer declares the joins once and lets every metric reference them by entity name.
Question. Given orders, customers, and regions, show the boilerplate every analyst types, then show how a semantic-layer entity declaration removes the join from the query surface entirely.
Input — schema.
| table | columns |
|---|---|
| orders | order_id, customer_id, amount, order_date |
| customers | customer_id, name, region_id |
| regions | region_id, region_name, country_code |
Code.
-- Every analyst types this join chain — by hand — once per query
SELECT r.region_name,
SUM(o.amount) AS total_revenue
FROM orders o
INNER JOIN customers c ON o.customer_id = c.customer_id
INNER JOIN regions r ON c.region_id = r.region_id
WHERE o.order_date >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY r.region_name;
# Semantic-layer style — declare the joins once, query the metric
# Cube.dev model (simplified)
cubes:
- name: Orders
sql_table: orders
joins:
- name: Customers
relationship: many_to_one
sql: "{Orders}.customer_id = {Customers}.customer_id"
measures:
total_revenue:
sql: amount
type: sum
dimensions:
order_date:
sql: order_date
type: time
- name: Customers
sql_table: customers
joins:
- name: Regions
relationship: many_to_one
sql: "{Customers}.region_id = {Regions}.region_id"
- name: Regions
sql_table: regions
dimensions:
region_name:
sql: region_name
type: string
Step-by-step explanation.
- The ad-hoc SQL re-types
INNER JOIN customers ... INNER JOIN regions ...every time. Five analysts, five copies. The day a new tax-region table is added, all five copies have to be edited. - The Cube schema declares the same joins as
relationship: many_to_onelines, one time, per cube. The semantic layer now knows how to traverse fromOrderstoRegionswhenever a query mentions a column from both cubes. - The consumer query becomes: "give me
Orders.total_revenuegrouped byRegions.region_name, filtered to the last 30 days." Cube generates the join chain on the fly — and it is the same chain every time. - The dbt SL equivalent uses
entitiesandrelationshipson eachsemantic_model; LookML usesjoinsdeclared inside eachexplore. The shape is the same: declare once, traverse forever.
Output.
| region_name | total_revenue (semantic) | total_revenue (ad-hoc, copy 1) | total_revenue (ad-hoc, copy 2) |
|---|---|---|---|
| EU | 12,400 | 12,400 | 12,400 |
| US | 9,800 | 9,800 | 9,800 |
| APAC | 6,200 | 6,200 | 6,200 |
The numbers match — but only because every ad-hoc copy happens to be in sync today. The semantic-layer version is in sync by construction.
Rule of thumb. Any join chain that appears in three or more queries should live as a declared entity / join in the semantic layer. Re-typing the same INNER JOIN ... ON ... block is the analytics-engineering equivalent of cargo-cult code copy-paste.
Worked example — the headless-BI contract
Detailed explanation. "Headless BI" is the marketing term; the engineering contract behind it is "expose a query API that any front end can consume." A semantic layer that ships REST, GraphQL, and SQL endpoints can serve a Tableau dashboard, a React embedded chart, and a Slack-bot LLM agent from the same metric file. The metric author writes once; the surfaces fan out.
Question. Show how the same Weekly Active Users metric is requested over (1) a SQL endpoint, (2) a REST endpoint, and (3) a GraphQL endpoint — and explain why the output is byte-for-byte the same number on every surface.
Input. A semantic layer that has published a metric named weekly_active_users over the Events cube.
Code.
-- 1. SQL endpoint (Cube SQL API, dbt SL JDBC, Looker SQL Runner)
SELECT MEASURE(weekly_active_users)
FROM events
WHERE event_date >= DATE_TRUNC('week', CURRENT_DATE);
# 2. REST endpoint (Cube REST, Looker API, dbt Cloud Semantic Layer API)
curl -s -H "Authorization: $TOKEN" \
"https://semantic.example.com/cubejs-api/v1/load" \
--data '{
"query": {
"measures": ["Events.weekly_active_users"],
"timeDimensions": [{
"dimension": "Events.event_date",
"granularity": "week",
"dateRange": "this week"
}]
}
}'
# 3. GraphQL endpoint (Cube GraphQL API)
query {
cube(where: { eventDate: { inDateRange: "this week" } }) {
events {
weeklyActiveUsers
}
}
}
Step-by-step explanation.
- Each endpoint translates the request into the same underlying warehouse SQL. The SQL endpoint is the most direct; REST and GraphQL wrap the request in JSON.
- The metric definition for
weekly_active_userslives in one file —events.yml(Cube) orevents.sql+metrics.yml(dbt SL) orevents.view.lkml+events.model.lkml(LookML). Every endpoint reads from that file. - The semantic layer caches the result of the compiled SQL. If three surfaces ask the same question within the cache window, the warehouse runs the query once and the layer serves the cached answer thrice.
- Authorization is enforced at the layer — the user's JWT or session is the same across endpoints, and row-level-security rules in the layer rewrite the SQL before it hits the warehouse.
Output (one number, three surfaces).
| Surface | Endpoint | Response field | Value |
|---|---|---|---|
| Tableau dashboard | SQL | MEASURE(weekly_active_users) |
18,432 |
| React embedded chart | REST | data[0].Events.weekly_active_users |
18,432 |
| Slack LLM agent | GraphQL | cube[0].events.weeklyActiveUsers |
18,432 |
Rule of thumb. If your team needs the same metric on more than one surface, the cheapest path to consistency is the semantic layer — not a metrics_macros.sql file shared across BI tools, not a "single source of truth" doc, not a Slack thread. Code, one file, three endpoints.
Semantic layer interview question on metric governance
A senior analytics-engineering interviewer often opens with: "Walk me through how you'd give a CEO confidence that the 'active users' number on the executive dashboard is the same one the ML team trains on, the embedded customer-portal chart renders, and the Slack LLM agent quotes back when asked."
Solution Using a layered semantic-layer governance pattern
# 1. A single metric definition (Cube / dbt SL / LookML pseudocode)
metric: weekly_active_users
description: Distinct users with any event in the trailing 7 days.
type: count_distinct
target_field: user_id
filter: event_ts >= dateadd('day', -7, current_date)
owner: data-platform-team
review: PR-required
sla: <30s p95
cache_window: 1h
# 2. The metric is exposed on three endpoints (SQL / REST / GraphQL).
# 3. Three consumers register subscriptions to the metric:
consumers:
- executive_dashboard (Looker / Tableau, SQL endpoint)
- embedded_customer_portal (REST endpoint, RLS on tenant_id)
- slack_llm_agent (GraphQL endpoint, RLS on slack_user.email)
Step-by-step trace.
| Step | Action | Outcome |
|---|---|---|
| 1 | Analytics engineer opens PR editing weekly_active_users.yml
|
CI runs metric tests + freshness check |
| 2 | Reviewer approves; merge to main | Semantic layer redeploys metric definition |
| 3 | Executive dashboard auto-reloads (cache invalidated) | New number visible in <60s |
| 4 | Embedded chart polls REST endpoint | Same new number, same cache key |
| 5 | LLM agent grounds prompt with semantic-layer schema | Returns the new definition + new number, citing the layer |
| 6 | Auditor diffs git log for weekly_active_users.yml
|
One source of truth, one commit history |
Output:
| Surface | Definition version | Value | Cache state |
|---|---|---|---|
| Executive dashboard | v17 (post-merge) | 18,432 | hot |
| Embedded portal | v17 | 18,432 | hot |
| LLM Slack agent | v17 | 18,432 | hot |
| Auditor PR-log | v1 → v17 | full diff visible | n/a |
Why this works — concept by concept:
- Single source of truth — the metric file in version control is the only place the definition exists. No duplicate calculated fields, no hidden Tableau formulas, no LLM hallucinations of joins.
- PR-required review — metric changes flow through the same code review as any other code change. The "five-dashboards-disagree" failure mode cannot recur because there is no other place to edit the definition.
- Cache invalidation on deploy — the semantic layer flushes its cache for the affected metric the moment the new definition lands. Surfaces converge to the new number within the cache-refresh window.
- RLS at the layer — row-level-security predicates live in the metric definition. The embedded portal automatically scopes to the tenant; the Slack agent automatically scopes to the asking user. The consumer code carries no security logic.
- LLM grounding — the agent calls the semantic layer instead of generating SQL from scratch. Hallucinated joins become impossible because the layer publishes the schema (cubes, measures, dimensions, joins) as the agent's tool surface.
- Cost — one warehouse query per cache window (often 1 hour), regardless of how many surfaces poll. The semantic layer is a strict cost reducer relative to N independent BI tools each running their own SQL.
SQL
Topic — aggregation
Measure and metric aggregation problems (SQL)
2. The role of the semantic layer in a modern stack
Below the warehouse, beside the BI tool, above the consumer — and right where LLM agents finally need it
The mental model in one line: the semantic layer sits between the warehouse marts and every consumer (BI tool, notebook, embedded app, or LLM agent), translating a metric request into governed warehouse SQL with caching, row-level security, and access control along the way. Get the placement right and every downstream surface becomes interchangeable.
The three tiers in one paragraph.
- Bottom — Warehouse + dbt marts. Snowflake, BigQuery, Databricks, Redshift, or Postgres holding the cleaned, tested, joined tables. dbt models do the row-level transforms — staging → marts. The semantic layer reads from the marts; it does not own them.
- Middle — Semantic layer. Cube / dbt SL / LookML hold the metric definitions, dimension hierarchies, joins, and security rules. The layer compiles requests to SQL and (often) caches results.
- Top — Consumers. Tableau, Power BI, Hex, Mode, Sigma, Streamlit, custom React dashboards, embedded analytics, and now LLM agents. Each consumer speaks SQL, REST, or GraphQL — and gets the same metric value.
Caching, query optimisation, and query rewriting.
- Pre-aggregations (Cube). Cube can materialise a roll-up table — e.g. "daily active users by region by platform" — and route the incoming request to the pre-agg when the granularity matches. Sub-second queries against trillion-row fact tables.
- dbt SL caching. The dbt Semantic Layer (dbt Cloud) ships a query cache keyed by metric + filters + granularity. Repeat requests within the TTL hit the cache, not the warehouse.
-
Looker PDTs (persistent derived tables). LookML's
derived_tableblocks can be persisted on a schedule, turning expensive transforms into a pre-computed warehouse table. The explore reads from the PDT, not from the live mart. - Query rewriting. Modern semantic layers detect when a roll-up table can answer the query and rewrite the SQL transparently — the consumer never knows the query plan changed.
Governed metrics vs ad-hoc SQL — the contract boundary.
- Inside the layer: governed, named metrics with descriptions, owners, SLAs, and PR-review history.
- Outside the layer: ad-hoc SQL still works (and is sometimes the right answer for exploratory analysis), but it is not the dashboard's source.
- The boundary is enforced operationally: BI tools and embedded apps point only at the semantic layer's endpoint. Ad-hoc SQL is a separate Snowflake / BigQuery role that does not feed any production dashboard.
Multi-tenant security and row-level access.
-
Row-level security (RLS). A B2B SaaS company has 5,000 tenants. The semantic layer rewrites every query to add
WHERE tenant_id = :current_tenant. Consumers cannot bypass it because they never write the SQL. - Column masking. Salary or PII columns can be selectively masked at the layer based on the caller's role.
- Tenant isolation across consumers. The same metric file works for the internal dashboard and the customer-facing embedded chart, because the security predicate is bound to the request context, not the query.
- Auth pass-through. The layer accepts a JWT (or OAuth token) and resolves the user's permissions to row predicates at query time. No "service account that sees everything" pattern.
Why LLM agents finally make semantic layers strategic.
- A general-purpose LLM that sees a raw warehouse schema guesses joins and filters — and ships confident wrong answers. A semantic layer publishes the governed schema (cubes, measures, dimensions, joins, allowed filters) as the agent's tool surface.
- Grounded queries become deterministic — the same prompt yields the same SQL because the agent is constrained to the cubes that exist.
- Audit trails become possible — every agent call resolves to a named metric, not to an opaque generated SQL string. Compliance and finance teams can review what the agent asked for.
- The semantic layer is now the interface the LLM agent uses, not a competing surface. This is the inversion that pushed every major BI vendor to ship a semantic-layer story in 2024–2026.
Worked example — without vs with a semantic layer, side by side
Detailed explanation. Two stacks ship the same Weekly Active Users number to a Tableau dashboard. The "without" stack has Tableau pointing directly at the Snowflake mart with a workbook-local calculated field. The "with" stack has Tableau pointing at the semantic layer's SQL endpoint and asking for the named metric.
Question. Compare the two architectures end-to-end. Show where the metric definition lives, who can edit it, and what happens when a second consumer (a Hex notebook) is added.
Input — comparison table.
| Concern | Without semantic layer | With semantic layer |
|---|---|---|
| Where metric SQL lives | Tableau workbook calculated field |
weekly_active_users.yml in git |
| Who can edit it | Anyone with Tableau workbook access | PR-required code review |
| How a second consumer adopts it | Copy-paste the SQL into Hex | Point Hex at the semantic-layer SQL endpoint |
| What happens when underlying schema changes | Tableau workbook breaks silently | CI catches the break in the next PR |
| RLS / multi-tenant security | Per-workbook plumbing | Declared once at the layer |
| LLM agent integration | Hallucinated SQL | Tool calls to named metrics |
Code (without).
-- Tableau workbook calculated field — not in version control
COUNT_DISTINCT(
IIF(
DATE(event_ts) >= TODAY() - 7
AND event_name IN ('login','search','purchase'),
user_id,
NULL
)
)
Code (with).
# weekly_active_users.yml — single source of truth, in git
- name: weekly_active_users
description: Distinct users with any event in the trailing 7 days.
type: count_distinct
target_field: user_id
semantic_model: events
filters:
- "{{ Dimension('events__event_ts') }} >= dateadd('day', -7, current_date)"
Step-by-step explanation.
- In the "without" world, the metric is locked inside a Tableau workbook. The Hex team copies the SQL because there is no API to call. Now two definitions exist; they drift the day someone changes the filter on one side.
- In the "with" world, the metric lives in a YAML file. Tableau queries the semantic layer's SQL endpoint with
SELECT MEASURE(weekly_active_users) FROM events. Hex points at the same endpoint. Both surfaces see the same number by construction. - When the schema changes (e.g.
event_tsrenamed toevent_timestamp), the dbt CI / Cube CI breaks on the next PR — the change is caught before any dashboard sees a stale number. - RLS in the "without" world is a per-workbook setting; in the "with" world it is a layer-level rule that applies uniformly.
Output. Two stacks ship the same number today — but the "with" stack ships the same number on Tuesday at 4pm when a new consumer onboards. The "without" stack ships a divergent number the same week.
Rule of thumb. Adopt the semantic layer the moment you have a second consumer of the same metric. One dashboard can live with an ad-hoc definition; two dashboards cannot — they will drift, and the cost of drift is one quarterly business review with conflicting numbers.
Worked example — caching: pre-aggregations, dbt SL cache, and Looker PDTs
Detailed explanation. Each platform's caching story differs. Cube's pre-aggregations roll up to a smaller table; dbt SL caches the result of a metric request; Looker PDTs persist a derived table on a schedule. All three reduce warehouse load — but they hit different layers.
Question. Given a fact table with 10 billion rows and a dashboard that asks "DAU by region by day for the last 90 days," design a caching strategy for each platform. Show the storage / freshness trade-off.
Input.
| Asset | Detail |
|---|---|
| Fact table |
events — 10B rows, ~3M new/day |
| Dashboard | DAU by region by day, 90-day window |
| SLA | <2s p95 |
| Freshness | <1 hour stale acceptable |
Code.
# Cube — pre-aggregation
cubes:
- name: Events
pre_aggregations:
dau_by_region_by_day:
measures:
- daily_active_users
dimensions:
- region
time_dimension: event_date
granularity: day
partition_granularity: month
refresh_key:
every: 1 hour
# dbt SL — cache settings (dbt Cloud)
saved_queries:
- name: dau_by_region_by_day
query_params:
metrics: [daily_active_users]
group_by:
- Dimension('events__region')
- TimeDimension('events__event_date', 'day')
where:
- "{{ TimeDimension('events__event_date') }} >= dateadd('day', -90, current_date)"
cache:
ttl: 3600 # seconds
# Looker — PDT
view: events_dau_pdt {
derived_table: {
sql:
SELECT DATE(event_ts) AS event_date,
region,
COUNT(DISTINCT user_id) AS dau
FROM events
WHERE event_ts >= CURRENT_DATE - INTERVAL '90 DAY'
GROUP BY 1, 2 ;;
sql_trigger_value: SELECT DATE_TRUNC('hour', CURRENT_TIMESTAMP) ;;
}
}
Step-by-step explanation.
- The Cube pre-agg materialises a roll-up table (
events__dau_by_region_by_day) partitioned by month and refreshed hourly. The dashboard's SQL hits this table — ~90 rows × number-of-regions, not 10B rows. Sub-second. - The dbt SL cache stores the result of the saved query in dbt Cloud's cache. Repeated requests with the same parameters get the cached row set. Cache misses re-run against the warehouse mart.
- Looker's PDT persists a derived table in the warehouse. The
sql_trigger_valuerebuilds the PDT every hour. Every explore that references this view reads from the PDT. - All three trade some staleness for massive speed-up. The freshness SLA (<1 hour) fits each pattern. The storage cost is similar — one roll-up table per critical metric grain.
Output.
| Platform | Query latency | Storage overhead | Freshness | Where the cache lives |
|---|---|---|---|---|
| Cube pre-agg | <500ms | 1 small table per pre-agg | hourly refresh | Warehouse + Cube metadata |
| dbt SL cache | <2s (warm), seconds (cold) | result set in dbt Cloud | TTL-based | dbt Cloud |
| Looker PDT | <1s | 1 derived table per PDT | hourly trigger | Warehouse |
Rule of thumb. Pre-compute when the query pattern is predictable and high-volume; let the cache TTL absorb the long tail. Picking the wrong layer (e.g. caching the result for an exploratory cube where every consumer asks a different grain) wastes the cache hit ratio.
Worked example — multi-tenant row-level security at the semantic layer
Detailed explanation. A B2B SaaS analytics product needs to scope every chart to the calling tenant. Without a semantic layer, every BI tool re-implements the WHERE tenant_id = ? predicate — and the day someone forgets it, tenant A sees tenant B's revenue. With a semantic layer, the predicate lives in the cube definition and is enforced for every consumer.
Question. Show the RLS rule for the same metric in Cube, dbt SL, and LookML. Trace how a single user's JWT routes to the right tenant scope.
Input. A multi-tenant orders table with a tenant_id column. The user alice@tenantA.com has a JWT with tenant_id = "tenant_A".
Code.
# Cube — query rewrite
cubes:
- name: Orders
sql_table: orders
public: true
# security context injected from JWT claim
sql: SELECT * FROM orders WHERE tenant_id = '{COMPILE_CONTEXT.securityContext.tenant_id}'
measures:
total_revenue:
sql: amount
type: sum
# dbt SL — runtime filter via dbt Cloud security context
semantic_models:
- name: orders
model: ref('fct_orders')
entities:
- name: order_id
type: primary
defaults:
agg_time_dimension: order_date
# MetricFlow respects access controls declared in dbt_project.yml
metrics:
- name: total_revenue
type: simple
type_params:
measure: revenue_sum
filter: "{{ Dimension('orders__tenant_id') }} = '{{ session.tenant_id }}'"
# Looker — access_filter
explore: orders {
access_filter: {
field: orders.tenant_id
user_attribute: tenant_id
}
}
Step-by-step explanation.
- Alice opens her tenant-A dashboard. Her JWT carries
tenant_id = "tenant_A". - The semantic layer extracts the claim into its security context.
- The cube / semantic_model / explore wraps every emitted SQL with
WHERE tenant_id = 'tenant_A'. There is no path to send a query that skips the predicate. - The warehouse executes the scoped query; only tenant-A rows are returned. The metric
total_revenueis computed over tenant-A rows only. - When Bob from tenant B opens the same dashboard URL, his JWT carries
tenant_id = "tenant_B". The same metric definition resolves to a different SQL — and a different number — without any code change.
Output.
| User | tenant_id claim | SQL emitted | total_revenue |
|---|---|---|---|
| Alice (tenant A) | tenant_A | WHERE tenant_id = 'tenant_A' |
412,300 |
| Bob (tenant B) | tenant_B | WHERE tenant_id = 'tenant_B' |
158,900 |
| Internal data team (no tenant) | — | denied / explicit override | n/a |
Rule of thumb. If your product is multi-tenant, the semantic layer is the only place tenant isolation belongs. Every other location (BI workbook filter, embedded chart query, app-side SDK call) is a cross-tenant data-leak waiting to ship.
Semantic layer interview question on caching and security
A senior interviewer might frame it as: "Your CEO dashboard polls 'DAU by region by day' every minute and runs into Snowflake credit overspend. The same dashboard is also embedded into a customer-facing portal where 200 tenants need their own scoped numbers. Design the semantic-layer caching and RLS strategy."
Solution Using pre-aggregations + tenant-scoped cache keys
cubes:
- name: Events
sql: |
SELECT * FROM events
WHERE tenant_id = '{COMPILE_CONTEXT.securityContext.tenant_id}'
measures:
dau:
type: count_distinct
sql: user_id
dimensions:
event_date:
type: time
sql: event_ts
region:
type: string
sql: region
pre_aggregations:
tenant_dau_region_day:
measures: [dau]
dimensions: [region]
time_dimension: event_date
granularity: day
partition_granularity: month
refresh_key:
every: 5 minute
# tenant_id is added to the cache key automatically because
# it appears in the cube SQL via COMPILE_CONTEXT
Step-by-step trace.
| Step | Caller | Cache key contains tenant_id? | Outcome |
|---|---|---|---|
| 1 | Internal CEO dashboard polls | yes (internal tenant) | hits warm pre-agg, 200ms |
| 2 | Tenant A's portal polls | yes (tenant_A) | hits warm pre-agg for tenant A, 250ms |
| 3 | Tenant B's portal polls | yes (tenant_B) | hits warm pre-agg for tenant B, 250ms |
| 4 | Refresh tick (every 5 min) | per-tenant rebuild | one Snowflake query per active tenant |
| 5 | Cold tenant (no traffic for 1 day) | pre-agg expires | next request rebuilds — 2s once |
| 6 | Audit query for credit usage | total Snowflake spend = N tenants × 12 queries/hour | ~95% reduction vs naive |
Output:
| Metric | Before | After |
|---|---|---|
| Snowflake credits / day | 240 | 14 |
| p95 latency (CEO dashboard) | 6.2s | 0.21s |
| p95 latency (tenant portal) | 4.8s | 0.25s |
| Cross-tenant leak risk | high (per-app plumbing) | zero (layer-enforced) |
Why this works — concept by concept:
- Pre-aggregations as the cost killer — Snowflake credits scale with scanned bytes, not with queries served. Routing every poll to a roll-up table of ~10K rows instead of 10B drops the bytes scanned by 6 orders of magnitude.
-
Per-tenant cache key — the
tenant_idappears in the cube SQL viaCOMPILE_CONTEXT, so Cube partitions the pre-agg storage by tenant. Tenant A's data is physically in a different row set from tenant B's. - Refresh granularity matches SLA — a 5-minute refresh is "fresh enough" for product analytics. Tightening to 1 minute would re-spend Snowflake credits; loosening to 1 hour would break the freshness contract.
- Cold-tenant elasticity — pre-aggs for inactive tenants expire and only get rebuilt on demand. Pay for what you query.
-
Single security predicate — the
WHERE tenant_id = ...line lives once. Every consumer (CEO dashboard, embedded portal, LLM agent) inherits it without writing tenant logic. - Cost — O(active_tenants × refresh_ticks) warehouse queries per day, each scanning O(1 day of events). Independent of consumer poll rate.
SQL
Topic — group-by
GROUP BY and granularity problems (SQL)
3. The three platforms compared
Cube vs dbt Semantic Layer vs LookML — the scoring rubric every analytics-engineering lead should keep at hand
The mental model in one line: Cube is the standalone OSS engine with the widest BI fan-out, the dbt Semantic Layer is the dbt-native option that piggybacks on the model layer you already own, and LookML is the original — tightly coupled to Looker as the consumer. The right pick is mostly a function of which consumers your team must serve, not the model file's syntax.
Cube.dev (formerly Cube.js) — the standalone OSS engine.
- What it is. An open-source semantic engine, written in Node.js + Rust, that publishes metric definitions over REST, GraphQL, and SQL APIs. Started as Cube.js in 2019, rebranded to Cube.dev.
- Strengths. Widest BI fan-out (any tool that speaks SQL or HTTP can consume it). Pre-aggregations are best-in-class. Self-hostable via Docker; managed via Cube Cloud. Excellent for embedded analytics and LLM agents because of the REST / GraphQL surfaces.
- Tradeoffs. Maintains its own model files (cubes) — duplication if you already model in dbt. The OSS edition lacks some governance features that ship in Cube Cloud (lineage, RBAC UI, query history).
-
Data model.
cube→measures,dimensions,joins,segments,pre_aggregations. A cube maps roughly to a fact / dim table.
dbt Semantic Layer (powered by MetricFlow) — the dbt-native option.
-
What it is. Metric definitions that live next to dbt models as
semantic_modelandmetricYAML, compiled by MetricFlow into SQL. Available in dbt Cloud (managed) and dbt Core (CLI / open-source). - Strengths. Lives in your existing dbt repo — same PR review, same CI, same lineage. Best-in-class time-spine and cumulative metrics. Direct integration with Tableau, Hex, Mode, Power BI, Sigma, and Lightdash via the dbt SL JDBC connector.
- Tradeoffs. The premium hosted Semantic Layer is gated to dbt Cloud Team / Enterprise plans (dbt Core has MetricFlow but not the cached server). Smaller embedded / API surface than Cube. AI / agent fan-out is improving but lags Cube's GraphQL story.
-
Data model.
semantic_model→entities,dimensions,measures;metric→ simple / ratio / derived / cumulative. Entities are the primary / foreign key declarations that drive joins.
Looker LookML — the original.
- What it is. The semantic modelling language that ships inside Looker. Mature since 2014; the reference implementation of "metric definitions live next to the BI tool."
- Strengths. Mature governance (Git-integrated workspace, content validation, IDE). Persistent derived tables (PDTs) are battle-tested. Deep integration with Looker's explore experience.
- Tradeoffs. Tightly coupled to Looker as the consumer. Other BI tools cannot natively consume LookML — you would expose Looker's SQL Runner or pipe via API. The license cost scales with Looker user seats, which can dominate the BI budget.
-
Data model.
view→dimensions,measures,filters;explore→joins;model→ packages explores together.
Scoring rubric — five axes that decide the pick.
| Axis | Cube.dev | dbt Semantic Layer | LookML |
|---|---|---|---|
| Openness (consumer fan-out) | ★★★★★ | ★★★★ | ★★ |
| Time-to-value (if no existing model) | ★★★ | ★★★★ (if dbt already in place) | ★★ |
| Cost (TCO) | $ — OSS, Cube Cloud paid | $$ — dbt Cloud Team/Enterprise | $$$ — per Looker seat |
| Embedded / LLM | ★★★★★ | ★★★ | ★ |
| Governance & lineage | ★★★★ (Cube Cloud) | ★★★★★ (lives in dbt repo) | ★★★★★ |
Common interview probes on platform choice.
- "When would you pick Cube over dbt SL?" — when the consumer mix is heavily embedded analytics, LLM agents, or non-dbt BI tools, and when pre-aggregations are the dominant cost saver.
- "When would you pick dbt SL over Cube?" — when the team already lives in a dbt repo and the consumers are mostly Tableau / Power BI / Hex / Mode through the JDBC connector.
- "When would you stay on LookML?" — when Looker is already the standard BI tool, the team values mature governance, and there is no near-term need to serve embedded or AI consumers.
- "What is the migration cost LookML → dbt SL?" — typically rewriting view / explore files as semantic_models, plus carefully porting calculated fields and access filters. Usually staged metric by metric, not as a big-bang.
Worked example — score the three platforms for an embedded analytics SaaS
Detailed explanation. A B2B SaaS company sells an analytics product embedded in customer apps. The consumer mix is: 80% embedded React charts (REST/GraphQL), 15% internal Tableau dashboards, 5% an early LLM agent. They already use dbt for upstream models.
Question. Score the three platforms against this consumer mix and pick the primary semantic layer.
Input — weighted axes.
| Axis | Weight | Cube | dbt SL | LookML |
|---|---|---|---|---|
| Embedded REST/GraphQL | 40% | 5 | 3 | 1 |
| Internal BI (Tableau) | 15% | 4 | 5 | 3 |
| LLM agent | 5% | 5 | 3 | 2 |
| dbt integration | 20% | 3 | 5 | 2 |
| Cost / OSS option | 10% | 5 | 3 | 1 |
| Governance | 10% | 4 | 5 | 5 |
Code.
weights = {
"embedded": 0.40,
"internal_bi": 0.15,
"llm": 0.05,
"dbt": 0.20,
"cost": 0.10,
"governance": 0.10,
}
scores = {
"Cube": {"embedded":5, "internal_bi":4, "llm":5, "dbt":3, "cost":5, "governance":4},
"dbt_SL": {"embedded":3, "internal_bi":5, "llm":3, "dbt":5, "cost":3, "governance":5},
"LookML": {"embedded":1, "internal_bi":3, "llm":2, "dbt":2, "cost":1, "governance":5},
}
weighted = {p: round(sum(s[a]*weights[a] for a in weights), 2) for p, s in scores.items()}
print(weighted)
# {'Cube': 4.4, 'dbt_SL': 3.85, 'LookML': 2.0}
Step-by-step explanation.
- Embedded REST/GraphQL is the dominant axis (40%). Cube's REST + GraphQL APIs score a 5; dbt SL's JDBC + API score a 3; LookML's "no native embedded API" scores a 1.
- Internal Tableau is well served by all three but easiest to wire through dbt SL's JDBC.
- The LLM-agent axis favours Cube because of the GraphQL schema and the published
metaendpoint that lists all cubes / dimensions / measures as a tool surface. - dbt integration favours dbt SL — it lives in the same repo.
- The weighted scores collapse to ~4.4 for Cube, 3.85 for dbt SL, 2.0 for LookML — Cube wins this scenario.
- The team picks Cube as the primary semantic layer. dbt continues to own the model layer (staging / marts), and Cube reads from the dbt marts.
Output.
| Platform | Weighted score | Decision |
|---|---|---|
| Cube.dev | 4.40 | primary |
| dbt SL | 3.85 | alternative if embedded shrinks |
| LookML | 2.00 | not a fit (no Looker consumer) |
Rule of thumb. Score by consumer mix, not by syntax preference. The semantic layer's job is to serve consumers; the file format matters only to the small team of analytics engineers maintaining it.
Worked example — Cube data model in detail
Detailed explanation. A Cube schema is built from cube blocks. Each cube wraps a SQL table (or view), exposes measures and dimensions, declares joins to other cubes, and optionally defines segments (named filters) and pre_aggregations (materialised roll-ups).
Question. Translate a small star schema (orders, customers, regions) into a Cube schema with one revenue measure and one region dimension. Show the joins declared once and re-used.
Input — star schema.
| Table | Role | Key columns |
|---|---|---|
| orders | fact | order_id, customer_id, amount, order_date |
| customers | dim | customer_id, region_id, name |
| regions | dim | region_id, region_name |
Code.
cubes:
- name: Orders
sql_table: analytics.fct_orders
joins:
- name: Customers
relationship: many_to_one
sql: "{Orders}.customer_id = {Customers}.customer_id"
measures:
total_revenue:
sql: amount
type: sum
format: currency
order_count:
type: count
dimensions:
order_id:
sql: order_id
type: number
primary_key: true
order_date:
sql: order_date
type: time
segments:
paid_orders:
sql: "{CUBE}.status = 'paid'"
- name: Customers
sql_table: analytics.dim_customers
joins:
- name: Regions
relationship: many_to_one
sql: "{Customers}.region_id = {Regions}.region_id"
dimensions:
customer_id:
sql: customer_id
type: number
primary_key: true
name:
sql: name
type: string
- name: Regions
sql_table: analytics.dim_regions
dimensions:
region_id:
sql: region_id
type: number
primary_key: true
region_name:
sql: region_name
type: string
Step-by-step explanation.
- Each
cubemaps to a warehouse table.Ordersis the fact;CustomersandRegionsare dims. - The
joinsblock onOrdersdeclares the join toCustomersonce. Thejoinsblock onCustomersdeclares the join toRegionsonce. Cube composes the chain: a query forOrders.total_revenuegrouped byRegions.region_nametraverses both joins automatically. -
measuresdefine aggregations on the cube's table.dimensionsdefine group-by axes;primary_key: truemarks the row identity for the cube. -
segmentsare named filters. A dashboard can ask "give meOrders.total_revenuefor segmentpaid_orders" without re-typing theWHERE status = 'paid'predicate. - Adding a new metric is one new
measureblock. Adding a new dim is one newdimension. Cubes scale linearly with metric count.
Output. A consumer can now ask: "Orders.total_revenue grouped by Regions.region_name filtered to Orders.paid_orders segment, last 30 days." The semantic layer composes the join chain and the segment filter — the consumer writes no SQL.
Rule of thumb. Declare every join once, every segment once, every measure once. The rule "don't repeat the SQL" is what makes the layer pay for itself within the first six metrics.
Worked example — dbt Semantic Layer data model in detail
Detailed explanation. A dbt Semantic Layer schema is built from semantic_models (which define entities, dimensions, and measures on a dbt model) and metrics (which express the business-level KPIs computed from those measures). MetricFlow turns metric requests into SQL.
Question. Re-express the same orders / customers / regions star schema as dbt Semantic Layer YAML. Show the entity-based joins and the simple metric for total revenue.
Input. dbt models fct_orders, dim_customers, dim_regions exist and have unique_key columns.
Code.
# models/semantic/orders.yml
semantic_models:
- name: orders
model: ref('fct_orders')
defaults:
agg_time_dimension: order_date
entities:
- name: order_id
type: primary
- name: customer_id
type: foreign
dimensions:
- name: order_date
type: time
type_params:
time_granularity: day
measures:
- name: revenue_sum
agg: sum
expr: amount
- name: customers
model: ref('dim_customers')
entities:
- name: customer_id
type: primary
- name: region_id
type: foreign
dimensions:
- name: name
type: categorical
- name: regions
model: ref('dim_regions')
entities:
- name: region_id
type: primary
dimensions:
- name: region_name
type: categorical
metrics:
- name: total_revenue
label: Total Revenue
type: simple
type_params:
measure: revenue_sum
Step-by-step explanation.
- Each
semantic_modelwraps an existing dbt model.entitiesdeclare the primary / foreign keys MetricFlow uses to auto-resolve joins. -
customer_idisprimaryincustomersandforeigninorders— MetricFlow knows thatordersjoins tocustomersoncustomer_idwithout an explicitJOIN ... ONblock. - Similarly,
region_idisprimaryinregionsandforeignincustomers— the chainorders → customers → regionsis implicit. -
dimensionsdeclare group-by axes;agg_time_dimensiondefaults the time grain for time-series queries. -
measuresare the aggregation building blocks. Ametricof typesimplewraps a single measure into a named, dashboard-facing KPI. - Other metric types compose more complex KPIs:
ratio(numerator / denominator),derived(an arithmetic expression over other metrics),cumulative(rolling totals with date-spine support).
Output. A consumer can ask: "total_revenue grouped by region__region_name for the last 30 days." MetricFlow resolves the join chain via entities and emits SQL — no JOIN ... ON typed by the consumer.
Rule of thumb. dbt SL's entity model is most powerful when your dbt marts already follow Kimball-style conventions (one primary key per table, foreign keys named consistently). Greenfield dbt + dbt SL projects converge faster than retrofits.
Worked example — LookML data model in detail
Detailed explanation. A LookML schema is built from view files (mapping to tables) and explore files (declaring joins between views). Each view exposes dimensions and measures; each explore lists the join blocks. The Looker UI shells out queries against the explore.
Question. Re-express the same star schema in LookML — a view per table, an explore that joins them, a measure for total revenue.
Input. Looker connection to the same warehouse with fct_orders, dim_customers, dim_regions.
Code.
# views/orders.view.lkml
view: orders {
sql_table_name: analytics.fct_orders ;;
dimension: order_id {
primary_key: yes
type: number
sql: ${TABLE}.order_id ;;
}
dimension_group: order {
type: time
timeframes: [date, week, month, quarter, year]
sql: ${TABLE}.order_date ;;
}
dimension: customer_id {
type: number
sql: ${TABLE}.customer_id ;;
}
measure: total_revenue {
type: sum
sql: ${TABLE}.amount ;;
value_format_name: usd
}
measure: order_count {
type: count
}
}
# views/customers.view.lkml
view: customers {
sql_table_name: analytics.dim_customers ;;
dimension: customer_id {
primary_key: yes
type: number
sql: ${TABLE}.customer_id ;;
}
dimension: region_id {
type: number
sql: ${TABLE}.region_id ;;
}
dimension: name { type: string sql: ${TABLE}.name ;; }
}
# views/regions.view.lkml
view: regions {
sql_table_name: analytics.dim_regions ;;
dimension: region_id {
primary_key: yes
type: number
sql: ${TABLE}.region_id ;;
}
dimension: region_name { type: string sql: ${TABLE}.region_name ;; }
}
# models/sales.model.lkml
explore: orders {
join: customers {
type: left_outer
relationship: many_to_one
sql_on: ${orders.customer_id} = ${customers.customer_id} ;;
}
join: regions {
type: left_outer
relationship: many_to_one
sql_on: ${customers.region_id} = ${regions.region_id} ;;
}
}
Step-by-step explanation.
- Each
viewwraps a warehouse table.dimensionandmeasureblocks declare the columns and aggregations. - The
explore: ordersblock declares the join chain.join: customersjoinsorders → customers;join: regionsjoinscustomers → regions. The relationship hint (many_to_one) lets Looker pick the right SQL form. - The
dimension_groupshortcut auto-generatesorder_date,order_week,order_month, etc. — every common time-grain dimension for free. -
value_format_name: usdformats the measure as currency in the BI surface. - A Looker user opens the
ordersexplore, selectsregions.region_nameandorders.total_revenue, and Looker emits the join chain transparently.
Output. Same number, same governance, same star schema — expressed in LookML files inside a Git-integrated Looker workspace.
Rule of thumb. LookML's per-view structure is more verbose than YAML alternatives but reads beautifully in code review. If Looker is the only BI surface, the verbosity is offset by IDE features (autocomplete, content validation, LookML test runner). When the BI surface fans out beyond Looker, the verbosity becomes a tax.
Semantic layer interview question on platform selection
A senior analytics-engineering interviewer might ask: "You inherit a company on Looker with 200 dashboards. The CTO wants to add an embedded analytics product and an LLM agent within 12 months. Walk me through the semantic layer migration plan."
Solution Using a staged "10 top metrics" migration to Cube alongside Looker
# Stage 1: stand up Cube alongside Looker (Cube reads from the same warehouse marts).
# Stage 2: identify the 10 top metrics by Looker query volume.
# Stage 3: rewrite those 10 metrics in Cube (or dbt SL), reference dbt marts.
# Stage 4: point the new embedded product and LLM agent at Cube.
# Stage 5: dual-publish — Looker continues to serve the 200 dashboards;
# Cube serves the new surfaces.
# Stage 6: as Looker dashboards retire or get rebuilt, port them to Cube one by one.
migration_plan:
parallel_run_months: 12
metric_priority:
- weekly_active_users
- daily_active_users
- total_revenue
- new_signups
- churn_rate
- retention_d7
- retention_d30
- average_order_value
- conversion_rate
- net_promoter_score
consumers:
- embedded_react_charts -> Cube REST
- llm_slack_agent -> Cube GraphQL
- existing_looker -> LookML (untouched)
- new_internal_bi -> Cube SQL API
Step-by-step trace.
| Stage | Action | Risk | Mitigation |
|---|---|---|---|
| 1 | Cube deployed alongside Looker | low | both read same warehouse, no migration |
| 2 | Top-10 metrics catalogued from Looker query history | low | weighted by query volume + dashboard count |
| 3 | Top-10 rewritten in Cube YAML | medium | unit-test each metric against Looker output for 30 days |
| 4 | Embedded product + LLM agent ship on Cube | medium | new code path, no impact on Looker |
| 5 | Dual-publish, no Looker dashboards touched | low | full backwards compatibility |
| 6 | Looker dashboards ported as part of normal roadmap | spread over months | each port is one PR, reviewable |
Output:
| Surface | Pre-migration | Post-migration |
|---|---|---|
| Looker dashboards | 200 | 200 (unchanged) |
| Embedded React charts | 0 | new on Cube |
| LLM Slack agent | 0 | new on Cube |
| New internal BI | n/a | on Cube SQL API |
| Source of truth metrics | LookML (200 metric defs) | Cube (10 top metrics) + LookML (long tail) |
Why this works — concept by concept:
- Parallel run, not big-bang — Looker and Cube co-exist for 12 months. No "stop-the-world" cutover. Risk distributed across the migration window.
- Top-10 metric priority — analytics-engineering effort focuses on the metrics that power the new consumer surfaces. The long tail of 190 dashboards stays on Looker until a natural rebuild.
- Embedded + LLM on Cube — the new surfaces are wired only to the new layer. Their existence does not depend on Looker uptime, license seats, or LookML rewrites.
- Same warehouse marts — both layers read from the dbt-managed marts. The underlying data is one copy; the metric definitions are layered above.
- Unit-test against Looker — for each ported metric, dual-run for 30 days and compare numbers daily. Drift > 0.5% blocks the migration of that metric.
- Cost — incremental dual-platform cost during the 12-month run, offset by the new revenue streams (embedded product, LLM agent) that depend on Cube.
SQL
Topic — joins
JOIN problems for entity resolution (SQL)
4. Defining a metric in each platform
One Weekly Active Users metric, three vocabularies — the side-by-side comparison every interviewer probes
The mental model in one line: the same Weekly Active Users metric is count_distinct user_id over the trailing 7 days, but each platform asks you to spell it differently — Cube's measure type: countDistinct, dbt SL's simple metric over a count_distinct measure, and LookML's measure type: count_distinct. Translating between the three is a syntax exercise once the semantics are clear.
The same metric, three vocabularies.
-
Cube.
cubeexposes acount_distinctmeasure onuser_id. The "weekly" granularity is supplied by the consumer via thetimeDimensionsblock, or pre-baked as asegment. -
dbt SL. A
semantic_modelexposes a measurecount_distinct(user_id). Ametricof typesimplewraps the measure and is queried with agranularity: weektime dimension. -
LookML. A
viewexposes ameasure type: count_distinct. Thedimension_groupauto-generates a*_weekdimension that the explore groups by.
Where joins are declared.
-
Cube. Per-cube
joinsblock:relationship: many_to_one+sql: "{Cube}.k = {Other}.k". -
dbt SL. Per-semantic_model
entities: declareprimaryandforeignentities; MetricFlow infers the join. -
LookML. Per-explore
join:type: left_outer+sql_on: ${a.k} = ${b.k}.
Granularity, time dimensions, and date-spine handling.
-
Cube. Time dimensions support
granularity: day | week | month | quarter | year. Pre-aggregations can be partitioned by month or week for cost control. Date-spine is handled implicitly by the consumer query. -
dbt SL. Best-in-class time spine — declare a
time_spinemodel once insemantic_models.yml, and MetricFlow uses it to fill in zero rows for missing dates in cumulative metrics. Granularities:second | minute | hour | day | week | month | quarter | year. -
LookML.
dimension_groupauto-generates every common timeframe. PDTs can be partitioned by date. No native date-spine — analysts hand-roll a "calendar" view if needed.
Derived and ratio metrics.
-
Cube. Compose by referencing other measures in a derived measure's
sql: e.g.conversion_rate: sql: "{purchases} * 1.0 / NULLIF({sessions}, 0)". -
dbt SL. First-class metric types:
ratio(numerator + denominator) andderived(expression over named metrics). The cleanest of the three for compound KPIs. -
LookML. Compose with
measure type: numberand an expression referencing other measures:${purchases} * 1.0 / NULLIF(${sessions}, 0).
Filters, segments, and parameter inputs.
-
Cube.
segmentsare named filters reusable across queries. Templated parameters via{ FILTER_PARAMS }for runtime injection. -
dbt SL.
filterblocks on metrics for static filters;whereclauses on saved queries for runtime filters. Less templating, more declarative. -
LookML.
filterblocks on views andparameterblocks for runtime input.liquidtemplate language for advanced rewrites.
Caching, materialisation, and roll-ups.
-
Cube.
pre_aggregationsare the headline feature — declare the roll-up grain and refresh schedule; Cube auto-routes matching queries. -
dbt SL. dbt Cloud Semantic Layer cache; saved queries can be persisted as tables via dbt's
materialized: table. -
LookML. PDTs persisted on a schedule via
sql_trigger_valueordatagroup. Most mature of the three but warehouse-coupled.
Versioning a metric definition.
- All three live in Git. Cube has
meta.versionper cube; dbt SL inherits dbt'sversionanddefined_insemantics; LookML has Looker Workspaces (Git-backed branches) with content validation.
Worked example — Weekly Active Users in Cube
Detailed explanation. A Cube schema exposes a weekly_active_users measure as a count_distinct of user_id over a 7-day rolling window. The consumer queries it with a dateRange filter on the time dimension.
Question. Write the Cube definition for weekly_active_users and show the consumer query that returns the WAU per week for the trailing 12 weeks.
Input. An events table with user_id, event_ts, event_name.
Code.
cubes:
- name: Events
sql_table: analytics.fct_events
measures:
active_users:
type: countDistinct
sql: user_id
title: Active users
dimensions:
event_ts:
sql: event_ts
type: time
event_name:
sql: event_name
type: string
user_id:
sql: user_id
type: number
segments:
logged_in:
sql: "{CUBE}.event_name = 'login'"
pre_aggregations:
weekly_active_rollup:
measures: [active_users]
time_dimension: event_ts
granularity: week
partition_granularity: month
refresh_key:
every: 1 hour
// Consumer query — REST/GraphQL/SQL all hit the same definition
{
"measures": ["Events.active_users"],
"timeDimensions": [{
"dimension": "Events.event_ts",
"granularity": "week",
"dateRange": ["2026-03-22", "2026-06-14"]
}]
}
Step-by-step explanation.
- The
active_usersmeasure iscountDistinct(user_id). Cube treats it as a measure usable at any granularity. - The consumer asks for
Events.active_usersgrouped byEvents.event_tsat weekly granularity, over a 12-week date range. - Cube checks pre-aggregations; the
weekly_active_rollupis partitioned by month at weekly granularity, so it matches. The query reads the pre-agg, not the raw fact. - The result is one row per week; "WAU" emerges naturally from
count_distinctat the week grain. No bespoke "WAU" formula is needed because the measure + granularity combination is WAU.
Output (sample).
| week | active_users |
|---|---|
| 2026-03-22 | 18,210 |
| 2026-03-29 | 18,405 |
| 2026-04-05 | 18,890 |
| ... | ... |
| 2026-06-14 | 19,432 |
Rule of thumb. In Cube, the same count_distinct measure can serve as DAU, WAU, MAU depending on the granularity the consumer asks for. Don't define three measures — define one and let the time dimension do the work.
Worked example — Weekly Active Users in dbt Semantic Layer
Detailed explanation. The dbt Semantic Layer defines the same metric as a simple metric over a count_distinct measure on the events semantic model. MetricFlow injects the granularity at query time.
Question. Write the dbt SL YAML for weekly_active_users and the MetricFlow CLI / Python call to return WAU per week for the trailing 12 weeks.
Input. A dbt model fct_events with user_id, event_ts, event_name. A time_spine model is configured.
Code.
# models/semantic/events.yml
semantic_models:
- name: events
model: ref('fct_events')
defaults:
agg_time_dimension: event_ts
entities:
- name: user_id
type: foreign
- name: event_id
type: primary
dimensions:
- name: event_ts
type: time
type_params:
time_granularity: day
- name: event_name
type: categorical
measures:
- name: distinct_users
agg: count_distinct
expr: user_id
metrics:
- name: weekly_active_users
label: Weekly Active Users
type: simple
type_params:
measure: distinct_users
# MetricFlow CLI consumer call
mf query \
--metrics weekly_active_users \
--group-by metric_time__week \
--start-time 2026-03-22 \
--end-time 2026-06-14
Step-by-step explanation.
-
distinct_usersis the measure (the count_distinct LEGO brick).weekly_active_usersis thesimplemetric that wraps it as a dashboard-facing KPI. - The
metric_time__weekgroup-by tells MetricFlow to aggregate the measure at the week grain usingagg_time_dimension: event_ts. - MetricFlow joins to the time spine (declared once in
semantic_models.yml) so weeks with zero activity still appear as zero rows — not as missing rows. - The query returns one row per week. Consumers can be Tableau (via JDBC), Hex (native dbt SL integration), or any tool that speaks the SL API.
Output (sample).
| metric_time__week | weekly_active_users |
|---|---|
| 2026-03-22 | 18,210 |
| 2026-03-29 | 18,405 |
| 2026-04-05 | 18,890 |
| ... | ... |
| 2026-06-14 | 19,432 |
Rule of thumb. When the metric is "count something distinct at a time grain," reach for a simple metric over a count_distinct measure. Reserve ratio, derived, and cumulative for the metrics that genuinely need them.
Worked example — Weekly Active Users in LookML
Detailed explanation. A LookML view exposes a count_distinct measure on user_id. A dimension_group auto-generates event_week. The explore groups by event_week and pivots on active_users.
Question. Write the LookML view + explore for the same WAU metric and the Looker query (or SQL Runner equivalent) for the trailing 12 weeks.
Input. A Looker connection to the same fct_events warehouse table.
Code.
# views/events.view.lkml
view: events {
sql_table_name: analytics.fct_events ;;
dimension: event_id {
primary_key: yes
type: number
sql: ${TABLE}.event_id ;;
}
dimension: user_id {
type: number
sql: ${TABLE}.user_id ;;
hidden: yes
}
dimension_group: event {
type: time
timeframes: [date, week, month, quarter, year]
sql: ${TABLE}.event_ts ;;
}
dimension: event_name {
type: string
sql: ${TABLE}.event_name ;;
}
measure: active_users {
type: count_distinct
sql: ${user_id} ;;
label: "Active users"
}
}
# models/events.model.lkml
explore: events {
description: "Activity events fact for WAU/DAU/MAU"
}
-- SQL emitted by Looker for the trailing 12-week WAU query
SELECT DATE_TRUNC('week', ${TABLE}.event_ts) AS event_week,
COUNT(DISTINCT ${TABLE}.user_id) AS active_users
FROM analytics.fct_events AS events
WHERE ${TABLE}.event_ts >= CURRENT_DATE - INTERVAL '84 day'
GROUP BY 1
ORDER BY 1;
Step-by-step explanation.
- The
dimension_group: eventauto-generatesevent_date,event_week,event_month,event_quarter,event_year. Selectingevent_weekin the Looker UI drives theDATE_TRUNC('week', ...)in the emitted SQL. - The
measure: active_usersiscount_distinct ${user_id}. Looker pairs it with the chosen time grain to produce DAU / WAU / MAU. - The explore is intentionally minimal —
eventsis a single-table fact, so no joins are needed for this metric. Joins tousers,regions, etc. would be added in the same explore. - The Looker UI generates the emitted SQL automatically. Power users can drop to SQL Runner; the explore is the typical surface.
Output (sample, identical to Cube / dbt SL).
| event_week | active_users |
|---|---|
| 2026-03-22 | 18,210 |
| 2026-03-29 | 18,405 |
| 2026-04-05 | 18,890 |
| ... | ... |
| 2026-06-14 | 19,432 |
Rule of thumb. In LookML, the dimension_group is the productivity unlock. Define one time field, get every common grain for free. The verbosity tax is paid up front; the daily authoring tax is small.
Worked example — derived and ratio metrics across all three
Detailed explanation. A "conversion rate" metric is purchases / sessions. Each platform expresses it differently — Cube via a number-typed measure referencing two sum measures, dbt SL via a ratio metric, LookML via a number measure that references two count measures.
Question. Define conversion_rate = purchases / sessions in all three platforms. Show the safe-division pattern (NULLIF on the denominator).
Input. Cubes / semantic_models / views for sessions (with a purchases flag column).
Code.
# Cube — derived measure
cubes:
- name: Sessions
sql_table: analytics.fct_sessions
measures:
sessions_count:
type: count
purchases_count:
type: count
filters:
- sql: "{CUBE}.purchase_flag = true"
conversion_rate:
type: number
sql: "{purchases_count} * 1.0 / NULLIF({sessions_count}, 0)"
format: percent
# dbt SL — first-class ratio metric
metrics:
- name: conversion_rate
type: ratio
type_params:
numerator: purchases_count
denominator: sessions_count
# LookML — number measure with safe division
view: sessions {
measure: sessions_count {
type: count
}
measure: purchases_count {
type: count
filters: [purchase_flag: "yes"]
}
measure: conversion_rate {
type: number
sql: ${purchases_count} * 1.0 / NULLIF(${sessions_count}, 0) ;;
value_format_name: percent_2
}
}
Step-by-step explanation.
- Cube and LookML express the ratio as a derived measure —
purchases / sessionswithNULLIFto protect against zero division. - dbt SL provides a first-class
ratiotype. MetricFlow generates theNULLIF-style protection automatically and ensures both metrics are aggregated at the same granularity before the ratio is computed. - All three return the same number for any given granularity. The dbt SL form is the most concise; the Cube and LookML forms make the safe-division explicit.
- Formatting (
format: percent) lives in the layer so every consumer renders the metric the same way (e.g.12.4%, not0.124).
Output.
| platform | conversion_rate (this week) |
|---|---|
| Cube | 12.4% |
| dbt SL | 12.4% |
| LookML | 12.4% |
Rule of thumb. For first-order ratios, dbt SL's typed metric is the cleanest. For ratios with conditional filters on numerator and denominator (e.g. "conversion rate of paid users"), all three platforms let you wrap the measure in a filter — choose the one whose syntax your team is already fluent in.
Semantic layer interview question on cross-platform metric translation
A senior interviewer might frame: "We are migrating from LookML to dbt SL. Translate this LookML measure into the dbt SL equivalent: a count_distinct user_id filtered to purchase events, grouped at weekly granularity, with a 30-day rolling window option."
Solution Using a dbt SL simple metric plus a cumulative overlay
# models/semantic/events.yml
semantic_models:
- name: events
model: ref('fct_events')
defaults:
agg_time_dimension: event_ts
entities:
- name: event_id
type: primary
- name: user_id
type: foreign
dimensions:
- name: event_ts
type: time
type_params: { time_granularity: day }
- name: event_name
type: categorical
measures:
- name: distinct_purchasers
agg: count_distinct
expr: user_id
agg_time_dimension: event_ts
# measure filter — only purchase events feed this measure
filter: "{{ Dimension('events__event_name') }} = 'purchase'"
metrics:
- name: weekly_purchasers
label: Weekly purchasers
type: simple
type_params:
measure: distinct_purchasers
- name: purchasers_30d_rolling
label: 30-day rolling purchasers
type: cumulative
type_params:
measure: distinct_purchasers
window: 30 days
Step-by-step trace.
| Step | Action | Notes |
|---|---|---|
| 1 | LookML measure count_distinct(user_id) filter purchase → dbt SL measure distinct_purchasers
|
filter expressed in YAML, not Looker liquid |
| 2 | LookML event_week dimension → dbt SL metric_time__week group-by |
granularity injected at query time |
| 3 | New requirement: 30-day rolling window | dbt SL cumulative metric with window: 30 days
|
| 4 | Time spine joined automatically | gaps filled with zero rows |
| 5 | Consumer queries: weekly_purchasers OR purchasers_30d_rolling
|
one schema, two surfaces |
Output:
| metric_time__week | weekly_purchasers | purchasers_30d_rolling |
|---|---|---|
| 2026-05-31 | 4,120 | 14,800 |
| 2026-06-07 | 4,395 | 15,210 |
| 2026-06-14 | 4,602 | 15,690 |
Why this works — concept by concept:
-
Measure-level filter — declaring the
event_name = 'purchase'filter on the measure means every metric built from it inherits the filter. No risk that a downstream metric forgets the filter. -
Metric vs measure separation — measures are LEGO bricks; metrics are the dashboard-facing KPIs. The same
distinct_purchasersmeasure powers both the weekly and the rolling-30-day metric. - Cumulative metric type — MetricFlow handles the rolling-window math (joining each row to a 30-day lookback range) without the analyst hand-rolling a window function.
- Time spine join — gaps in activity become zero rows. Dashboards don't render "missing" weeks; they render "zero" weeks, which is what executives expect.
-
Granularity-agnostic measure —
distinct_purchasersworks at any grain. Query atmetric_time__dayfor DAU-style; query atmetric_time__weekfor WAU-style; query atmetric_time__monthfor MAU-style. One measure, three dashboards. - Cost — one warehouse pass per granularity per refresh; MetricFlow plans the SQL once and reuses across consumers.
SQL
Topic — case-expression
Conditional metric and CASE expression problems (SQL)
5. Consumer fan-out — who queries the semantic layer
One query API, many surfaces — Tableau, Power BI, Hex, Mode, embedded apps, and LLM agents from the same metric file
The mental model in one line: a well-placed semantic layer is consumed by every analytics surface in your stack — BI tools through SQL or JDBC, notebooks through the SQL endpoint, embedded apps through REST or GraphQL, and LLM agents through the published schema as a tool surface. Each consumer gets the same metric definition, the same RLS predicates, and the same cache benefits.
The consumer map in one paragraph.
- Heavy BI tools — Tableau, Power BI, Looker, Sigma, MicroStrategy — speak JDBC / ODBC or a vendor-specific connector.
- Notebooks — Jupyter, Hex, Deepnote, Mode, Databricks notebooks — speak SQL via JDBC or a Python client.
- Embedded analytics — React / Vue / Angular charts inside SaaS products — speak REST or GraphQL.
- LLM agents — Slack bots, custom GPTs, Anthropic / OpenAI tools — speak the semantic layer's schema as a tool surface, then the layer emits SQL.
BI tools.
- Looker. Native LookML consumer. The semantic layer is Looker for Looker.
- Tableau. Reads dbt SL via the dbt Cloud connector. Reads Cube via the Cube SQL API. Reads LookML indirectly via Looker SQL Runner or by exporting an extract.
- Power BI. Reads dbt SL via the dbt Cloud / Tableau-style connector. Reads Cube via SQL API.
- Hex / Mode / Sigma. Best-in-class native dbt SL integrations; also speak Cube SQL API.
Notebooks.
-
Jupyter. A
cube-jupyter-clientPython package or a generic JDBC driver loads metric results into a DataFrame. - Hex. First-class dbt SL integration — drag a metric into a cell, get a DataFrame.
- Deepnote. Speaks SQL over JDBC against the layer.
- Databricks notebooks. Useful for ML feature engineering — read the metric, train on it.
Embedded analytics — Cube's GraphQL/REST surface.
-
REST. A POST to
/cubejs-api/v1/loadwith a JSON query. Returns a JSON result with rows and metadata. -
GraphQL. A
cube(where: ...) { events { weeklyActiveUsers } }query against the published schema. Strongly typed. - JWT auth. Every embedded request carries a JWT signed by the host app. The layer extracts claims (tenant_id, user_id, role) and binds them to RLS predicates.
-
Front-end SDKs.
@cubejs-client/react,@cubejs-client/vuegive you<QueryRenderer>components that take a query and render charts.
AI/LLM agents — text-to-SQL using the semantic layer as the grounding context.
- The agent receives the question: "What was the WAU last week broken down by region?"
- The agent calls a tool that returns the published cubes / measures / dimensions / joins as a structured schema.
- The agent constructs a semantic-layer query (not raw SQL):
Events.active_users grouped by Regions.region_name, last week. - The layer compiles to SQL, runs against the warehouse, returns the result.
- The agent renders the answer in natural language with the metric name and definition cited — auditable.
Cost-control: caching layers, materialised aggregates, query budgets.
- Caching. Per-consumer or global; TTL or invalidation on metric deploy.
- Materialised aggregates. Cube pre-aggs; dbt-materialised saved queries; Looker PDTs.
- Query budgets. Per-tenant or per-consumer rate limits — the embedded React chart cannot accidentally DDoS the warehouse because the layer enforces a poll interval.
Authentication, RLS, tenant isolation across consumers.
- Auth. Each consumer attaches a JWT or service-account token. The layer resolves the identity once.
- RLS. Per-tenant or per-user predicates injected before the warehouse SQL is emitted.
- Tenant isolation. Cache keys partition by tenant claim. No cross-tenant cache pollution.
- Audit. Every query is logged with the calling identity, the metric requested, the SQL emitted, and the result row count.
Migration patterns.
- LookML → dbt SL. Port the 10 top metrics first; keep Looker live for the long tail. Dual-run for 30 days per metric; reconcile numbers daily.
- LookML → Cube. Same staged pattern; Cube becomes the primary for embedded + LLM, Looker keeps the existing dashboards.
- Greenfield — pick by consumer mix. Mostly embedded + LLM → Cube. Already on dbt + mostly Tableau/Hex/Mode → dbt SL. Already on Looker with no near-term embedded plans → stay on LookML.
Worked example — wiring the same metric into Tableau, Hex, and a React app
Detailed explanation. Three consumers want the same weekly_active_users metric. Each speaks a different protocol. The semantic layer's published definitions are the same file; the wiring is per-consumer.
Question. Show the connection / query snippet for Tableau (JDBC), Hex (native SL integration), and a React embedded chart (REST).
Input. A semantic layer with weekly_active_users published. Endpoints: SQL at :13306, REST at /cubejs-api/v1/load, GraphQL at /cubejs-api/v1/graphql.
Code.
-- Tableau / JDBC consumer
SELECT MEASURE(weekly_active_users) AS wau,
event_week
FROM events
GROUP BY event_week
ORDER BY event_week;
# Hex notebook — dbt SL integration (Hex syntax)
import pandas as pd
from hex_sl import query
df = query(
metrics=["weekly_active_users"],
group_by=["metric_time__week"],
start_time="2026-03-22",
end_time="2026-06-14",
)
df.head()
// React embedded chart — Cube REST via @cubejs-client/react
import { CubeProvider, useCubeQuery } from "@cubejs-client/react";
import { Line } from "react-chartjs-2";
function WauChart() {
const { resultSet } = useCubeQuery({
measures: ["Events.active_users"],
timeDimensions: [{
dimension: "Events.event_ts",
granularity: "week",
dateRange: ["2026-03-22", "2026-06-14"],
}],
});
if (!resultSet) return <p>Loading…</p>;
return <Line data={resultSet.chartPivot()} />;
}
Step-by-step explanation.
- Tableau opens a JDBC connection to the layer. The query reads like normal SQL, but the
MEASURE(weekly_active_users)syntax tells the layer to resolve the named metric (rather than treatweekly_active_usersas a column). - Hex's native integration takes Python-style arguments and translates them into the dbt SL query API. The result is a DataFrame ready for further analysis.
- The React component uses the
@cubejs-client/reacthook. The query object is the same JSON shape as a REST request; the component re-renders on result. - All three consumers see the same number because they all read from the same metric definition on the layer.
Output (one number, three surfaces).
| Consumer | Protocol | Code surface | Returned (this week) |
|---|---|---|---|
| Tableau | JDBC SQL | SELECT MEASURE(weekly_active_users) |
19,432 |
| Hex | SL Python | query(metrics=[...]) |
19,432 |
| React chart | REST/JS | useCubeQuery(...) |
19,432 |
Rule of thumb. Pick the consumer protocol that matches the host environment — JDBC for heavy BI, native SDK for first-class notebook integration, REST/GraphQL for embedded. The metric definition is invariant; only the wiring changes.
Worked example — RLS across a B2B SaaS embedded chart
Detailed explanation. A B2B SaaS product embeds a chart of "your active users this week" in every customer's tenant dashboard. The same chart code ships to 200 tenants; each tenant must see only their own number. RLS at the semantic layer makes the front-end code identical for every tenant.
Question. Show the JWT flow, the layer's RLS rewrite, and the resulting SQL for two different tenants.
Input. Two tenants: tenant_A (Alice) and tenant_B (Bob). Both load the same React embedded component.
Code.
// Host app generates a JWT with the tenant_id claim
const token = jwt.sign(
{ tenant_id: currentUser.tenant_id, sub: currentUser.id },
process.env.CUBE_API_SECRET,
{ expiresIn: "10m" }
);
// Front-end attaches the token to every request
<CubeProvider
cubeApi={cubejs(token, { apiUrl: "https://semantic.example.com/cubejs-api/v1" })}
>
<WauChart />
</CubeProvider>
# semantic layer — cube definition with RLS
cubes:
- name: Events
sql: |
SELECT *
FROM analytics.fct_events
WHERE tenant_id = '{COMPILE_CONTEXT.securityContext.tenant_id}'
measures:
active_users:
type: countDistinct
sql: user_id
-- SQL emitted for Alice (tenant_A)
SELECT COUNT(DISTINCT user_id) AS active_users
FROM (SELECT * FROM analytics.fct_events WHERE tenant_id = 'tenant_A') events
WHERE event_ts >= DATE_TRUNC('week', CURRENT_DATE);
-- SQL emitted for Bob (tenant_B)
SELECT COUNT(DISTINCT user_id) AS active_users
FROM (SELECT * FROM analytics.fct_events WHERE tenant_id = 'tenant_B') events
WHERE event_ts >= DATE_TRUNC('week', CURRENT_DATE);
Step-by-step explanation.
- The host app signs a JWT containing
tenant_id. The layer's API secret verifies the signature. - The React component is the same for both tenants — no tenant logic in front-end code.
- The semantic layer's cube SQL templates the
tenant_idfrom the verified JWT into the FROM clause. The warehouse only ever sees scoped data. - The cache key partitions by
tenant_id. Tenant A's cache and tenant B's cache are physically distinct entries. - Adding a third tenant requires zero code changes — just a new JWT with
tenant_id = "tenant_C".
Output.
| Caller | tenant_id | Returned active_users |
|---|---|---|
| Alice | tenant_A | 412 |
| Bob | tenant_B | 158 |
| Internal (no JWT) | — | request denied |
Rule of thumb. Multi-tenant SaaS without semantic-layer RLS is a leak waiting to ship. Front-end tenant logic will be forgotten — at some endpoint, in some refactor — and a customer will see another customer's data. The layer-side predicate is the only one that holds.
Worked example — LLM agent grounding via the semantic layer
Detailed explanation. An LLM agent receives the question "What was WAU last week by region, and how does it compare to the week before?" The agent's tools include cube_meta (returns published cubes / measures / dimensions / joins) and cube_query (executes a structured query). The agent never writes raw SQL.
Question. Walk through the tool calls, the structured query the agent constructs, the layer's compiled SQL, and the natural-language answer.
Input. Cube with Events.active_users measure, Regions.region_name dimension, joined via Events → Customers → Regions.
Code.
// Tool: cube_meta (called once at session start)
{
"cubes": [
{
"name": "Events",
"measures": ["Events.active_users"],
"dimensions": ["Events.event_ts", "Events.event_name"],
"joins": [{"to": "Customers", "on": "customer_id"}]
},
{
"name": "Customers",
"joins": [{"to": "Regions", "on": "region_id"}]
},
{
"name": "Regions",
"dimensions": ["Regions.region_name"]
}
]
}
// Tool: cube_query (called by the agent to answer the question)
{
"measures": ["Events.active_users"],
"dimensions": ["Regions.region_name"],
"timeDimensions": [{
"dimension": "Events.event_ts",
"granularity": "week",
"dateRange": ["2026-06-01", "2026-06-14"]
}]
}
-- The semantic layer compiles to (sketch)
SELECT r.region_name,
DATE_TRUNC('week', e.event_ts) AS event_week,
COUNT(DISTINCT e.user_id) AS active_users
FROM analytics.fct_events e
JOIN analytics.dim_customers c ON e.customer_id = c.customer_id
JOIN analytics.dim_regions r ON c.region_id = r.region_id
WHERE e.event_ts >= '2026-06-01'
AND e.event_ts < '2026-06-15'
GROUP BY r.region_name, DATE_TRUNC('week', e.event_ts)
ORDER BY event_week, r.region_name;
Step-by-step explanation.
- The agent loads
cube_metato see the published schema. It learns: there is anEvents.active_usersmeasure, aRegions.region_namedimension, and an implicit join chain between them. - The agent constructs a structured
cube_query— not SQL — using the measures and dimensions it just learned. The query asks forEvents.active_usersgrouped byRegions.region_nameat weekly granularity over the last 14 days. - The semantic layer compiles the structured query to SQL, including the join chain
Events → Customers → Regions. The agent never wrote a JOIN clause. - The result rows come back per region per week. The agent computes the week-over-week delta in natural language and presents the answer.
- Critically: the agent could not have hallucinated a wrong join. The schema published by
cube_metais the only surface it has access to.
Output (sample).
| region_name | event_week | active_users | wow_delta |
|---|---|---|---|
| EU | 2026-06-01 | 9,210 | n/a |
| EU | 2026-06-08 | 9,440 | +2.5% |
| US | 2026-06-01 | 7,890 | n/a |
| US | 2026-06-08 | 8,150 | +3.3% |
| APAC | 2026-06-01 | 1,210 | n/a |
| APAC | 2026-06-08 | 1,332 | +10.1% |
The agent renders: "WAU last week was 18,922, up 4.0% from 18,310 the prior week. APAC grew fastest at +10.1%." — and cites Events.active_users as the metric.
Rule of thumb. Ground LLM agents on the semantic layer's schema, not on the raw warehouse schema. The constrained surface eliminates the entire category of "hallucinated JOIN" failures and makes every agent response auditable.
Worked example — migration from LookML to dbt SL, staged by metric
Detailed explanation. A company on Looker decides to migrate to dbt SL. Rather than a big-bang rewrite, they identify the 10 top metrics by query volume and rewrite those first. The other 190 LookML measures stay live until they are naturally retired or rebuilt.
Question. Outline the staged migration plan with a 30-day dual-run reconciliation per metric.
Input. 200 LookML measures; top 10 cover 80% of dashboard query volume.
Code.
migration:
approach: staged
parallel_run_days: 30
top_metrics:
- weekly_active_users
- daily_active_users
- total_revenue
- new_signups
- churn_rate
- retention_d7
- retention_d30
- average_order_value
- conversion_rate
- lifetime_value
reconciliation:
cadence: daily
tolerance_pct: 0.5
block_threshold_pct: 1.0
owner: analytics-engineering
Step-by-step explanation.
- Inventory all LookML measures and rank by query volume from Looker's
i__looker.historysystem table. Pick the top 10 — they cover most of the business. - Rewrite each measure as a dbt SL
metric(often asimpleover acount_distinctorsummeasure). For derived measures, useratioorderivedtypes. - Dual-run for 30 days. Every day, a reconciliation job queries both LookML and dbt SL for the same metric, same filters, same granularity. If the numbers differ by > 0.5%, raise a warning; > 1.0%, block the migration of that metric.
- After 30 days of clean reconciliation, mark the dbt SL metric as primary. New dashboards point at dbt SL; the LookML measure stays read-only for one quarter, then is deleted.
- Repeat for the next batch of 10 metrics. The long tail (190 measures) is migrated opportunistically as dashboards are rebuilt.
Output.
| Stage | Metrics migrated | LookML measures live | Dual-run pass rate |
|---|---|---|---|
| Stage 1 (months 1–2) | 10 (top 10) | 200 | 100% |
| Stage 2 (months 3–4) | 30 (next 20) | 170 | 100% |
| Stage 3 (months 5–6) | 80 (next 50) | 120 | 98% |
| Stage 4 (months 7–12) | 200 (all) | 0 | 99% overall |
Rule of thumb. Stage the migration by query volume, not alphabetically and not by team ownership. The top 10 metrics carry the migration's business value; the long tail can sit on LookML for as long as Looker is licensed.
Semantic layer interview question on cross-consumer reliability
A senior interviewer might frame it as: "Your CTO walks in with a complaint: 'The number on the executive dashboard is different from the number in the embedded customer portal and different again from what the Slack bot says.' Walk me through the diagnosis and the fix."
Solution Using semantic-layer consolidation as the audit-and-fix pattern
# Step 1: diagnose — list every place "weekly_active_users" is computed
diagnosis:
exec_dashboard:
source: Looker calculated field
formula: count_distinct user_id where event_ts >= ...
embedded_portal:
source: Tableau workbook
formula: count_distinct user_id where event_name = 'login' and event_ts >= ...
slack_bot:
source: ad-hoc Snowflake query in Python
formula: count_distinct user_id where event_ts >= dateadd(day, -7, current_date)
# Step 2: fix — publish one metric, point every surface at it
metrics:
- name: weekly_active_users
description: Distinct users with any event in the trailing 7 days.
type: simple
type_params:
measure: distinct_users
filter: "{{ TimeDimension('events__event_ts') }} >= dateadd('day', -7, current_date)"
# Step 3: rewire consumers
consumers:
exec_dashboard: semantic_layer_sql_endpoint
embedded_portal: semantic_layer_rest_endpoint
slack_bot: semantic_layer_graphql_endpoint
Step-by-step trace.
| Step | Consumer | Pre-fix number | Why it differed | Post-fix number |
|---|---|---|---|---|
| 1 | Exec dashboard | 19,221 | included all events | 18,922 |
| 2 | Embedded portal | 14,650 | filtered to login only |
18,922 |
| 3 | Slack bot | 18,990 | off-by-one in date window | 18,922 |
| 4 | All three after fix | 18,922 | all reading from same metric | 18,922 |
Output:
| Consumer | Source of weekly_active_users
|
Discrepancy after consolidation |
|---|---|---|
| Exec dashboard | semantic layer SQL endpoint | 0 |
| Embedded portal | semantic layer REST endpoint | 0 |
| Slack LLM bot | semantic layer GraphQL endpoint | 0 |
Why this works — concept by concept:
- Diagnosis first — every "numbers don't match" incident starts as a survey of every place the metric is computed. The semantic layer's value proposition is precisely this consolidation.
-
One file, one definition — the migrated metric lives once in
weekly_active_users.yml. There is no other place to edit it. - Consumer rewiring — each surface stops computing the metric in-tool and starts reading from the layer's endpoint. The viz / chart / bot code shrinks.
- Cache convergence — within one cache TTL of the fix landing, every consumer reports the same number. The "numbers different by consumer" symptom physically cannot reproduce.
-
Audit trail — every future change to the metric flows through a PR. The CTO's "why did this number change?" question is answered by
git log weekly_active_users.yml. - Cost — usually a reduction — three separate consumer-side computations collapse into one cached semantic-layer query.
SQL
Topic — sql
End-to-end SQL practice for analytics engineering
Cheat sheet — semantic layer recipes
- All-in on Looker, no near-term embedded. Stay on LookML, integrate dbt for the upstream model layer, and let Looker continue to be the consumer. The semantic layer is Looker.
- dbt-native, multiple BI tools (Tableau / Power BI / Hex / Mode). Adopt the dbt Semantic Layer + MetricFlow. Definitions live next to dbt models; CI catches metric breaks at PR time.
- Embedded analytics or AI agents are the dominant consumer. Choose Cube.dev for the REST / GraphQL surface. The semantic layer becomes the agent's tool surface.
-
Need pre-aggregated cubes for sub-second BI on billion-row facts. Use Cube's
pre_aggregations— declare the roll-up grain, partition by month, refresh on a TTL. -
Need date spine + cumulative / rolling-window metrics. dbt SL has best-in-class time-spine support; cumulative metrics with
window: 30 daysare first-class. - Need to ground an LLM agent on governed metrics. Publish the semantic layer's schema as the agent's tool surface. The agent constructs structured queries; the layer compiles to SQL.
-
Multi-tenant SaaS embedded analytics. Route every consumer through the semantic layer with JWT-driven RLS. The cube
sqltemplate inlinestenant_idfromsecurityContext. - Migrating off LookML. Stage by query volume — port the top 10 metrics first, dual-run for 30 days, reconcile daily, and let the long tail retire naturally over the next 12 months.
- Picking between Cube and dbt SL. If consumers are 60%+ embedded / API / LLM, lean Cube. If consumers are 60%+ existing BI tools and you already own a dbt repo, lean dbt SL.
-
Defining a metric once. Pick the LEGO-brick measure (
count_distinct,sum,count) and let the platform compose the metric. Avoid hand-rolling separate measures for DAU, WAU, MAU — use one and vary granularity. -
Safe division in derived metrics. Wrap the denominator in
NULLIF(..., 0)(Cube and LookML) or use the typedratiometric (dbt SL). - Cache eviction on metric edit. Make sure the semantic layer flushes the cache on deploy — otherwise consumers see the old number for the duration of the TTL after a definition change.
-
Audit trail. Every metric definition lives in git.
git log <metric.yml>is the answer to "why did this number change?" — for compliance, finance, and the CEO.
Frequently asked questions
Do I need a semantic layer if I use dbt?
You can run dbt without a semantic layer — your marts will be clean and well-tested, and analysts will write SQL on top of them. The semantic layer matters the moment more than one consumer asks for the same metric. Without it, the metric is re-implemented in each BI tool, each notebook, and each embedded chart — and the numbers drift. The dbt Semantic Layer (powered by MetricFlow) is the natural choice if you already use dbt, because metric definitions live in the same repo as your models and flow through the same PR review and CI. If your consumers include embedded analytics or LLM agents, Cube is often the better fit because of its REST/GraphQL surface; in that case you keep dbt for the model layer and put Cube on top.
Can I use the dbt Semantic Layer without dbt Cloud?
Partially. MetricFlow (the engine behind the dbt Semantic Layer) ships as open source and runs from dbt Core via the mf CLI — you can define semantic_models and metrics, validate them, and query metrics locally. What you do not get without dbt Cloud Team or Enterprise is the hosted Semantic Layer server, the caching tier, the JDBC connector for Tableau / Power BI / Hex / Mode, and the official integrations. For most teams that pay for any BI tool today, the dbt Cloud tier pays for itself in eliminated metric-duplication and reduced warehouse spend; for pure OSS shops, MetricFlow + a custom query gateway is a viable path but more work.
Is Cube.dev free?
The Cube Core engine is open source and free to self-host — Docker, your warehouse, your servers. Cube Cloud is the paid managed offering, which adds the IDE, hosted pre-aggregation runners, deployment automation, query history, role-based access control, and lineage. Teams typically start on Cube Core for evaluation and adopt Cube Cloud once metric count grows beyond 50 or embedded analytics SLAs require professional infrastructure. The OSS edition is genuinely usable in production — Cube was the first semantic layer with this OSS-with-paid-managed model and remains the most popular standalone semantic engine.
How does LookML compare to the dbt Semantic Layer?
LookML is the original semantic modelling language and is tightly coupled to Looker as the consumer surface — view and explore files map to the Looker UI, the IDE, and the SQL Runner. The dbt Semantic Layer is consumer-agnostic and lives in your existing dbt repo. LookML wins on IDE polish, content validation, and the fact that Looker users get a fully managed experience. dbt SL wins on portability (Tableau, Power BI, Hex, Mode all consume it natively), on cost (no per-seat Looker license), and on lineage (metrics live next to the models they read from). The most common 2026 migration pattern is staged: dbt SL for new dashboards and the top 10 metrics, LookML for the long tail until naturally retired.
Can LLM agents query a semantic layer?
Yes — and this is the use case that pushed every major BI vendor to ship a semantic-layer story in 2024–2026. The agent reads the published schema (cubes / semantic_models / views, plus measures, dimensions, and joins) as a tool surface, then constructs structured queries against the layer. The layer compiles to SQL. The agent never writes raw SQL itself, which eliminates the "hallucinated JOIN" failure mode that plagues text-to-SQL on raw warehouse schemas. Cube's GraphQL surface and dbt SL's API are both well-suited for this; the LookML route works via Looker's API but with more friction. For LLM-heavy roadmaps, Cube is the most common pick.
What's the difference between a metrics layer and a semantic layer?
The terms overlap heavily in 2026 marketing. A "metrics layer" emphasises the named KPIs (WAU, MAU, revenue, churn, retention) — the dashboard-facing numbers. A "semantic layer" emphasises the broader modelling surface — measures, dimensions, joins, entity relationships, segments — that lets you derive metrics. dbt SL, Cube, and LookML are all semantic layers in this fuller sense; LookML calls itself "semantic modelling," dbt SL talks about "semantic_models," and Cube talks about "cubes." The 2020-era "metrics layer" companies (Transform, Supergrain, Trace, GoodData) either pivoted into full semantic layers or were acquired by larger semantic-layer / BI vendors. Practically, when you read "metrics layer" today, assume "semantic layer" — they refer to the same engineering object.
Practice on PipeCode
- Drill the aggregation practice library → for the
sum/count/count_distinctmeasures that every semantic layer composes metrics from. - Rehearse on joins problems → for the entity-resolution patterns Cube, dbt SL, and LookML automate.
- Sharpen group-by drills → for the granularity reasoning every measure definition leans on.
- Layer the case-expression library → for conditional metric patterns ("count only paid orders").
- Stack the window-functions practice library → for rolling, cumulative, and ranking metrics.
- Layer the filtering library → for the segment / where-clause patterns metric definitions inherit.
- For the broader surface, read top data engineering interview questions →.
- Stack the prerequisites with the only 5 skills you need to become a data engineer →.
- Sharpen the SQL axis with the SQL for data engineering interviews course →.
- For schema craft and the modelling fundamentals every semantic layer leans on, work through the data modelling for data engineering interviews course →.
Pipecode.ai is Leetcode for Data Engineering — every semantic-layer recipe above ships with hands-on practice rooms where you write the `count_distinct` measure, the entity-based join, and the safe-division ratio against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you can rehearse the analytics-engineering moves behind Cube, dbt SL, and LookML against the same SQL fundamentals every interviewer probes.





Top comments (0)